Vorhersagen der Wahrscheinlichkeiten von Klassen im Fall von Gradient Boosting-Bäumen in Spark mithilfe der Baumausgabe

Question

May 18, 2016, 05:20 PM

tree apache-spark-mllib boosting probability prediction

Vorhersagen der Wahrscheinlichkeiten von Klassen im Fall von Gradient Boosting-Bäumen in Spark mithilfe der Baumausgabe

Es ist bekannt, dass GBTs in Spark Ihnen ab sofort vorausgesagte Bezeichnungen geben.

Ich dachte daran, vorhergesagte Wahrscheinlichkeiten für eine Klasse zu berechnen (sagen wir alle Instanzen, die unter ein bestimmtes Blatt fallen)

Die Codes zum Erstellen von GBTs

import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

//Importing the data
val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository

//Parsing the data
val parsedData = data.map { line =>
    val parts = line.split(',').map(_.toDouble)
    LabeledPoint(parts(0), Vectors.dense(parts.tail))
}

//Splitting the data
val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training = splits(0).cache() 
val test = splits(1)

// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 2 // We can use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 2
boostingStrategy.treeStrategy.maxBins = 32
boostingStrategy.treeStrategy.subsamplingRate = 0.5
boostingStrategy.treeStrategy.maxMemoryInMB =1024
boostingStrategy.learningRate = 0.1

// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(training, boostingStrategy)  

model.toDebugString

Dies gibt mir der Einfachheit halber 2 Bäume der Tiefe 2 wie folgt:

 Tree 0:
    If (feature 3 <= 2.0)
     If (feature 2 <= 1.25)
      Predict: -0.5752212389380531
     Else (feature 2 > 1.25)
      Predict: 0.07462686567164178
    Else (feature 3 > 2.0)
     If (feature 0 <= 30.17)
      Predict: 0.7272727272727273
     Else (feature 0 > 30.17)
      Predict: 1.0
  Tree 1:
    If (feature 5 <= 67.0)
     If (feature 4 <= 100.0)
      Predict: 0.5739387416147804
     Else (feature 4 > 100.0)
      Predict: -0.550117566730937
    Else (feature 5 > 67.0)
     If (feature 2 <= 0.0)
      Predict: 3.0383669122382835
     Else (feature 2 > 0.0)
      Predict: 0.4332824083446489

Meine Frage ist: Kann ich die obigen Bäume verwenden, um vorhergesagte Wahrscheinlichkeiten zu berechnen wie:

In Bezug auf jede Instanz in der Funktionsgruppe, die für die Vorhersage verwendet wird

exp (Blattscore von Baum 0 + Blattscore von Baum 1) / (1 + exp (Blattscore von Baum 0 + Blattscore von Baum 1))

Dies gibt mir eine Art Wahrscheinlichkeit. Aber nicht sicher, ob es der richtige Weg ist. Auch wenn es ein Dokument gibt, in dem erklärt wird, wie die Blattpunktzahl (Vorhersage) berechnet wird. Ich wäre wirklich dankbar, wenn jemand teilen kann.

Jeder Vorschlag wäre hervorragend.