Inhalt

Ich werde alle Parameter von lightGBM grob erklären. Da es viele Inhalte gibt, werde ich sie über mehrere Tage langsam übersetzen. Ich werde die Details von Zeit zu Zeit in einem separaten Artikel aktualisieren. Wenn Sie einen Fehler machen, würde ich es begrüßen, wenn Sie darauf hinweisen könnten. Der offizielle Github von lightGBM ist hier

Das grundlegende Beschreibungsformat ist Standard = Standard, Typ = Typ, Optionen = Optionen, Einschränkungen = Einschränkungen

Kernparameter

-- config, default = "", type = string, alias: config_file

Festlegen des Dateipfads
** Hinweis **: Nur in der CLI-Version verfügbar

-- task, default = train, type = enum, options: train, prognost```, convert_model, refit, alias: task_type ``

-- train, alias: training

"Vorhersage", Alias: "Vorhersage", "Test"

-- convert_model, Konvertiert die Modelldatei in das if-else-Format. Weitere Informationen finden Sie unter E / A-Parameter.

-- refit, refit mit neuen Daten, Alias: refit_tree

** Hinweis **: Nur in der CLI-Version verfügbar. Unterstützte Funktionen sind in sprachspezifischen Paketen verfügbar.
objective , default = regression, type = enum, options: regression, regression_l1, huber, fair, poisson, quantile, mape, gamma, tweedie, binary, multiclass, multiclassova, cross_entropy, cross_entropy_lambda, lambdarank, rank_xendcg, aliases: objective_type, app, application

--Rückkehr

-- Regression, L2-Verlust, Aliase: Regression_l2, l2, mean_squared_error, mse, l2_root, root_mean_squared_error, rmse

"Regression_l1", L1-Verlust, Alias: "l1", "mean_absolute_error", "mae"
- huber, Huber loss
- fair, Fair loss
- poisson, Poisson regression
- quantile, Quantile regression
- mape, MAPE loss: mean_absolute_percentage_error

-- Gamma, Gamma-Regression mit Log-Link. Anwendungsbeispiel: Fälle, in denen die Häufigkeit des Versicherungsschutzes modelliert wird, und andere Fälle, in denen die Gammaverteilung verfolgt wird. Gamma-verteilt

-- tweedie, Tweedie-Regression mit Log-Link. Anwendungsbeispiel: Modellierung des Totalverlusts der Versicherung und anderer Fälle nach Tweedie-Verteilung [tweedie-verteilt](https://en.wikipedia.org/wiki/ Tweedie_distribution # Occurrence_and_applications)

--Dichotomie

-- binär, binär Protokollverlust (oder logistische Regression)

--Label muss 0 oder 1 sein; [0,1] Allgemeine Wahrscheinlichkeiten von Labels finden Sie unter Cross Entropy (https://en.wikipedia.org/wiki/Cross_entropy)

Andere Klassenklassifikation

-- multiclass, Softmax, Alias: softmax

-- multiclassova, One-vs-All, Alias: multiclass_ova, ova , ovr

  -  ``num_class`` should be set as well

Kreuzentropieanwendung

-- cross_entropy, objektive Funktion der Kreuzentropie (Gewicht ist willkürlich), Alias: xentropy

-- cross_entropy_lambda, andere Parametrisierung der Kreuzentropie, Alias: xentlambda

  -  label is anything in interval [0, 1]

Ranglistenanwendung

-- lambdarank, lambdarank. label_gain (Definitionsbuch) (Erläuterung ab Seite) hat ganzzahlige Beschriftungen und gewichtet jeden Wert der Beschriftung so, dass er kleiner als die Anzahl der Elemente in label_gain ist.

-- rank_xendcg, XE_NDCG_MART Rangzielfunktion, Alias: xendcg, xe_ndcg, xe_ndcg_mart, xendcg_mart

-- rank_xendcg Die Berechnung ist schnell und das Verhalten ist ähnlich wie bei Lambdarank.

Das Etikett muss vom Typ "int" sein und große Zahlen müssen eine bessere Bedeutung haben. (Beispiel: 0: schlecht, 1: normal, 2: gut, 3: ziemlich gut)

-- boosting, default = gbdt, type = enum, options: gbdt, rf, dart, goss, alias: boosting_type, boost

-- gbdt, typische Gradientenverstärkung, auch bekannt als: gbrt

-- rf, zufälliger Baum, Alias: random_forest

dart, Dropouts meet Multiple Additive Regression Trees
goss, Gradient-based One-Side Sampling

-- data, default = "", type = string, Aliase: train, train_data, train_data_file, data_filename

Wenn Sie den Pfad der Trainingsdaten und den Pfad angeben, trainiert LightGBM mit diesen Daten.
** Hinweis **: Nur CLI-Version verfügbar

-- valid, default = "", type = string, Aliase: test, valid_data, valid_data_file, test_data, test_data_file` `,` `valid_filenames

Validierungs- / Testdatenpfad, LightGBM versucht, das Ergebnis unter Verwendung dieser Daten auszugeben.
Es können mehrere Validierungsdaten verwendet werden, die durch , getrennt sind.
** Hinweis **: Nur CLI-Version verfügbar

-- num_iterations, default = 100, type = int, Aliase: num_iteration, n_iter, num_tree, num_trees, num_round , num_rounds, num_boost_round, n_estimators, Constraints: num_iterations> = 0

Anzahl der Boostings
** Hinweis **: Intern erstellt LightGBM bei anderen Klassifizierungsproblemen num_class * num_iterations-Bäume.

-- learning_rate, default = 0.1, type = double, alias: shrinkage_rate, eta, Einschränkung: learning_rate> 0.0

Schrumpfmaß
"Dart" beeinflusst das normalisierte Gewicht von "umgestürzten Bäumen".

-- num_leaves, default = 31, type = int, Aliase: num_leaf, max_leaves, max_leaf, Einschränkungen: 1 <num_leaves <= 131072`

Maximale Anzahl Blätter in einem Baum.

-- tree_learner, default = serial, type = enum, Optionen: serial, feature, data, abstimmen, Alias: tree, tree_type, tree_learner_type

Geben Sie an, wie Sie Bäume lernen. Da der Begriff spezialisiert ist, entfällt die Übersetzung.
- serial, single machine tree learner

-- feature, Feature Parallel Tree Learner, Alias: feature_parallel

-- data, Datenparalleler Baumlerner, Alias: data_parallel

-- Voting, Voting Parallel Tree Learner, Alias: Voting_parallel

Siehe Paralleles Lernen.
num_threads , default = 0, type = int, aliases: num_thread, nthread, nthreads, n_jobs
Anzahl der für LightGBM verwendeten Threads
In OpenMP bedeutet "0" die Standardanzahl von Threads.
Um die Berechnungsgeschwindigkeit zu maximieren, sollte dieser Parameter auf ** die tatsächliche Anzahl der CPU-Kerne ** und nicht auf die Anzahl der Threads eingestellt werden. Seien Sie also bitte vorsichtig. (Die meisten CPUs verwenden Hyper-Threading, um 2 Threads pro CPU zu erzeugen.)
Wenn Ihr Datensatz klein ist, machen Sie ihn nicht groß. (Verwenden Sie beispielsweise keine 64 Threads für 10000 Datenspalten.)

--Task Manager und andere CPU-Überwachungstools zeigen möglicherweise an, dass nicht alle Kerne verwendet werden. ** Das ist normal **

Verwenden Sie bei der Parallelverarbeitung nicht die gesamte Anzahl der CPU-Kerne, um die Netzwerkleistung nicht zu beeinträchtigen.
** Hinweis **: Ändern Sie diesen Parameter ** nicht während des Trainings **. Unerwartete Fehler können auftreten, insbesondere wenn Sie mehrere Aufgaben gleichzeitig in einem externen Paket ausführen.
device_type , default = cpu, type = enum, options: cpu, gpu, aliases: device
Geben Sie das Gerät an, mit dem Bäume gelernt werden. Sie können mithilfe der GPU beschleunigen.
** Hinweis **: Sie können beschleunigen, indem Sie ein kleineres max_bin verwenden (Beispiel 63).
** Hinweis **: Standardmäßig wird die GPU für schnellere Geschwindigkeiten zum 32-Bit-Gleitkomma hinzugefügt. Dies kann die Genauigkeit einiger Aufgaben beeinträchtigen und durch Festlegen von "gpu_use_dp = true" auf 64-Bit-Gleitkomma geändert werden. Das Trainieren kann jedoch länger dauern. ..
** Hinweis **: Wenn Sie eine GPU mit lightGBM verwenden möchten, Installationshandbuch Bitte beziehen Sie sich auf.
seed , default = None, type = int, aliases: random_seed, random_state
Dieser Samen erzeugt andere Samen. Beispiel: data_random_seed, feature_fraction_seed usw.
Standardmäßig wird dieser Startwert aufgrund der Standardwerte anderer Startwerte nicht verwendet.
Dieser Samen hat eine niedrigere Priorität als andere Samen. Das heißt, wenn Sie explizit einen anderen Startwert angeben, wird dieser Startwert überschrieben.

Kontrollparameter lernen

force_col_wise , default = false, type = bool
Es kann nur CPU verwendet werden
Wenn Sie dies auf "true" setzen, können Sie ein spaltenbasiertes Histogramm erstellen.
Es wird empfohlen, diesen Parameter in folgenden Fällen anzuwenden:
Große Anzahl von Spalten oder große Anzahl von Fächern

-- num_threads ist groß, zB > 20

Ich möchte die Speicherkosten senken
** Hinweis **: Wenn sowohl force_col_wise als auch force_row_wise false sind, versucht LightGBM beide zuerst und verwendet die schnelleren. Um den Overhead loszuwerden, setzen Sie den schnelleren manuell auf "true".
** Hinweis **: Kann nicht mit force_row_wise verwendet werden, bitte wählen Sie nur eine der beiden aus.
force_row_wise , default = false, type = bool
Es kann nur CPU verwendet werden
Wenn Sie dies auf "true" setzen, können Sie ein zeilenbasiertes Histogramm erstellen.
Es wird empfohlen, diesen Parameter in folgenden Fällen anzuwenden:
Große Anzahl von Daten oder relativ kleine Anzahl von Fächern
Relativ wenige num_threads, zB <= 16

--Wenn Sie beschleunigen möchten, indem Sie einen kleinen Wert "bagging_fraction" oder "goss" verwenden

** Hinweis **: Wenn Sie dies auf "true" setzen, wird die Speichernutzung für das Dataset verdoppelt. Wenn Sie nicht genügend Speicher haben, verwenden Sie force_col_wise = true.
** Hinweis **: Wenn sowohl force_col_wise als auch force_row_wise false sind, versucht LightGBM beide zuerst und verwendet die schnelleren. Um den Overhead loszuwerden, setzen Sie den schnelleren manuell auf "true".
** Hinweis **: Kann nicht mit force_col_wise verwendet werden, bitte wählen Sie nur eine der beiden aus.
histogram_pool_size , default = -1.0, type = double, aliases: hist_pool_size
Maximale Cache-Größe des historischen Histogramms (MB-Einheit)

-- <0 bedeutet unbegrenzt

max_depth , default = -1, type = int
Begrenzen Sie die maximale Tiefe des Baummodells. Dies wird verwendet, um eine Überanpassung zu beheben, wenn die Anzahl der Daten gering ist. Die Spezifikationen des Holzes ändern sich nicht.

-- <= 0 bedeutet unbegrenzt.

-- min_data_in_leaf, default = 20, type = int, Aliase: min_data_per_leaf, min_data, min_child_samples, Einschränkungen: min_data_in_leaf> = 0

Minimale Anzahl von Daten für ein Blatt. Wird verwendet, um mit Überanpassung umzugehen.
min_sum_hessian_in_leaf , default = 1e-3, type = double, aliases: min_sum_hessian_per_leaf, min_sum_hessian, min_hessian, min_child_weight, constraints: min_sum_hessian_in_leaf >= 0.0
Minimale Summe von Hessisch in einem Blatt. Ähnlich wie min_data_in_leaf wird es verwendet, um Überanpassung zu behandeln.

-- bagging_fraction, default = 1.0, type = double, alias: sub_row, subsample, bagging, Einschränkung: 0.0 <bagging_fraction <= 1.0`

Ähnlich wie feature_fraction, aber es extrahiert zufällig eine Teilmenge der Daten ohne Resampling.
Wird verwendet, um die Berechnungsgeschwindigkeit des Trainings zu verbessern.
Wird verwendet, um mit Überanpassung umzugehen.
** Hinweis **: bagging_freq muss auch ein Wert ungleich Null sein, damit das Absacken wirksam wird.

-- pos_bagging_fraction, default = 1.0, type = double, Aliase: pos_sub_row, pos_subsample, pos_bagging, Einschränkungen: 0.0 <pos_bagging_fraction <= 1.0`

Nur mit binary verwenden.
Wird für unausgeglichene binäre Klassifizierungsprobleme verwendet. Extrahieren Sie während des Absackens zufällig #pos_samples * pos_bagging_fraction positive Proben.
Muss mit neg_bagging_fraction verwendet werden.
Wenn Sie es auf "1.0" setzen, ist es ungültig.
** Hinweis **: Sie müssen bagging_freq und neg_bagging_fraction eingeben, damit es wirksam wird.
** Hinweis **: Wenn sowohl pos_bagging_fraction als auch neg_bagging_fraction 1.0 sind, ist das ausgeglichene Absacken deaktiviert.
** Hinweis **: bagging_fraction wird ignoriert, wenn Balanced Bagging aktiviert ist.

-- neg_bagging_fraction, default = 1.0, type = double, Aliase: neg_sub_row, neg_subsample, neg_bagging, Einschränkungen: 0.0 <neg_bagging_fraction <= 1.0`

Nur mit binary verwenden.
Wird für unausgeglichene binäre Klassifizierungsprobleme verwendet. Extrahieren Sie während des Absackens nach dem Zufallsprinzip "#neg_samples * neg_bagging_fraction" -Negativproben.

--Verwenden Sie mit pos_bagging_fraction.

Wenn Sie es auf "1.0" setzen, ist es ungültig.
** Hinweis **: Sie müssen bagging_freq und neg_bagging_fraction eingeben, damit es wirksam wird.
** Hinweis **: Wenn sowohl pos_bagging_fraction als auch neg_bagging_fraction 1.0 sind, ist das ausgeglichene Absacken deaktiviert.
** Hinweis **: bagging_fraction wird ignoriert, wenn Balanced Bagging aktiviert ist.

-- bagging_freq, default = 0, type = int, alias: subsample_freq

Beutelfrequenz

-- 0 bedeutet kein Absacken. ; k bedeutet, dass es wiederholt einmal pro k verpackt wird.

** Hinweis **: Der Wert von "bagging_fraction" muss kleiner als "1.0" sein, damit das Absacken wirksam wird.

-- bagging_seed, default = 3, type = int, alias: bagging_fraction_seed

--Bagging zufälliger Samen

-- feature_fraction, default = 1.0, type = double, alias: sub_feature, colsample_bytree, Einschränkung: 0.0 <feature_fraction <= 1.0

--Wenn feature_fraction kleiner als 1.0 ist, extrahiert LightGBM jedes Mal zufällig ein Teilmerkmal. Beispielsweise wählt LightGBM mit 0.8 vor dem Training 80% der Funktionen aus.

Kann verwendet werden, um das Training zu beschleunigen.
Kann als Gegenmaßnahme gegen Überanpassung eingesetzt werden.

-- feature_fraction_bynode, default = 1.0, type = double, alias: sub_feature_bynode, colsample_bynode, Einschränkung: 0.0 <feature_fraction_bynode <= 1.0

--Wenn feature_fraction_bynode kleiner als 1.0 ist, extrahiert LightGBM die Features an jedem Baumknoten teilweise. Beispielsweise extrahiert LightGBM mit "0,8" 80% der Features aus jedem Baumknoten.

Kann als Gegenmaßnahme gegen Überanpassung eingesetzt werden.
** Hinweis **: Im Gegensatz zu feature_fraction wird das Training nicht beschleunigt.
** Hinweis **: Wenn sowohl feature_fraction als auch feature_fraction_bynode kleiner als 1.0 sind, ist der endgültige Prozentsatz jedes Knotens doppelt so hoch wie der ursprüngliche feature_fraction * feature_fraction_bynode. ..
feature_fraction_seed , default = 2, type = int
Zufälliger Startwert von feature_fraction
extra_trees , default = false, type = bool
Wird für extrem zufällige Bäume verwendet.

--Wenn true, wählt lightGBM bei der Auswertung von Knotensplits nur einen zufälligen Schwellenwert für jedes Feature aus.

Wird als Gegenmaßnahme gegen Überanpassung verwendet.
extra_seed , default = 6, type = int
Zufälliger Startwert zur Auswahl des Schwellenwerts, wenn "extra_trees" wahr ist

-- Early_stopping_round, Standard = 0, Typ = int, Aliase: Early_stopping_rounds, Early_stopping, n_iter_no_change

Beenden Sie in der letzten Runde von "Early_stopping_round" das Training, wenn sich die Leistung nicht verbessert.

-- <= 0 bedeutet ungültig.

first_metric_only , default = false, type = bool
Wenn Sie nur die erste Bewertung des frühen Stopps verwenden möchten, setzen Sie diese auf "true".

-- max_delta_step, default = 0.0, type = double, Aliase: max_tree_output, max_leaf_output

Begrenzen Sie die maximale Anzahl der ausgegebenen Blätter.

-- <= 0 bedeutet unbegrenzt.

Die maximale Anzahl der endgültigen Blätter ist learning_rate * max_delta_step.

-- lambda_l1, default = 0.0, type = double, alias: reg_alpha, limit: lambda_l1> = 0.0

--L1 Regularisierung

-- lambda_l2, default = 0.0, type = double, alias: reg_lambda, lambda, limit: lambda_l2> = 0.0

--L2-Regularisierung

-- min_gain_to_split, default = 0.0, type = double, alias: min_split_gain, limit: min_gain_to_split> = 0.0

Minimale Verstärkung beim Teilen (Verstärkung)

-- drop_rate, default = 0.1, type = double, alias: rate_drop, Einschränkung: 0.0 <= drop_rate <= 1.0

Wird nur für "Dart" verwendet.

--dropout rate: Aussetzer werden verwendet, um den zufälligen Teil der Funktion während des Trainings zu schwächen. um einen zufälligen Bruchteil der Eingabefunktionen während der Trainingsphase stummzuschalten. Referenzen)

max_drop , default = 50, type = int
- used only in dart
- max number of dropped trees during one boosting iteration
- <=0 means no limit
skip_drop , default = 0.5, type = double, constraints: 0.0 <= skip_drop <= 1.0
- used only in dart
- probability of skipping the dropout procedure during a boosting iteration
xgboost_dart_mode , default = false, type = bool
- used only in dart
- set this to true, if you want to use xgboost dart mode
uniform_drop , default = false, type = bool
- used only in dart
- set this to true, if you want to use uniform drop
drop_seed , default = 4, type = int
- used only in dart
- random seed to choose dropping models
top_rate , default = 0.2, type = double, constraints: 0.0 <= top_rate <= 1.0
- used only in goss
- the retain ratio of large gradient data
other_rate , default = 0.1, type = double, constraints: 0.0 <= other_rate <= 1.0
- used only in goss
- the retain ratio of small gradient data
min_data_per_group , default = 100, type = int, constraints: min_data_per_group > 0
- minimal number of data per categorical group
max_cat_threshold , default = 32, type = int, constraints: max_cat_threshold > 0
- used for the categorical features
- limit the max threshold points in categorical features
cat_l2 , default = 10.0, type = double, constraints: cat_l2 >= 0.0
- used for the categorical features
- L2 regularization in categorical split
cat_smooth , default = 10.0, type = double, constraints: cat_smooth >= 0.0
- used for the categorical features
- this can reduce the effect of noises in categorical features, especially for categories with few data
max_cat_to_onehot , default = 4, type = int, constraints: max_cat_to_onehot > 0
- when number of categories of one feature smaller than or equal to max_cat_to_onehot, one-vs-other split algorithm will be used
top_k , default = 20, type = int, aliases: topk, constraints: top_k > 0
- used only in voting tree learner, refer to Voting parallel <./Parallel-Learning-Guide.rst#choose-appropriate-parallel-algorithm>__
- set this to larger value for more accurate result, but it will slow down the training speed
monotone_constraints , default = None, type = multi-int, aliases: mc, monotone_constraint
- used for constraints of monotonic features
- 1 means increasing, -1 means decreasing, 0 means non-constraint
- you need to specify all features in order. For example, mc=-1,0,1 means decreasing for 1st feature, non-constraint for 2nd feature and increasing for the 3rd feature
monotone_constraints_method , default = basic, type = string, aliases: monotone_constraining_method, mc_method
- used only if monotone_constraints is set
- monotone constraints method
  - basic, the most basic monotone constraints method. It does not slow the library at all, but over-constrains the predictions
  - intermediate, a more advanced method <https://github.com/microsoft/LightGBM/files/3457826/PR-monotone-constraints-report.pdf>__, which may slow the library very slightly. However, this method is much less constraining than the basic method and should significantly improve the results
monotone_penalty , default = 0.0, type = double, aliases: monotone_splits_penalty, ms_penalty, mc_penalty, constraints: monotone_penalty >= 0.0
- used only if monotone_constraints is set
- monotone penalty <https://github.com/microsoft/LightGBM/files/3457826/PR-monotone-constraints-report.pdf>__: a penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree. The penalty applied to monotone splits on a given depth is a continuous, increasing function the penalization parameter
- if 0.0 (the default), no penalization is applied
feature_contri , default = None, type = multi-double, aliases: feature_contrib, fc, fp, feature_penalty
- used to control feature's split gain, will use gain[i] = max(0, feature_contri[i]) * gain[i] to replace the split gain of i-th feature
- you need to specify all features in order
forcedsplits_filename , default = "", type = string, aliases: fs, forced_splits_filename, forced_splits_file, forced_splits
- path to a .json file that specifies splits to force at the top of every decision tree before best-first learning commences
- .json file can be arbitrarily nested, and each split contains feature, threshold fields, as well as left and right fields representing subsplits
- categorical splits are forced in a one-hot fashion, with left representing the split containing the feature value and right representing other values
- Note: the forced split logic will be ignored, if the split makes gain worse
- see this file <https://github.com/microsoft/LightGBM/tree/master/examples/binary_classification/forced_splits.json>__ as an example
refit_decay_rate , default = 0.9, type = double, constraints: 0.0 <= refit_decay_rate <= 1.0
- decay rate of refit task, will use leaf_output = refit_decay_rate * old_leaf_output + (1.0 - refit_decay_rate) * new_leaf_output to refit trees
- used only in refit task in CLI version or as argument in refit function in language-specific package
cegb_tradeoff , default = 1.0, type = double, constraints: cegb_tradeoff >= 0.0
- cost-effective gradient boosting multiplier for all penalties
cegb_penalty_split , default = 0.0, type = double, constraints: cegb_penalty_split >= 0.0
- cost-effective gradient-boosting penalty for splitting a node
cegb_penalty_feature_lazy , default = 0,0,...,0, type = multi-double
- cost-effective gradient boosting penalty for using a feature
- applied per data point
cegb_penalty_feature_coupled , default = 0,0,...,0, type = multi-double
- cost-effective gradient boosting penalty for using a feature
- applied once per forest
path_smooth , default = 0, type = double, constraints: path_smooth >= 0.0
- controls smoothing applied to tree nodes
- helps prevent overfitting on leaves with few samples
- if set to zero, no smoothing is applied
- if path_smooth > 0 then min_data_in_leaf must be at least 2
- larger values give stronger regularisation
  - the weight of each node is (n / path_smooth) * w + w_p / (n / path_smooth + 1), where n is the number of samples in the node, w is the optimal node weight to minimise the loss (approximately -sum_gradients / sum_hessians), and w_p is the weight of the parent node
  - note that the parent output w_p itself has smoothing applied, unless it is the root node, so that the smoothing effect accumulates with the tree depth
verbosity , default = 1, type = int, aliases: verbose
- controls the level of LightGBM's verbosity
- < 0: Fatal, = 0: Error (Warning), = 1: Info, > 1: Debug
input_model , default = "", type = string, aliases: model_input, model_in
- filename of input model
- for prediction task, this model will be applied to prediction data
- for train task, training will be continued from this model
- Note: can be used only in CLI version
output_model , default = LightGBM_model.txt, type = string, aliases: model_output, model_out
- filename of output model in training
- Note: can be used only in CLI version
snapshot_freq , default = -1, type = int, aliases: save_period
- frequency of saving model file snapshot
- set this to positive value to enable this function. For example, the model file will be snapshotted at each iteration if snapshot_freq=1
- Note: can be used only in CLI version

IO Parameters

Dataset Parameters


-  ``max_bin`` , default = ``255``, type = int, constraints: ``max_bin > 1``

   -  max number of bins that feature values will be bucketed in

   -  small number of bins may reduce training accuracy but may increase general power (deal with over-fitting)

   -  LightGBM will auto compress memory according to ``max_bin``. For example, LightGBM will use ``uint8_t`` for feature value if ``max_bin=255``

-  ``max_bin_by_feature`` , default = ``None``, type = multi-int

   -  max number of bins for each feature

   -  if not specified, will use ``max_bin`` for all features

-  ``min_data_in_bin`` , default = ``3``, type = int, constraints: ``min_data_in_bin > 0``

   -  minimal number of data inside one bin

   -  use this to avoid one-data-one-bin (potential over-fitting)

-  ``bin_construct_sample_cnt`` , default = ``200000``, type = int, aliases: ``subsample_for_bin``, constraints: ``bin_construct_sample_cnt > 0``

   -  number of data that sampled to construct histogram bins

   -  setting this to larger value will give better training result, but will increase data loading time

   -  set this to larger value if data is very sparse

-  ``data_random_seed`` , default = ``1``, type = int, aliases: ``data_seed``

   -  random seed for sampling data to construct histogram bins

-  ``is_enable_sparse`` , default = ``true``, type = bool, aliases: ``is_sparse``, ``enable_sparse``, ``sparse``

   -  used to enable/disable sparse optimization

-  ``enable_bundle`` , default = ``true``, type = bool, aliases: ``is_enable_bundle``, ``bundle``

   -  set this to ``false`` to disable Exclusive Feature Bundling (EFB), which is described in `LightGBM: A Highly Efficient Gradient Boosting Decision Tree <https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree>`__

   -  **Note**: disabling this may cause the slow training speed for sparse datasets

-  ``use_missing`` , default = ``true``, type = bool

   -  set this to ``false`` to disable the special handle of missing value

-  ``zero_as_missing`` , default = ``false``, type = bool

   -  set this to ``true`` to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices)

   -  set this to ``false`` to use ``na`` for representing missing values

-  ``feature_pre_filter`` , default = ``true``, type = bool

   -  set this to ``true`` to pre-filter the unsplittable features by ``min_data_in_leaf``

   -  as dataset object is initialized only once and cannot be changed after that, you may need to set this to ``false`` when searching parameters with ``min_data_in_leaf``, otherwise features are filtered by ``min_data_in_leaf`` firstly if you don't reconstruct dataset object

   -  **Note**: setting this to ``false`` may slow down the training

-  ``pre_partition`` , default = ``false``, type = bool, aliases: ``is_pre_partition``

   -  used for parallel learning (excluding the ``feature_parallel`` mode)

   -  ``true`` if training data are pre-partitioned, and different machines use different partitions

-  ``two_round`` , default = ``false``, type = bool, aliases: ``two_round_loading``, ``use_two_round_loading``

   -  set this to ``true`` if data file is too big to fit in memory

   -  by default, LightGBM will map data file to memory and load features from memory. This will provide faster data loading speed, but may cause run out of memory error when the data file is very big

   -  **Note**: works only in case of loading data directly from file

-  ``header`` , default = ``false``, type = bool, aliases: ``has_header``

   -  set this to ``true`` if input data has header

   -  **Note**: works only in case of loading data directly from file

-  ``label_column`` , default = ``""``, type = int or string, aliases: ``label``

   -  used to specify the label column

   -  use number for index, e.g. ``label=0`` means column\_0 is the label

   -  add a prefix ``name:`` for column name, e.g. ``label=name:is_click``

   -  **Note**: works only in case of loading data directly from file

-  ``weight_column`` , default = ``""``, type = int or string, aliases: ``weight``

   -  used to specify the weight column

   -  use number for index, e.g. ``weight=0`` means column\_0 is the weight

   -  add a prefix ``name:`` for column name, e.g. ``weight=name:weight``

   -  **Note**: works only in case of loading data directly from file

   -  **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0, and weight is column\_1, the correct parameter is ``weight=0``

-  ``group_column`` , default = ``""``, type = int or string, aliases: ``group``, ``group_id``, ``query_column``, ``query``, ``query_id``

   -  used to specify the query/group id column

   -  use number for index, e.g. ``query=0`` means column\_0 is the query id

   -  add a prefix ``name:`` for column name, e.g. ``query=name:query_id``

   -  **Note**: works only in case of loading data directly from file

   -  **Note**: data should be grouped by query\_id

   -  **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0``

-  ``ignore_column`` , default = ``""``, type = multi-int or string, aliases: ``ignore_feature``, ``blacklist``

   -  used to specify some ignoring columns in training

   -  use number for index, e.g. ``ignore_column=0,1,2`` means column\_0, column\_1 and column\_2 will be ignored

   -  add a prefix ``name:`` for column name, e.g. ``ignore_column=name:c1,c2,c3`` means c1, c2 and c3 will be ignored

   -  **Note**: works only in case of loading data directly from file

   -  **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``

   -  **Note**: despite the fact that specified columns will be completely ignored during the training, they still should have a valid format allowing LightGBM to load file successfully

-  ``categorical_feature`` , default = ``""``, type = multi-int or string, aliases: ``cat_feature``, ``categorical_column``, ``cat_column``

   -  used to specify categorical features

   -  use number for index, e.g. ``categorical_feature=0,1,2`` means column\_0, column\_1 and column\_2 are categorical features

   -  add a prefix ``name:`` for column name, e.g. ``categorical_feature=name:c1,c2,c3`` means c1, c2 and c3 are categorical features

   -  **Note**: only supports categorical with ``int`` type (not applicable for data represented as pandas DataFrame in Python-package)

   -  **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``

   -  **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)

   -  **Note**: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero

   -  **Note**: all negative values will be treated as **missing values**

   -  **Note**: the output cannot be monotonically constrained with respect to a categorical feature

-  ``forcedbins_filename`` , default = ``""``, type = string

   -  path to a ``.json`` file that specifies bin upper bounds for some or all features

   -  ``.json`` file should contain an array of objects, each containing the word ``feature`` (integer feature index) and ``bin_upper_bound`` (array of thresholds for binning)

   -  see `this file <https://github.com/microsoft/LightGBM/tree/master/examples/regression/forced_bins.json>`__ as an example

-  ``save_binary`` , default = ``false``, type = bool, aliases: ``is_save_binary``, ``is_save_binary_file``

   -  if ``true``, LightGBM will save the dataset (including validation data) to a binary file. This speed ups the data loading for the next time

   -  **Note**: ``init_score`` is not saved in binary file

   -  **Note**: can be used only in CLI version; for language-specific packages you can use the correspondent function

Predict Parameters

num_iteration_predict , default = -1, type = int
- used only in prediction task
- used to specify how many trained iterations will be used in prediction
- <= 0 means no limit
predict_raw_score , default = false, type = bool, aliases: is_predict_raw_score, predict_rawscore, raw_score
- used only in prediction task
- set this to true to predict only the raw scores
- set this to false to predict transformed scores
predict_leaf_index , default = false, type = bool, aliases: is_predict_leaf_index, leaf_index
- used only in prediction task
- set this to true to predict with leaf index of all trees
predict_contrib , default = false, type = bool, aliases: is_predict_contrib, contrib
- used only in prediction task
- set this to true to estimate SHAP values <https://arxiv.org/abs/1706.06060>__, which represent how each feature contributes to each prediction
- produces #features + 1 values where the last value is the expected value of the model output over the training data
- Note: if you want to get more explanation for your model's predictions using SHAP values like SHAP interaction values, you can install shap package <https://github.com/slundberg/shap>__
- Note: unlike the shap package, with predict_contrib we return a matrix with an extra column, where the last column is the expected value
predict_disable_shape_check , default = false, type = bool
- used only in prediction task
- control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
- if false (the default), a fatal error will be raised if the number of features in the dataset you predict on differs from the number seen during training
- if true, LightGBM will attempt to predict on whatever data you provide. This is dangerous because you might get incorrect predictions, but you could use it in situations where it is difficult or expensive to generate some features and you are very confident that they were never chosen for splits in the model
- Note: be very careful setting this parameter to true
pred_early_stop , default = false, type = bool
- used only in prediction task
- if true, will use early-stopping to speed up the prediction. May affect the accuracy
pred_early_stop_freq , default = 10, type = int
- used only in prediction task
- the frequency of checking early-stopping prediction
pred_early_stop_margin , default = 10.0, type = double
- used only in prediction task
- the threshold of margin in early-stopping prediction
output_result , default = LightGBM_predict_result.txt, type = string, aliases: predict_result, prediction_result, predict_name, prediction_name, pred_name, name_pred
- used only in prediction task
- filename of prediction result
- Note: can be used only in CLI version

Convert Parameters


-  ``convert_model_language`` , default = ``""``, type = string

   -  used only in ``convert_model`` task

   -  only ``cpp`` is supported yet; for conversion model to other languages consider using `m2cgen <https://github.com/BayesWitnesses/m2cgen>`__ utility

   -  if ``convert_model_language`` is set and ``task=train``, the model will be also converted

   -  **Note**: can be used only in CLI version

-  ``convert_model`` , default = ``gbdt_prediction.cpp``, type = string, aliases: ``convert_model_file``

   -  used only in ``convert_model`` task

   -  output filename of converted model

   -  **Note**: can be used only in CLI version

Objective Parameters
--------------------

-  ``objective_seed`` , default = ``5``, type = int

   -  used only in ``rank_xendcg`` objective

   -  random seed for objectives, if random process is needed

-  ``num_class`` , default = ``1``, type = int, aliases: ``num_classes``, constraints: ``num_class > 0``

   -  used only in ``multi-class`` classification application

-  ``is_unbalance`` , default = ``false``, type = bool, aliases: ``unbalance``, ``unbalanced_sets``

   -  used only in ``binary`` and ``multiclassova`` applications

   -  set this to ``true`` if training data are unbalanced

   -  **Note**: while enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities

   -  **Note**: this parameter cannot be used at the same time with ``scale_pos_weight``, choose only **one** of them

-  ``scale_pos_weight`` , default = ``1.0``, type = double, constraints: ``scale_pos_weight > 0.0``

   -  used only in ``binary`` and ``multiclassova`` applications

   -  weight of labels with positive class

   -  **Note**: while enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities

   -  **Note**: this parameter cannot be used at the same time with ``is_unbalance``, choose only **one** of them

-  ``sigmoid`` , default = ``1.0``, type = double, constraints: ``sigmoid > 0.0``

   -  used only in ``binary`` and ``multiclassova`` classification and in ``lambdarank`` applications

   -  parameter for the sigmoid function

-  ``boost_from_average`` , default = ``true``, type = bool

   -  used only in ``regression``, ``binary``, ``multiclassova`` and ``cross-entropy`` applications

   -  adjusts initial score to the mean of labels for faster convergence

-  ``reg_sqrt`` , default = ``false``, type = bool

   -  used only in ``regression`` application

   -  used to fit ``sqrt(label)`` instead of original values and prediction result will be also automatically converted to ``prediction^2``

   -  might be useful in case of large-range labels

-  ``alpha`` , default = ``0.9``, type = double, constraints: ``alpha > 0.0``

   -  used only in ``huber`` and ``quantile`` ``regression`` applications

   -  parameter for `Huber loss <https://en.wikipedia.org/wiki/Huber_loss>`__ and `Quantile regression <https://en.wikipedia.org/wiki/Quantile_regression>`__

-  ``fair_c`` , default = ``1.0``, type = double, constraints: ``fair_c > 0.0``

   -  used only in ``fair`` ``regression`` application

   -  parameter for `Fair loss <https://www.kaggle.com/c/allstate-claims-severity/discussion/24520>`__

-  ``poisson_max_delta_step`` , default = ``0.7``, type = double, constraints: ``poisson_max_delta_step > 0.0``

   -  used only in ``poisson`` ``regression`` application

   -  parameter for `Poisson regression <https://en.wikipedia.org/wiki/Poisson_regression>`__ to safeguard optimization

-  ``tweedie_variance_power`` , default = ``1.5``, type = double, constraints: ``1.0 <= tweedie_variance_power < 2.0``

   -  used only in ``tweedie`` ``regression`` application

   -  used to control the variance of the tweedie distribution

   -  set this closer to ``2`` to shift towards a **Gamma** distribution

   -  set this closer to ``1`` to shift towards a **Poisson** distribution

-  ``lambdarank_truncation_level`` , default = ``20``, type = int, constraints: ``lambdarank_truncation_level > 0``

   -  used only in ``lambdarank`` application

   -  used for truncating the max DCG, refer to "truncation level" in the Sec. 3 of `LambdaMART paper <https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf>`__

-  ``lambdarank_norm`` , default = ``true``, type = bool

   -  used only in ``lambdarank`` application

   -  set this to ``true`` to normalize the lambdas for different queries, and improve the performance for unbalanced data

   -  set this to ``false`` to enforce the original lambdarank algorithm

-  ``label_gain`` , default = ``0,1,3,7,15,31,63,...,2^30-1``, type = multi-double

   -  used only in ``lambdarank`` application

   -  relevant gain for labels. For example, the gain of label ``2`` is ``3`` in case of default label gains

   -  separate by ``,``

Metric Parameters
-----------------

-  ``metric`` , default = ``""``, type = multi-enum, aliases: ``metrics``, ``metric_types``

   -  metric(s) to be evaluated on the evaluation set(s)

      -  ``""`` (empty string or not specified) means that metric corresponding to specified ``objective`` will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added)

      -  ``"None"`` (string, **not** a ``None`` value) means that no metric will be registered, aliases: ``na``, ``null``, ``custom``

      -  ``l1``, absolute loss, aliases: ``mean_absolute_error``, ``mae``, ``regression_l1``

      -  ``l2``, square loss, aliases: ``mean_squared_error``, ``mse``, ``regression_l2``, ``regression``

      -  ``rmse``, root square loss, aliases: ``root_mean_squared_error``, ``l2_root``

      -  ``quantile``, `Quantile regression <https://en.wikipedia.org/wiki/Quantile_regression>`__

      -  ``mape``, `MAPE loss <https://en.wikipedia.org/wiki/Mean_absolute_percentage_error>`__, aliases: ``mean_absolute_percentage_error``

      -  ``huber``, `Huber loss <https://en.wikipedia.org/wiki/Huber_loss>`__

      -  ``fair``, `Fair loss <https://www.kaggle.com/c/allstate-claims-severity/discussion/24520>`__

      -  ``poisson``, negative log-likelihood for `Poisson regression <https://en.wikipedia.org/wiki/Poisson_regression>`__

      -  ``gamma``, negative log-likelihood for **Gamma** regression

      -  ``gamma_deviance``, residual deviance for **Gamma** regression

      -  ``tweedie``, negative log-likelihood for **Tweedie** regression

      -  ``ndcg``, `NDCG <https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG>`__, aliases: ``lambdarank``, ``rank_xendcg``, ``xendcg``, ``xe_ndcg``, ``xe_ndcg_mart``, ``xendcg_mart``

      -  ``map``, `MAP <https://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/>`__, aliases: ``mean_average_precision``

      -  ``auc``, `AUC <https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve>`__

      -  ``binary_logloss``, `log loss <https://en.wikipedia.org/wiki/Cross_entropy>`__, aliases: ``binary``

      -  ``binary_error``, for one sample: ``0`` for correct classification, ``1`` for error classification

      -  ``auc_mu``, `AUC-mu <http://proceedings.mlr.press/v97/kleiman19a/kleiman19a.pdf>`__

      -  ``multi_logloss``, log loss for multi-class classification, aliases: ``multiclass``, ``softmax``, ``multiclassova``, ``multiclass_ova``, ``ova``, ``ovr``

      -  ``multi_error``, error rate for multi-class classification

      -  ``cross_entropy``, cross-entropy (with optional linear weights), aliases: ``xentropy``

      -  ``cross_entropy_lambda``, "intensity-weighted" cross-entropy, aliases: ``xentlambda``

      -  ``kullback_leibler``, `Kullback-Leibler divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`__, aliases: ``kldiv``

   -  support multiple metrics, separated by ``,``

-  ``metric_freq`` , default = ``1``, type = int, aliases: ``output_freq``, constraints: ``metric_freq > 0``

   -  frequency for metric output

   -  **Note**: can be used only in CLI version

-  ``is_provide_training_metric`` , default = ``false``, type = bool, aliases: ``training_metric``, ``is_training_metric``, ``train_metric``

   -  set this to ``true`` to output metric result over training dataset

   -  **Note**: can be used only in CLI version

-  ``eval_at`` , default = ``1,2,3,4,5``, type = multi-int, aliases: ``ndcg_eval_at``, ``ndcg_at``, ``map_eval_at``, ``map_at``

   -  used only with ``ndcg`` and ``map`` metrics

   -  `NDCG <https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG>`__ and `MAP <https://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/>`__ evaluation positions, separated by ``,``

-  ``multi_error_top_k`` , default = ``1``, type = int, constraints: ``multi_error_top_k > 0``

   -  used only with ``multi_error`` metric

   -  threshold for top-k multi-error metric

   -  the error on each sample is ``0`` if the true class is among the top ``multi_error_top_k`` predictions, and ``1`` otherwise

      -  more precisely, the error on a sample is ``0`` if there are at least ``num_classes - multi_error_top_k`` predictions strictly less than the prediction on the true class

   -  when ``multi_error_top_k=1`` this is equivalent to the usual multi-error metric

-  ``auc_mu_weights`` , default = ``None``, type = multi-double

   -  used only with ``auc_mu`` metric

   -  list representing flattened matrix (in row-major order) giving loss weights for classification errors

   -  list should have ``n * n`` elements, where ``n`` is the number of classes

   -  the matrix co-ordinate ``[i, j]`` should correspond to the ``i * n + j``-th element of the list

   -  if not specified, will use equal weights for all classes

Network Parameters
------------------

-  ``num_machines`` , default = ``1``, type = int, aliases: ``num_machine``, constraints: ``num_machines > 0``

   -  the number of machines for parallel learning application

   -  this parameter is needed to be set in both **socket** and **mpi** versions

-  ``local_listen_port`` , default = ``12400``, type = int, aliases: ``local_port``, ``port``, constraints: ``local_listen_port > 0``

   -  TCP listen port for local machines

   -  **Note**: don't forget to allow this port in firewall settings before training

-  ``time_out`` , default = ``120``, type = int, constraints: ``time_out > 0``

   -  socket time-out in minutes

-  ``machine_list_filename`` , default = ``""``, type = string, aliases: ``machine_list_file``, ``machine_list``, ``mlist``

   -  path of file that lists machines for this parallel learning application

   -  each line contains one IP and one port for one machine. The format is ``ip port`` (space as a separator)

-  ``machines`` , default = ``""``, type = string, aliases: ``workers``, ``nodes``

   -  list of machines in the following format: ``ip1:port1,ip2:port2``

GPU Parameters
--------------

-  ``gpu_platform_id`` , default = ``-1``, type = int

   -  OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform

   -  ``-1`` means the system-wide default platform

   -  **Note**: refer to `GPU Targets <./GPU-Targets.rst#query-opencl-devices-in-your-system>`__ for more details

-  ``gpu_device_id`` , default = ``-1``, type = int

   -  OpenCL device ID in the specified platform. Each GPU in the selected platform has a unique device ID

   -  ``-1`` means the default device in the selected platform

   -  **Note**: refer to `GPU Targets <./GPU-Targets.rst#query-opencl-devices-in-your-system>`__ for more details

-  ``gpu_use_dp`` , default = ``false``, type = bool

   -  set this to ``true`` to use double precision math on GPU (by default single precision is used)

.. end params list

Others
------

Continued Training with Input Score

LightGBM supports continued training with initial scores. It uses an additional file to store these initial scores, like the following:

0.5
-0.1
0.9
...

It means the initial score of the first data row is 0.5, second is -0.1, and so on. The initial score file corresponds with data file line by line, and has per score per line.

And if the name of data file is train.txt, the initial score file should be named as train.txt.init and placed in the same folder as the data file. In this case, LightGBM will auto load initial score file if it exists.

Weight Data


LightGBM supports weighted training. It uses an additional file to store weight data, like the following:

::

    1.0
    0.5
    0.8
    ...

It means the weight of the first data row is ``1.0``, second is ``0.5``, and so on.
The weight file corresponds with data file line by line, and has per weight per line.

And if the name of data file is ``train.txt``, the weight file should be named as ``train.txt.weight`` and placed in the same folder as the data file.
In this case, LightGBM will load the weight file automatically if it exists.

Also, you can include weight column in your data file. Please refer to the ``weight_column`` `parameter <#weight_column>`__ in above.

Query Data
~~~~~~~~~~

For learning to rank, it needs query information for training data.
LightGBM uses an additional file to store query data, like the following:

::

    27
    18
    67
    ...

It means first ``27`` lines samples belong to one query and next ``18`` lines belong to another, and so on.

**Note**: data should be ordered by the query.

If the name of data file is ``train.txt``, the query file should be named as ``train.txt.query`` and placed in the same folder as the data file.
In this case, LightGBM will load the query file automatically if it exists.

Also, you can include query/group id column in your data file. Please refer to the ``group_column`` `parameter <#group_column>`__ in above.

.. _Laurae++ Interactive Documentation: https://sites.google.com/view/lauraepp/parameters

[PYTHON] Erklärung aller lightGBM-Parameter (unterwegs)

Inhalt

Kernparameter

Kontrollparameter lernen

IO Parameters