J'expliquerai tous les paramètres de lightGBM en gros. Comme il y a beaucoup de contenu, je vais le traduire lentement sur plusieurs jours. Je mettrai à jour les détails dans un article séparé de temps à autre. Si vous faites une erreur, j'apprécierais que vous la signaliez. Le github officiel de lightGBM est ici
Le format de description de base est par défaut = par défaut, type = type, options = options, contraintes = contraintes
-- config
, default = ""
, type = string, alias: config_file
--Définir le chemin du fichier
-- tâche
, par défaut = train '', type = enum, options:
train,
prédire '', convert_model
, refit
, alias: `` type_tâche ''
-- train
, alias: formation
--prédire '', alias:
prédiction,
test``
-- convert_model
, Convertit le fichier modèle au format if-else. Pour plus d'informations, consultez Paramètres IO
-- refit
, refit avec de nouvelles données, alias: refit_tree
** Remarque **: Uniquement disponible dans la version CLI; les fonctionnalités prises en charge sont disponibles dans les packages spécifiés dans la langue.
objective
, default = regression
, type = enum, options: regression
, regression_l1
, huber
, fair
, poisson
, quantile
, mape
, gamma
, tweedie
, binary
, multiclass
, multiclassova
, cross_entropy
, cross_entropy_lambda
, lambdarank
, rank_xendcg
, aliases: objective_type
, app
, application
--Revenir
-- régression
, perte L2, alias: regression_l2
, l2
, mean_squared_error
, mse
, l2_root
, root_mean_squared_error
, `` rmse ''
-- regression_l1
, perte L1, alias: l1
, mean_absolute_error
, mae
- ``huber``, [Huber loss](https://en.wikipedia.org/wiki/Huber_loss>)
- ``fair``, [Fair loss](https://www.kaggle.com/c/allstate-claims-severity/discussion/24520)
- ``poisson``, [Poisson regression](https://en.wikipedia.org/wiki/Poisson_regression)
- ``quantile``, [Quantile regression](https://en.wikipedia.org/wiki/Quantile_regression)
- ``mape``, [MAPE loss](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error): ``mean_absolute_percentage_error``
-- gamma
, régression gamma avec log-link Exemple d'utilisation: cas où la fréquence de couverture d'assurance est modélisée, et autres cas où la distribution gamma est suivie distribué par gamma
-- tweedie
, régression Tweedie avec log-link. Exemple d'utilisation: Modélisation de la perte totale d'assurance et autres cas suivant la distribution de tweedie [tweedie-distribué](https://en.wikipedia.org/wiki/ Tweedie_distribution # Occurrence_and_applications)
--Dichotomie
-- binaire
, binaire perte de journal (ou régression logistique)
--Label doit être 0 ou 1; [0,1] Voir Entropie croisée (https://en.wikipedia.org/wiki/Cross_entropy) pour les probabilités générales des étiquettes
-- multiclass
, Softmax, Alias: softmax
-- multiclassova
, One-vs-All, Alias: multiclass_ova
, ʻova`` ,
ʻOvr``
- ``num_class`` should be set as well
-Application croisée d'entropie
-- cross_entropy
, fonction objective de l'entropie croisée (le poids est arbitraire), alias: xentropy
-- cross_entropy_lambda
, autre paramétrage de l'entropie croisée, alias: xentlambda
- label is anything in interval [0, 1]
--Application de classement
-- lambdarank
, lambdarank. label_gain
(livre de définition) (Explication à partir de la page) a des étiquettes entières et pondère toute valeur de l'étiquette afin qu'elle soit inférieure au nombre d'éléments dans label_gain
.
-- rank_xendcg
, XE_NDCG_MART Fonction d'objectif de rang, alias: xendcg
, xe_ndcg
, xe_ndcg_mart
, `` xendcg_mart '' ''
-- rank_xendcg
Le calcul est rapide et le comportement est similaire à lambdarank
.
-- boosting
, default = gbdt
, type = enum, options: gbdt
, rf
, dart
, goss
, alias: boosting_type ''
, `` booster ''
-- gbdt
, amplification de gradient typique, également connue sous le nom de: gbrt
-- rf
, arbre aléatoire, alias: random_forest
goss
, Gradient-based One-Side Sampling
-- data
, default = ""
, type = string, alias: train
, train_data
, train_data_file
, data_filename
Si vous spécifiez le chemin des données d'entraînement et le chemin, LightGBM s'entraînera à l'aide de ces données.
** Remarque **: version CLI uniquement disponible
-- valid
, default = ""
, type = string, alias: test
, valid_data
, valid_data_file
, test_data
, test_data_file` `,` `noms_fichiers_valides
Chemin des données de validation / test, LightGBM essaie de sortir le résultat en utilisant ces données.
Plusieurs données de validation peuvent être utilisées, séparées par ,
.
** Remarque **: version CLI uniquement disponible
-- num_iterations
, par défaut = 100 '', type = int, alias:
num_iteration,
n_iter,
num_tree,
num_trees,
num_round,
num_rounds,
num_boost_round,
n_estimators, Contraintes:
num_iterations> = 0``
--Nombre de boosting
num_class * num_iterations
dans d'autres problèmes de classification.-- learning_rate
, default = 0.1
, type = double, alias: shrinkage_rate
, ```eta, contrainte:
learning_rate> 0.0``
-- dart
affecte les poids normalisés des `` arbres tombés ''.
-- num_leaves
, default = 31
, type = int, alias: num_leaf
, max_leaves
, max_leaf
, contraintes: `` 1 <num_leaves <= 131072` ''
-Nombre maximum de feuilles dans un arbre.
-- tree_learner
, default = serial '', type = enum, options:
serial,
feature,
data,
vote, alias:
tree ,
tree_type,
tree_learner_type``
Spécifiez comment apprendre les arbres. Le terme étant spécialisé, la traduction est omise.
serial
, single machine tree learner-- fonctionnalité
, fonctionnalité apprenant en arbre parallèle, alias: feature_parallel
-- data
, apprenant d'arbre parallèle de données, alias: data_parallel
vote '', vote de l'apprenant d'arbre parallèle, alias:
vote_parallèle``
Veuillez consulter Parallel Learning.
num_threads
, default = 0
, type = int, aliases: num_thread
, nthread
, nthreads
, n_jobs
--Nombre de threads utilisés pour LightGBM
--Dans OpenMP, `` 0 '' signifie le nombre par défaut de threads.
Afin de maximiser la vitesse de calcul, ce paramètre doit être défini sur ** le nombre réel de cœurs de processeur **, et non sur le nombre de threads, donc soyez prudent. (La plupart des processeurs utilisent l'hyper-threading pour générer 2 threads par processeur.)
Si votre ensemble de données est petit, ne le faites pas grand. (Par exemple, n'utilisez pas 64 threads pour 10000 colonnes de données.)
--Task Manager et d'autres outils de surveillance du processeur peuvent montrer que tous les cœurs ne sont pas utilisés. ** C'est normal **
En traitement parallèle, n'utilisez pas le nombre total de cœurs de processeur afin de ne pas réduire les performances du réseau.
** Remarque **: veuillez ne pas ** ne pas modifier ce paramètre pendant l'entraînement **. Des erreurs inattendues peuvent se produire, en particulier si vous exécutez plusieurs tâches en même temps dans un package externe.
device_type
, default = cpu
, type = enum, options: cpu
, gpu
, aliases: device
--Spécifiez le périphérique utilisé pour l'apprentissage des arbres. Vous pouvez accélérer en utilisant le GPU.
** Remarque **: Vous pouvez accélérer en utilisant un `` max_bin '' plus petit (Exemple 63).
** Remarque **: Par défaut, le GPU est ajouté en virgule flottante 32 bits pour des vitesses plus rapides. Cela peut affecter la précision de certaines tâches et peut être changé en virgule flottante 64 bits en définissant `` gpu_use_dp = true '', mais l'entraînement peut prendre plus de temps. ..
** Remarque **: si vous souhaitez utiliser le GPU avec lightGBM, Guide d'installation Prière de se référer à.
seed
, default = None
, type = int, aliases: random_seed
, random_state
Cette graine générera d'autres graines. Exemple: data_random_seed
, feature_fraction_seed
, etc.
--Par défaut, cette graine n'est pas utilisée en raison des valeurs par défaut des autres graines.
-Cette graine a une priorité moindre que les autres graines. Autrement dit, si vous spécifiez explicitement une autre graine, cette graine sera écrasée.
force_col_wise
, default = false
, type = bool--Seul `` cpu '' peut être utilisé
--Il est recommandé d'appliquer ce paramètre dans les cas suivants:
-- num_threads
est grand, par exemple > 20
--Je veux réduire le coût de la mémoire
** Remarque **: Lorsque les deux force_col_wise '' et
force_row_wise '' sont faux '', LightGBM essaiera les deux en premier et utilisera le plus rapide. Pour vous débarrasser des frais généraux), réglez manuellement le plus rapide sur
vrai ''.
** Remarque **: ne peut pas être utilisé avec `` force_row_wise '', veuillez choisir un seul des deux.
force_row_wise
, default = false
, type = bool
--Seul `` cpu '' peut être utilisé
--Il est recommandé d'appliquer ce paramètre dans les cas suivants:
Grand nombre de données ou nombre relativement petit de bacs
Relativement peu de num_threads '', par exemple
<= 16``
--Lorsque vous souhaitez accélérer en utilisant une petite valeur bagging_fraction '' ou
goss ''
** Remarque **: définir ce paramètre sur true '' double l'utilisation de la mémoire pour l'ensemble de données. Si vous n'avez pas assez de mémoire, utilisez
force_col_wise = true ''.
** Remarque **: Lorsque les deux force_col_wise '' et
force_row_wise '' sont faux '', LightGBM essaiera les deux en premier et utilisera le plus rapide. Pour vous débarrasser des frais généraux), réglez manuellement le plus rapide sur
vrai ''.
** Remarque **: ne peut pas être utilisé avec `` force_col_wise '', veuillez choisir un seul des deux.
histogram_pool_size
, default = -1.0
, type = double, aliases: hist_pool_size
--Taille maximale du cache de l'histogramme historique (unité Mo)
`` <0 '' signifie illimité
max_depth
, default = -1
, type = int
Limitez la profondeur maximale du modèle d'arbre. Ceci est utilisé pour traiter le surajustement lorsque le nombre de données est petit. Les spécifications du bois ne changent pas.
-- <= 0
signifie illimité.
-- min_data_in_leaf
, default = 20 '', type = int, alias:
min_data_per_leaf,
min_data,
min_child_samples, contraintes:
min_data_in_leaf> = 0``
--Nombre minimum de données pour une feuille. Utilisé pour traiter le surajustement.
min_sum_hessian_in_leaf
, default = 1e-3
, type = double, aliases: min_sum_hessian_per_leaf
, min_sum_hessian
, min_hessian
, min_child_weight
, constraints: min_sum_hessian_in_leaf >= 0.0
-Somme minimale de hesian dans une feuille. Similaire à `` min_data_in_leaf '', il est utilisé pour gérer le surajustement.
-- bagging_fraction
, default = 1.0
, type = double, alias: sub_row
, subsample
, bagging
, contraintes: `` 0.0 <bagging_fraction <= 1.0` ''
Similaire à `` feature_fraction '', mais il extrait de manière aléatoire un sous-ensemble de données sans rééchantillonnage.
Utilisé pour améliorer la vitesse de calcul de la formation.
Utilisé pour traiter le surajustement.
** Remarque **: bagging_freq
doit également être une valeur différente de zéro pour que l'ensachage prenne effet.
-- pos_bagging_fraction
, default = 1.0
, type = double, alias: pos_sub_row
, pos_subsample
, pos_bagging
, contraintes: `` 0.0 <pos_bagging_fraction <= 1.0` ''
--Utiliser uniquement avec `` binaire ''.
--Doit être utilisé avec `` neg_bagging_fraction ''.
--Si vous le définissez sur `` 1.0 '', il sera invalide.
** Note **: Vous devez remplir bagging_freq '' et
neg_bagging_fraction '' pour qu'il prenne effet.
** Remarque **: Si les deux pos_bagging_fraction '' et
neg_bagging_fraction '' sont `` 1.0 '', l'ensachage équilibré est désactivé.
** Remarque **: bagging_fraction
est ignoré si l'ensachage équilibré est activé.
-- neg_bagging_fraction
, default = 1.0
, type = double, alias: neg_sub_row
, neg_subsample
, neg_bagging
, contraintes: `` 0.0 <neg_bagging_fraction <= 1.0` ''
--Utiliser uniquement avec `` binaire ''.
--Utiliser avec `` pos_bagging_fraction ''.
--Si vous le définissez sur `` 1.0 '', il sera invalide.
** Note **: Vous devez remplir bagging_freq '' et
neg_bagging_fraction '' pour qu'il prenne effet.
** Remarque **: Si les deux pos_bagging_fraction '' et
neg_bagging_fraction '' sont `` 1.0 '', l'ensachage équilibré est désactivé.
** Remarque **: bagging_fraction
est ignoré si l'ensachage équilibré est activé.
-- bagging_freq
, par défaut = 0
, type = int, alias: subsample_freq
Fréquence de mise en sac
0 '' signifie pas d'ensachage. ;
ksignifie qu'il est ensaché à plusieurs reprises une fois tous les
k ''.
** Remarque **: La valeur de bagging_fraction
doit être inférieure à `` 1.0 '' pour que l'ensachage prenne effet.
-- bagging_seed
, par défaut = 3 '', type = int, alias:
bagging_fraction_seed``
--Bagging des semences aléatoires
-- feature_fraction
, default = 1.0
, type = double, alias: sub_feature
, colsample_bytree
, contrainte: 0.0 <feature_fraction <= 1.0
feature_fraction
est inférieur à 1.0 '', LightGBM extraira aléatoirement certaines fonctionnalités à chaque fois. Par exemple, avec
0.8 '', LightGBM sélectionnera 80% des fonctionnalités avant l'entraînement.--Peut être utilisé pour accélérer la formation.
--Peut être utilisé comme contre-mesure contre le surajustement.
-- feature_fraction_bynode
, default = 1.0
, type = double, alias: sub_feature_bynode
, colsample_bynode
, contrainte: 0.0 <feature_fraction_bynode <= 1.0
feature_fraction_bynode
est inférieur à 1.0 '', LightGBM extraira partiellement les fonctionnalités à chaque nœud de l'arborescence. Par exemple, avec
0.8 '', LightGBM extrait 80% des fonctionnalités de chaque nœud d'arbre.--Peut être utilisé comme contre-mesure contre le surajustement.
** Remarque **: contrairement à `` feature_fraction '', l'entraînement n'est pas accéléré.
** Remarque **: Si les deux feature_fraction '' et
feature_fraction_bynodesont inférieurs à
1.0 '', le pourcentage final de chaque nœud sera le double du `` feature_fraction * feature_fraction_bynode '' d'origine. ..
feature_fraction_seed
, default = 2
, type = int
Graine aléatoire de feature_fraction
extra_trees
, default = false
, type = bool
Utilisé pour les arbres extrêmement aléatoires.
--Si `` true '', lightGBM sélectionnera un seul seuil aléatoire pour chaque fonctionnalité lors de l'évaluation des fractionnements de nœuds.
Utilisé comme contre-mesure pour le surajustement.
extra_seed
, default = 6
, type = int
Graine aléatoire utilisée pour sélectionner le seuil lorsque `ʻextra_trees`` est vrai
--ʻearly_stopping_round``, default = `` 0``, type = int, alias:
ʻearly_stopping_rounds,` ʻearly_stopping
, n_iter_no_change
--Au dernier tour de `ʻearly_stopping_round``, arrêtez l'entraînement si les performances ne s'améliorent pas.
-- <= 0
signifie invalide.
first_metric_only
, default = false
, type = bool--Si vous souhaitez n'utiliser que la première évaluation de l'arrêt prématuré, définissez ceci sur `` vrai ''.
-- max_delta_step
, par défaut = 0.0 '', type = double, alias:
max_tree_output,
max_leaf_output``
-- <= 0
signifie illimité.
-- lambda_l1
, par défaut = 0.0
, type = double, alias: reg_alpha
, limite: lambda_l1> = 0.0
-- lambda_l2
, par défaut = 0.0
, type = double, alias: reg_lambda
, lambda
, limite: lambda_l2> = 0.0
-- min_gain_to_split
, par défaut = 0.0
, type = double, alias: min_split_gain
, limite: min_gain_to_split> = 0.0
--Gain minimum lors du fractionnement (gain)
-- drop_rate
, default = 0.1
, type = double, alias: rate_drop
, contrainte: 0.0 <= drop_rate <= 1.0
Utilisé uniquement pour `` fléchette ''.
Taux d'abandon: les abandons sont utilisés pour affaiblir la partie aléatoire de la fonction pendant l'entraînement. pour désactiver une fraction aléatoire des entités d'entrée pendant la phase de formation. Références)
max_drop
, default = 50
, type = int
used only in dart
max number of dropped trees during one boosting iteration
<=0
means no limit
skip_drop
, default = 0.5
, type = double, constraints: 0.0 <= skip_drop <= 1.0
used only in dart
probability of skipping the dropout procedure during a boosting iteration
xgboost_dart_mode
, default = false
, type = bool
used only in dart
set this to true
, if you want to use xgboost dart mode
uniform_drop
, default = false
, type = bool
used only in dart
set this to true
, if you want to use uniform drop
drop_seed
, default = 4
, type = int
used only in dart
random seed to choose dropping models
top_rate
, default = 0.2
, type = double, constraints: 0.0 <= top_rate <= 1.0
used only in goss
the retain ratio of large gradient data
other_rate
, default = 0.1
, type = double, constraints: 0.0 <= other_rate <= 1.0
used only in goss
the retain ratio of small gradient data
min_data_per_group
, default = 100
, type = int, constraints: min_data_per_group > 0
max_cat_threshold
, default = 32
, type = int, constraints: max_cat_threshold > 0
used for the categorical features
limit the max threshold points in categorical features
cat_l2
, default = 10.0
, type = double, constraints: cat_l2 >= 0.0
used for the categorical features
L2 regularization in categorical split
cat_smooth
, default = 10.0
, type = double, constraints: cat_smooth >= 0.0
used for the categorical features
this can reduce the effect of noises in categorical features, especially for categories with few data
max_cat_to_onehot
, default = 4
, type = int, constraints: max_cat_to_onehot > 0
max_cat_to_onehot
, one-vs-other split algorithm will be usedtop_k
, default = 20
, type = int, aliases: topk
, constraints: top_k > 0
used only in voting
tree learner, refer to Voting parallel <./Parallel-Learning-Guide.rst#choose-appropriate-parallel-algorithm>
__
set this to larger value for more accurate result, but it will slow down the training speed
monotone_constraints
, default = None
, type = multi-int, aliases: mc
, monotone_constraint
used for constraints of monotonic features
1
means increasing, -1
means decreasing, 0
means non-constraint
you need to specify all features in order. For example, mc=-1,0,1
means decreasing for 1st feature, non-constraint for 2nd feature and increasing for the 3rd feature
monotone_constraints_method
, default = basic
, type = string, aliases: monotone_constraining_method
, mc_method
used only if monotone_constraints
is set
monotone constraints method
basic
, the most basic monotone constraints method. It does not slow the library at all, but over-constrains the predictions
intermediate
, a more advanced method <https://github.com/microsoft/LightGBM/files/3457826/PR-monotone-constraints-report.pdf>
__, which may slow the library very slightly. However, this method is much less constraining than the basic method and should significantly improve the results
monotone_penalty
, default = 0.0
, type = double, aliases: monotone_splits_penalty
, ms_penalty
, mc_penalty
, constraints: monotone_penalty >= 0.0
used only if monotone_constraints
is set
monotone penalty <https://github.com/microsoft/LightGBM/files/3457826/PR-monotone-constraints-report.pdf>
__: a penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree. The penalty applied to monotone splits on a given depth is a continuous, increasing function the penalization parameter
if 0.0
(the default), no penalization is applied
feature_contri
, default = None
, type = multi-double, aliases: feature_contrib
, fc
, fp
, feature_penalty
used to control feature's split gain, will use gain[i] = max(0, feature_contri[i]) * gain[i]
to replace the split gain of i-th feature
you need to specify all features in order
forcedsplits_filename
, default = ""
, type = string, aliases: fs
, forced_splits_filename
, forced_splits_file
, forced_splits
path to a .json
file that specifies splits to force at the top of every decision tree before best-first learning commences
.json
file can be arbitrarily nested, and each split contains feature
, threshold
fields, as well as left
and right
fields representing subsplits
categorical splits are forced in a one-hot fashion, with left
representing the split containing the feature value and right
representing other values
Note: the forced split logic will be ignored, if the split makes gain worse
see this file <https://github.com/microsoft/LightGBM/tree/master/examples/binary_classification/forced_splits.json>
__ as an example
refit_decay_rate
, default = 0.9
, type = double, constraints: 0.0 <= refit_decay_rate <= 1.0
decay rate of refit
task, will use leaf_output = refit_decay_rate * old_leaf_output + (1.0 - refit_decay_rate) * new_leaf_output
to refit trees
used only in refit
task in CLI version or as argument in refit
function in language-specific package
cegb_tradeoff
, default = 1.0
, type = double, constraints: cegb_tradeoff >= 0.0
cegb_penalty_split
, default = 0.0
, type = double, constraints: cegb_penalty_split >= 0.0
cegb_penalty_feature_lazy
, default = 0,0,...,0
, type = multi-double
cost-effective gradient boosting penalty for using a feature
applied per data point
cegb_penalty_feature_coupled
, default = 0,0,...,0
, type = multi-double
cost-effective gradient boosting penalty for using a feature
applied once per forest
path_smooth
, default = 0
, type = double, constraints: path_smooth >= 0.0
controls smoothing applied to tree nodes
helps prevent overfitting on leaves with few samples
if set to zero, no smoothing is applied
if path_smooth > 0
then min_data_in_leaf
must be at least 2
larger values give stronger regularisation
the weight of each node is (n / path_smooth) * w + w_p / (n / path_smooth + 1)
, where n
is the number of samples in the node, w
is the optimal node weight to minimise the loss (approximately -sum_gradients / sum_hessians
), and w_p
is the weight of the parent node
note that the parent output w_p
itself has smoothing applied, unless it is the root node, so that the smoothing effect accumulates with the tree depth
verbosity
, default = 1
, type = int, aliases: verbose
controls the level of LightGBM's verbosity
< 0
: Fatal, = 0
: Error (Warning), = 1
: Info, > 1
: Debug
input_model
, default = ""
, type = string, aliases: model_input
, model_in
filename of input model
for prediction
task, this model will be applied to prediction data
for train
task, training will be continued from this model
Note: can be used only in CLI version
output_model
, default = LightGBM_model.txt
, type = string, aliases: model_output
, model_out
filename of output model in training
Note: can be used only in CLI version
snapshot_freq
, default = -1
, type = int, aliases: save_period
frequency of saving model file snapshot
set this to positive value to enable this function. For example, the model file will be snapshotted at each iteration if snapshot_freq=1
Note: can be used only in CLI version
Dataset Parameters
- ``max_bin`` , default = ``255``, type = int, constraints: ``max_bin > 1``
- max number of bins that feature values will be bucketed in
- small number of bins may reduce training accuracy but may increase general power (deal with over-fitting)
- LightGBM will auto compress memory according to ``max_bin``. For example, LightGBM will use ``uint8_t`` for feature value if ``max_bin=255``
- ``max_bin_by_feature`` , default = ``None``, type = multi-int
- max number of bins for each feature
- if not specified, will use ``max_bin`` for all features
- ``min_data_in_bin`` , default = ``3``, type = int, constraints: ``min_data_in_bin > 0``
- minimal number of data inside one bin
- use this to avoid one-data-one-bin (potential over-fitting)
- ``bin_construct_sample_cnt`` , default = ``200000``, type = int, aliases: ``subsample_for_bin``, constraints: ``bin_construct_sample_cnt > 0``
- number of data that sampled to construct histogram bins
- setting this to larger value will give better training result, but will increase data loading time
- set this to larger value if data is very sparse
- ``data_random_seed`` , default = ``1``, type = int, aliases: ``data_seed``
- random seed for sampling data to construct histogram bins
- ``is_enable_sparse`` , default = ``true``, type = bool, aliases: ``is_sparse``, ``enable_sparse``, ``sparse``
- used to enable/disable sparse optimization
- ``enable_bundle`` , default = ``true``, type = bool, aliases: ``is_enable_bundle``, ``bundle``
- set this to ``false`` to disable Exclusive Feature Bundling (EFB), which is described in `LightGBM: A Highly Efficient Gradient Boosting Decision Tree <https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree>`__
- **Note**: disabling this may cause the slow training speed for sparse datasets
- ``use_missing`` , default = ``true``, type = bool
- set this to ``false`` to disable the special handle of missing value
- ``zero_as_missing`` , default = ``false``, type = bool
- set this to ``true`` to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices)
- set this to ``false`` to use ``na`` for representing missing values
- ``feature_pre_filter`` , default = ``true``, type = bool
- set this to ``true`` to pre-filter the unsplittable features by ``min_data_in_leaf``
- as dataset object is initialized only once and cannot be changed after that, you may need to set this to ``false`` when searching parameters with ``min_data_in_leaf``, otherwise features are filtered by ``min_data_in_leaf`` firstly if you don't reconstruct dataset object
- **Note**: setting this to ``false`` may slow down the training
- ``pre_partition`` , default = ``false``, type = bool, aliases: ``is_pre_partition``
- used for parallel learning (excluding the ``feature_parallel`` mode)
- ``true`` if training data are pre-partitioned, and different machines use different partitions
- ``two_round`` , default = ``false``, type = bool, aliases: ``two_round_loading``, ``use_two_round_loading``
- set this to ``true`` if data file is too big to fit in memory
- by default, LightGBM will map data file to memory and load features from memory. This will provide faster data loading speed, but may cause run out of memory error when the data file is very big
- **Note**: works only in case of loading data directly from file
- ``header`` , default = ``false``, type = bool, aliases: ``has_header``
- set this to ``true`` if input data has header
- **Note**: works only in case of loading data directly from file
- ``label_column`` , default = ``""``, type = int or string, aliases: ``label``
- used to specify the label column
- use number for index, e.g. ``label=0`` means column\_0 is the label
- add a prefix ``name:`` for column name, e.g. ``label=name:is_click``
- **Note**: works only in case of loading data directly from file
- ``weight_column`` , default = ``""``, type = int or string, aliases: ``weight``
- used to specify the weight column
- use number for index, e.g. ``weight=0`` means column\_0 is the weight
- add a prefix ``name:`` for column name, e.g. ``weight=name:weight``
- **Note**: works only in case of loading data directly from file
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0, and weight is column\_1, the correct parameter is ``weight=0``
- ``group_column`` , default = ``""``, type = int or string, aliases: ``group``, ``group_id``, ``query_column``, ``query``, ``query_id``
- used to specify the query/group id column
- use number for index, e.g. ``query=0`` means column\_0 is the query id
- add a prefix ``name:`` for column name, e.g. ``query=name:query_id``
- **Note**: works only in case of loading data directly from file
- **Note**: data should be grouped by query\_id
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0``
- ``ignore_column`` , default = ``""``, type = multi-int or string, aliases: ``ignore_feature``, ``blacklist``
- used to specify some ignoring columns in training
- use number for index, e.g. ``ignore_column=0,1,2`` means column\_0, column\_1 and column\_2 will be ignored
- add a prefix ``name:`` for column name, e.g. ``ignore_column=name:c1,c2,c3`` means c1, c2 and c3 will be ignored
- **Note**: works only in case of loading data directly from file
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``
- **Note**: despite the fact that specified columns will be completely ignored during the training, they still should have a valid format allowing LightGBM to load file successfully
- ``categorical_feature`` , default = ``""``, type = multi-int or string, aliases: ``cat_feature``, ``categorical_column``, ``cat_column``
- used to specify categorical features
- use number for index, e.g. ``categorical_feature=0,1,2`` means column\_0, column\_1 and column\_2 are categorical features
- add a prefix ``name:`` for column name, e.g. ``categorical_feature=name:c1,c2,c3`` means c1, c2 and c3 are categorical features
- **Note**: only supports categorical with ``int`` type (not applicable for data represented as pandas DataFrame in Python-package)
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``
- **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)
- **Note**: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero
- **Note**: all negative values will be treated as **missing values**
- **Note**: the output cannot be monotonically constrained with respect to a categorical feature
- ``forcedbins_filename`` , default = ``""``, type = string
- path to a ``.json`` file that specifies bin upper bounds for some or all features
- ``.json`` file should contain an array of objects, each containing the word ``feature`` (integer feature index) and ``bin_upper_bound`` (array of thresholds for binning)
- see `this file <https://github.com/microsoft/LightGBM/tree/master/examples/regression/forced_bins.json>`__ as an example
- ``save_binary`` , default = ``false``, type = bool, aliases: ``is_save_binary``, ``is_save_binary_file``
- if ``true``, LightGBM will save the dataset (including validation data) to a binary file. This speed ups the data loading for the next time
- **Note**: ``init_score`` is not saved in binary file
- **Note**: can be used only in CLI version; for language-specific packages you can use the correspondent function
Predict Parameters
num_iteration_predict
, default = -1
, type = int
used only in prediction
task
used to specify how many trained iterations will be used in prediction
<= 0
means no limit
predict_raw_score
, default = false
, type = bool, aliases: is_predict_raw_score
, predict_rawscore
, raw_score
used only in prediction
task
set this to true
to predict only the raw scores
set this to false
to predict transformed scores
predict_leaf_index
, default = false
, type = bool, aliases: is_predict_leaf_index
, leaf_index
used only in prediction
task
set this to true
to predict with leaf index of all trees
predict_contrib
, default = false
, type = bool, aliases: is_predict_contrib
, contrib
used only in prediction
task
set this to true
to estimate SHAP values <https://arxiv.org/abs/1706.06060>
__, which represent how each feature contributes to each prediction
produces #features + 1
values where the last value is the expected value of the model output over the training data
Note: if you want to get more explanation for your model's predictions using SHAP values like SHAP interaction values, you can install shap package <https://github.com/slundberg/shap>
__
Note: unlike the shap package, with predict_contrib
we return a matrix with an extra column, where the last column is the expected value
predict_disable_shape_check
, default = false
, type = bool
used only in prediction
task
control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
if false
(the default), a fatal error will be raised if the number of features in the dataset you predict on differs from the number seen during training
if true
, LightGBM will attempt to predict on whatever data you provide. This is dangerous because you might get incorrect predictions, but you could use it in situations where it is difficult or expensive to generate some features and you are very confident that they were never chosen for splits in the model
Note: be very careful setting this parameter to true
pred_early_stop
, default = false
, type = bool
used only in prediction
task
if true
, will use early-stopping to speed up the prediction. May affect the accuracy
pred_early_stop_freq
, default = 10
, type = int
used only in prediction
task
the frequency of checking early-stopping prediction
pred_early_stop_margin
, default = 10.0
, type = double
used only in prediction
task
the threshold of margin in early-stopping prediction
output_result
, default = LightGBM_predict_result.txt
, type = string, aliases: predict_result
, prediction_result
, predict_name
, prediction_name
, pred_name
, name_pred
used only in prediction
task
filename of prediction result
Note: can be used only in CLI version
Convert Parameters
- ``convert_model_language`` , default = ``""``, type = string
- used only in ``convert_model`` task
- only ``cpp`` is supported yet; for conversion model to other languages consider using `m2cgen <https://github.com/BayesWitnesses/m2cgen>`__ utility
- if ``convert_model_language`` is set and ``task=train``, the model will be also converted
- **Note**: can be used only in CLI version
- ``convert_model`` , default = ``gbdt_prediction.cpp``, type = string, aliases: ``convert_model_file``
- used only in ``convert_model`` task
- output filename of converted model
- **Note**: can be used only in CLI version
Objective Parameters
--------------------
- ``objective_seed`` , default = ``5``, type = int
- used only in ``rank_xendcg`` objective
- random seed for objectives, if random process is needed
- ``num_class`` , default = ``1``, type = int, aliases: ``num_classes``, constraints: ``num_class > 0``
- used only in ``multi-class`` classification application
- ``is_unbalance`` , default = ``false``, type = bool, aliases: ``unbalance``, ``unbalanced_sets``
- used only in ``binary`` and ``multiclassova`` applications
- set this to ``true`` if training data are unbalanced
- **Note**: while enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities
- **Note**: this parameter cannot be used at the same time with ``scale_pos_weight``, choose only **one** of them
- ``scale_pos_weight`` , default = ``1.0``, type = double, constraints: ``scale_pos_weight > 0.0``
- used only in ``binary`` and ``multiclassova`` applications
- weight of labels with positive class
- **Note**: while enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities
- **Note**: this parameter cannot be used at the same time with ``is_unbalance``, choose only **one** of them
- ``sigmoid`` , default = ``1.0``, type = double, constraints: ``sigmoid > 0.0``
- used only in ``binary`` and ``multiclassova`` classification and in ``lambdarank`` applications
- parameter for the sigmoid function
- ``boost_from_average`` , default = ``true``, type = bool
- used only in ``regression``, ``binary``, ``multiclassova`` and ``cross-entropy`` applications
- adjusts initial score to the mean of labels for faster convergence
- ``reg_sqrt`` , default = ``false``, type = bool
- used only in ``regression`` application
- used to fit ``sqrt(label)`` instead of original values and prediction result will be also automatically converted to ``prediction^2``
- might be useful in case of large-range labels
- ``alpha`` , default = ``0.9``, type = double, constraints: ``alpha > 0.0``
- used only in ``huber`` and ``quantile`` ``regression`` applications
- parameter for `Huber loss <https://en.wikipedia.org/wiki/Huber_loss>`__ and `Quantile regression <https://en.wikipedia.org/wiki/Quantile_regression>`__
- ``fair_c`` , default = ``1.0``, type = double, constraints: ``fair_c > 0.0``
- used only in ``fair`` ``regression`` application
- parameter for `Fair loss <https://www.kaggle.com/c/allstate-claims-severity/discussion/24520>`__
- ``poisson_max_delta_step`` , default = ``0.7``, type = double, constraints: ``poisson_max_delta_step > 0.0``
- used only in ``poisson`` ``regression`` application
- parameter for `Poisson regression <https://en.wikipedia.org/wiki/Poisson_regression>`__ to safeguard optimization
- ``tweedie_variance_power`` , default = ``1.5``, type = double, constraints: ``1.0 <= tweedie_variance_power < 2.0``
- used only in ``tweedie`` ``regression`` application
- used to control the variance of the tweedie distribution
- set this closer to ``2`` to shift towards a **Gamma** distribution
- set this closer to ``1`` to shift towards a **Poisson** distribution
- ``lambdarank_truncation_level`` , default = ``20``, type = int, constraints: ``lambdarank_truncation_level > 0``
- used only in ``lambdarank`` application
- used for truncating the max DCG, refer to "truncation level" in the Sec. 3 of `LambdaMART paper <https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf>`__
- ``lambdarank_norm`` , default = ``true``, type = bool
- used only in ``lambdarank`` application
- set this to ``true`` to normalize the lambdas for different queries, and improve the performance for unbalanced data
- set this to ``false`` to enforce the original lambdarank algorithm
- ``label_gain`` , default = ``0,1,3,7,15,31,63,...,2^30-1``, type = multi-double
- used only in ``lambdarank`` application
- relevant gain for labels. For example, the gain of label ``2`` is ``3`` in case of default label gains
- separate by ``,``
Metric Parameters
-----------------
- ``metric`` , default = ``""``, type = multi-enum, aliases: ``metrics``, ``metric_types``
- metric(s) to be evaluated on the evaluation set(s)
- ``""`` (empty string or not specified) means that metric corresponding to specified ``objective`` will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added)
- ``"None"`` (string, **not** a ``None`` value) means that no metric will be registered, aliases: ``na``, ``null``, ``custom``
- ``l1``, absolute loss, aliases: ``mean_absolute_error``, ``mae``, ``regression_l1``
- ``l2``, square loss, aliases: ``mean_squared_error``, ``mse``, ``regression_l2``, ``regression``
- ``rmse``, root square loss, aliases: ``root_mean_squared_error``, ``l2_root``
- ``quantile``, `Quantile regression <https://en.wikipedia.org/wiki/Quantile_regression>`__
- ``mape``, `MAPE loss <https://en.wikipedia.org/wiki/Mean_absolute_percentage_error>`__, aliases: ``mean_absolute_percentage_error``
- ``huber``, `Huber loss <https://en.wikipedia.org/wiki/Huber_loss>`__
- ``fair``, `Fair loss <https://www.kaggle.com/c/allstate-claims-severity/discussion/24520>`__
- ``poisson``, negative log-likelihood for `Poisson regression <https://en.wikipedia.org/wiki/Poisson_regression>`__
- ``gamma``, negative log-likelihood for **Gamma** regression
- ``gamma_deviance``, residual deviance for **Gamma** regression
- ``tweedie``, negative log-likelihood for **Tweedie** regression
- ``ndcg``, `NDCG <https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG>`__, aliases: ``lambdarank``, ``rank_xendcg``, ``xendcg``, ``xe_ndcg``, ``xe_ndcg_mart``, ``xendcg_mart``
- ``map``, `MAP <https://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/>`__, aliases: ``mean_average_precision``
- ``auc``, `AUC <https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve>`__
- ``binary_logloss``, `log loss <https://en.wikipedia.org/wiki/Cross_entropy>`__, aliases: ``binary``
- ``binary_error``, for one sample: ``0`` for correct classification, ``1`` for error classification
- ``auc_mu``, `AUC-mu <http://proceedings.mlr.press/v97/kleiman19a/kleiman19a.pdf>`__
- ``multi_logloss``, log loss for multi-class classification, aliases: ``multiclass``, ``softmax``, ``multiclassova``, ``multiclass_ova``, ``ova``, ``ovr``
- ``multi_error``, error rate for multi-class classification
- ``cross_entropy``, cross-entropy (with optional linear weights), aliases: ``xentropy``
- ``cross_entropy_lambda``, "intensity-weighted" cross-entropy, aliases: ``xentlambda``
- ``kullback_leibler``, `Kullback-Leibler divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`__, aliases: ``kldiv``
- support multiple metrics, separated by ``,``
- ``metric_freq`` , default = ``1``, type = int, aliases: ``output_freq``, constraints: ``metric_freq > 0``
- frequency for metric output
- **Note**: can be used only in CLI version
- ``is_provide_training_metric`` , default = ``false``, type = bool, aliases: ``training_metric``, ``is_training_metric``, ``train_metric``
- set this to ``true`` to output metric result over training dataset
- **Note**: can be used only in CLI version
- ``eval_at`` , default = ``1,2,3,4,5``, type = multi-int, aliases: ``ndcg_eval_at``, ``ndcg_at``, ``map_eval_at``, ``map_at``
- used only with ``ndcg`` and ``map`` metrics
- `NDCG <https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG>`__ and `MAP <https://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/>`__ evaluation positions, separated by ``,``
- ``multi_error_top_k`` , default = ``1``, type = int, constraints: ``multi_error_top_k > 0``
- used only with ``multi_error`` metric
- threshold for top-k multi-error metric
- the error on each sample is ``0`` if the true class is among the top ``multi_error_top_k`` predictions, and ``1`` otherwise
- more precisely, the error on a sample is ``0`` if there are at least ``num_classes - multi_error_top_k`` predictions strictly less than the prediction on the true class
- when ``multi_error_top_k=1`` this is equivalent to the usual multi-error metric
- ``auc_mu_weights`` , default = ``None``, type = multi-double
- used only with ``auc_mu`` metric
- list representing flattened matrix (in row-major order) giving loss weights for classification errors
- list should have ``n * n`` elements, where ``n`` is the number of classes
- the matrix co-ordinate ``[i, j]`` should correspond to the ``i * n + j``-th element of the list
- if not specified, will use equal weights for all classes
Network Parameters
------------------
- ``num_machines`` , default = ``1``, type = int, aliases: ``num_machine``, constraints: ``num_machines > 0``
- the number of machines for parallel learning application
- this parameter is needed to be set in both **socket** and **mpi** versions
- ``local_listen_port`` , default = ``12400``, type = int, aliases: ``local_port``, ``port``, constraints: ``local_listen_port > 0``
- TCP listen port for local machines
- **Note**: don't forget to allow this port in firewall settings before training
- ``time_out`` , default = ``120``, type = int, constraints: ``time_out > 0``
- socket time-out in minutes
- ``machine_list_filename`` , default = ``""``, type = string, aliases: ``machine_list_file``, ``machine_list``, ``mlist``
- path of file that lists machines for this parallel learning application
- each line contains one IP and one port for one machine. The format is ``ip port`` (space as a separator)
- ``machines`` , default = ``""``, type = string, aliases: ``workers``, ``nodes``
- list of machines in the following format: ``ip1:port1,ip2:port2``
GPU Parameters
--------------
- ``gpu_platform_id`` , default = ``-1``, type = int
- OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform
- ``-1`` means the system-wide default platform
- **Note**: refer to `GPU Targets <./GPU-Targets.rst#query-opencl-devices-in-your-system>`__ for more details
- ``gpu_device_id`` , default = ``-1``, type = int
- OpenCL device ID in the specified platform. Each GPU in the selected platform has a unique device ID
- ``-1`` means the default device in the selected platform
- **Note**: refer to `GPU Targets <./GPU-Targets.rst#query-opencl-devices-in-your-system>`__ for more details
- ``gpu_use_dp`` , default = ``false``, type = bool
- set this to ``true`` to use double precision math on GPU (by default single precision is used)
.. end params list
Others
------
Continued Training with Input Score
LightGBM supports continued training with initial scores. It uses an additional file to store these initial scores, like the following:
::
0.5
-0.1
0.9
...
It means the initial score of the first data row is 0.5
, second is -0.1
, and so on.
The initial score file corresponds with data file line by line, and has per score per line.
And if the name of data file is train.txt
, the initial score file should be named as train.txt.init
and placed in the same folder as the data file.
In this case, LightGBM will auto load initial score file if it exists.
Weight Data
LightGBM supports weighted training. It uses an additional file to store weight data, like the following:
::
1.0
0.5
0.8
...
It means the weight of the first data row is ``1.0``, second is ``0.5``, and so on.
The weight file corresponds with data file line by line, and has per weight per line.
And if the name of data file is ``train.txt``, the weight file should be named as ``train.txt.weight`` and placed in the same folder as the data file.
In this case, LightGBM will load the weight file automatically if it exists.
Also, you can include weight column in your data file. Please refer to the ``weight_column`` `parameter <#weight_column>`__ in above.
Query Data
~~~~~~~~~~
For learning to rank, it needs query information for training data.
LightGBM uses an additional file to store query data, like the following:
::
27
18
67
...
It means first ``27`` lines samples belong to one query and next ``18`` lines belong to another, and so on.
**Note**: data should be ordered by the query.
If the name of data file is ``train.txt``, the query file should be named as ``train.txt.query`` and placed in the same folder as the data file.
In this case, LightGBM will load the query file automatically if it exists.
Also, you can include query/group id column in your data file. Please refer to the ``group_column`` `parameter <#group_column>`__ in above.
.. _Laurae++ Interactive Documentation: https://sites.google.com/view/lauraepp/parameters