한국소성가공학회 실습 2

Predict Bulk Modulus with PyCaret

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. AI in Materials Science¶

1.1. The Fourth Paradigm in Materials Science¶

Image: Ankit Agrawal and Alok Choudhary, "Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science," APL Materials 4, 053208 (2016); https://doi.org/10.1063/1.4946894

1.2. Fast Materials Screening¶

AI-based methods as a pre-screening tool for traditional methods like DFT

Image: Park, H., Bartel, C. J., Ceder, G., Zapol, P., "Layered Transition Metal Oxides as Ca Intercalation Cathodes: A Systematic First-Principles Evaluation," Adv. Energy Mater, 2021, 11, 2101698. https://doi.org/10.1002/aenm.202101698

1.3. Materials Database¶

The Materials Project is a database of predicted properties of materials using Density Functional Theory (DFT).
Structural information and Property data for inorganic materials
https://materialsproject.org/

2. PyCaret¶

PyCaret is ideal for:
- Experienced Data Scientists who want to increase productivity.
- Citizen Data Scientists who prefer a low code machine learning solution.
- Data Science Professionals who want to build rapid prototypes.
- Data Science and Machine Learning students and enthusiasts.
- https://pycaret.org/ https://pycaret.readthedocs.io/en/latest/index.html

AutoML
- Automate from data preprocessing to model validation

#!pip install pycaret

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import pycaret
import pycaret.regression
import pycaret.classification
import numpy as np
import pandas as pd

# filter warnings messages from the notebook
import warnings
warnings.filterwarnings('ignore')

3. Regression¶

3.1. Feature Extraction¶

Descriptors

Extract features using matminer
Colvalent radius, s$\cdot$p$\cdot$d$\cdot$f orbital valence, oxidation state, space group, density, etc.
Bulk modulus is directly related the interatomic potential and volume per atoms

Dataset

Input: Descriptors
Output: Bulk modulus
Bulk modulus of a substance is a measure of how resistant to compression the substance is

df_reg = pd.read_csv("/content/drive/MyDrive/kstp/data_files/df_reg.csv", index_col = 0)
df_reg

3.2. Pipeline Setup¶

Initializes the training environment and creates the transformation pipeline
Setup function must be called before executing any other functions
It takes two mandatory parameters: "data" and "target"

pipeline = pycaret.regression.setup(data = df_reg, 
                                    target = 'k_vrh', 
                                    train_size = 0.9, 
                                    fold = 5, 
                                    silent = True, 
                                    session_id = 123)

INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=False, features_todrop=[],
                                      id_columns=[], ml_usecase='regression',
                                      numerical_features=[], target='k_vrh',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                numeric_strategy=...
                ('scaling', 'passthrough'), ('P_transform', 'passthrough'),
                ('binn', 'passthrough'), ('rem_outliers', 'passthrough'),
                ('cluster_all', 'passthrough'),
                ('dummy', Dummify(target='k_vrh')),
                ('fix_perfect', Remove_100(target='k_vrh')),
                ('clean_names', Clean_Colum_Names()),
                ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                ('dfs', 'passthrough'), ('pca', 'passthrough')],
         verbose=False)
INFO:logs:setup() succesfully completed......................................

Normalization

Preprocessing such as PCA, feature selection, and normalization is possible

We can normalize the data by setting "normalize=True" in "setup()"

pipeline1 = pycaret.regression.setup(data = df_reg, 
                                     target = 'k_vrh', 
                                     train_size = 0.9, 
                                     fold = 5, 
                                     normalize = True, 
                                     silent = True, 
                                     session_id = 123)

INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=False, features_todrop=[],
                                      id_columns=[], ml_usecase='regression',
                                      numerical_features=[], target='k_vrh',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                numeric_strategy=...
                                                  target='k_vrh')),
                ('P_transform', 'passthrough'), ('binn', 'passthrough'),
                ('rem_outliers', 'passthrough'), ('cluster_all', 'passthrough'),
                ('dummy', Dummify(target='k_vrh')),
                ('fix_perfect', Remove_100(target='k_vrh')),
                ('clean_names', Clean_Colum_Names()),
                ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                ('dfs', 'passthrough'), ('pca', 'passthrough')],
         verbose=False)
INFO:logs:setup() succesfully completed......................................

3.3. Training¶

Top-performing model based on the criteria defined in “sort” parameter

Show performances based on 10-fold cross validation

top_model = pycaret.regression.compare_models(sort = 'R2')

INFO:logs:create_model_container: 18
INFO:logs:master_model_container: 18
INFO:logs:display_container: 2
INFO:logs:ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                    max_depth=None, max_features='auto', max_leaf_nodes=None,
                    max_samples=None, min_impurity_decrease=0.0,
                    min_impurity_split=None, min_samples_leaf=1,
                    min_samples_split=2, min_weight_fraction_leaf=0.0,
                    n_estimators=100, n_jobs=-1, oob_score=False,
                    random_state=123, verbose=0, warm_start=False)
INFO:logs:compare_models() succesfully completed......................................

lasso = pycaret.regression.create_model('lasso', fold = 5)

INFO:logs:create_model_container: 19
INFO:logs:master_model_container: 19
INFO:logs:display_container: 3
INFO:logs:Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=123,
      selection='cyclic', tol=0.0001, warm_start=False)
INFO:logs:create_model() succesfully completed......................................

Turn off the K-fold cross validation

It should be noted that the performance of the model may be overestimated

models = pycaret.regression.compare_models(sort = 'R2', cross_validation = False)

INFO:logs:create_model_container: 19
INFO:logs:master_model_container: 19
INFO:logs:display_container: 4
INFO:logs:LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
              random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn',
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
INFO:logs:compare_models() succesfully completed......................................

lasso = pycaret.regression.create_model('lasso', cross_validation = False)

INFO:logs:display_container: 5
INFO:logs:Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=123,
      selection='cyclic', tol=0.0001, warm_start=False)
INFO:logs:create_models() succesfully completed......................................

3.4. Model Analysis¶

pycaret.regression.evaluate_model(lasso)

INFO:logs:Initializing evaluate_model()
INFO:logs:evaluate_model(estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=123,
      selection='cyclic', tol=0.0001, warm_start=False), fold=None, fit_kwargs=None, plot_kwargs=None, feature_name=None, groups=None, use_train_data=False)

3.5. Hyperparameter Tuning¶

Grid search, random search, Bayesian optimization, etc.

It can be seen that the performance improves through hyperparameter tuning

tuned_model = pycaret.regression.tune_model(lasso, 
                                            optimize = 'R2', 
                                            choose_better = True, 
                                            n_iter = 50, 
                                            fold = 5)

INFO:logs:create_model_container: 21
INFO:logs:master_model_container: 21
INFO:logs:display_container: 6
INFO:logs:Lasso(alpha=0.19, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=123,
      selection='cyclic', tol=0.0001, warm_start=False)
INFO:logs:tune_model() succesfully completed......................................

Before hyperparameter tuning

lasso

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=123,
      selection='cyclic', tol=0.0001, warm_start=False)

After hyperparameter tuning

tuned_model

Lasso(alpha=0.19, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=123,
      selection='cyclic', tol=0.0001, warm_start=False)

3.6. Prediction¶

Generates the label using a trained model

Test the trained model on unseen data

pycaret.regression.predict_model(tuned_model)

INFO:logs:Initializing predict_model()
INFO:logs:predict_model(estimator=Lasso(alpha=0.19, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=123,
      selection='cyclic', tol=0.0001, warm_start=False), probability_threshold=None, encoded_labels=True, drift_report=False, raw_score=False, round=4, verbose=True, ml_usecase=MLUsecase.REGRESSION, display=None, drift_kwargs=None)
INFO:logs:Checking exceptions
INFO:logs:Preloading libraries
INFO:logs:Preparing display monitor

Prediction on any data point

rf = pycaret.regression.create_model('rf', fold = 5)

INFO:logs:create_model_container: 22
INFO:logs:master_model_container: 22
INFO:logs:display_container: 8
INFO:logs:RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=-1, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)
INFO:logs:create_model() succesfully completed......................................

arbitrary_point = df_reg.iloc[236, 1:].values.reshape(1,-1)

pred = rf.predict(arbitrary_point)
ground_truth = df_reg.iloc[236,0]

print("Model prediction: ", pred)
print("Ground truth: ", ground_truth)

Model prediction:  [161.26855656]
Ground truth:  157.38629328397334

4. Classification¶

Classify whether the material is metal or non-metal from composition information
Input: Descriptors obtained from composition
Output: 0 (Non-metal) or 1 (Metal)
Follow the same procedure as for regression task

import pycaret.classification

df_cls = pd.read_csv("/content/drive/MyDrive/kstp/data_files/df_cls.csv", index_col = 0)
df_cls

Use the ramaining data except formula

df_cls = df_cls.iloc[:,1:]

cls = pycaret.classification.setup(data = df_cls, target = 'is_metal', train_size = 0.9, fold = 5, silent = True, session_id = 123)

INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=False, features_todrop=[],
                                      id_columns=[],
                                      ml_usecase='classification',
                                      numerical_features=[], target='is_metal',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                numeric_st...
                ('scaling', 'passthrough'), ('P_transform', 'passthrough'),
                ('binn', 'passthrough'), ('rem_outliers', 'passthrough'),
                ('cluster_all', 'passthrough'),
                ('dummy', Dummify(target='is_metal')),
                ('fix_perfect', Remove_100(target='is_metal')),
                ('clean_names', Clean_Colum_Names()),
                ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                ('dfs', 'passthrough'), ('pca', 'passthrough')],
         verbose=False)
INFO:logs:setup() succesfully completed......................................

Classification metrics

Accuracy, AUC, Recall, F1, etc.

For all metrics, the closer to 1, the better the classification performance

models = pycaret.classification.compare_models(sort = 'Accuracy')

INFO:logs:create_model_container: 14
INFO:logs:master_model_container: 14
INFO:logs:display_container: 2
INFO:logs:LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=123, reg_alpha=0.0, reg_lambda=0.0, silent='warn',
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
INFO:logs:compare_models() succesfully completed......................................

rf = pycaret.classification.create_model('rf', fold = 5)

INFO:logs:create_model_container: 15
INFO:logs:master_model_container: 15
INFO:logs:display_container: 3
INFO:logs:RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=123, verbose=0,
                       warm_start=False)
INFO:logs:create_model() succesfully completed......................................

pycaret.classification.evaluate_model(rf)

INFO:logs:Initializing evaluate_model()
INFO:logs:evaluate_model(estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=123, verbose=0,
                       warm_start=False), fold=None, fit_kwargs=None, plot_kwargs=None, feature_name=None, groups=None, use_train_data=False)

pycaret.classification.predict_model(rf)

INFO:logs:Initializing predict_model()
INFO:logs:predict_model(estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=123, verbose=0,
                       warm_start=False), probability_threshold=None, encoded_labels=False, drift_report=False, raw_score=False, round=4, verbose=True, ml_usecase=MLUsecase.CLASSIFICATION, display=None, drift_kwargs=None)
INFO:logs:Checking exceptions
INFO:logs:Preloading libraries
INFO:logs:Preparing display monitor

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

	k_vrh	vpa	density	MagpieData mean MeltingT	MagpieData mean NUnfilled	packing fraction	MagpieData mode MeltingT	MagpieData minimum NUnfilled	MagpieData maximum GSvolume_pa	MagpieData mean GSvolume_pa	MagpieData minimum NValence	MagpieData mean NdUnfilled	MagpieData mode NUnfilled	MagpieData mean NpValence	MagpieData avg_dev NpUnfilled	MagpieData minimum MeltingT	MagpieData maximum MeltingT	MagpieData maximum NdValence	MagpieData mode GSvolume_pa	MagpieData mean MendeleevNumber	MagpieData minimum Electronegativity	MagpieData minimum MendeleevNumber	std_dev oxidation state
0	295.077545	12.957800	13.988541	2496.500000	4.000000	0.570238	1687.00	4.0	20.440000	17.265000	4.0	2.000	4.0	1.000000	2.000000	1687.00	3306.00	6.0	14.090000	67.500000	1.90	57.0	5.656854
1	74.370488	17.868860	6.519289	1401.760000	2.800000	0.788912	1211.40	0.0	54.230000	24.146000	2.0	1.200	3.0	0.800000	1.920000	1050.00	1768.00	10.0	10.245000	56.400000	0.95	8.0	0.000000
2	234.099927	15.435634	17.027465	2016.300000	3.500000	0.686917	2041.40	2.0	16.690000	15.437500	4.0	2.750	2.0	0.000000	0.000000	1941.00	2041.40	9.0	15.020000	58.000000	1.54	43.0	2.529822
3	30.178322	18.871482	3.312854	606.150000	0.500000	0.732266	453.69	0.0	22.890000	18.892917	1.0	0.000	1.0	0.000000	0.000000	453.69	923.00	10.0	16.593333	35.000000	0.98	1.0	0.000000
4	41.336301	19.547245	5.439796	633.502500	0.250000	0.705746	923.00	0.0	25.237586	21.902730	1.0	0.000	0.0	0.000000	0.000000	234.32	923.00	10.0	22.890000	52.000000	0.98	1.0	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7137	62.971555	11.111316	2.536134	470.408750	2.250000	0.587636	54.80	1.0	16.593333	12.401250	1.0	0.875	2.0	2.000000	1.000000	54.80	2183.00	3.0	9.105000	49.625000	0.98	1.0	3.043544
7138	278.173744	8.570548	5.426881	2277.285714	5.857143	0.686866	2348.00	5.0	13.010000	9.674286	3.0	3.000	5.0	0.571429	2.448980	2183.00	2348.00	3.0	7.172500	60.857143	1.63	46.0	3.346640
7139	32.487326	15.415494	1.283597	743.880588	1.705882	0.718573	453.69	1.0	20.440000	17.498431	1.0	0.000	1.0	0.470588	1.439446	453.69	1687.00	0.0	16.593333	19.117647	0.98	1.0	0.000000
7140	44.284962	16.322156	5.419970	388.491429	2.000000	0.345413	54.80	0.0	31.560000	16.214286	6.0	0.000	2.0	3.142857	0.571429	54.80	903.78	10.0	9.105000	83.857143	1.65	69.0	3.082207
7141	71.567048	11.830199	4.046538	622.086667	1.333333	0.675695	54.80	1.0	16.593333	12.256111	1.0	0.000	1.0	1.333333	0.888889	54.80	1357.77	10.0	9.105000	50.666667	0.98	1.0	1.732051

	Description	Value
0	session_id	123
1	Target	k_vrh
2	Original Data	(7142, 23)
3	Missing Values	True
4	Numeric Features	22
5	Categorical Features	0
6	Ordinal Features	False
7	High Cardinality Features	False
8	High Cardinality Method	None
9	Transformed Train Set	(6427, 22)
10	Transformed Test Set	(715, 22)
11	Shuffle Train-Test	True
12	Stratify Train-Test	False
13	Fold Generator	KFold
14	Fold Number	5
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	0a0e
20	Imputation Type	simple
21	Iterative Imputation Iteration	None
22	Numeric Imputer	mean
23	Iterative Imputation Numeric Model	None
24	Categorical Imputer	constant
25	Iterative Imputation Categorical Model	None
26	Unknown Categoricals Handling	least_frequent
27	Normalize	False
28	Normalize Method	None
29	Transformation	False
30	Transformation Method	None
31	PCA	False
32	PCA Method	None
33	PCA Components	None
34	Ignore Low Variance	False
35	Combine Rare Levels	False
36	Rare Level Threshold	None
37	Numeric Binning	False
38	Remove Outliers	False
39	Outliers Threshold	None
40	Remove Multicollinearity	False
41	Multicollinearity Threshold	None
42	Remove Perfect Collinearity	True
43	Clustering	False
44	Clustering Iteration	None
45	Polynomial Features	False
46	Polynomial Degree	None
47	Trignometry Features	False
48	Polynomial Threshold	None
49	Group Features	False
50	Feature Selection	False
51	Feature Selection Method	classic
52	Features Selection Threshold	None
53	Feature Interaction	False
54	Feature Ratio	False
55	Interaction Threshold	None
56	Transform Target	False
57	Transform Target Method	box-cox

	Description	Value
0	session_id	123
1	Target	k_vrh
2	Original Data	(7142, 23)
3	Missing Values	True
4	Numeric Features	22
5	Categorical Features	0
6	Ordinal Features	False
7	High Cardinality Features	False
8	High Cardinality Method	None
9	Transformed Train Set	(6427, 22)
10	Transformed Test Set	(715, 22)
11	Shuffle Train-Test	True
12	Stratify Train-Test	False
13	Fold Generator	KFold
14	Fold Number	5
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	a07b
20	Imputation Type	simple
21	Iterative Imputation Iteration	None
22	Numeric Imputer	mean
23	Iterative Imputation Numeric Model	None
24	Categorical Imputer	constant
25	Iterative Imputation Categorical Model	None
26	Unknown Categoricals Handling	least_frequent
27	Normalize	True
28	Normalize Method	zscore
29	Transformation	False
30	Transformation Method	None
31	PCA	False
32	PCA Method	None
33	PCA Components	None
34	Ignore Low Variance	False
35	Combine Rare Levels	False
36	Rare Level Threshold	None
37	Numeric Binning	False
38	Remove Outliers	False
39	Outliers Threshold	None
40	Remove Multicollinearity	False
41	Multicollinearity Threshold	None
42	Remove Perfect Collinearity	True
43	Clustering	False
44	Clustering Iteration	None
45	Polynomial Features	False
46	Polynomial Degree	None
47	Trignometry Features	False
48	Polynomial Threshold	None
49	Group Features	False
50	Feature Selection	False
51	Feature Selection Method	classic
52	Features Selection Threshold	None
53	Feature Interaction	False
54	Feature Ratio	False
55	Interaction Threshold	None
56	Transform Target	False
57	Transform Target Method	box-cox

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
et	Extra Trees Regressor	12.0479	594.9934	24.2459	0.8960	0.3211	0.2908	2.096
lightgbm	Light Gradient Boosting Machine	13.1333	599.6549	24.3223	0.8953	0.3345	0.3181	0.408
rf	Random Forest Regressor	13.4195	698.4265	26.3109	0.8778	0.3296	0.3245	5.834
gbr	Gradient Boosting Regressor	15.6848	703.4731	26.3972	0.8770	0.3865	0.4309	1.590
knn	K Neighbors Regressor	18.8921	1017.3413	31.8747	0.8216	0.4326	0.4564	0.196
dt	Decision Tree Regressor	18.9631	1250.4976	35.2976	0.7808	0.4398	0.3759	0.178
lr	Linear Regression	27.6286	1617.5449	40.1885	0.7166	0.6450	1.3883	1.148
ridge	Ridge Regression	27.6279	1617.5062	40.1880	0.7166	0.6449	1.3880	0.038
br	Bayesian Ridge	27.6245	1617.4076	40.1869	0.7166	0.6440	1.3864	0.038
huber	Huber Regressor	27.3125	1641.9605	40.4763	0.7124	0.6599	1.5571	0.170
lasso	Lasso Regression	28.2483	1673.2683	40.8707	0.7068	0.6444	1.3864	0.032
ada	AdaBoost Regressor	32.6235	1866.3366	43.1900	0.6720	0.7023	1.1851	0.622
en	Elastic Net	31.0166	1901.4729	43.5832	0.6667	0.6450	1.1215	0.034
par	Passive Aggressive Regressor	30.9325	2130.8661	46.0131	0.6269	0.7398	1.9523	0.032
omp	Orthogonal Matching Pursuit	33.7089	2237.7258	47.2829	0.6076	0.6832	1.6168	0.040
lar	Least Angle Regression	32.9794	2343.6933	46.7437	0.5925	0.7113	1.5937	0.036
llar	Lasso Least Angle Regression	60.9662	5703.7367	75.5084	-0.0005	1.0052	2.6295	0.032
dummy	Dummy Regressor	60.9662	5703.7368	75.5084	-0.0005	1.0052	2.6295	0.016

	MAE	MSE	RMSE	R2	RMSLE	MAPE
Fold
0	28.5412	1877.9492	43.3353	0.6860	0.6319	1.1045
1	28.4220	1503.8801	38.7799	0.7173	0.6346	1.0561
2	28.4987	1619.6501	40.2449	0.7234	0.6728	1.2390
3	27.8673	1575.6525	39.6945	0.7202	0.6126	0.8133
4	27.9125	1789.2094	42.2990	0.6869	0.6699	2.7191
Mean	28.2483	1673.2683	40.8707	0.7068	0.6444	1.3864
Std	0.2955	138.8760	1.6889	0.0167	0.0234	0.6804

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
lightgbm	Light Gradient Boosting Machine	12.5440	451.4136	21.2465	0.9219	0.3249	0.2789	0.29
rf	Random Forest Regressor	11.9251	463.2798	21.5239	0.9199	0.3017	0.2681	5.70
et	Extra Trees Regressor	11.3981	556.9921	23.6007	0.9036	0.2978	0.2479	2.62
gbr	Gradient Boosting Regressor	15.6892	668.0345	25.8464	0.8844	0.3723	0.3922	2.55
knn	K Neighbors Regressor	17.9950	827.4488	28.7654	0.8569	0.3869	0.3661	0.03
dt	Decision Tree Regressor	18.1102	1040.7476	32.2606	0.8200	0.3962	0.3470	0.15
lr	Linear Regression	29.1737	1809.9832	42.5439	0.6869	0.7109	2.3270	0.02
ridge	Ridge Regression	29.1751	1810.0604	42.5448	0.6869	0.7108	2.3277	0.02
lar	Least Angle Regression	29.1737	1809.9846	42.5439	0.6869	0.7109	2.3270	0.02
br	Bayesian Ridge	29.1856	1810.6336	42.5515	0.6868	0.7105	2.3321	0.03
lasso	Lasso Regression	30.0671	1897.7419	43.5631	0.6717	0.7275	2.6124	0.02
en	Elastic Net	32.2494	1944.7172	44.0989	0.6636	0.7063	1.8929	0.01
huber	Huber Regressor	29.3523	2053.9846	45.3209	0.6447	0.7408	2.9267	0.22
ada	AdaBoost Regressor	36.0101	2121.9747	46.0649	0.6329	0.8378	1.4045	0.89
par	Passive Aggressive Regressor	32.5164	2264.5766	47.5876	0.6082	0.8107	3.0197	0.02
omp	Orthogonal Matching Pursuit	36.1562	2731.0672	52.2596	0.5275	0.7631	3.2885	0.01
llar	Lasso Least Angle Regression	61.6404	5780.9048	76.0323	-0.0001	1.0266	2.8614	0.02
dummy	Dummy Regressor	61.6404	5780.9048	76.0323	-0.0001	1.0266	2.8614	0.00

	MAE	MSE	RMSE	R2	RMSLE	MAPE
Fold
0	27.6558	1813.2025	42.5817	0.6968	0.6264	1.0405
1	28.1306	1457.2672	38.1742	0.7261	0.6463	1.0540
2	27.9286	1596.6030	39.9575	0.7274	0.6776	1.2729
3	27.3751	1530.2343	39.1182	0.7283	0.6077	0.7910
4	27.1054	1712.9308	41.3876	0.7003	0.6683	2.7697
Mean	27.6391	1622.0476	40.2438	0.7158	0.6452	1.3856
Std	0.3688	127.3025	1.5752	0.0141	0.0259	0.7086

	vpa	density	MagpieData mean MeltingT	MagpieData mean NUnfilled	packing fraction	MagpieData mode MeltingT	MagpieData minimum NUnfilled	MagpieData maximum GSvolume_pa	MagpieData mean GSvolume_pa	MagpieData minimum NValence	...	MagpieData minimum MeltingT	MagpieData maximum MeltingT	MagpieData maximum NdValence	MagpieData mode GSvolume_pa	MagpieData mean MendeleevNumber	MagpieData minimum Electronegativity	MagpieData minimum MendeleevNumber	std_dev oxidation state	k_vrh	Label
0	0.701711	-0.798078	-0.837051	0.211009	1.895607	-0.898599	-0.368097	4.474749	2.278716	-0.918808	...	-0.834311	-0.157167	-1.604618	-0.717333	-1.143494	-1.529214	-1.211086	0.443816	63.367096	54.765396
1	-0.617081	-0.457117	-0.511846	-0.420388	-0.753925	-0.898599	-0.368097	-0.708838	-1.022111	-0.093024	...	-0.834311	-0.006234	0.843769	-0.717333	0.627458	0.276397	0.239350	0.447951	76.100433	90.682869
2	1.468199	-0.278924	-0.616194	-1.192096	-1.116219	-0.177000	-1.006112	0.236755	0.562947	2.109065	...	0.128198	-1.454171	0.843769	-0.222338	1.089826	0.541220	1.231753	0.528459	43.414299	21.156166
3	-0.572876	-0.881211	-0.652042	0.304550	-1.339661	-0.900070	-0.368097	-0.708838	-0.900161	-0.093024	...	-0.836273	-0.006234	-1.332575	-0.717333	0.729237	0.276397	0.239350	0.421930	36.862297	93.154945
4	-0.867244	2.043521	1.319934	3.859084	1.163660	1.695573	2.183963	-0.534352	-0.944250	-0.368285	...	1.207558	0.477466	-1.604618	-0.914362	-0.643324	-0.108800	-0.638545	2.590237	210.491104	276.203522
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
710	-0.107009	-0.698368	-0.290859	1.333494	0.190230	-0.898599	0.269918	0.111275	-0.156392	-0.368285	...	-0.834311	-0.157167	-1.604618	-0.717333	-1.021360	-0.493996	-0.982069	-1.120963	114.602539	94.318146
711	-0.639954	-0.528004	-0.777578	-0.892767	-0.540587	-0.898599	-0.368097	-0.052050	-0.970308	-0.918808	...	-0.834311	-0.699374	0.843769	-0.717333	0.575696	-1.192166	-1.325594	0.487291	119.183334	71.598373
712	0.407570	0.451744	0.210555	0.912562	0.871026	1.091488	-1.006112	-0.261630	0.229826	-0.368285	...	-0.563430	-0.157167	0.843769	0.621347	-1.701824	-0.156949	-0.982069	-1.120963	78.364815	87.682449
713	0.579226	-0.080156	0.558155	0.865792	-0.553754	-0.143086	0.269918	0.236755	0.769799	-0.093024	...	0.173435	0.216006	0.843769	0.438336	0.473336	-0.229173	0.277519	-1.120963	35.157047	91.986328
714	-0.841389	-0.216850	-1.202937	0.042637	-0.608252	-0.898599	0.269918	-0.312777	-0.692642	-0.368285	...	-0.834311	-1.203634	0.843769	-0.717333	1.250346	0.444921	1.384430	0.787872	184.245941	107.263069

	MAE	MSE	RMSE	R2	RMSLE	MAPE
Fold
0	13.4856	882.6203	29.7089	0.8524	0.3264	0.3242
1	13.8110	560.0324	23.6650	0.8947	0.3412	0.3628
2	13.0609	563.2316	23.7325	0.9038	0.3299	0.2862
3	13.7995	819.6946	28.6303	0.8545	0.3059	0.2450
4	12.9405	666.5537	25.8177	0.8834	0.3445	0.4045
Mean	13.4195	698.4265	26.3109	0.8778	0.3296	0.3245
Std	0.3633	131.9695	2.4827	0.0209	0.0136	0.0559

	formula	is_metal	H	He	Li	Be	B	C	N	O	...	Fm	Md	No	Lr	0-norm	2-norm	3-norm	5-norm	7-norm	10-norm
0	Ag(AuS)2	True	0.0	0	0.0	0.0	0.00	0.0	0.0	0.00	...	0	0	0	0	3	0.600000	0.514256	0.460906	0.441882	0.428730
1	Ag(W3Br7)2	True	0.0	0	0.0	0.0	0.00	0.0	0.0	0.00	...	0	0	0	0	3	0.726873	0.683796	0.668584	0.666919	0.666681
2	Ag0.5Ge1Pb1.75S4	False	0.0	0	0.0	0.0	0.00	0.0	0.0	0.00	...	0	0	0	0	4	0.621647	0.569761	0.553591	0.551970	0.551738
3	Ag0.5Ge1Pb1.75Se4	False	0.0	0	0.0	0.0	0.00	0.0	0.0	0.00	...	0	0	0	0	4	0.621647	0.569761	0.553591	0.551970	0.551738
4	Ag2BBr	True	0.0	0	0.0	0.0	0.25	0.0	0.0	0.00	...	0	0	0	0	3	0.612372	0.538609	0.506099	0.501109	0.500098
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4916	ZrTaN3	False	0.0	0	0.0	0.0	0.00	0.0	0.6	0.00	...	0	0	0	0	3	0.663325	0.614463	0.600984	0.600078	0.600002
4917	ZrTe	True	0.0	0	0.0	0.0	0.00	0.0	0.0	0.00	...	0	0	0	0	2	0.707107	0.629961	0.574349	0.552045	0.535887
4918	ZrTi2O	True	0.0	0	0.0	0.0	0.00	0.0	0.0	0.25	...	0	0	0	0	3	0.612372	0.538609	0.506099	0.501109	0.500098
4919	ZrTiF6	True	0.0	0	0.0	0.0	0.00	0.0	0.0	0.00	...	0	0	0	0	3	0.770552	0.752308	0.750039	0.750001	0.750000
4920	ZrW2	True	0.0	0	0.0	0.0	0.00	0.0	0.0	0.00	...	0	0	0	0	2	0.745356	0.693361	0.670782	0.667408	0.666732

	Description	Value
0	session_id	123
1	Target	is_metal
2	Target Type	Binary
3	Label Encoded	False: 0, True: 1
4	Original Data	(4921, 110)
5	Missing Values	False
6	Numeric Features	85
7	Categorical Features	24
8	Ordinal Features	False
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(4428, 109)
12	Transformed Test Set	(493, 109)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	5
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	clf-default-name
21	USI	17db
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	False
30	Normalize Method	None
31	Transformation	False
32	Transformation Method	None
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Remove Perfect Collinearity	True
45	Clustering	False
46	Clustering Iteration	None
47	Polynomial Features	False
48	Polynomial Degree	None
49	Trignometry Features	False
50	Polynomial Threshold	None
51	Group Features	False
52	Feature Selection	False
53	Feature Selection Method	classic
54	Features Selection Threshold	None
55	Feature Interaction	False
56	Feature Ratio	False
57	Interaction Threshold	None
58	Fix Imbalance	False
59	Fix Imbalance Method	SMOTE

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
lightgbm	Light Gradient Boosting Machine	0.9063	0.9644	0.9003	0.9118	0.9058	0.8126	0.8129	0.236
et	Extra Trees Classifier	0.9006	0.9638	0.8867	0.9127	0.8993	0.8013	0.8020	0.702
rf	Random Forest Classifier	0.8970	0.9598	0.8822	0.9096	0.8956	0.7941	0.7946	0.718
dt	Decision Tree Classifier	0.8740	0.8740	0.8637	0.8825	0.8727	0.7480	0.7487	0.090
lda	Linear Discriminant Analysis	0.8726	0.9383	0.8606	0.8825	0.8712	0.7453	0.7458	0.126
knn	K Neighbors Classifier	0.8713	0.9289	0.8538	0.8853	0.8691	0.7426	0.7432	0.638
ridge	Ridge Classifier	0.8701	0.0000	0.8601	0.8784	0.8690	0.7403	0.7408	0.026
lr	Logistic Regression	0.8650	0.9331	0.8660	0.8648	0.8652	0.7299	0.7303	0.244
gbc	Gradient Boosting Classifier	0.8629	0.9379	0.8764	0.8540	0.8649	0.7258	0.7264	0.818
ada	Ada Boost Classifier	0.8589	0.9337	0.8642	0.8555	0.8596	0.7177	0.7181	0.350
svm	SVM - Linear Kernel	0.8496	0.0000	0.8949	0.8313	0.8572	0.6991	0.7107	0.108
nb	Naive Bayes	0.7660	0.8791	0.6367	0.8597	0.7308	0.5322	0.5515	0.048
qda	Quadratic Discriminant Analysis	0.5452	0.5756	0.5143	0.7212	0.4123	0.0898	0.1255	0.092
dummy	Dummy Classifier	0.5005	0.5000	1.0000	0.5005	0.6671	0.0000	0.0000	0.036

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.8883	0.9587	0.8626	0.9097	0.8855	0.7765	0.7776
1	0.8905	0.9518	0.8939	0.8879	0.8909	0.7810	0.7811
2	0.8962	0.9639	0.8916	0.8998	0.8957	0.7923	0.7924
3	0.9062	0.9622	0.8849	0.9245	0.9043	0.8124	0.8132
4	0.9040	0.9622	0.8781	0.9262	0.9015	0.8079	0.8090
Mean	0.8970	0.9598	0.8822	0.9096	0.8956	0.7941	0.7946
Std	0.0071	0.0043	0.0113	0.0146	0.0068	0.0142	0.0144

	Parameters
alpha	1.0
copy_X	True
fit_intercept	True
max_iter	1000
normalize	False
positive	False
precompute	False
random_state	123
selection	cyclic
tol	0.0001
warm_start	False