Machine Learning Tools and Scripts

Script Name Script Description Download Related Chronicle
CNN_Loop.py The script converts 3D coordinates for bottlebrush polymer structures into binary voxels and utilizes a convolutional neural network architecture to perform classification of these structures into their respective shapes. We also utilize the generated feature vectors for different structures to quantify the similarities between these BBP shapes. Training data is available in Project GLDP000007, named "BBP shapes CNN training data" Login to download GLDP000007
GlycoDeNovo GlycoDeNovo -- an Efficient Algorithm for Accurate de novo Glycan Topology Reconstruction from Tandem Mass Spectras. More information at https://www.cs.brandeis.edu/~hong/Research/GlycoDeNovo/GlycoDeNovo.htm Login to download GLDP000004
SHAP_for_SEML This script preprocesses material data, converting and standardizing relevant features for modeling purposes, then trains a stacked regression model (SEML) using multiple regressors to predict the unstable stacking fault energy based on composition features of the materials. The model is evaluated using the R² score on test data to measure its predictive accuracy. Finally, SHAP values are computed and saved to interpret the influence of each feature on predictions, with a violin plot visualizing these contributions. Login to download N/A
ML Models for enantioselectivity prediction These scripts trained different ML regression models to predict the enantioselectivity of the reaction using features of imine, nucleophile, catalyst, and solvent. These models using the algorithms as 1) Least absolute shrinkage and selection operator (LASSO); 2) Decision Tree; 3) Random Forest; 4) Gradient Boosting; 5) Support Vector Regression. Login to download GLDP000096
Data Preprocessing for enantioselectivity This script preprocess the data on catalysts, nucleophiles, solvents, and iminiums from an Excel file, and show the histogram of enantioselectivity for reactions Login to download GLDP000096
GMM_plot.py This script includes the training of Gaussian Mixture Model with varying numbers of components. For each model, the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) scores are used to determine the optimal number of components by balancing model complexity and fit quality. Additionally, parity plot are provided for different ML model performance on testing dataset. Login to download GLDP000096