Data View Explanations

This article will address many questions around Data Views.
  • Q: What does the Populate with Data button do?
  • A: The Populate with Data button uses a machine learning model to fill in gaps in the output properties in the data set.
  • Q: When I make a prediction, why does it return a distribution?
  • A: Citrination uses cutting edge uncertainty quantification to assess model uncertainty.  These uncertainty estimates are discussed in detail in this paper.
  • Q: What can I use Design for?
  • A: Design can be used to solve the inverse design problem of determining the inputs that match specifications for the output properties.  For example, Design can be used to suggest alloy compositions that might match specifications for mechanical properties.  
  • Q:  Can I keep my Data View private?  Can I share it?
  • A: By default, the public Citrination platform keeps your Data Views private, so that other users cannot access them.  In order to share them with other users, click the Access tab.  From there, you can choose to either make the Data View public so that all users can see it, or you can share it with specific groups or teams.  To create a new group to share your Data View with, click your user name in the top right of the browser window, and click the Groups item in the menu.
  • Q: On the Model Reports tab, what is the difference between the Pearson Correlation and Distance Correlation?
  • A: More information about the Pearson Correlation and Distance Correlation can be found on their respective wikipedia pages, here and here.
  • Q: On the Model Reports tab, what is the t-SNE projection?
  • A: The t-SNE projection is a dimension reduction technique that can be used to visualize high-dimensional data.  Like Principal Component Analysis, it is used to project data down to a lower number of dimensions.  t-SNE is useful because it preserves neighborhoods; points that are near each other in the original high-dimensional space should still be near each other after the t-SNE projection.  This property makes t-SNE useful for visualizing clusters within the data set.  More information on t-SNE can be found here.
  • Q: On the Model Reports tab, what are the Important Features?
  • A: The Important Features are the input features that the machine learning model depends on most strongly.  The feature importances can be useful in gaining some insight into how the machine learning model is making its predictions.
  • Q: How come some of the Important Features are not input features that I selected for this model?
  • A: When chemical formula is used as an input to the model, the Citrination platform automatically parses that formula and generates dozens of features based on the chemical formula to use as inputs to the model.
  • Q: What is the RMSE? What is the GTME?
  • A: The RMSE is the Root Mean Squared Error.  This is a useful statistical metric for model quality.  Lower RMSE means the model is more accurate.  The GTME is the Guess The Mean Error.  This is the error that a completely uninformative model would have--the error of a model which always guesses the mean value of the training data set.  The ratio RMSE/GTME is a useful non-dimensional model quality metric.  A value of RMSE/GTME=0 is a perfect model.  If RMSE/GTME=1, then the model is uninformative.  
  • Q: What is a good value of RMSE/GTME?
  • A: This depends on what you would like to use the model for.  Generally, I would say that a value of RMSE/GTME > 0.9 indicates a model with very high uncertainty.  If 0.6 < RMSE/GTME < 0.9, this model is typically a good candidate for experimental design.  Lower values of RMSE/GTME indicate increasingly accurate models. 
  • Q: What are MEI and MLI Scores?
  • A: The MEI (Maximum Expected Improvement) strategy simply selects the candidate with the highest (or lowest, for minimization) target value for all un-measured candidates.

    The MEI Score is normalized from 0-100 and corresponds to the likelihood that candidates will meet your objective (minimization or maximization) and satisfy any constraints.

    The MLI (Maximum Likelihood of Improvement) strategy selects the un-measured candidate that is the most likely to have a higher (or lower, for minimization) target value than the best previously measured material. This strategy incorporates uncertainty estimates in its prediction.

    The MLI Score, normalized between 0 and 100, indicates the percent likelihood of a given candidate to out-perform all the materials in the training data in terms of optimizing the target and satisfying constraints. This score takes into account both predicted output values and uncertainty. It provides suggested experiments that explore new regions of design space (have high uncertainty). Candidates with a known value for the target output will have a Suggested Experiments score of 0.

    MEI and MLI scores are explained in more detail in this paper
  • Q: What is the Predicted versus Actual plot?
  • A: The Predicted versus Actual plot is generated via cross-validation.  It compares the model predictions to the actual output values in the training set.  A perfect model would have predictions that lie on the line y=x.  The Supported and Unsupported labels indicate whether the specified points are extrapolations in the input feature space.  
  • Q: What is the Distribution of Standard Residuals plot?
  • A: The standard residuals are given by (Predicted - Actual)/(Uncertainty Estimate).  This plot is useful for assessing whether the uncertainty estimates are well-calibrated for this model.  In the case of perfectly calibrated uncertainty estimates, this distribution would be expected to follow a normal distribution.  

Feedback and Knowledge Base