Data Analysis
Data filtering
With the diversity of materials synthesis and characterization experiments possible and the variations in procedure pursued by different practitioners, all data cannot be equal in relevance or validity. In many cases, it is therefore valuable to filter the data. Some filtering may be for the express purposes of including only that which followed near-identical protocols, ignoring the rest as functionally not relevant. Other filtering may exclude or downgrade the priority of data using different approaches, e.g. downgrading the relative impact on the resulting models of experiments using relatively outdated equipment or methodology. In principle, filtering might also include more subjective criteria, e.g. the established reputation or experience of the researcher.
While such judgments may be difficult, it seems clear that they are important. Not all experiments, or for that matter models and theoretical calculations, are equal. And even if the integrity and validity of all experiments were perfect, materials science involves so many permutations of process and measurement that identifying the relevant databases for comparison can be an onerous task.
Data mining
Given a filtered database, model generation and knowledge extraction depends on how well the relationships - obvious or hidden - in the database can be explicitly identified. One can consider this a data mining exercise, which can follow several styles.
Most straightforward, materials synthesis is normally considered as achieving specific material performance metrics (dependent variables) as a function of suitably chosen source materials and processing (independent variables). Often the experimentalist has a clear idea about both sets of variables, anticipating that deriving the relationship between them will result in a valuable model representing materials system behavior. In this case informatics approaches such as principal component analysis (PCA), neural networks, or other techniques can be used to extract these relationships. The fact that this is typically a multivariate (multiple inputs and outputs) makes it necessary to use systematic methodologies to infer and adequately represent the relationships as well-defined models.
More generally, materials databases may often include parameters potentially relevant to understanding but which are not directly controllable input variables nor performance output metrics. Advanced techniques of informatics analysis are able to address these situations as well, revealing correlations and relationships between variables even without prior knowledge of which are considered independent (input) or dependent (output) variables. A major benefit of identifying such relationships in data mining is that they may convey new fundamental materials insights, and lead to experiments designed to confirm or refute the fundamental mechanisms postulated to explain the relationships exposed by data mining.