Npj Comput. Mater.: 聚簇描述—机器学习法实现外推
海归学者发起的公益学术平台
分享信息,整合资源
交流学术,偶尔风月
热电材料可以将热能转化为电能,除了在能量采集、热电制冷和热力发电机等领域有着广泛的应用前景,也是处理工业废热的环保手段,甚至可以用于人体可穿戴的体温供能。
Fig. 1 Interpolation results of XGB that was the best prediction model in the interpolation problems to predict the thermoelectric properties of the 5205 observations in the ESTM dataset.
现有的热电材料主要是合金及其衍生的各种掺杂体系。目前理论预测热电材料有两种方法,传统的以密度泛函计算为代表的量化计算方法和基于数据挖掘或数据驱动发现的机器学习法。由于密度泛函法的计算量会随着粒子的增加而指数增大,因此难以用于面向大体积超胞的掺杂材料的热电性质预测。
Fig. 2 The overall process of SIMD to generate the material representations for an input tabular data of the materials.
机器学习法虽然已经有了不少从材料的化学组分预测物理性质的报道,但是除了缺乏公用的数据集,而且还存在外推问题:虽然训练得到的预测模型可以使数据集中材料的化学组成与热电性质之间很好地匹配,但是要外推到完全未知的材料,其预测准确率就迅速降低。对于热电材料,材料化学组成的描述不准确,比如没有考虑基质与掺杂之间的关联作用是外推问题的主要原因。
Fig. 3 The overall process of SIMD to generate the system identified features of the input chemical composition in the transfer learning environments.
来自韩国化工研究所的Gyoung S.Na和Hyunju Chang教授团队在公用数据集、机器学习法应用和外推问题的解决上都进行了创新性探索。他们首先建立了一个包含5205个实验观测对象的公用数据集,其中有880种独立的热电材料和包含品质因子在内的五种实验测得的热电性质。
Fig. 4 Confusion matrices of XGBd and SXGBd in the highthroughput screening to discover high-ZT (≥1.5) thermoelectric materials from unknown material groups.
随后他们对比了五种机器学习算法的预测结果,发现XGB法在四种热电性质预测中实现了0.9以上的R2值,同时也发现了外推的低效率问题(R2值小于0.2)。因此他们提出了一种材料描述子。将数据集中不同掺杂但基质组成类似的材料识别出来并归于一簇,提取相关物理与化学信息来构成这种系统识别的材料描述子(system-identified material descriptor,SIMD),并且作为机器学习法的输入参数。利用这种描述子,不但高通量筛选的假阳性可以被降低50%以上,而且针对未参与训练的热电材料的ZT值外推预测,也可以将R2值从原来的0.13显著提高到0.71。
Fig. 6 Experimentally measured and predicted ZTs of Ag- and Ti-doped Bi0.5Sb1.5Te3 materials.
作者的研究证明了在机器学习法给出的预测模型中,对输入条件的描述越准确,该模型就越接近客观现实,从而外推结果也就越可靠。这种材料描述子除了有助于提高未知材料性质的预测准确率和材料空间的高通量搜索效率,同时也是聚类分析同机器学习互相结合的典范。
该文近期发表于npj Computational Materials 8:214(2022),英文标题与摘要如下,点击左下角“阅读原文”可以自由获取论文PDF。
A public database of thermoelectric materials and system-identified material representation for data-driven discovery
Gyoung S. Na & Hyunju Chang
Thermoelectric materials have received much attention as energy harvesting devices and power generators. However, discovering novel high-performance thermoelectric materials is challenging due to the structural diversity and complexity of the thermoelectric materials containing alloys and dopants. For the efficient data-driven discovery of novel thermoelectric materials, we constructed a public dataset that contains experimentally synthesized thermoelectric materials and their experimental thermoelectric properties. For the collected dataset, we were able to construct prediction models that achieved R2-scores greater than 0.9 in the regression problems to predict the experimentally measured thermoelectric properties from the chemical compositions of the materials. Furthermore, we devised a material descriptor for the chemical compositions of the materials to improve the extrapolation capabilities of machine learning methods. Based on transfer learning with the proposed material descriptor, we significantly improved the R2-score from 0.13 to 0.71 in predicting experimental ZTs of the materials from completely unexplored material groups.
扩展阅读
微信扫码关注该文公众号作者