Today's artificial intelligence technology is ushering in a " spring " . The 2018 " Turing Award " , AlphaGo 's outstanding performance in Go, and AlphaFold2 's great breakthrough in protein structure prediction have all inspired scientists' enthusiasm for this field, so much so that machine learning research was once " fired ". " Out of the circle. Machine learning provides people with a tool that can simulate and even create " human intelligence " . From biomedicine to political science, researchers are increasingly using machine learning as a tool to build models from data to make predictions. But the conclusions in many of these studies may be " suspected " of being exaggerated, according to researchers at Princeton University in New Jersey . They sounded the alarm about what they called a " brewing reproducibility crisis " in the field of machine learning .
What is the repeatability crisis?
That is, the findings of one experiment are difficult to replicate in another research team. In recent decades, there have been phenomena that research results cannot be reproduced in many fields. In 2016 , Dr. Han Chunyu 's paper published in Nature Biotechnology caused a huge controversy on the topic of reproducibility. In the same year , a survey report by Nature pointed out that more than 70% of researchers tried but failed to reproduce the experimental results of another scientist, and more than half of the researchers failed to reproduce their own experiments. Now, machine learning is also facing a crisis of reproducibility of research results.
Causes of the Reproducibility Crisis
In 2020 , the new coronavirus swept the world, and people lacked accurate testing and treatment methods in the face of a surge in the number of patients. Perhaps artificial intelligence could detect the disease earlier on images of the lungs and predict which patients are most likely to become seriously ill - with this anticipation, hundreds of studies have sprung up, they claim And prove that artificial intelligence can perform these tasks with high precision. But a team of researchers at the University of Cambridge in the UK examined more than 400 models and came to a startling conclusion: Each model has fatal flaws. When the basic logic of experimental design is questioned, where does reproducibility come from? The root cause of this phenomenon is that researchers and peer reviewers do not fully grasp the technology of artificial intelligence, and modern artificial intelligence is built on machine learning.
What is a data breach?
Data leakage is the most common problem when applying machine learning. Data leakage exists in the application of machine learning itself, that is, the data set used to train the machine learning algorithm contains some characteristics of things to be predicted, that is to say, some information in the test data is leaked into the training set. Failure to separate the data to be predicted from the training dataset can result in models that perform " very well "" terribly " in the real world . In addition, lack of knowledge about machine learning algorithms, insufficient understanding of research data, and misjudgment of research results are all factors that may cause a reproducibility crisis.
Consequences of the Reproducibility Crisis
The most immediate consequence of the reproducibility crisis is the inability to discern whether observed phenomena are real, fictional, or purely coincidental. The purpose of science is to establish facts as accurately as possible, and when you can't tell the truth from the fake, can the results of research still be called science? And along with people's debates about " true and false " , fake labels are quietly printed on researchers who publish controversial results, and even in their disciplines, which will cause a huge credibility crisis. Just as the Alzheimer's disease research falsification incident that shocked the field of neuroscience last month, even if the fake pictures only involve Aβ*56 (which is not the mainstream of research in this field), an Aβ oligomer form, it is still very important. The Aβ hypothesis, which dominates Alzheimer's research , has had a shock, and people have even begun to question its findings indiscriminately.
possible workaround
People often say that when a problem is identified, it must be solved. Sayash Kapoor , a machine learning researcher at Princeton University, and colleagues suggest eight main types of data leaks to watch out for. The " data checklist " they propose can help researchers detect possible data breaches as early as possible. Xiao Liu , a clinical ophthalmologist at the University of Birmingham, UK , developed reporting guidelines for research involving artificial intelligence. This guide assists regulators in discerning the quality (good or poor) of a researcher's work. In an article published in Nature Computational Science , the journal Nature Computational Science pointed out that making code and data publicly available for machine learning research is critical to improving the reproducibility of research methods, including code for training, validating and testing models and data collection, cleaning and Code for finishing steps. At the same time, when we do not know whether there is a causal relationship or a correlation between a feature and the target variable in the actual modeling process, we can carry out further data exploration, using correlation coefficient matrix heat map, feature distribution analysis, grouping box type methods such as graphs to prevent data leakage that may occur in modeling.
Summarize
From the past to the future - machine learning is used in a large number of scientific communities for its predictability advantages. Even if the reproducibility crisis will cast a cloud over it, as researchers continue to propose responses to the crisis, it is believed that machine learning will still be a " hot " application tool in the scientific community.
References:
1. https://www.nature.com/articles/d41586-022-02035-w
2. https://www.hpcwire.com/2019/02/19/machine-learning-reproducability-crisis-science/
3. https://blogs.nvidia.com/blog/2019/03/27/how-ai-machine-learning-are-advancing-academic-research/
4. https://www.statnews.com/2021/06/02/machine-learning-ai-methodology-research-flaws/
5. https://thegradient.pub/independently-reproducible-machine-learning/
6. https://machinelearningmastery.com/data-preparation-without-data-leakage/
7. https://blog.csdn.net/lomodays207/article/details/87607569
8. https://zhuanlan.zhihu.com/p/246482947
9. https://new.qq.com/omn/20220802/20220802A07Z3V00.html
10. https://www.nature.com/articles/s43588-021-00152-6
If you need to reduce the repetition rate of your manuscript , or have further polishing needs , you can upload your manuscript to the Metroiden Submission System (computer) :
online.medjaden.com
You and published SCI
only one
focus on