大数据时代的最大挑战(一）? - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>DataSciences - 数据科学

大数据时代的最大挑战(一）?

大数据时代的最大挑战(一）?# DataSciences - 数据科学

l*o2014-08-24 07:08

1 楼

挑战一: 数据挖掘者的陷阱或"愚蠢把戏"。
海量数据，"大数据"，和高频数据---由 “Big Data” 说开去之二
这个"愚蠢把戏"是直译。见下文：
http://bit.ly/StpdDtMnrTrck
文中有更加强烈的用词，像”evil data miners”，不过也还有的放矢，我们再做解释
。.此文发表于 The Journal of Investing 2007年第1期。大家可以验证一下此文的出
处：
Leinweber, David J. "Stupid data miner tricks: overfitting the S&P 500."
The Journal of Investing 16.1 (2007): 15-22.
在“Big Data”热潮涌动的今天，似乎人人都要学习机器学习，数据发掘，文中论点仍
然振聋发聩，现摘要一下：
“The new data miners pore over large, diffuse sets of raw data trying to
discern patterns that would otherwise go undetected....
<> ...（A) good (and real) example of how data mining can work well, （is）
when it is applied to extracting a simple pattern from a large data set...
<> The dark side of data mining is to pick and choose from a large set of
data to try to explain a small one.”
文中主要提到了两个问题，一是用复杂模型去拟合（或过度拟合）数据，另一个是大海
捞针回归法（用海量数据拟合少量数据）的弊端。虽然有运用数据过简之嫌，但文中给
出了一个石破天惊的回归实例：标普500 和以下三个序列的99%相关度：
1. 孟加拉的黄油产量；2. 美国芝士产量；3.美国和孟加拉的绵羊总数
作者继续指出：
”Evil data miners often specialized in “explaining” financial data,
especially the US stock market“，就好像那些 “superball effect” 之类的。
而且，”When data mining techniques are used to scour a vast selection of
data to explain a small piece of financial market history, the results are
often ridiculous.”
笔者对类似问题，一直有同感。读完此文之后，更茅塞顿开。以此和其它相关研究成果
，我想提出“大数据时代”的第一大挑战：如何避免盲目数据挖掘和罪恶数据挖掘者
（挖）的陷阱？
以此文抛砖引玉，与大家共同探讨！

l*o2014-08-24 07:08

2 楼

更正：倒数第八行中"superball effect" 应是“super bowl effect”，拼写错误。
关于 “super bowl effect”，可查阅
http://bit.ly/SprBl_Ind

l*o2014-08-24 07:08

3 楼

有兴趣的同好可以读一下这篇文章，总结一下作者建议的避免初学者陷阱和别人设的陷
阱的办法？可能对初学者和准备面试的人有用。另外，对于自己或朋友投资，怎么看别
人给的回归测试，也可能有用。
http://bit.ly/StpdDtMnrTrck

g*s2014-08-24 07:08

4 楼

感谢分享！

P*62014-08-24 07:08

5 楼

你这个问题做数据分析干一段时间都知道，就是high-dimension变量和/或 multiple
tests的问题.
有统计的，和domain knowledge 的方法来检测和矫正，如果是真心为结果负责的话。

."

【在 l******o 的大作中提到】

: 挑战一: 数据挖掘者的陷阱或"愚蠢把戏"。
: 海量数据，"大数据"，和高频数据---由 “Big Data” 说开去之二
: 这个"愚蠢把戏"是直译。见下文：
: http://bit.ly/StpdDtMnrTrck
: 文中有更加强烈的用词，像”evil data miners”，不过也还有的放矢，我们再做解释
: 。.此文发表于 The Journal of Investing 2007年第1期。大家可以验证一下此文的出
: 处：
: Leinweber, David J. "Stupid data miner tricks: overfitting the S&P 500."
: The Journal of Investing 16.1 (2007): 15-22.
: 在“Big Data”热潮涌动的今天，似乎人人都要学习机器学习，数据发掘，文中论点仍