经济学人财经 || 论机器学习如何革新市场情报
1
导读
感谢思维导图作者May Li
May Li,我要去追逐心中的太阳,北大临床心理备考中
2
听力|精读|翻译|词组
A river needs a dam
河流需要筑坝
英文部分选自经济学人20191123Finance and economics版块
A river needs a dam
河流需要筑坝
How machine learning is revolutionising market intelligence
论机器学习如何革新市场情报
The Thames seems to draw people who work on intelligence-gathering. The spooks of mi6 are housed in a funky-looking building overlooking the river. Two miles downstream, in a shared office space near Blackfriars Bridge, lives Arkera, a firm that uses machine-learning technology to sort intelligence from newspapers, websites and other public sources for emerging-market investors. Its location is happenstance. London has the right time zone, between the Americas and Asia. It is a nice place to live. The Thames happens to run through it.
泰晤士河似乎吸引着各类情报收集人员。军情六处(MI6)的特工们就驻扎于泰晤士河畔时髦建筑大楼内。而在两英里开外的下游,黑衣修士桥附近的共享办公空间,Arkera也坐落于此。这是一家利用机器学习技术,从报纸、网站和其他公开资源中筛选信息,为新兴市场投资者服务的公司。其选址暗含诸多巧合。伦敦的时区得天独厚,位于美洲和亚洲之间。这儿也是居住的好地方,泰晤士河恰好贯穿其中。
注:
1.spook:鬼、间谍、特工
2.mi6:mi6=Military Intelligence ,全称是Military-Intelligence-section-six,意译为英国军情六处,是英国军方情报部门负责搜集国外情报和反恐怖主义活动的组织,原为秘密情报局国外处,负责为英国政府在全球各地收集情报,主要任务包括反对恐怖主义、武器扩散与海外地区动乱带来的威胁。1995年,军情六处总部搬到了位于伦敦维多利亚区泰晤士河畔(位于Vauxhall Bridge东端北侧)的大楼。
3.funky-looking:different but cool/nice,时髦的,特别的。
4.Blackfriars Bridge:黑衣修士桥(Blackfriars Bridge)是伦敦泰晤士河上的一座意大利风格的桥,和黑衣修士铁路桥并不是同一座,位于黑衣修士铁路桥和滑铁卢桥之间。1769年开通,是继伦敦桥(London Bridge)和威斯敏斯特桥(Westminster Bridge)后伦敦泰晤士河上的第三座大桥。
5.Arkera:一个位于英国伦敦的个人投资人工智能平台,为金融机构提供人工智能支持的应用程序,可将真实的文字事件和新闻直接连接到投资产品。公司网址:https://www.arkera.ai/
6.happenstance:an event that might have been arranged although it was really accidental意外事件、偶然事件
Arkera’s founders, Nav Gupta and Vinit Sahni, both have a background in “macro” hedge funds, the sort that like to bet on big moves in currencies and bond and stock prices ahead of predicted changes in the political climate. The firm’s clients might want a steer on the political risks affecting public finances in Brazil, or to gauge the social pressures that could arise as a consequence of an austerity programme in Egypt. It applies machine learning to find market intelligence and make it usable.
Arkera的创始人Nav Gupta和Vinit Sahni均有过从事“宏观”对冲基金的背景,这类基金通常在政治气氛预期转向之前,押注于价格可能有大幅波动的货币、债券和股票。假如有客户希望对影响巴西公共财政的政治风险有所了解,或希望合理评估埃及财政紧缩计划所带来的社会压力。Arkera正是利用机器学习来挖掘市场情报,从而满足客户需求。
注:
1.Steer: be a guiding force, as with directions or advice 驾驶、掌舵。
2.Austerity:财政紧缩政策。
For many people, the use of such technologies in finance is the stuff of dystopian science fiction, of machines running amok. But once you look at market intelligence through the eyes of computer science, it provokes disquieting thoughts of a different kind. It gives a sense of just how creaky and haphazard the old-school, analogue business of intelligence-gathering has been.
对大多数人而言,在金融领域使用这些技术有点像反乌托邦科幻小说里的情节,是疯狂运行的机器产物。但是一旦你从计算机科学的角度看待市场信息,它会唤醒另一令人忧虑的想法——老一套的情报收集业务是多么的老旧和毫无章法。
注:
1. Dystopian :(the idea of) a society in which people do not work well with each other and are not happy adj.反乌托邦的,指充满丑恶与不幸的。反乌托邦的小说通常是叙述人类科技的泛滥,在表面上提高人类的生活水平,但本质上掩饰着虚弱空洞的精神世界,人类的精神在高度发达的技术社会并没有真正的自由。
2. Creaky:used to describe something that is old-fashioned and not now effective adj.老旧的,老朽的
Haphazard:lacking order or purpose; not planned adj.无计划的,随意的
Analysts have used text data to try to predict changes in asset prices for a century or more. In 1933 Alfred Cowles, an economist whose grandfather had founded the Chicago Tribune, published a pioneering paper in this vein. Cowles sorted stock market commentary by William Peter Hamilton, a long-ruling editor of the Wall Street Journal, into three buckets (bullish, bearish or doubtful) and attached an action to each (buy, sell or avoid). He concluded that investors would have done better simply to buy and hold the leading stocks in the Dow Jones index than to follow Hamilton’s steer.
分析师使用文本数据来预测资产价格变化已长达一个世纪或者更久。1933年,经济学家阿尔弗雷德·考尔斯(他的爷爷创办了芝加哥论坛报)在这方面出版了具有开创性的文章。考尔斯将长期担任华尔街日报编辑的威廉·彼得·汉密尔顿的股市评论分为三种(看涨、看跌或不确定)并执行对应操作(买入、卖出或不交易)。他总结得出:比起追随汉密尔顿的指导,投资者买入并持有道琼斯指数中龙头股的策略反而表现更好。
注:
Alfred Cowles: 1933年阿尔弗雷德·考尔斯发表了《股票市场预测师真具有预测能力吗?》,可能是第一本公开出版的对专家“战胜市场”能力进行统计检验的著作。考尔斯分析了1928- 1932年期间16家金融机构对个股的7500个推荐意见,比较了实际预测者的收益分布与由随机挑选股票组成的投资组合收益的分布,发现没有显著的统计证据表明预测者的能力强过市场。
作者:石杉orDarren 来源:雪球 链接:
https://xueqiu.com/8292391239/89609653
The application of machine-learning models to text-as-data might seem a world away from Cowles’s approach. But in concept, it is similar. The relevant text is sought. Values are ascribed to it. A statistical model is applied. Its predictions are tested for robustness. Of course, with bags of computing power and suites of self-learning models, the enterprise is on a different scale from Cowles’s rudimentary exercise. The endless expanse of the internet means far richer source material. The range of possible values ascribed to it will be broader than “bullish, bearish or doubtful”. And self-learning algorithms can test and retest the combinations that yield the best predictions.
将机器学习模型运用到文本数据的方法似乎和考尔斯的方法相去甚远。但在原理上,两者是相似的。找到相关文本,为其赋值,选取统计模型。其预测结果再不断通过稳健性测试。当然,现今强大的算力、一系列自我学习模型,企业样本规模,相较于考尔斯时期的基础实验都已大不相同。无尽的网络扩张意味着无穷的信息资源。赋予文本的潜在价值标签也远远不止“看涨、看跌或者不确定”。自我学习算法能通过不断重复检验,从而得到最佳预测组合。
It is tempting to focus on the black-box elements of all this: the language software that “reads” the source text and the algorithms that use the data to make predictions. But this is like judging a hi-fi system by its speakers. A lot of the important work comes earlier in the process. Arkera, for instance, spends a lot of effort finding all the relevant text and “cleaning” it—stripping it of extraneous junk, such as captions and disclaimers. “A good signal is crucial,” says Mr Gupta.
在对软件黑盒测试时,人们自然会关注一些显而易见的因素,如“读取”源文本的语言软件以及利用数据进行预测的算法。但这样的关注,就好比仅凭扬声器去评价一套高保真音响系统,有失偏颇。因为在真正数据处理过程的前期就会涉及到很多重要的准备工作。比如,Arkera 花费了大量的精力去寻找所有相关的源文本,并进行“清理”——将无关的外部垃圾信息从中剥离,例如标题和免责声明。Gupta 先生认为“一个好的信号才是至关重要的”。
注:
1. Black-box Testing 黑盒测试,也称之为功能测试,它是通过测试来检测每个功能是否能够正常使用。在测试中,把程序看作一个不能打开的黑盒子,在完全不考虑程序内部结构和内部特性的情况下,在程序接口进行测试,它只检查程序功能是否按照需求规格说明书的规定正常使用,程序是否能适当地接收输入数据而产生正确的输出信息。黑盒测试着眼于程序外部结构,不考虑内部逻辑结构,主要针对软件界面和软件功能进行测试。
2. Extraneous: 没有直接关系的;无关的
Eg: We shall ignore factors extraneous to the problem.
He gives Brazil’s pension reform as an example. The country has 513 parliamentarians. They have social-media accounts, websites and blogs. They speak to the press—Brazil has scores of regional newspapers. All are potential sources of useful data. If you cut corners at this stage you might miss something that even the best statistical model cannot fix later. There is little point in having a cool amplifier and great speakers if the stylus on your record-player is worn out.
他以巴西养老金改革为例。这个国家共有513位国会议员。他们都有社交网络账号,个人网页和博客。他们通过媒体发声——巴西有大量的地方报纸。所有的这一切都是潜在的有用信息来源。如果在信息采集和清洗阶段走了捷径,一些关键信息可能会被遗漏,那么到了后期,即便是最好的统计模型也将无能为力。就好像是唱片机上的唱针坏了,那么再酷的功放和再好的扬声器也没了意义。
注释:
Cut corners: 走捷径,抄近路
Any good emerging-market analyst knows this, too. If you bumped into one shortly after Brazil’s elections last year, he was probably on his way to Brasília to sound out prospects for a crucial pension reform. Without it, Brazil’s public debt would be certain to explode, sparking capital flight. In July a pension bill finally passed Brazil’s lower house. Arkera’s models tracked the leanings of Brazil’s politicians to get an early sense of the likely outcome. It would be hard for an analyst working unaided to mimic this reach, even if he was always on the ground and spoke perfect Portuguese.
任何一个优秀的新兴市场分析师也都知道这一点。如果你在去年巴西大选不久后就偶遇了这样一个人,他很可能是在去巴西利亚的路上,打听重要的养老金改革前景。倘若没有改革预期,那么巴西的公共债务肯定会激增,最终引发资本外逃。7月,巴西下议院终于通过了一项养老金法案。Arkera的模型(足不出户)就能追踪到巴西政客们的倾向,对可能的结果有一个初步预测。而对于一个独立分析师来说,想要达成类似上述人工智能的预测效率和预测结果,即使他坚守在工作一线,操一口流利的葡萄牙语,也很难实现。
注:
1.bump into
动词词组 If you bump into someone you know, you meet them unexpectedly. 碰见
例:I happened to bump into Mervyn Johns in the hallway.
我碰巧在走廊里撞见了默文·约翰斯。
2.sound out
动词词组 If you sound someone out, you question them in order to find out what their opinion is about something. 探询
例:He is sounding out Middle Eastern governments on ways to resolve the conflict.
他正在探询中东各国政府解决这一冲突的方法。
3.capital flight资本外逃
资本逃避又称资本外逃或资本转移,是指一种由于经济危机、政治动荡、战争等因素,导致本国资本迅速流到国外,从而规避可能发生的风险的现象。
扩展:“什么是资本逃避?”
https://wiki.mbalib.com/wiki/%E8%B5%84%E6%9C%AC%E9%80%83%E9%81%BF
Intelligence-gathering is a labour-intensive business. It is thus ripe for automation. That this is happening in finance is also natural. There is a well-defined objective (to make money). There is a well-defined end-point (buy, sell or avoid). Without such clarity of purpose, intelligence is an endless river. It is one undammed thing after another.
信息采集是一项劳动密集型工作。因此,自动化的时机已经成熟。其在金融业的应用也就水到渠成。但是总需要有一个清晰的目标(赚钱)和一个明确的终点(买进、卖出或不交易)来指引信息文本在金融领域的智能应用。如果没有这样明确的目标,无穷的网络信息就像一条无尽的河流,而信息泛滥不过又是一项人工智能的决堤之物罢了。
翻译组:
Vivian,女,金融硕士,爱潜水爱运动
Vivifang,女,外币债券交易员,经济学人粉丝
Summer,女,QE在职,梦想能仗翻译/音乐 /健康走天涯Ashley,女,金融硕士,爱宠物 爱英语,爱旅游,经济学人粉丝
校对组:
Emily,食物链底端金融民工,经济学人粉丝
Jerry,男,金融专业研究生,经济学人铁粉
3
观点|评论|思考
本期感想
Alan,男,金融工程硕士,经济学人粉丝
17年data mining课上,我和组员做了一个text-mining的project,就和文章里所描述的利用文本来做trade是一样的。简单来讲就是利用一个特定的dictionary来检索文本的情绪,dictionary里分成三类词汇“buy,sell,uncertain”,来鉴别文章究竟是表达哪种立场。
当然如果真要把accuracy提高到某一水平的话,是需要相当卓绝的技巧的。文章里的Alfred Cowles做的text-as-data prediction和我当时做的项目本质是一样的,和市面上的text mining 策略的hedge fund也是一样的,但是精度却差了很多。这种策略往往到一定阶段之后精度只能停留在一个数值,比如77%,想再往上提高精度就需要承担更多的风险。为什么会要承担更多风险呢?因为在做策略的时候,往往会把数据集,分成训练集和回测集。
训练集是用来确定策略的各个参数,试图找到最优的参数;再用另一部分回测集来确认之前的参数是不是有过度优化,参数是否稳定等等。
因为虽然你用训练集训练到了所有完美的参数,回报率能做到220%,但是那是你利用后见之明的上帝视角,用类似于最优化算法得到了一个最优值,然而过去和现在千差万别,将精度提高是一个假象,实际是你的策略脱离了普适性而只最优化了某一时间段的走势,所以要有回测集。
再比如说,怎么识别垃圾邮件?在《黑客与画家》一书中,作者分享了当时是一直在尝试针对垃圾邮件的特定模式让程序作识别,但是在识别出一定数量的垃圾邮件后,再想识别出更多垃圾邮件就需要牺牲精度,因为程序会把更多正常邮件也识别成为垃圾邮件了。后来他发现用统计学方法来检测垃圾邮件是最有效的方式,具体要用到欧拉公式和伯努利条件概率。他完全换了一个方法。
现在的机器学习做交易其实也和上面的例子很相似,想要作prediction的过程中总会有很多tradeoff,如何平衡tradeoff,比如数据维数太多容易失真维数太少预测效果差,比如稳定性低但是精度高但稳定性高精度低等问题,都是典型的tradeoff 的问题。
但是一个显而易见的优势是,现在的data越来越多了,对于做机器学习来说必要的要喂进去的data很多时候并不缺乏,缺乏的还是如何做好trade的技术。
4
愿景
01 第十五期翻译打卡营
03 早起打卡营
微信扫码关注该文公众号作者