Redian新闻
>
Quant面试『真题』系列:第一期

Quant面试『真题』系列:第一期

财经


量化投资与机器学习微信公众号,是业内垂直于量化投资、对冲基金、Fintech、人工智能、大数据领域的主流自媒体公众号拥有来自公募、私募、券商、期货、银行、保险、高校等行业30W+关注者,连续2年被腾讯云+社区评选为“年度最佳作者”。


量化投资与机器学公众号在2022年又双叒叕开启了一个全新系列:



QIML汇集了来自全球顶尖对冲基金、互联网大厂的真实面试题目。希望给各位读者带来不一样的求职与学习体验!


第一期


出题机构:AQR

▌题目难度:Easy


题目

Say that you are running a multiple linear regression and that you have reason to believe that several of the predictors are correlated. How will the results of the regression be affected if several are indeed correlated? How would you deal with this problem?


答案

There will be two primary problems when running a regression if several of the predictor variables are correlated. The first is that the coefficient estimates and signs will vary dramatically, depending on what particular variables you included in the model. Certain coefficients may even have confidence intervals that include 0 (meaning it is difficult to tell whether an increase in that X value is associated with an increase or  decrease in Y or not), and hence results will not be statistically significant. The second is that the resulting p-values will be misleading. For instance, an important variable might have a high p-value and so be deemed as statistically insignificant even though it is actually important. It is as if the effect of the correlated features were “split” between them, leading to uncertainty about which features are actually relevant to the model.


You can deal this problem by either removing or combining the correlated predictors. To effectively remove one of the predictors, it is best to understand the cause of the correlation (i.e., did you include extraneous predictors such as X and 2X or are there some latent variables underlying one or more of the ones you have included that affect both? To combine predictors, it is possible to include interaction terms (the product of the two that are correlated). Additionally, you could also (1) center the data and (2) try to obtain a larger size of sample, thereby giving you narrower confidence intervals. Lastly, you can apply regularization methods (such as in ridge regression) )


---


出题机构:Point72

▌题目难度:Easy


题目

Describe the motivation behind random forests. What are two ways in which thet improve upon individual decision trees?


答案

Random forests are used since individual decision trees are usually prone to overfitting. Not only can these utilize multiple decision trees and then average their decisions, but they can be used for either classification or regression. There are a few main ways in which they allow for stronger out-of-sample prediction than do individual decision trees.


* As in other ensemble models, using a large set of trees created in a resample of data (bootstrap aggregation) will lead to a model yielding more consistent results, More specifically, and in contrast to decision trees, it leads to diversity in training data for each tree and so contributes to better results in terms of bias-variance trade-off (particularly with respect to variance).


* Using only m < p features at each split helps to de-correlate the decision trees, thereby avoiding having very important features always appearing at the first splits of the trees (which would happen on standalone trees due to the nature of information gain).


* They’re fairly easy to implement and fast to run.


* They can produce very interpretable feature-importance values, thereby improving model understandability and feature selection.


The first two bullet points are the main ways random forests improve upon single decision trees.


---


出题机构:Two Sigma

▌题目难度:Easy


题目

Say you were running a linear regression for a dataset but you accidentally duplicated every data point. What happens to your beta coefficient?


答案

we see that the coefficient remains unchanged.


---


出题机构:Robinhood

▌题目难度:Easy


题目

Say you are building a binary classifier for an unbalanced dataset (where one class is much rarer than the other, say 1% and 99%, respectively). How do you handle this situation? 


答案

Unbalanced classes can be dealt with in several ways.


First, you want to check whether you can get more data or not. While in many scenarios, data may be expensive or difficult to acquire, it’s important to not overlook this approach, and at least mention it to your interviewer.


Next, make sure you’re looking at appropriate metrics,. For example, accuracy is not a correct metric to use when classes are imbalanced — instead, you want to look at precision, recall, F1 score, and the ROC curve.


Then, you can resample the training set by either oversampling the rare samples or undersampling the abundant samples; both can be accomplished via bootstrapping. These approaches are easy and quick to run, so they should be good starting points. Note, if the event is inherently rare, then oversampling may not be necessary, and you should focus more on the evaluation function. 


Additionally, you could try generating synthetic examples. There are several algorithms for doing so - the most popular is called SMOTE (synthetic minority oversampling technique), which creates synthetic samples of the rare class rather than pure copies by selecting various instances. It does this by modifying the attributes slightly by a random amount proportional to the difference in neighboring instances.


Another way is to resample classes by running ensemble models with different ratios of the classes, or by running an ensemble model using all samples of the rare class and a differing amount of the abundant class. Note that some models, such as logistic regression, are able to handle unbalanced classes relatively well in a standalone manner. You can also adjust the probability threshold to something besides 0.5 for classifying the unbalanced outcome.


Lastly, you can design your own cost function the penalizes wrong classification of the rare class more than wrong classifications of the abundant class. This is useful if you have to use a particular kind of model and you’re unable to resample. However, it can be complex to set up the penalty matrix, especially with many classes.


---


出题机构:Facebook

▌题目难度:Easy


题目

When performing K-means clustering, how do you choose k?


答案

The elbow method is the most well-known method for choosing k in k-means clustering. The intuition behind this technique is that first few clusters will explain a lot of the variation in the data, but past a certain point, the amount of information added is diminishing. Looking at a graph of explained variation (on the y-axis) versus the number of clusters (k), there should be a sharp change in the y-axis at some level of k. For example, in the graph that follows, we see a dropoff at approximately k = 6.


Note that the explained variation is quantified by the within-cluster sum of squared errors. To calculate this error metric, we look at, for each cluster, the total sum of squared errors (using Euclidean distance). A caveat to keep in mind: the assumption of a drop in variation may not necessarily be true — the y-axis may be continuously decreasing slowly (i.e., there is no significant drop).


Another popular alternative to determining k in k-means clustering is to apply the silhouette method, which aims to measure how similar points are in its cluster compared to other clusters. Concretely, it looks at:

where x is the mean distance to the examples of the nearest cluster, and y is the mean distance to other examples in the same cluster. The coefficient varies between -1 and 1 for any given point. A value of 1 implies that the point is in the “right” cluster and vice versa for a score of -1. By plotting the score on the y-axis versus k, we can get an idea for the optimal number of clusters based on this metric. Note that the metric used in the silhouette method is more computationally intensive to calculate for all points versus the elbow method. 


Taking  a step back, while both the elbow and silhouette methods serve their purpose, sometimes it helps to lean on your business intuition when choosing the number of clusters. For example, if you are clustering patients or customer groups, stakeholders and subject matter experts should have a hunch concerning how many groups they expect to see in the data. Additionally, you can visualize the features for the different groups and assess whether they are indeed behaving similarly. There is no perfect method for picking k, because if there were, it would be a supervised problem and not an unsupervised one.


---


出题机构:PWC

▌题目难度:Easy


题目

Compare and contrast gradient boosting and random forests.


答案

The first main difference is that, in gradient boosting, trees are built one at a time, such that successive weak learners learn from the mistakes of preceding weak learners. In random forests, the trees are built independently at the same time.


The second difference is in the output: gradient boosting combines the results of the weak learners with each successive iteration, whereas, in random forests, the trees are combined at the end (through either averaging or majority).


Because of their structural differences, gradient boosting is often more prone to overfitting than are random forests due to their focus on mistakes over training iterations and the lack of independence in tree building. Additionally, gradient boosting hyper-parameters are harder to tune than those of random forests. Lastly, gradient boosting may take longer to train than random forests because the trees of the latter are built sequentially.


---


相关阅读

干翻机器学习面试!

全程干货!Citadel在职Quant求职经验分享

G-Research:量化研究员面试『真题』

小编尽力了!G-Research量化面试『真题』答案出炉!

Quant Puzzle:高级享受!

独家!中国量化私募面试Q&A系列——鸣石投资

独家!中国量化私募面试Q&A系列——白鹭资管

Quant求职系列:Jane Street烧脑Puzzle(2019-2020)

Two Sigma:面试还是挺难(附面经)!

你能做几道?Jane Street烧脑面试题!

独家!全球顶尖对冲基金LeetCode面试题汇总

挑战Man Group!顶级对冲基金的10道Python面试题

微信扫码关注该文公众号作者

戳这里提交新闻线索和高质量文章给我们。
相关阅读
你将如何进行资产转换,如果未来三年通胀都会维持在10%/年?父母老,怎奉养:第一代移民最深的痛“数据跨境安全网关”条款评注(一):第一版前言勘误——兼谈近阶段的人生感悟帮你试过啦!连南方人都爱的北方菜:第一家麻辣拌冷门专业推荐系列:斯坦福大学的STS专业到底在学什么?2022年英国少年数学挑战赛真题和答案出炉防控体系的重要迭代:第九版防控方案学习(一)重磅!上海加入全球抢人行列:世界名校生直接落“沪”?!成立5个月月销破千万,Pinpoint的"爆品公式"与"营销法则"新一期“好书无人问津”系列今晚19:30,直播话《留学》第一期|疫情之下,如何提供高质量国际教育今日聚焦:第三艘航母核心技术,外国专家曾断言:绝不可能!中国,就是不信邪!Quant面试『真题』系列:第三期最大回撤超30%,"固收+"怎么变成"固收-"?昔日"爆款"遭质疑,如今如何再出发?疫情之下,国际学校如何保证优质教育?| 直播话《留学》第一期干货回顾俄罗斯的命运又出大新闻了!真题泄漏,CAIE考试局火速行动!多个考试局纷纷入场!高盛、麦肯锡、亚麻面试考什么?网友扒出的这400道真题全剧透了…面试官亲授 | 惊了!Meta面试题竟有如此高的适配度!昕诺医学:第三方动物实验中心,助力创新器械的科研成果转化为什么欧洲一直分裂,中国一直大一统?Hiring | Real Estate Senior Accountant / Accounting Manager创新创业大赛 | 晋级全国总决赛项目展示第一期——上海分赛区基业长青系列:中国家族企业的过去、现在与未来 | 特别策划赛维时代IPO过会:第一季净利降71% 出口跨境电商热度降温敏楠:第一次在美国参加葬礼“亚裔在莱镇”影展及慈善义卖第一期重振PointNet++雄风!PointNeXt:通过改进的模型训练和缩放策略重新审视PointNet++如何"借"一双慧眼看穿波动?1987年大崩盘留下的启示:"价值先生"才是同盟,请忘掉"市场先生"乌克兰还能撑多久?赴华:第三国在加拿大转机赴华的注意事项(附:踢人情况|达美AA上海出港情况)我国第一批工学博导名单(二)要闻汇总!IBO回应谣言,CAIE真题疑泄漏,UCL首次公开本科申请数据!防控体系的重要迭代:第九版防控方案学习(合集)SaaS这门慢生意,酷家乐是如何炼成工具型顶流的? | GGV投资笔记第一百二十一期
logo
联系我们隐私协议©2024 redian.news
Redian新闻
Redian.news刊载任何文章,不代表同意其说法或描述,仅为提供更多信息,也不构成任何建议。文章信息的合法性及真实性由其作者负责,与Redian.news及其运营公司无关。欢迎投稿,如发现稿件侵权,或作者不愿在本网发表文章,请版权拥有者通知本网处理。