有些是lz自己面的有些是各处收集来的 红/绿皮书的题就不贴了 可能有些时间的原因
难免可能记错一些 请大家
包含!~
待lz想起来会不定期更新
---
explain EM algorithm, use EM algorithm to find SVD of a given matrix
---
Assume if you write an online training model, estimate how many obs the
parameters start to converge
Whats the convergence rate of online training algorithm (e.g. stochastic
gradient descent)
Convergence rate of gradient descent?
How many points are needed to converge give dim of feature space ?
Derive gradient descent formula
---
reservoir sampling prob. Proof
---
How to select SVM kernel?
---
use monte carlo method to estimate pi, how can you ensure first 6 digits are
accurate?
---
given many fair coins, how to construct an event with p = pi - 3
---
The elevator problem: assume N person in an elevator, there are m floors,
find E(# of stops of this elevator) and Var(# of stops of this elevator)
---
Shuffle an array in o(n) time o(1) space
---
Prove that in regression R^2=cor^2(y,y^hat)
---
Deepcopy a graph
---
When to use linked list, when to use array?
---
Implement a hashmap so that I can iterate this hashmap the order that I put
elements in
---
Implement hashmap using tree
---
deep iterator:
{{1,2,3},4,{{5,6},4}}
find lowest common ancestor given 2 tree nodes
---
define metrics to measure the successfulness of a newsfeed picking algorithm
in facebook
click through rate (if we maximize this rate, will we do some damage?)
---
dialog sql question, assume table:
userid
appid
type: 'imp'/'click'
timestamp
define a metric to measure the successfulness of click over imp
intensity (given a certian time interval, count the click) volume
click through rate (write sql to calculate this rate)
if for an appid # row of click > # of rows of imp, what could be the reason?
some action will generate false click rows
how to calculate the real click through rate based on this erroneous data?
---
write the sqrt(x) function
---
given an array of integers, find the median, faster than sort
---
assume a graph stored as
src|friend
a|b
b|c
..
need to find the friends of friends that are not currently frind of mine
i.e. c to a
how to do this in hadoop platform?
---
regression spline
ridge regression, the shrinking of parameter is proportional to all
paremeters or just individual parameters?
prove this
---
if i have more data points, how the bias and variance will change?
ridge regression and lasso how the bias and varaince will change compared to
linear regression?
---
CART, the splits why its all binary? why we dont use multiple splits for
each split?
what is the stop splitting rule?
how to prune tree?
---
assume we have groups and CM data, how to suggest groups to CM?
how to pick a good metric of distance if we use kNN?
if for each group, build a classification model to estimate the prob that
this CM is interested in, what is the potential pitfall of this?
groups with too few members?
what is the distribution of groups over # of group members?
---
when do you know your model is done?
---
assume we have 3 data sets: 1. user_id,ads_id,click_or_not, 2.user_id, user
attibutes, 3. ads_id, ads attributes
how do you estimate P{click|user, ads}?
if we have users that click a lot of ads, and users only click small amount
of ads, how do you build models that can deal with both kind of users?
(not under sample or cost sensitive modling, i.e. TFIDF)