问一道(大)数据 algorithm (转载) - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>Programming - 葵花宝典

问一道(大)数据 algorithm (转载)

问一道(大)数据 algorithm (转载)# Programming - 葵花宝典

m*h2015-03-22 07:03

1 楼

我妈去中信让他们代签B2，中信的人要求她办一张五万元的回卡，说不然不给代递。
请问这是新规定吗？怎么从来没听过？

n*32015-03-22 07:03

2 楼

【以下文字转载自 JobHunting 讨论区】
发信人: nacst23 (cnc), 信区: JobHunting
标题: 问一道(大)数据 algorithm
发信站: BBS 未名空间站 (Sun Mar 22 00:11:01 2015, 美东)
请教大家一下：
两组人， POSITIVE 和 Negative ，
say
POSITIVE 100K ppl，
Negative 900K ppl.
基本的数据结构是人的 ID 和 length of stay（待了几天）。
ID length of stay(days)
ppl-0000001 8
ppl-0000002 10
...
目的是 sample Negative 组出来 100K 人 ,
which one-to-one match the Positive 组人
的 length of stay（待了几天），
这样 match 完, 两组人的 100K 个 length of stay（待了几天）
完全一样.
当然如果 negative
组人有多个 match 一个 POSITIVE 组人，任取一个就好了。
想用 c++ 写，use STL／Map hash，
不知有没好的算法哦，
or 更好的 STL 数据结构／算法可用？
因为是准备写成 RCPP for R, 现在不考虑用
并行 Solution.
谢谢。

a*n2015-03-22 07:03

3 楼

你们那里中信土规定或者忽悠你，大使馆没有这个规定。

【在 m*h 的大作中提到】

: 我妈去中信让他们代签B2，中信的人要求她办一张五万元的回卡，说不然不给代递。
: 请问这是新规定吗？怎么从来没听过？

n*32015-03-22 07:03

4 楼

the for loop will take a long time to finish;
I want to figure out some good algorithm/Data strucute
to speed it up. Thanks.

【在 n*****3 的大作中提到】

: 【以下文字转载自 JobHunting 讨论区】
: 发信人: nacst23 (cnc), 信区: JobHunting
: 标题: 问一道(大)数据 algorithm
: 发信站: BBS 未名空间站 (Sun Mar 22 00:11:01 2015, 美东)
: 请教大家一下：
: 两组人， POSITIVE 和 Negative ，
: say
: POSITIVE 100K ppl，
: Negative 900K ppl.
: 基本的数据结构是人的 ID 和 length of stay（待了几天）。

i*i2015-03-22 07:03

5 楼

你是在哪个城市哪家中信啊，五万元回卡是啥呀--是存款么？

k*g2015-03-22 07:03

6 楼

not a statistician, 有错轻拍
first break down the larger set by length of stay. After this step, the
random sampling will be performed within records of the same length of stay.
check that for each length of stay, the larger data set provides enough data
for the task (i.e. larger than the number of records in the smaller data
set). If not, you have to change your subsampling strategy.
assign uniform random numbers to each record in the larger set. sort them.
Select the first N records, where N = number of records in the smaller set.
make sure you know how to use a random number generator.