我想购买18G的Dropbox帐号 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>PDA - 掌中宝

我想购买18G的Dropbox帐号

我想购买18G的Dropbox帐号# PDA - 掌中宝

c*y2012-01-19 08:01

1 楼

一个很大的数组，长度可以是100000或者更大，里面的数字可以从0到2的32次方，找出
出现频率最高
的数字
我给的答案是hashtable, 小印女很不满，觉得太占内存，红黑树，好像也不满意，哪
位大牛给指点
下，她到底期待一个什么样的数据结构啊

e*r2012-01-19 08:01

2 楼

不管你是从淘宝网拍来的还是自己挣来的，PM我你的美元Offer，如果确实可用我会
Paypal你。
条件是：18G的Dropbox永久帐号，我可以修改user ID和Password.

y*e2012-01-19 08:01

3 楼

首先要问，输入是存在主内存中，还是存在外部文件？
若是输入是存在主内存中的array，是否已经排序好了？多半是没排序的。
那么这个问题可以简化成，给定一个未排序的32bit数组，找出出现频率最高的。用
hashtable 是最简单的解法; 不过面官对 hashtable 不满意，你可以分析出
hashtable 的 memory overhead 有多大，这样可以表现出你对 hashtable 内在实
现的了解程度。（hashtable是很占内存的）。
另外一个很简单的方案：对数组排序，用掉O(NlogN)的时间。那么相同的元素都放
在一起了。接下来扫描数组一遍，同时记录相同元素出现的次数，能有用O(1)的空
间找到出现频率最大的相同元素。

N*k2012-01-19 08:01

4 楼

哈哈，这样省事，不过俺选择推广俺的邀请链接赚空间：
http://db.tt/jinSUII
欢迎使用，此链接比非.edu邮箱注册用户的邀请链接多赠送250M，即500M，初始空间即
为2.5G。

c*22012-01-19 08:01

5 楼

This is to find an algorithm, not to find a data structure.
One way to try:
for(i=0;i<32;i++)
read one bit (the i-th bit) of all numbers
Count how many ones, save it to counts[i]
Then we get array counts[32]
Do this a few rounds:
Find the smallest non-zero number from counts[32]: min
for(i=0;i<32;i++)
counts[i] -= min;
until
there are only two different numbers left in the array
one number must be 0, the other non-zero number
constuct the number from array counts:
i-t

r*r2012-01-19 08:01

6 楼

....真执着啊

【在 N**k 的大作中提到】

: 哈哈，这样省事，不过俺选择推广俺的邀请链接赚空间：
: http://db.tt/jinSUII
: 欢迎使用，此链接比非.edu邮箱注册用户的邀请链接多赠送250M，即500M，初始空间即
: 为2.5G。

s*n2012-01-19 08:01

7 楼

看不懂原理。不过举个例子。假设1M数只有一个1。剩下都是零。
那么照这个算法，最frequent的数是1？

【在 c***2 的大作中提到】

: This is to find an algorithm, not to find a data structure.
: One way to try:
: for(i=0;i<32;i++)
: read one bit (the i-th bit) of all numbers
: Count how many ones, save it to counts[i]
: Then we get array counts[32]
: Do this a few rounds:
: Find the smallest non-zero number from counts[32]: min
: for(i=0;i<32;i++)
: counts[i] -= min;

c*22012-01-19 08:01

8 楼

This won't work.(good only for a few special cases).
The direction may be right...

【在 c***2 的大作中提到】

c*y2012-01-19 08:01

9 楼

嗯，很好的想法。不过，读一个bit和读进来一个字长耗费的时间都是一样的吧

f*g2012-01-19 08:01

10 楼

Could we do multi-level bucket sort:
Round 1:
Construct the buckets as int *a[32].
For each number, we place the number into the buckets by the most-right non-
zero bit. For example: 01000000,00000000,00110000,11000000, we link it into
the bucket a[30]. 00000001,00000000,00110000,11000000 => link to bucket a[
24].
Then for each bucket, we could use the same strategy to sort or quick sort(
depends on the number of integers to be sorted).

t*a2012-01-19 08:01

11 楼

agree!

non-
into

【在 f****g 的大作中提到】

: Could we do multi-level bucket sort:
: Round 1:
: Construct the buckets as int *a[32].
: For each number, we place the number into the buckets by the most-right non-
: zero bit. For example: 01000000,00000000,00110000,11000000, we link it into
: the bucket a[30]. 00000001,00000000,00110000,11000000 => link to bucket a[
: 24].
: Then for each bucket, we could use the same strategy to sort or quick sort(
: depends on the number of integers to be sorted).
:

s*g2012-01-19 08:01

12 楼

How does example work? I did not get it.

non-
into

【在 f****g 的大作中提到】

p*72012-01-19 08:01

13 楼

先用1bit bitmap，0表示没出现，1表示出现过1次或者以上
然后再根据上面的结果创建一个hashtable表格，里面只包括了出现了1次或者1次以上的数字，这样这个hashtable的大小会小很多。
最后就是一个counting的过程

f*g2012-01-19 08:01

14 楼

That's a very good solution too: use bitmap to determine the existence of
each
element, and then build a hashtable to count.

上的数字，这样这
个hashtable的大小会小很多。

【在 p********7 的大作中提到】

: 先用1bit bitmap，0表示没出现，1表示出现过1次或者以上
: 然后再根据上面的结果创建一个hashtable表格，里面只包括了出现了1次或者1次以上的数字，这样这个hashtable的大小会小很多。
: 最后就是一个counting的过程

f*g2012-01-19 08:01

15 楼

Take 4-bit integer as an example:
Bucket 1: 1000 - 1111
Bucket 2: 0100 - 0111
Bucket 3: 0010 - 0011
Bucket 4: 0001 - 0001
Got it?

【在 s*********g 的大作中提到】

: How does example work? I did not get it.
:
: non-
: into

d*i2012-01-19 08:01

16 楼

学习了。

k*g2012-01-19 08:01

17 楼

不知道这个bitmap有什么用，原来数组中的数，出现次数肯定大于等于1，跟直接扫描
数组有区别？

上的数字，这
样这个hashtable的大小会小很多。

【在 p********7 的大作中提到】

g*e2012-01-19 08:01

18 楼

这个bitmap开销很大。2^32 bits。建ht的过程又要读一遍。这个ht能直接count，还是
类似bucket sort要来好几轮？
我觉得还是类似bucket sort的思路好，每次2^8个buckets，最坏的情况全部数字读4遍
。从space和time的cost均衡一些。
但是我觉得tree不是更好，类似Huffman coding，每个node带一个counter？

【在 f****g 的大作中提到】

: That's a very good solution too: use bitmap to determine the existence of
: each
: element, and then build a hashtable to count.
:
: 上的数字，这样这
: 个hashtable的大小会小很多。

p*72012-01-19 08:01

19 楼

这么一来其实还是分了N个extra 内存

non-
into

【在 f****g 的大作中提到】