avatar
only average statistics# DataSciences - 数据科学
j*g
1
This is not my task. This is a task of my colleague and I don't like the way
he did it. I am interested in how to analysis this. We have very detailed
data and we want to find some correlation with another source of data. On
the other source, they only have average statistics, for example, average
price of a product by each district. We have the detailed price for each
store, but no district information. How can I find the correlation? Can I
fit the distribution with weighted average and then compare the distribution
?
Thanks
avatar
j*g
2
另外,我们的data一定是bias的,而他们的不是
avatar
c*z
3
maybe the only thing you can do is to aggregate to district level and
compare with them...
avatar
D*e
4
Hierarchical modeling
avatar
d*n
5
那你恐怕很难做东西。
如果两个分布已经叠加了,把他们分开也是很难的。现在即使告诉你分布而不仅仅是平
均数,让你对一系列数据分类,这也是一个不简单的过程。

【在 j*******g 的大作中提到】
: 另外,我们的data一定是bias的,而他们的不是
avatar
B*6
6

way
lz的意思是说 不知道每个store对应的district?你指的数据的bias是说没有收集到所
有store的数据?
distribution

【在 j*******g 的大作中提到】
: This is not my task. This is a task of my colleague and I don't like the way
: he did it. I am interested in how to analysis this. We have very detailed
: data and we want to find some correlation with another source of data. On
: the other source, they only have average statistics, for example, average
: price of a product by each district. We have the detailed price for each
: store, but no district information. How can I find the correlation? Can I
: fit the distribution with weighted average and then compare the distribution
: ?
: Thanks

avatar
j*g
7
I think this is interesting topic, but not for this particular scenario :)

【在 D***e 的大作中提到】
: Hierarchical modeling
avatar
j*g
8
Our data is like 1/4 of the other data source and sample size is so large
that if data is not bias, we should be able to prove they come from the same
distribution, but I cannot prove that, plus given the way we collect our
data, I kind of believe our data is bias.

【在 B*******6 的大作中提到】
:
: way
: lz的意思是说 不知道每个store对应的district?你指的数据的bias是说没有收集到所
: 有store的数据?
: distribution

avatar
B*6
9
算samples基于已知mean的分布likelihood?

same

【在 j*******g 的大作中提到】
: Our data is like 1/4 of the other data source and sample size is so large
: that if data is not bias, we should be able to prove they come from the same
: distribution, but I cannot prove that, plus given the way we collect our
: data, I kind of believe our data is bias.

avatar
h*7
10
用supervised clustering
train之后把mean做成initialization,clustering loop算一步得出membership即可

way
distribution

【在 j*******g 的大作中提到】
: This is not my task. This is a task of my colleague and I don't like the way
: he did it. I am interested in how to analysis this. We have very detailed
: data and we want to find some correlation with another source of data. On
: the other source, they only have average statistics, for example, average
: price of a product by each district. We have the detailed price for each
: store, but no district information. How can I find the correlation? Can I
: fit the distribution with weighted average and then compare the distribution
: ?
: Thanks

avatar
j*g
11
this is close. but i don't think the information is enough to do
classification, since we only look at the price.

【在 h*****7 的大作中提到】
: 用supervised clustering
: train之后把mean做成initialization,clustering loop算一步得出membership即可
:
: way
: distribution

avatar
j*g
12
the result of these will be very dense on the mean... i don't think it will
work

【在 B*******6 的大作中提到】
: 算samples基于已知mean的分布likelihood?
:
: same

avatar
h*7
13
not classification, it is called supervised clustering, quite different from
classification

【在 j*******g 的大作中提到】
: this is close. but i don't think the information is enough to do
: classification, since we only look at the price.

avatar
j*g
14
I did some homework... Are you suggesting using the average statistics as '
teacher' to do clustering over our dataset? Thank you.

from

【在 h*****7 的大作中提到】
: not classification, it is called supervised clustering, quite different from
: classification

avatar
j*g
15
I am thinking this is not going to work, if you use price difference as the
distance, the cluster will again becomes very dense on the price, it will
not cluster to the district

【在 h*****7 的大作中提到】
: 用supervised clustering
: train之后把mean做成initialization,clustering loop算一步得出membership即可
:
: way
: distribution

相关阅读
logo
联系我们隐私协议©2024 redian.news
Redian新闻
Redian.news刊载任何文章,不代表同意其说法或描述,仅为提供更多信息,也不构成任何建议。文章信息的合法性及真实性由其作者负责,与Redian.news及其运营公司无关。欢迎投稿,如发现稿件侵权,或作者不愿在本网发表文章,请版权拥有者通知本网处理。