[Data Science Project] Location data quality - 未名空间MITBBS历史存档

[Data Science Project] Location data quality# DataSciences - 数据科学

b*k2014-09-24 07:09

1 楼

灰猫mm不好意思标价，又有朋友提议对灰猫mm的画进行竞拍。懂艺术的朋友先说个底价
吧，咱们好开始竞拍，不好意思版面说的可以给我私信。：）
或者不搞竞拍，直接出价也好，别大家都不好意思，反而冷场了。。。我来煽煽风，点
点火，细细：P

c*z2014-09-24 07:09

2 楼

Hi all,
This is my first project in the new company, and it is about third party
data quality. There is no gold standard for quality, but we know that
repetition of location in the dataset might imply bad quality, because in
this case the location might come from a centroid (e.g. a cell tower, rather
than a cell phone).
There is also no ground truth about which datasets are good, but we know
some good ones, particularly the channels we own.
We are exploring the relationship between data quality of a vendor and the
distance of its location distribution from the known good ones. Here comes
the other moving part, what does distance mean here. Basically, each vendor
provides us requests to display ads, with the request there is location.
Hence we can group by location and see how many times each appears. We can
then group by frequency and see how many locations appear that many times.
This way each vendor gives a contingency table with two columns: frequency
and count.
In terms of comparing contingency table, what would you suggest?
Or should I go back to the raw data, or the intermediate table (location and
frequency)?
Thanks a lot!

E*T2014-09-24 07:09

3 楼

我觉得艺术无价。
就大家，谁喜欢，觉得多少钱自己觉得值得，就出多少钱吧。

w*p2014-09-24 07:09

4 楼

可以建一个index来量化数据质量的好坏，比如location repeat两次的pentaly是多少
，repeat三次的penalty是多少。而penalty多少可以根据business的理解来给，并且这
是可以调整的。
这样每个vendor都会得到一个data quality的分数。然后再map这个分数based on
distance to the good vender?
个人浅见，高手莫笑。

a*72014-09-24 07:09

5 楼

嗯，要不然买家先私下跟灰猫沟通一下？

【在 E**********T 的大作中提到】

: 我觉得艺术无价。
: 就大家，谁喜欢，觉得多少钱自己觉得值得，就出多少钱吧。

c*z2014-09-24 07:09

6 楼

There is a scale problem in this approach, some vendors have 100 times more
data. But maybe we can try normalizing to percentages...

z*e2014-09-24 07:09

7 楼

很好的主意！！！！！

【在 a*********7 的大作中提到】

: 嗯，要不然买家先私下跟灰猫沟通一下？

g*o2014-09-24 07:09

8 楼

看到contigency table就想到chi-square和fisher.exact test了

rather

【在 c***z 的大作中提到】

: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

a*72014-09-24 07:09

9 楼

那你赶紧上吧，哈哈

【在 z********e 的大作中提到】

: 很好的主意！！！！！

c*z2014-09-24 07:09

10 楼

I tried Chi square and G tests, but sometimes a bad partner is closer to a
good one than another good one (e.g. dist(G1, G2) > dist(G1, B1)). Also,
they both drop the zero cells, while these cells are important to us (e.g.
one bad partner has locations that repeat millions of times, while this
never happen for good partners, and in G test this case will be omitted).
Fisher's test is exponential and too slow for our case, while there are
thousands of rows. Maybe we can rebin the table to make it fewer rows, but I
would like to delay rebinning as much as possible, since it loses
information.
Thanks a lot!

z*e2014-09-24 07:09

11 楼

好！：）

【在 a*********7 的大作中提到】

: 那你赶紧上吧，哈哈

c*z2014-09-24 07:09

12 楼

In some sense this is similar to the word distributions in documents and I
am measuring the distance between the documents using the count tables (
rather, aggregated count tables with only two columns: frequency and count).
Another analogy I can think of is the wealth distribution (e.g. Gini index).
Any suggestions are extremely welcome! Thanks a lot!

b*k2014-09-24 07:09

13 楼

刚在yangjin的提醒下看到灰猫mm已经更新价格了，那就请大家开始竞拍吧：P

【在 b*********k 的大作中提到】

: 灰猫mm不好意思标价，又有朋友提议对灰猫mm的画进行竞拍。懂艺术的朋友先说个底价
: 吧，咱们好开始竞拍，不好意思版面说的可以给我私信。：）
: 或者不搞竞拍，直接出价也好，别大家都不好意思，反而冷场了。。。我来煽煽风，点
: 点火，细细：P

l*m2014-09-24 07:09

14 楼

do you have other data, such as user id, ip address timestamp, carrier id,
app id....
with additional info, it is much easier

rather

【在 c***z 的大作中提到】

: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

l*n2014-09-24 07:09

15 楼

The statistical tests on contingency table mentioned in previous posts do
not help in this case, because they only tell you whether they are different
. As Gini index, it tells you how inequality the income across a nation's
papulation, but does not tell you which population has good income.
What you need is criteria to measure the goodness of the data.I would
suggest you use entropy or some form of variation.

).
).

【在 c***z 的大作中提到】

: In some sense this is similar to the word distributions in documents and I
: am measuring the distance between the documents using the count tables (
: rather, aggregated count tables with only two columns: frequency and count).
: Another analogy I can think of is the wealth distribution (e.g. Gini index).
: Any suggestions are extremely welcome! Thanks a lot!

g*o2014-09-24 07:09

16 楼

I'm not really sure.
as also mentioned using entropy.
would Mutual Information or KL-divergence be used based on the count (bin)
data of the locations between good and bad vendors?

I

【在 c***z 的大作中提到】

: I tried Chi square and G tests, but sometimes a bad partner is closer to a
: good one than another good one (e.g. dist(G1, G2) > dist(G1, B1)). Also,
: they both drop the zero cells, while these cells are important to us (e.g.
: one bad partner has locations that repeat millions of times, while this
: never happen for good partners, and in G test this case will be omitted).
: Fisher's test is exponential and too slow for our case, while there are
: thousands of rows. Maybe we can rebin the table to make it fewer rows, but I
: would like to delay rebinning as much as possible, since it loses
: information.
: Thanks a lot!

w*p2014-09-24 07:09

17 楼

This is what I meant at the first point.
Create a data quality score using some criteria then analyze the
relationship between this score and the distance.
Or, in other words, you can calculate a "distance" using the location
repetition frequency. A good definition of this "distance" and an
appropriate transformation will finally make it has a linear relation with
the physical distance.

different

【在 l******n 的大作中提到】

: The statistical tests on contingency table mentioned in previous posts do
: not help in this case, because they only tell you whether they are different
: . As Gini index, it tells you how inequality the income across a nation's
: papulation, but does not tell you which population has good income.
: What you need is criteria to measure the goodness of the data.I would
: suggest you use entropy or some form of variation.
:
: ).
: ).

c*z2014-09-24 07:09

18 楼

I had the same concern that there might not be some intrinsic relationship
between the distance/difference and quality/performance. I also proposed
that we should focus on the goodness of the data. But at this moment I am
asked to focus on the distance.
I think the logic of my boss is to build wheels first then find a way to use
it, rather than study if we need the wheel first.
PS: I don't have other data yet, not very familiar with all the data yet.
PS2: I tried G test which is related to KL-divergence, but it didn't work
well.
PS3: I don't have physical locations yet, the tables I have are aggregated
to one level higher, containing only two columns: location frequency and how
many locations are repeated that many times. Maybe I should propose to go
back to the finer level table with location and frequency.
PS4: Just tried cosine distance, and it is not working well either. Some bad
partners are closer to good ones than they are to each other.
Thanks so much for your replies!

different

【在 l******n 的大作中提到】

: The statistical tests on contingency table mentioned in previous posts do
: not help in this case, because they only tell you whether they are different
: . As Gini index, it tells you how inequality the income across a nation's
: papulation, but does not tell you which population has good income.
: What you need is criteria to measure the goodness of the data.I would
: suggest you use entropy or some form of variation.
:
: ).
: ).

c*z2014-09-24 07:09

19 楼

The problem is that we have neither a good criteria for quality nor for
distance nor an intrinsic relationship between the two...

【在 w**p 的大作中提到】

: This is what I meant at the first point.
: Create a data quality score using some criteria then analyze the
: relationship between this score and the distance.
: Or, in other words, you can calculate a "distance" using the location
: repetition frequency. A good definition of this "distance" and an
: appropriate transformation will finally make it has a linear relation with
: the physical distance.
:
: different

l*n2014-09-24 07:09

20 楼

I have many projects like this which is more of science project other than
real business project. I usually go back to the client and ask for
clarification and objectives. Also it is the opportunity to educate your
client what can be done and what can't.
It is your show time, and don't be too shy to say it does not make sense.

【在 c***z 的大作中提到】

: The problem is that we have neither a good criteria for quality nor for
: distance nor an intrinsic relationship between the two...

c*z2014-09-24 07:09

21 楼

唉，还是比较难做到啊，尤其是才开始工作，还没有多少credit
我也是反复ask for clarification and objectives，领导从一开始说free end到确定
要distance，我也就弄distance。实在不行了再跟头说我们还是弄performance吧。

【在 l******n 的大作中提到】

: I have many projects like this which is more of science project other than
: real business project. I usually go back to the client and ask for
: clarification and objectives. Also it is the opportunity to educate your
: client what can be done and what can't.
: It is your show time, and don't be too shy to say it does not make sense.

m*a2014-09-24 07:09

22 楼

读了很多遍,还是没弄明白楼主到底想干啥,问题没有很好定义,很难下手
能否先抛开这些数据, 从Business 那边来看这个问题, 看看他们到底想干什么
知道Business 那边的目的了,再回头看这些数据怎样用
现在好象是连在什么 Level - Location 还是Vendor 上来定义问题都不清楚
想了两个办法, aggregate to some level 之后
1. cluster 的办法, 看看能否和已知的好的 cluster 到一块
2. classification 的办法, 看看 score 是否像好的, 试试机器学习中
bootstrapping 的办法 - 和统计中的bootstrapping 不是一个东西
但不知道到底有多大的 sample

rather

【在 c***z 的大作中提到】

: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

c*z2014-09-24 07:09

23 楼

Exactly. I have propose to start from the business questions.
And this is the reply from boss:
"I am not clear what kind of answers from 'business' you are looking for. It
has always been the same: Ability to differentiate good location quality
traffic from bad location quality traffic."
Still, no idea about what "good traffic" means, just a bunch of good/bad
traffic samples, need to generalize to a definition.
So we don't have a definition for goodness, nor a definition for metric, nor
an idea about the intrinsic relation between the two. We are just exploring
.
I tried clustering with a few data points (each vendor is a point) and the
bad ones are mixed into the good ones. The metrics I used are X^2, G, RMSE,
consine, area between curves, etc
I also tried classification, there are too few features and data points and
there is serious overfitting.
Can you explain a bit about the difference in bootstrapping in ML and stat?
Thanks so much!

【在 m******a 的大作中提到】

: 读了很多遍,还是没弄明白楼主到底想干啥,问题没有很好定义,很难下手
: 能否先抛开这些数据, 从Business 那边来看这个问题, 看看他们到底想干什么
: 知道Business 那边的目的了,再回头看这些数据怎样用
: 现在好象是连在什么 Level - Location 还是Vendor 上来定义问题都不清楚
: 想了两个办法, aggregate to some level 之后
: 1. cluster 的办法, 看看能否和已知的好的 cluster 到一块
: 2. classification 的办法, 看看 score 是否像好的, 试试机器学习中
: bootstrapping 的办法 - 和统计中的bootstrapping 不是一个东西
: 但不知道到底有多大的 sample
:

c*z2014-09-24 07:09

24 楼

Got some progress. I did a clustering analysis on 150 vendors (112 good ones
and 38 bad ones), using a strange metric (average height of the area
between two log-log curves).
The result is almost too good to be true: in group 1, everyone is bad; in
group 2, everyone except one is good.
The interesting thing is that as I throw in more data points, things can get
worse or better...
Take a look at the picture. Any suggestions and comments are extremely
welcome!

It
nor
exploring

【在 c***z 的大作中提到】

: Exactly. I have propose to start from the business questions.
: And this is the reply from boss:
: "I am not clear what kind of answers from 'business' you are looking for. It
: has always been the same: Ability to differentiate good location quality
: traffic from bad location quality traffic."
: Still, no idea about what "good traffic" means, just a bunch of good/bad
: traffic samples, need to generalize to a definition.
: So we don't have a definition for goodness, nor a definition for metric, nor
: an idea about the intrinsic relation between the two. We are just exploring
: .

T*u2014-09-24 07:09

25 楼

能解释一下什么是(average height of the area between two log-log curves)？

ones
get

【在 c***z 的大作中提到】

: Got some progress. I did a clustering analysis on 150 vendors (112 good ones
: and 38 bad ones), using a strange metric (average height of the area
: between two log-log curves).
: The result is almost too good to be true: in group 1, everyone is bad; in
: group 2, everyone except one is good.
: The interesting thing is that as I throw in more data points, things can get
: worse or better...
: Take a look at the picture. Any suggestions and comments are extremely
: welcome!
:

T*u2014-09-24 07:09

26 楼

不是很确定你做的是什么，但是感觉这种出现频率的东西和zipf's distribution可能
相关，或者 log-normal distribution有关。

c*z2014-09-24 07:09

27 楼

Thanks a lot! Will take a look at the zipf stuff.
Just realized that the MKFC metric is just the Cramér-von Mises stat using
raw count instead of probability mass. Will try Cramér-von Mises instead. :
)
http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Arn

m*t2014-09-24 07:09

28 楼

你这套东东我真的不太熟 follow 你这个tread看看怎么实际解决问题
不过有点好奇为啥用hierarchical clustering 我知道计算起来方便些
除此以外呢？

using
:

【在 c***z 的大作中提到】

: Thanks a lot! Will take a look at the zipf stuff.
: Just realized that the MKFC metric is just the Cramér-von Mises stat using
: raw count instead of probability mass. Will try Cramér-von Mises instead. :
: )
: http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Arn

c*z2014-09-24 07:09

29 楼

I have been asking the same question to my boss, about the practical use of
this abstract metric...
The reason we can't use k-mean is that these metrics are actually not real
metrics, as they don't follow triangular inequality, and hence the mean
means nothing (convergence of mean doesn't imply convergence of variance).
The only thing I can think of is then hierarchical clustering...

m*t2014-09-24 07:09

30 楼

可能我从最开始就没听明白你的metrics到底是啥。。。
另外hierarchical method 你不是也要算距离么。。。
我对你那个fuzzy model不太熟。。。能避开算距离的问题？

of

【在 c***z 的大作中提到】

: I have been asking the same question to my boss, about the practical use of
: this abstract metric...
: The reason we can't use k-mean is that these metrics are actually not real
: metrics, as they don't follow triangular inequality, and hence the mean
: means nothing (convergence of mean doesn't imply convergence of variance).
: The only thing I can think of is then hierarchical clustering...

c*z2014-09-24 07:09

31 楼

Strictly speaking, these distance are not metrics but ordinals, so I can do
hierarchical clustering using the order, iirc. :)

c*z2014-09-24 07:09

32 楼

Had some more progress. I run 100 trials on 7 metrics, each with 200 vendors
, and recorded the F1 scores. Attached is the plot of the F1 score.

do

【在 c***z 的大作中提到】

: Strictly speaking, these distance are not metrics but ordinals, so I can do
: hierarchical clustering using the order, iirc. :)

c*z2014-09-24 07:09

33 楼

Current question is to investigate the misclassified vendors (e.g. a vendor
which is hand labeled good - the first letter being "G", but the algorithm
puts in the "bad" cluster).
The plots of TP and FN are awfully close to each other; also are TN and FP.
I am totally clueless now (as always)...
Any suggestion is extremely welcome! The x-axis is recurrence (i.e.
frequency) and the y-axis is traffic (i.e. total volume of records with
locations repeated that many times).

vendors

【在 c***z 的大作中提到】

: Had some more progress. I run 100 trials on 7 metrics, each with 200 vendors
: , and recorded the F1 scores. Attached is the plot of the F1 score.
:
: do

c*z2014-09-24 07:09

34 楼

Same comparison, in percentiles of recurrence and percentages of traffic.

vendor
.

【在 c***z 的大作中提到】

: Current question is to investigate the misclassified vendors (e.g. a vendor
: which is hand labeled good - the first letter being "G", but the algorithm
: puts in the "bad" cluster).
: The plots of TP and FN are awfully close to each other; also are TN and FP.
: I am totally clueless now (as always)...
: Any suggestion is extremely welcome! The x-axis is recurrence (i.e.
: frequency) and the y-axis is traffic (i.e. total volume of records with
: locations repeated that many times).
:
: vendors

c*z2014-09-24 07:09

35 楼

Same comparison, in log-log.

【在 c***z 的大作中提到】

: Same comparison, in percentiles of recurrence and percentages of traffic.
:
: vendor
: .

T*u2014-09-24 07:09

36 楼

超哥威武。在不透露商业机密的基础上，呼吁这类实战的帖子。太有用了。

c*z2014-09-24 07:09

37 楼

阶段性总结
Overall this task can be conducted iteratively between two steps: the
training step using clustering of labeled samples and the bootstrapping step
adding unlabeled samples to increase coverage. Currently we can consider
the first iteration of the training step complete and move on the the
bootstrapping step.
1. 2000+ good and 2000+ bad partners provided；
2. I conducted hierarchical clustering analysis with seven metrics on a set
of good and bad samples, luckily the clusters are highly correlated with the
hand labeling - in other words the in-group distances are usually larger
than the between-group distances;
3. four top performing metrics identified with 100 trials on 200 samples
each；
4. consistently misclassified samples identified, but investigation on the
cause is currently on hold - no clear clue how why they are mislabeled;
5. attempt to trial on 4000 samples encountered engineering difficulty - R
is inefficient with such large scale computation;
6. I am currently working on the bootstrapping step to increase the coverage
of labels, there are several methods being considered;
6a. we can measure the distance between the unlabeled sample to a typical
good point and a typical bad point, then compare the two to decide a label;
the task of finding typical good and bad points are troublesome though;
6b. we can also find the nearest neighbors of the unlabeled sample and
decide a label based on this; we can use all four metrics and conduct a vote
(ensemble learning);
6c. we can also view this in the Bayesian way, i.e. assume the unknown
sample is good, find its nearest neighbor, label the unknown with its
neighbor's label; the mean in-group and mean between-group distances can be
used to produce confidence;
6d. we can also use supervised learning, with the percentile percentages as
features;
6e. confidence intervals are doable but require more research;
6f. engineering to scale up is doable as well, need to pick up Java or Scala
(for Spark).

c*z2014-09-24 07:09

38 楼

Any suggestions and comment are extremely welcome! Thanks a lot!

c*z2014-09-24 07:09

39 楼

Had some more progress. Using some better data, and after correcting for
flipped clusters (i.e. usually the bad points are in cluster 1, but
occasionally they like cluster 2 better), I had 95% accuracy in clustering
the points.
Now the bootstrap step, I labeled test points with its nearest neighbor, and
had 80% accuracy using a majority vote by the metrics. I am modifying
the algorithm so that I can allow more false positives and less false
negatives, as required by the business.
The real headache is when I look at the mislabeled cases, I have no clue why
they are mislabeled - hence cannot make improvement.
Any suggestions and comment are extremely welcome! Thanks a lot!

c*z2014-09-24 07:09

40 楼

Hi all,
This is my first project in the new company, and it is about third party
data quality. There is no gold standard for quality, but we know that
repetition of location in the dataset might imply bad quality, because in
this case the location might come from a centroid (e.g. a cell tower, rather
than a cell phone).
There is also no ground truth about which datasets are good, but we know
some good ones, particularly the channels we own.
We are exploring the relationship between data quality of a vendor and the
distance of its location distribution from the known good ones. Here comes
the other moving part, what does distance mean here. Basically, each vendor
provides us requests to display ads, with the request there is location.
Hence we can group by location and see how many times each appears. We can
then group by frequency and see how many locations appear that many times.
This way each vendor gives a contingency table with two columns: frequency
and count.
In terms of comparing contingency table, what would you suggest?
Or should I go back to the raw data, or the intermediate table (location and
frequency)?
Thanks a lot!

w*p2014-09-24 07:09

41 楼

可以建一个index来量化数据质量的好坏，比如location repeat两次的pentaly是多少
，repeat三次的penalty是多少。而penalty多少可以根据business的理解来给，并且这
是可以调整的。
这样每个vendor都会得到一个data quality的分数。然后再map这个分数based on
distance to the good vender?
个人浅见，高手莫笑。

c*z2014-09-24 07:09

42 楼

There is a scale problem in this approach, some vendors have 100 times more
data. But maybe we can try normalizing to percentages...

g*o2014-09-24 07:09

43 楼

看到contigency table就想到chi-square和fisher.exact test了

rather

【在 c***z 的大作中提到】

: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

c*z2014-09-24 07:09

44 楼

I tried Chi square and G tests, but sometimes a bad partner is closer to a
good one than another good one (e.g. dist(G1, G2) > dist(G1, B1)). Also,
they both drop the zero cells, while these cells are important to us (e.g.
one bad partner has locations that repeat millions of times, while this
never happen for good partners, and in G test this case will be omitted).
Fisher's test is exponential and too slow for our case, while there are
thousands of rows. Maybe we can rebin the table to make it fewer rows, but I
would like to delay rebinning as much as possible, since it loses
information.
Thanks a lot!

c*z2014-09-24 07:09

45 楼

In some sense this is similar to the word distributions in documents and I
am measuring the distance between the documents using the count tables (
rather, aggregated count tables with only two columns: frequency and count).
Another analogy I can think of is the wealth distribution (e.g. Gini index).
Any suggestions are extremely welcome! Thanks a lot!

l*m2014-09-24 07:09

46 楼

do you have other data, such as user id, ip address timestamp, carrier id,
app id....
with additional info, it is much easier

rather

【在 c***z 的大作中提到】

: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

l*n2014-09-24 07:09

47 楼

The statistical tests on contingency table mentioned in previous posts do
not help in this case, because they only tell you whether they are different
. As Gini index, it tells you how inequality the income across a nation's
papulation, but does not tell you which population has good income.
What you need is criteria to measure the goodness of the data.I would
suggest you use entropy or some form of variation.

).
).

【在 c***z 的大作中提到】

: In some sense this is similar to the word distributions in documents and I
: am measuring the distance between the documents using the count tables (
: rather, aggregated count tables with only two columns: frequency and count).
: Another analogy I can think of is the wealth distribution (e.g. Gini index).
: Any suggestions are extremely welcome! Thanks a lot!

g*o2014-09-24 07:09

48 楼

I'm not really sure.
as also mentioned using entropy.
would Mutual Information or KL-divergence be used based on the count (bin)
data of the locations between good and bad vendors?

I

【在 c***z 的大作中提到】

: I tried Chi square and G tests, but sometimes a bad partner is closer to a
: good one than another good one (e.g. dist(G1, G2) > dist(G1, B1)). Also,
: they both drop the zero cells, while these cells are important to us (e.g.
: one bad partner has locations that repeat millions of times, while this
: never happen for good partners, and in G test this case will be omitted).
: Fisher's test is exponential and too slow for our case, while there are
: thousands of rows. Maybe we can rebin the table to make it fewer rows, but I
: would like to delay rebinning as much as possible, since it loses
: information.
: Thanks a lot!

w*p2014-09-24 07:09

49 楼

This is what I meant at the first point.
Create a data quality score using some criteria then analyze the
relationship between this score and the distance.
Or, in other words, you can calculate a "distance" using the location
repetition frequency. A good definition of this "distance" and an
appropriate transformation will finally make it has a linear relation with
the physical distance.

different

【在 l******n 的大作中提到】

: The statistical tests on contingency table mentioned in previous posts do
: not help in this case, because they only tell you whether they are different
: . As Gini index, it tells you how inequality the income across a nation's
: papulation, but does not tell you which population has good income.
: What you need is criteria to measure the goodness of the data.I would
: suggest you use entropy or some form of variation.
:
: ).
: ).

c*z2014-09-24 07:09

50 楼

I had the same concern that there might not be some intrinsic relationship
between the distance/difference and quality/performance. I also proposed
that we should focus on the goodness of the data. But at this moment I am
asked to focus on the distance.
I think the logic of my boss is to build wheels first then find a way to use
it, rather than study if we need the wheel first.
PS: I don't have other data yet, not very familiar with all the data yet.
PS2: I tried G test which is related to KL-divergence, but it didn't work
well.
PS3: I don't have physical locations yet, the tables I have are aggregated
to one level higher, containing only two columns: location frequency and how
many locations are repeated that many times. Maybe I should propose to go
back to the finer level table with location and frequency.
PS4: Just tried cosine distance, and it is not working well either. Some bad
partners are closer to good ones than they are to each other.
Thanks so much for your replies!

different

【在 l******n 的大作中提到】

: The statistical tests on contingency table mentioned in previous posts do
: not help in this case, because they only tell you whether they are different
: . As Gini index, it tells you how inequality the income across a nation's
: papulation, but does not tell you which population has good income.
: What you need is criteria to measure the goodness of the data.I would
: suggest you use entropy or some form of variation.
:
: ).
: ).

c*z2014-09-24 07:09

51 楼

The problem is that we have neither a good criteria for quality nor for
distance nor an intrinsic relationship between the two...

【在 w**p 的大作中提到】

: This is what I meant at the first point.
: Create a data quality score using some criteria then analyze the
: relationship between this score and the distance.
: Or, in other words, you can calculate a "distance" using the location
: repetition frequency. A good definition of this "distance" and an
: appropriate transformation will finally make it has a linear relation with
: the physical distance.
:
: different

l*n2014-09-24 07:09

52 楼

I have many projects like this which is more of science project other than
real business project. I usually go back to the client and ask for
clarification and objectives. Also it is the opportunity to educate your
client what can be done and what can't.
It is your show time, and don't be too shy to say it does not make sense.

【在 c***z 的大作中提到】

: The problem is that we have neither a good criteria for quality nor for
: distance nor an intrinsic relationship between the two...

c*z2014-09-24 07:09

53 楼

唉，还是比较难做到啊，尤其是才开始工作，还没有多少credit
我也是反复ask for clarification and objectives，领导从一开始说free end到确定
要distance，我也就弄distance。实在不行了再跟头说我们还是弄performance吧。

【在 l******n 的大作中提到】

: I have many projects like this which is more of science project other than
: real business project. I usually go back to the client and ask for
: clarification and objectives. Also it is the opportunity to educate your
: client what can be done and what can't.
: It is your show time, and don't be too shy to say it does not make sense.

m*a2014-09-24 07:09

54 楼

读了很多遍,还是没弄明白楼主到底想干啥,问题没有很好定义,很难下手
能否先抛开这些数据, 从Business 那边来看这个问题, 看看他们到底想干什么
知道Business 那边的目的了,再回头看这些数据怎样用
现在好象是连在什么 Level - Location 还是Vendor 上来定义问题都不清楚
想了两个办法, aggregate to some level 之后
1. cluster 的办法, 看看能否和已知的好的 cluster 到一块
2. classification 的办法, 看看 score 是否像好的, 试试机器学习中
bootstrapping 的办法 - 和统计中的bootstrapping 不是一个东西
但不知道到底有多大的 sample

rather

【在 c***z 的大作中提到】

: Hi all,
: This is my first project in the new company, and it is about third party
: data quality. There is no gold standard for quality, but we know that
: repetition of location in the dataset might imply bad quality, because in
: this case the location might come from a centroid (e.g. a cell tower, rather
: than a cell phone).
: There is also no ground truth about which datasets are good, but we know
: some good ones, particularly the channels we own.
: We are exploring the relationship between data quality of a vendor and the
: distance of its location distribution from the known good ones. Here comes

c*z2014-09-24 07:09

55 楼

Exactly. I have propose to start from the business questions.
And this is the reply from boss:
"I am not clear what kind of answers from 'business' you are looking for. It
has always been the same: Ability to differentiate good location quality
traffic from bad location quality traffic."
Still, no idea about what "good traffic" means, just a bunch of good/bad
traffic samples, need to generalize to a definition.
So we don't have a definition for goodness, nor a definition for metric, nor
an idea about the intrinsic relation between the two. We are just exploring
.
I tried clustering with a few data points (each vendor is a point) and the
bad ones are mixed into the good ones. The metrics I used are X^2, G, RMSE,
consine, area between curves, etc
I also tried classification, there are too few features and data points and
there is serious overfitting.
Can you explain a bit about the difference in bootstrapping in ML and stat?
Thanks so much!

【在 m******a 的大作中提到】

: 读了很多遍,还是没弄明白楼主到底想干啥,问题没有很好定义,很难下手
: 能否先抛开这些数据, 从Business 那边来看这个问题, 看看他们到底想干什么
: 知道Business 那边的目的了,再回头看这些数据怎样用
: 现在好象是连在什么 Level - Location 还是Vendor 上来定义问题都不清楚
: 想了两个办法, aggregate to some level 之后
: 1. cluster 的办法, 看看能否和已知的好的 cluster 到一块
: 2. classification 的办法, 看看 score 是否像好的, 试试机器学习中
: bootstrapping 的办法 - 和统计中的bootstrapping 不是一个东西
: 但不知道到底有多大的 sample
:

c*z2014-09-24 07:09

56 楼

Got some progress. I did a clustering analysis on 150 vendors (112 good ones
and 38 bad ones), using a strange metric (average height of the area
between two log-log curves).
The result is almost too good to be true: in group 1, everyone is bad; in
group 2, everyone except one is good.
The interesting thing is that as I throw in more data points, things can get
worse or better...
Take a look at the picture. Any suggestions and comments are extremely
welcome!

It
nor
exploring

【在 c***z 的大作中提到】

: Exactly. I have propose to start from the business questions.
: And this is the reply from boss:
: "I am not clear what kind of answers from 'business' you are looking for. It
: has always been the same: Ability to differentiate good location quality
: traffic from bad location quality traffic."
: Still, no idea about what "good traffic" means, just a bunch of good/bad
: traffic samples, need to generalize to a definition.
: So we don't have a definition for goodness, nor a definition for metric, nor
: an idea about the intrinsic relation between the two. We are just exploring
: .

T*u2014-09-24 07:09

57 楼

能解释一下什么是(average height of the area between two log-log curves)？

ones
get

【在 c***z 的大作中提到】

: Got some progress. I did a clustering analysis on 150 vendors (112 good ones
: and 38 bad ones), using a strange metric (average height of the area
: between two log-log curves).
: The result is almost too good to be true: in group 1, everyone is bad; in
: group 2, everyone except one is good.
: The interesting thing is that as I throw in more data points, things can get
: worse or better...
: Take a look at the picture. Any suggestions and comments are extremely
: welcome!
:

T*u2014-09-24 07:09

58 楼

不是很确定你做的是什么，但是感觉这种出现频率的东西和zipf's distribution可能
相关，或者 log-normal distribution有关。

c*z2014-09-24 07:09

59 楼

Thanks a lot! Will take a look at the zipf stuff.
Just realized that the MKFC metric is just the Cramér-von Mises stat using
raw count instead of probability mass. Will try Cramér-von Mises instead. :
)
http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Arn

m*t2014-09-24 07:09

60 楼

你这套东东我真的不太熟 follow 你这个tread看看怎么实际解决问题
不过有点好奇为啥用hierarchical clustering 我知道计算起来方便些
除此以外呢？

using
:

【在 c***z 的大作中提到】

: Thanks a lot! Will take a look at the zipf stuff.
: Just realized that the MKFC metric is just the Cramér-von Mises stat using
: raw count instead of probability mass. Will try Cramér-von Mises instead. :
: )
: http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Arn

c*z2014-09-24 07:09

61 楼

I have been asking the same question to my boss, about the practical use of
this abstract metric...
The reason we can't use k-mean is that these metrics are actually not real
metrics, as they don't follow triangular inequality, and hence the mean
means nothing (convergence of mean doesn't imply convergence of variance).
The only thing I can think of is then hierarchical clustering...

m*t2014-09-24 07:09

62 楼

可能我从最开始就没听明白你的metrics到底是啥。。。
另外hierarchical method 你不是也要算距离么。。。
我对你那个fuzzy model不太熟。。。能避开算距离的问题？

of

【在 c***z 的大作中提到】

: I have been asking the same question to my boss, about the practical use of
: this abstract metric...
: The reason we can't use k-mean is that these metrics are actually not real
: metrics, as they don't follow triangular inequality, and hence the mean
: means nothing (convergence of mean doesn't imply convergence of variance).
: The only thing I can think of is then hierarchical clustering...

c*z2014-09-24 07:09

63 楼

Strictly speaking, these distance are not metrics but ordinals, so I can do
hierarchical clustering using the order, iirc. :)

c*z2014-09-24 07:09

64 楼

Had some more progress. I run 100 trials on 7 metrics, each with 200 vendors
, and recorded the F1 scores. Attached is the plot of the F1 score.

do

【在 c***z 的大作中提到】

: Strictly speaking, these distance are not metrics but ordinals, so I can do
: hierarchical clustering using the order, iirc. :)

c*z2014-09-24 07:09

65 楼

Current question is to investigate the misclassified vendors (e.g. a vendor
which is hand labeled good - the first letter being "G", but the algorithm
puts in the "bad" cluster).
The plots of TP and FN are awfully close to each other; also are TN and FP.
I am totally clueless now (as always)...
Any suggestion is extremely welcome! The x-axis is recurrence (i.e.
frequency) and the y-axis is traffic (i.e. total volume of records with
locations repeated that many times).

vendors

【在 c***z 的大作中提到】

: Had some more progress. I run 100 trials on 7 metrics, each with 200 vendors
: , and recorded the F1 scores. Attached is the plot of the F1 score.
:
: do

c*z2014-09-24 07:09

66 楼

Same comparison, in percentiles of recurrence and percentages of traffic.

vendor
.

【在 c***z 的大作中提到】

: Current question is to investigate the misclassified vendors (e.g. a vendor
: which is hand labeled good - the first letter being "G", but the algorithm
: puts in the "bad" cluster).
: The plots of TP and FN are awfully close to each other; also are TN and FP.
: I am totally clueless now (as always)...
: Any suggestion is extremely welcome! The x-axis is recurrence (i.e.
: frequency) and the y-axis is traffic (i.e. total volume of records with
: locations repeated that many times).
:
: vendors

c*z2014-09-24 07:09

67 楼

Same comparison, in log-log.

【在 c***z 的大作中提到】

: Same comparison, in percentiles of recurrence and percentages of traffic.
:
: vendor
: .

T*u2014-09-24 07:09

68 楼

超哥威武。在不透露商业机密的基础上，呼吁这类实战的帖子。太有用了。

c*z2014-09-24 07:09

69 楼

阶段性总结
Overall this task can be conducted iteratively between two steps: the
training step using clustering of labeled samples and the bootstrapping step
adding unlabeled samples to increase coverage. Currently we can consider
the first iteration of the training step complete and move on the the
bootstrapping step.
1. 2000+ good and 2000+ bad partners provided；
2. I conducted hierarchical clustering analysis with seven metrics on a set
of good and bad samples, luckily the clusters are highly correlated with the
hand labeling - in other words the in-group distances are usually larger
than the between-group distances;
3. four top performing metrics identified with 100 trials on 200 samples
each；
4. consistently misclassified samples identified, but investigation on the
cause is currently on hold - no clear clue how why they are mislabeled;
5. attempt to trial on 4000 samples encountered engineering difficulty - R
is inefficient with such large scale computation;
6. I am currently working on the bootstrapping step to increase the coverage
of labels, there are several methods being considered;
6a. we can measure the distance between the unlabeled sample to a typical
good point and a typical bad point, then compare the two to decide a label;
the task of finding typical good and bad points are troublesome though;
6b. we can also find the nearest neighbors of the unlabeled sample and
decide a label based on this; we can use all four metrics and conduct a vote
(ensemble learning);
6c. we can also view this in the Bayesian way, i.e. assume the unknown
sample is good, find its nearest neighbor, label the unknown with its
neighbor's label; the mean in-group and mean between-group distances can be
used to produce confidence;
6d. we can also use supervised learning, with the percentile percentages as
features;
6e. confidence intervals are doable but require more research;
6f. engineering to scale up is doable as well, need to pick up Java or Scala
(for Spark).

c*z2014-09-24 07:09

70 楼

Any suggestions and comment are extremely welcome! Thanks a lot!

c*z2014-09-24 07:09

71 楼

Had some more progress. Using some better data, and after correcting for
flipped clusters (i.e. usually the bad points are in cluster 1, but
occasionally they like cluster 2 better), I had 95% accuracy in clustering
the points.
Now the bootstrap step, I labeled test points with its nearest neighbor, and
had 80% accuracy using a majority vote by the metrics. I am modifying
the algorithm so that I can allow more false positives and less false
negatives, as required by the business.
The real headache is when I look at the mislabeled cases, I have no clue why
they are mislabeled - hence cannot make improvement.
Any suggestions and comment are extremely welcome! Thanks a lot!

c*z2014-09-24 07:09

72 楼

终于完成了project。
Summary of Findings. Overall we were able to verify the hypothesis regarding
partner quality and location recurrence; we were also able to design a fast
mechanism to classify partners by data quality based on location recurrence
; finally we were able to identify and correct for errors in hand labeling.
The engineers will take over implementation, even though I would like to do
it myself on Scala + Spark...

c*z2014-09-24 07:09

73 楼

附两张图，可以看到false negative and true negatives are very similar, as
well as the positives. 这说明很可能算法是对的，而手工label是错的。business的
人去核对去了。

regarding
fast
recurrence
.
do

【在 c***z 的大作中提到】

: 终于完成了project。
: Summary of Findings. Overall we were able to verify the hypothesis regarding
: partner quality and location recurrence; we were also able to design a fast
: mechanism to classify partners by data quality based on location recurrence
: ; finally we were able to identify and correct for errors in hand labeling.
: The engineers will take over implementation, even though I would like to do
: it myself on Scala + Spark...