[Data Science Project Case] Fuzzy matching on names - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>DataSciences - 数据科学

[Data Science Project Case] Fuzzy matching on names

[Data Science Project Case] Fuzzy matching on names# DataSciences - 数据科学

c*z2014-04-04 07:04

1 楼

We have two data sets, one for product views and one for actual
purchases. We don't have all the shopping cart information and need to
infer the missing ones.
To make a training case we need to join the two sets, and the cart id
and item names are the only available keys. The problem is the items
can have many names in both sets, e.g. Dell 17" XPS and Dell XPS
Laptop 17 inch mean the same item.
I am thinking of two ways: tf-idf to identify the first three words of
item names; or clustering using edit distance.
This would be the first time I am doing a text analysis project, so I
am wondering if I need a lot of data, instead of just a smaller
sample, as well as what would be the best approach and tools. I am
familiar with R, Matlab, Pig and some Scala, and am willing to learn
other languages as well.
Thanks a lot!

C*i2014-04-04 07:04

2 楼

这个看起来不错
http://openrefine.org/
另外觉得Python 之类的做这类处理应该比较顺手
http://stackoverflow.com/questions/682367/good-python-modules-f
http://stackoverflow.com/questions/2923420/fuzzy-string-matchin

h*32014-04-04 07:04

3 楼

tf-idf肯定不靠谱。tf-idf通常都是针对一篇document，一篇起码有几千字的文章。你
才几个单词，你算出来的tf就是几个样本的结果，没有任何意义。
可以考虑用edit distance来做clustering，不过那个速度太慢了。edit distance是N
方的复杂度。简单来说，你可以用Jaccard Index，就是两组词的交集大小除以两组词
的并集大小。
不过，我觉得最靠谱的还是先找个词典，把所有brand的名词都统计起来。然后再找个
词典，把商品category的词也统计起来。这样有语意上的匹配。单单只看词的话，很有
可能把iPhone的套子和iPhone放到一起了。。。

【在 c***z 的大作中提到】

: We have two data sets, one for product views and one for actual
: purchases. We don't have all the shopping cart information and need to
: infer the missing ones.
: To make a training case we need to join the two sets, and the cart id
: and item names are the only available keys. The problem is the items
: can have many names in both sets, e.g. Dell 17" XPS and Dell XPS
: Laptop 17 inch mean the same item.
: I am thinking of two ways: tf-idf to identify the first three words of
: item names; or clustering using edit distance.
: This would be the first time I am doing a text analysis project, so I

l*m2014-04-04 07:04

4 楼

i would suggest using search related frameworks or techques. all of them are
based on indexing, very fast

【在 c***z 的大作中提到】

r*d2014-04-04 07:04

5 楼

无奇策。。

【在 c***z 的大作中提到】

N*n2014-04-04 07:04

6 楼

我觉得也是得老老实实地去分析。。用edit distance 去cluster不是很可靠。
比如你这个例子，得把Dell_17_XPS 有几种可能的名字都鉴别出来。。
Perl 处理这种字符text应该是很顺手。。其它不清楚

【在 r*****d 的大作中提到】

: 无奇策。。

E*12014-04-04 07:04

7 楼

Maybe you can try the 're' module in python. I once saw an example that it
converts different formats of phone number into one style.
(123)456-7890
123-456-7890
123456789
+1(123)456-789 (with country code)
123-456-789 x321 (with extension)
and etc.
These numbers are essentially the same, re module can help and save you time.

c*t2014-04-04 07:04

8 楼

这不是regular expressions的题目吗？我哪里想错了吗？
Perl当年是在regex领域独霸。现在popular的语言里都有有关的package/lib，node上
用java, python, r都可以。也有专门的小软件不过没在node上用过

c*o2014-04-04 07:04

9 楼

直接sql regex

c*z2014-04-04 07:04

10 楼

Thank you all!

c*z2014-04-04 07:04

11 楼

Can you share more details? Thanks a lot!

are

【在 l*******m 的大作中提到】

: i would suggest using search related frameworks or techques. all of them are
: based on indexing, very fast

c*z2014-04-04 07:04

12 楼

Can you share more details?
I used regex in r for a bit, but don't the package that does this kind of
job...
Thanks a lot!

【在 c****t 的大作中提到】

: 这不是regular expressions的题目吗？我哪里想错了吗？
: Perl当年是在regex领域独霸。现在popular的语言里都有有关的package/lib，node上
: 用java, python, r都可以。也有专门的小软件不过没在node上用过

c*z2014-04-04 07:04

13 楼

I tried Jaccard index and it worked well. I will take a look at cosine
distance and other suggested method as well. Thanks again! You guys are very
helpful!

c*z2014-04-04 07:04

14 楼

Some updates:
We did a pilot with Jaccard index, and at the cost of 2 false positives, I
was able to add 15 true positives, in addition to the 5 true matches by
exact matching (i.e. J distance is 0).
At a larger scale, I took a sample of 2000 matched records and painfully
eyeballed them.
It seems that after Jaccard index > 0.35 (about 1500 records satisfy this
condition, among which about 1000 are exact match), things start to look
good.
In terms of training set for purchase inferencing, this increases the sample
size by 50%. In other possible applications, such as clustering for
categories, product IDs and brands, we may start from 0.35 as a criteria.
Any suggestions and comments are extremely welcome!

l*02014-04-04 07:04

15 楼

你用 Jccard index 计算两个产品名的相似度？比如 Dell 17" XPS and Dell XPS
Laptop 17

sample

【在 c***z 的大作中提到】

: Some updates:
: We did a pilot with Jaccard index, and at the cost of 2 false positives, I
: was able to add 15 true positives, in addition to the 5 true matches by
: exact matching (i.e. J distance is 0).
: At a larger scale, I took a sample of 2000 matched records and painfully
: eyeballed them.
: It seems that after Jaccard index > 0.35 (about 1500 records satisfy this
: condition, among which about 1000 are exact match), things start to look
: good.
: In terms of training set for purchase inferencing, this increases the sample

c*z2014-04-04 07:04

16 楼

Correct.
Last Friday we tested this method on items names from NPD, and it worked
well (we did a K-S test on NPD's data and our own data).

【在 l******0 的大作中提到】

: 你用 Jccard index 计算两个产品名的相似度？比如 Dell 17" XPS and Dell XPS
: Laptop 17
:
: sample

l*02014-04-04 07:04

17 楼

iPhone 5, iPhone 4 and iPhone screen,iPhone battery, 这类的，怎么区别？或是
你数据中产品名 ambiguity 本身就比较少，所以不管怎么做，效果都不会差. 用 RE
把首字母开头为大写的字符串提取出来，uniq sort, 应该就比较容易看出这些产品名
的特征了。
不过，网络语言，有些时候不用大写，比如 iphone or obama. 但是产品 review, 网
络上抓取的时候，应该有那个产品的 official name.

【在 c***z 的大作中提到】

: Correct.
: Last Friday we tested this method on items names from NPD, and it worked
: well (we did a K-S test on NPD's data and our own data).

c*z2014-04-04 07:04

18 楼

Yes, model/brand/category is a big thing, we append them a number of times
to the item name, to increase their weight