[Data Science Project Case] Topic Learning - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>DataSciences - 数据科学

[Data Science Project Case] Topic Learning

[Data Science Project Case] Topic Learning# DataSciences - 数据科学

j*g2014-07-25 07:07

1 楼

There are quite a bunch of really messed up dataset we have to use, mostly
due to bad ETL and lousy client input. In one column, the content can be
vastly different. For example, in the column "store information", the
content could be the store name, which is good, or it could be just the
brand, or the address, some short name like "ABC", or some meaningless code/
strings.
This would be an unsupervised learning problem. There are several things we
want to achieve: 1, identify the quality of certain column, come up with a
probability or a confidence level how the actual content associate with the
topic. 2, classify the content into several groups based on the quality. 3,
we also want to generalize the information so that any topic/content comes
in, we can have a good idea how good the quality, how relevant they are.
Interesting how people like me try to discover about a new term, I will
always google it and see what is the result and I will build an idea what
this topic might be. First of all, I want to know if we can do some sort of
similar information retrieval with 3rd party API. Since we have too few
information in the column, it is difficult to do topic modeling like a
document. If we build an dictionary, we have to take N-gram into
consideration, I don't how to deal with that.
I am quite new to the data science world, any input will be greatly
appreciated.

D*u2014-07-25 07:07

2 楼

my two cents, N-gram not going to help much here.
You definitely need to build dictionaries either for the goods or for the
trash or both. Then, next step is "term frequency" calculation problem. Do
some research on TF-IDF or BM25, don't be daunted by the name, the
algorithms are simple ways of counting frequency.