Redian新闻
>
[Data Science Project Case] Topic Learning
avatar
[Data Science Project Case] Topic Learning# DataSciences - 数据科学
j*g
1
There are quite a bunch of really messed up dataset we have to use, mostly
due to bad ETL and lousy client input. In one column, the content can be
vastly different. For example, in the column "store information", the
content could be the store name, which is good, or it could be just the
brand, or the address, some short name like "ABC", or some meaningless code/
strings.
This would be an unsupervised learning problem. There are several things we
want to achieve: 1, identify the quality of certain column, come up with a
probability or a confidence level how the actual content associate with the
topic. 2, classify the content into several groups based on the quality. 3,
we also want to generalize the information so that any topic/content comes
in, we can have a good idea how good the quality, how relevant they are.
Interesting how people like me try to discover about a new term, I will
always google it and see what is the result and I will build an idea what
this topic might be. First of all, I want to know if we can do some sort of
similar information retrieval with 3rd party API. Since we have too few
information in the column, it is difficult to do topic modeling like a
document. If we build an dictionary, we have to take N-gram into
consideration, I don't how to deal with that.
I am quite new to the data science world, any input will be greatly
appreciated.
avatar
D*u
2
my two cents, N-gram not going to help much here.
You definitely need to build dictionaries either for the goods or for the
trash or both. Then, next step is "term frequency" calculation problem. Do
some research on TF-IDF or BM25, don't be daunted by the name, the
algorithms are simple ways of counting frequency.
相关阅读
logo
联系我们隐私协议©2024 redian.news
Redian新闻
Redian.news刊载任何文章,不代表同意其说法或描述,仅为提供更多信息,也不构成任何建议。文章信息的合法性及真实性由其作者负责,与Redian.news及其运营公司无关。欢迎投稿,如发现稿件侵权,或作者不愿在本网发表文章,请版权拥有者通知本网处理。