[Data Science Project Case] Topic Learning# DataSciences - 数据科学
j*g
1 楼
There are quite a bunch of really messed up dataset we have to use, mostly
due to bad ETL and lousy client input. In one column, the content can be
vastly different. For example, in the column "store information", the
content could be the store name, which is good, or it could be just the
brand, or the address, some short name like "ABC", or some meaningless code/
strings.
This would be an unsupervised learning problem. There are several things we
want to achieve: 1, identify the quality of certain column, come up with a
probability or a confidence level how the actual content associate with the
topic. 2, classify the content into several groups based on the quality. 3,
we also want to generalize the information so that any topic/content comes
in, we can have a good idea how good the quality, how relevant they are.
Interesting how people like me try to discover about a new term, I will
always google it and see what is the result and I will build an idea what
this topic might be. First of all, I want to know if we can do some sort of
similar information retrieval with 3rd party API. Since we have too few
information in the column, it is difficult to do topic modeling like a
document. If we build an dictionary, we have to take N-gram into
consideration, I don't how to deal with that.
I am quite new to the data science world, any input will be greatly
appreciated.
due to bad ETL and lousy client input. In one column, the content can be
vastly different. For example, in the column "store information", the
content could be the store name, which is good, or it could be just the
brand, or the address, some short name like "ABC", or some meaningless code/
strings.
This would be an unsupervised learning problem. There are several things we
want to achieve: 1, identify the quality of certain column, come up with a
probability or a confidence level how the actual content associate with the
topic. 2, classify the content into several groups based on the quality. 3,
we also want to generalize the information so that any topic/content comes
in, we can have a good idea how good the quality, how relevant they are.
Interesting how people like me try to discover about a new term, I will
always google it and see what is the result and I will build an idea what
this topic might be. First of all, I want to know if we can do some sort of
similar information retrieval with 3rd party API. Since we have too few
information in the column, it is difficult to do topic modeling like a
document. If we build an dictionary, we have to take N-gram into
consideration, I don't how to deal with that.
I am quite new to the data science world, any input will be greatly
appreciated.