Redian新闻
>
A Closer Look at Android RunTime (ART) in Android L
avatar
A Closer Look at Android RunTime (ART) in Android L# MobileDevelopment - 移动开发
c*z
1
This is something I am working on and would like to hear if you have any
clue.
Say we have millions of product names, such as "Xbox 360", "Playstation 4",
etc.
We want to extract (tokenize) meaningful information from billions of URLs (
click history), and want to distinguish the 360 in "Xbox 360" (useful) and
the 360 in session ids (garbage).
For example, given
www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there
The first 09 is size (keep) and the second 09 is garbage (drop)
We want: amazon nike running shoes 09 mens buy hello there; but we want to
drop: abc 123, as well as the second 09
Due to the size of the data, manually checking the names is impossible. Does
anyone have a clue?
I am thinking about hashing table, but that means the parsing time raises
from O(1) to O(N), and N is millions!
Thanks!
avatar
I*y
3
不知道理解对不对,胡说两句:
感觉可以根据url的pattern来分类然后extract things you want
按你那个例子,比如同属amazon domain的url pattern都是domain/brand/item%size/
... 那么已知这个pattern就可以把你需要的提出来。
avatar
c*z
4
Sounds good, will take a look at the patterns. Thanks a lot!
avatar
l*n
5
就用regular expression match就好了

size/

【在 I******y 的大作中提到】
: 不知道理解对不对,胡说两句:
: 感觉可以根据url的pattern来分类然后extract things you want
: 按你那个例子,比如同属amazon domain的url pattern都是domain/brand/item%size/
: ... 那么已知这个pattern就可以把你需要的提出来。

avatar
c*z
6
Can you give more details?
I did regular expression match in R on company names, it was a pain in the
butt, and it's only 10k names...
avatar
r*y
7
One way is to establish a clickstream, which leads to a sale. From the item
sold, make sense of URLs clicked during the process.

,
(

【在 c***z 的大作中提到】
: This is something I am working on and would like to hear if you have any
: clue.
: Say we have millions of product names, such as "Xbox 360", "Playstation 4",
: etc.
: We want to extract (tokenize) meaningful information from billions of URLs (
: click history), and want to distinguish the 360 in "Xbox 360" (useful) and
: the 360 in session ids (garbage).
: For example, given
: www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there
: The first 09 is size (keep) and the second 09 is garbage (drop)

avatar
b*L
8
running-shoes%09mens 不是size9, %dd 是ascii码
所以用regex 应该很容易parse 这些url
avatar
c*z
9
谢谢大家的input,我确实对url不熟,呵呵,还是要多多学习啊
avatar
b*o
10
完全不懂你在说什么。
你给的那个例子里%09都是escaped unicode:
import urllib
urllib.unquote("www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref
=hello%09there")
'www.amazon.com/nike/running-shoes\tmens/buy?q=abc&x=123&ref=hello\tthere'
另外,为什么drop q=...和x=...,但是保留ref=...? 这两者就功能而言没有任何不同
呀,都是GET request里的param。还是说你有一个param的whitelist/blacklist?

,
(

【在 c***z 的大作中提到】
: This is something I am working on and would like to hear if you have any
: clue.
: Say we have millions of product names, such as "Xbox 360", "Playstation 4",
: etc.
: We want to extract (tokenize) meaningful information from billions of URLs (
: click history), and want to distinguish the 360 in "Xbox 360" (useful) and
: the 360 in session ids (garbage).
: For example, given
: www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there
: The first 09 is size (keep) and the second 09 is garbage (drop)

avatar
c*z
11
Ah, I just realized that I know too little for URL parse, I will ask the
engineers so that I can ask the question more intelligently.

ref

【在 b*****o 的大作中提到】
: 完全不懂你在说什么。
: 你给的那个例子里%09都是escaped unicode:
: import urllib
: urllib.unquote("www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref
: =hello%09there")
: 'www.amazon.com/nike/running-shoes\tmens/buy?q=abc&x=123&ref=hello\tthere'
: 另外,为什么drop q=...和x=...,但是保留ref=...? 这两者就功能而言没有任何不同
: 呀,都是GET request里的param。还是说你有一个param的whitelist/blacklist?
:
: ,

avatar
l*0
12
we have millions of product names, such as "Xbox 360"
--- The 'millions of product names' are known and in your database, or
unknown?
You want to extract company name --> product name from URL, or anything else?
First impression is to sort all the URL lines alphabetically so that it
would be much easier to identify different URL patterns from different sites.

,
(

【在 c***z 的大作中提到】
: This is something I am working on and would like to hear if you have any
: clue.
: Say we have millions of product names, such as "Xbox 360", "Playstation 4",
: etc.
: We want to extract (tokenize) meaningful information from billions of URLs (
: click history), and want to distinguish the 360 in "Xbox 360" (useful) and
: the 360 in session ids (garbage).
: For example, given
: www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there
: The first 09 is size (keep) and the second 09 is garbage (drop)

avatar
l*s
13
this is a sequence labeling task:
a url is a sequence, your task is to find out terms within the url.
It's similar with named entity recognition task.
You can read some paper about it.
Model: CRF, MEMM, HMM
training data: manually label them
avatar
l*s
14
cont:
Use tag B I O to indicate beginning, inside, and outside of a word.
Each character in URL will be assigned a tag, B, or I, or O.
Then this becomes a classification task, just with 3 class labels: BIO.
Grab any classifier you want, mine is MaxEnt
Feature engineering:
convert each character to a feature vector. The most helpful features will
be: n gram character before or after current charactor, length of url,
whether there is a digit or letter in neighboring characters, and of course
current character.
Model training and decoding:
This step is pretty simple, exactly the same with any other classification
tasks.
tips: use some post-processing rules to improve.
avatar
l*s
15
cont:
Use tag B I O to indicate beginning, inside, and outside of a word.
Each character in URL will be assigned a tag, B, or I, or O.
Then this becomes a classification task, just with 3 class labels: BIO.
Grab any classifier you want, mine is MaxEnt
Feature engineering:
convert each character to a feature vector. The most helpful features will
be: n gram character before or after current charactor, length of url,
whether there is a digit or letter in neighboring characters, and of course
current character.
Model training and decoding:
This step is pretty simple, exactly the same with any other classification
tasks.
tips: use some post-processing rules to improve.
avatar
l*s
16
cont:
Use tag B I O to indicate beginning, inside, and outside of a word.
Each character in URL will be assigned a tag, B, or I, or O.
Then this becomes a classification task, just with 3 class labels: BIO.
Grab any classifier you want, mine is MaxEnt
Feature engineering:
convert each character to a feature vector. The most helpful features will
be: n gram character before or after current charactor, length of url,
whether there is a digit or letter in neighboring characters, and of course
current character.
Model training and decoding:
This step is pretty simple, exactly the same with any other classification
tasks.
tips: use some post-processing rules to improve.
相关阅读
logo
联系我们隐私协议©2024 redian.news
Redian新闻
Redian.news刊载任何文章,不代表同意其说法或描述,仅为提供更多信息,也不构成任何建议。文章信息的合法性及真实性由其作者负责,与Redian.news及其运营公司无关。欢迎投稿,如发现稿件侵权,或作者不愿在本网发表文章,请版权拥有者通知本网处理。