Hi there,
I am using Nutch to crawl a site that has dynamic pages.
http://www.example.com/browse/category1/category2/category3?navid=1234567
I commented out the line in regex-urlfilter.txt to allow dynamic pages.
i.e. to allow the URLs that has the question mark character in it.
# skip URLs containing certain characters as probable queries, etc.
# -[?*!@=]
However, in those dynamic pages, there is a panel - "NARROW DOWN YOUR
RESULTS BY:", and there are many filters which lead to hundreds of outlinks
that won't bring any extra data, but will result in 100x+ page requests.
http://www.example.com/browse/category1/category2/category3?navid=1234567+2
http://www.example.com/browse/category1/category2/category3?navid=1234567+3
http://www.example.com/browse/category1/category2/category3?navid=1234567+2+3
To avoid causing unnecessary burden for the target website, I want to
filter out the URLs that contains "+" sign. And the regular expression
looks like this now:
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|... omit ...|js|JS)$
+^http://www.example.com/ProductsCategory
+^http://www.example.com/browse/
-[+]
Where, /ProductsCategory is the seed URLs that I need to start and
/browse.. are the pages that I want to collect.
Also, I am assuming "-[+]" will remove the URLs that contains "+" sign.
However, it is not doing what I expect now.
And I can still see the robot is grabbing the pages that contains "+" from
the nohup file.
Question1: how can I modify my the regular expression in
regex-urlfilter.txt to fit my need?
I have also followed the NutchInEclipse
<http://wiki.apache.org/nutch/RunNutchInEclipse> tutorial by tejas in Nutch
Wiki. And now I have a working environment to test Nutch source code.
Question2: Is there an easy way in Eclipse to test the output of a list of
URLs after being filtered by a certain regular expression?
I know Nutch is using java.util.regex but I want to know how Nutch read
from a configuration file and which character should I escape ..etc.
Thanks!
Bin
I am using Nutch to crawl a site that has dynamic pages.
http://www.example.com/browse/category1/category2/category3?navid=1234567
I commented out the line in regex-urlfilter.txt to allow dynamic pages.
i.e. to allow the URLs that has the question mark character in it.
# skip URLs containing certain characters as probable queries, etc.
# -[?*!@=]
However, in those dynamic pages, there is a panel - "NARROW DOWN YOUR
RESULTS BY:", and there are many filters which lead to hundreds of outlinks
that won't bring any extra data, but will result in 100x+ page requests.
http://www.example.com/browse/category1/category2/category3?navid=1234567+2
http://www.example.com/browse/category1/category2/category3?navid=1234567+3
http://www.example.com/browse/category1/category2/category3?navid=1234567+2+3
To avoid causing unnecessary burden for the target website, I want to
filter out the URLs that contains "+" sign. And the regular expression
looks like this now:
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|... omit ...|js|JS)$
+^http://www.example.com/ProductsCategory
+^http://www.example.com/browse/
-[+]
Where, /ProductsCategory is the seed URLs that I need to start and
/browse.. are the pages that I want to collect.
Also, I am assuming "-[+]" will remove the URLs that contains "+" sign.
However, it is not doing what I expect now.
And I can still see the robot is grabbing the pages that contains "+" from
the nohup file.
Question1: how can I modify my the regular expression in
regex-urlfilter.txt to fit my need?
I have also followed the NutchInEclipse
<http://wiki.apache.org/nutch/RunNutchInEclipse> tutorial by tejas in Nutch
Wiki. And now I have a working environment to test Nutch source code.
Question2: Is there an easy way in Eclipse to test the output of a list of
URLs after being filtered by a certain regular expression?
I know Nutch is using java.util.regex but I want to know how Nutch read
from a configuration file and which character should I escape ..etc.
Thanks!
Bin