Nutch Regular Expression Testing

Hi there,

I am using Nutch to crawl a site that has dynamic pages.

http://www.example.com/browse/category1/category2/category3?navid=1234567

I commented out the line in regex-urlfilter.txt to allow dynamic pages.
i.e. to allow the URLs that has the question mark character in it.

# skip URLs containing certain characters as probable queries, etc.
# -[?*!@=]

However, in those dynamic pages, there is a panel - "NARROW DOWN YOUR
RESULTS BY:", and there are many filters which lead to hundreds of outlinks
that won't bring any extra data, but will result in 100x+ page requests.

http://www.example.com/browse/category1/category2/category3?navid=1234567+2
http://www.example.com/browse/category1/category2/category3?navid=1234567+3
http://www.example.com/browse/category1/category2/category3?navid=1234567+2+3

To avoid causing unnecessary burden for the target website, I want to
filter out the URLs that contains "+" sign. And the regular expression
looks like this now:

-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|... omit ...|js|JS)$
+^http://www.example.com/ProductsCategory
+^http://www.example.com/browse/
-[+]

Where, /ProductsCategory is the seed URLs that I need to start and
/browse.. are the pages that I want to collect.
Also, I am assuming "-[+]" will remove the URLs that contains "+" sign.
However, it is not doing what I expect now.
And I can still see the robot is grabbing the pages that contains "+" from
the nohup file.

Question1: how can I modify my the regular expression in
regex-urlfilter.txt to fit my need?

I have also followed the NutchInEclipse
<http://wiki.apache.org/nutch/RunNutchInEclipse> tutorial by tejas in Nutch
Wiki. And now I have a working environment to test Nutch source code.

Question2: Is there an easy way in Eclipse to test the output of a list of
URLs after being filtered by a certain regular expression?

I know Nutch is using java.util.regex but I want to know how Nutch read
from a configuration file and which character should I escape ..etc.

Thanks!

Bin

Nutch Regular Expression Testing

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List