CodeCanyon

Web Crawler and Scraper for Files and Links

Web Crawler and Scraper for Files and Links

Web Crawler and Scraper for Files and Links

About Web Crawler and Scraper

Web Crawler can be used to get links, emails, images and files from a webpage or site.

Web Crawler has a simple and intuitive interface.

The crawler is multithreaded and optimized for performance. It scans the webpage based on MIME types and file extensions, so it can find hidden links.

Two applications are included in the package. A Windows Forms application and a new WPF application with extended functionality. The “Deep crawl” feature allows the crawler to search all the linked pages from the selected website.

After crawling, the Web Crawler will save all links and e-mail addresses to the selected folder, along with all the crawled files.

The WPF crawler/scraper allows the user to input a regular expression to scrape through the webpages. The new application gives the user a greater control over the crawling process.

How to use the Windows Forms crawler

On the top is a box for entering the URL to crawl. Underneath the URL box is a folder in which to save the crawled files. The last box is for file extensions that the crawler should look for. If the file extensions box is left empty, then the program only looks for links and e-mails on the page and saves them to the linkList.txt and emailList.txt files in the output directory.

The application is primarily meant for subpage crawling, but can crawl a whole website when the “deep crawl” option is checked. This option is very resource intensive as it tries to make parallel connections to the server for better performance.

How to use the WPF crawler and scraper

The WPF has a similar interface to the Windows Forms crawler/scraper. The first three boxes have the same functionality. The last box is optional. It can be used to enter a regular expression by which to search each crawled webpage for anything that can be matched by a regular expression. This can be used to search for phone numbers, names, locations etc.

The crawler is multithreaded and optimized for performance. It scans the webpage based on MIME types and file extensions, so it can find hidden links. There is some support for AJAX calls. The new engine allows for more control over what is crawled and the depth and scope of the crawl. The user can also control the number of concurrent threads that the program will use to scrape webpages.

About the rating

It seems that only people who do not like the product or could not use it properly decide to rate it. If you like the application, you can help the developer by rating it up.

by
by
by
by
by
by