Introduction

This project is an implementation of a simple sitemap generator. The sitemap is printed to stdout, and comprises of 3 parts:

Links to all pages under the same domain
Links to static content items such as images
External URLs.

The implementation requires a starting URL as input. The application then crawls all pages within the domain, but does not follow the links to external sites.

Design Notes

The implementation uses a Fork-Join pool and recursively creates new tasks to crawl the linked pages. Links are in the web-pages are parsed using Jsoup and then categorized as internal-link, internal-media, or external link.

The implementation internally keeps track of how the linked pages are discovered and which ones cannot be downloaded due to errors. The output can thus be extended to include extra information if desired.

Tests are written using Spock-framework. Only integration tests have been implemented as most logic ic closely related to actual network connections and parsing of HTML. The tests use WireMock (provides an embedded jetty server) to simulate the target site to be crawled.

Requests are retried up to 3 times in case of IOExceptions.

Limitations & Possible improvements

Number of concurrent threads is currently fixed at 4. The number is kept low to avoid excessive load on target server.
There is no logic related to delays between requests.
Implementation is a stand-alone applications and is not designed to be run as a distributed system.
Check response header's content-type and abort download if the content is not html.
Attempt to encode URLs with spaces.
Output could be saved as a file.
The crawler could be made to cope with large websites by: a. Writing the output incrementally as crawling progresses. b. Keeping track of crawled pages in an external storage.

Build & Run Instructions

For instructions on how to build the project and run the sitemap generator, please see build-instructions.txt

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
build-instructions.txt		build-instructions.txt
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle
simple-sitemap-generator.iml		simple-sitemap-generator.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Design Notes

Limitations & Possible improvements

Build & Run Instructions

About

Releases

Packages

Languages

License

coderoute/simple-sitemap-generator

Folders and files

Latest commit

History

Repository files navigation

Introduction

Design Notes

Limitations & Possible improvements

Build & Run Instructions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages