Ivan the Crawler

Asim 3/15/2024

This week I put together a website crawler using Python. There’s a couple variations of the script, it can be repurposed for just about anything that anyone needs to crawl/automate our public facing assets.

This project contains three Python scripts designed for web crawling purposes. The first, ivan.py, is tailored for single source/page crawling, ideal for checking a specific page for broken links or patterns. The second, ivan-crawl-all.py, extends this functionality to crawl entire websites, identifying broken links and pattern matches across all accessible pages from a given starting URL. The third, ivan-crawl-all-sanitized.py, further enhances the entire website crawling by implementing sanitized data handling, ensuring more efficient and focused crawling by filtering out irrelevant or redundant data.

Note: Keep in mind that this is a simple python crawler method that can be repurposed for whatever task you might need

Features

Single Page Crawler (“ivan.py”): - Crawls a specified single web page. - Identifies and logs broken links found on the page. - Searches for and logs specific patterns within the page URLs.

Entire Website Crawler (“ivan-crawl-all.py”):
- Crawls an entire website starting from a specified URL.
- Recursively follows all internal links to cover the entire website.
- Logs broken links and pattern matches found across the site.
- Organizes output into readable blocks for each crawled URL.
Entire Website Crawler with Data Sanitization (“ivan-crawl-all-sanitized.py”):
- Offers all the features of “ivan-crawl-all.py”.
- Includes advanced data sanitization to avoid processing non-essential content like documents, images, and external links, making the crawling process more efficient.

Requirements

Python 3.x
BeautifulSoup4
Requests
termcolor

Installation

Ensure Python 3.x is installed on your system. Then, install the required packages using pip:

pip install beautifulsoup4 requests termcolor

Usage

Single Page Crawler (“ivan.py”)

Configure the script: Edit the ivan.py file to specify the URL you wish to crawl and, optionally, a pattern to search for within the page.
Run the script:

python ivan.py

Entire Website Crawler (“ivan-crawl-all.py”)

Configure the script: Edit the ivan-crawl-all.py file to specify the starting URL for the website crawl and, optionally, a pattern to search for across the site.
Run the script:

python ivan-crawl-all.py

Entire Website Crawler with Data Sanitization (“ivan-crawl-all-sanitized.py”)

Configure the script: Similar to ivan-crawl-all.py, edit the ivan-crawl-all-sanitized.py file to specify the starting URL and optional pattern. This version will automatically filter out non-essential content for a more focused crawl.
Run the script:

python ivan-crawl-all-sanitized.py

Review the Output

Each crawler will generate a timestamped output file in the same directory, named according to the script run (crawl-YYYYMMDD-HHMMSS.txt, crawl-all-YYYYMMDD-HHMMSS.txt, or crawl-all-sanitized-YYYYMMDD-HHMMSS.txt). Open these files to review the crawl results.

YYYYMMDD - year, month, day
HHMMSS - hour, minutes, seconds

Log Output Example

$ python ivan.py
========================================
Crawl Start Time: 2024-03-14 15:13:31
Crawling: https://ww2.valleyair.org/landing
----------------------------------------
Pattern match found: https://www.valleyair.org/busind/comply/source_testing.htm
Pattern match found: https://www.valleyair.org/videos/video_idx.htm
Pattern match found: https://www.valleyair.org/General_Info/AGLoader.htm
Pattern match found: https://www.valleyair.org/Programs/CCAP/bps/BPS_idx.htm
Error fetching https://ww2.valleyair.org/permitting/permit-exempt-equipment-registration-peer/ : 404 Client Error: Not Found for url: https://ww2.valleyair.org/permitting/permit-exempt-equipment-registration-peer/%20
Pattern match found: https://www.valleyair.org/Symposiums/symposiums_idx.htm
Pattern match found: https://www.valleyair.org/Programs/SpecialCitySelection/SCSC_idx.htm
Error fetching https://twitter.com/ValleyAir: 400 Client Error: Bad Request for url: https://twitter.com/ValleyAir
----------------------------------------
Crawl End Time: 2024-03-14 15:13:43
Finished crawling.
========================================
Total Pages Crawled: 1
Total Broken Links Found: 2
Total Pattern Matches Found: 6

Customizing the Crawler

Starting URL: Change the start_url variable to the URL where you want the crawl to begin.
Pattern Matching: Modify the pattern variable with the regex pattern for the links you’re interested in.
Visual Breaks: Adjust the characters and length of insert_visual_break and insert_minor_visual_break functions for custom visual separators in the log.

The scripts

ivan.py

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin
from termcolor import colored
import datetime

# Initialize counters
visited = set()
broken_links_count = 0
pattern_matches_count = 0

# Open an output file
output_file_name = 'crawl-' + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + '.txt'
output_file = open(output_file_name, 'w')

def log_message(message, color=None, to_console=True, to_file=True, include_timestamp=True):
    if include_timestamp:
        time_str = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        formatted_message = f"{time_str} - {message}"
    else:
        formatted_message = message
    
    if to_console:
        if color:
            print(colored(formatted_message, color))
        else:
            print(formatted_message)
            
    if to_file:
        output_file.write(formatted_message + "\n")

def insert_visual_break(to_console=True, to_file=True):
    break_line = "=" * 40  
    if to_console:
        print(break_line)
    if to_file:
        output_file.write(break_line + "\n")

def insert_minor_visual_break(to_console=True, to_file=True):
    break_line = "-" * 40  
    if to_console:
        print(break_line)
    if to_file:
        output_file.write(break_line + "\n")

def fetch_page(url):
    global broken_links_count
    if not url.startswith(('http://', 'https://')):
        return None  # Skip non-HTTP/HTTPS URLs

    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        log_message(f"Error fetching {url}: {e}", "red", include_timestamp=False)
        broken_links_count += 1
        return None

def validate_links(base_url, html_content, pattern=None):
    global pattern_matches_count
    soup = BeautifulSoup(html_content, 'html.parser')

    for link in soup.find_all('a', href=True):
        href = link['href'].split('#')[0]  # Remove fragment identifiers

        # Skip non-HTTP/HTTPS URLs, including tel:, mailto:, etc.
        if href.startswith(('tel:', 'mailto:')):
            continue

        url = urljoin(base_url, href)

        if url in visited:
            continue  # Skip already visited URLs

        visited.add(url)  # Mark the URL as visited

        if pattern and re.search(pattern, url):
            log_message(f"Pattern match found: {url}", "green", include_timestamp=False)
            pattern_matches_count += 1
        else:
            fetch_page(url)  # Fetch page only if it's not a pattern match to avoid unnecessary requests

def crawl_website(start_url, pattern=None):
    insert_visual_break()
    global visited
    start_time = datetime.datetime.now()
    log_message(f"Crawl Start Time: {start_time.strftime('%Y-%m-%d %H:%M:%S')}", include_timestamp=False)
    log_message(f"Crawling: {start_url}", include_timestamp=False)
    insert_minor_visual_break()
    visited.add(start_url)
    content = fetch_page(start_url)
    if content:
        validate_links(start_url, content, pattern)
    end_time = datetime.datetime.now()
    insert_minor_visual_break()
    log_message(f"Crawl End Time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}", include_timestamp=False)
    log_message("Finished crawling.", include_timestamp=False)

if __name__ == "__main__":
    start_url = "https://www.yourdomain.com/"
    pattern = r"^https://www\.yourdomain\.com/"
    crawl_website(start_url, pattern)

    # Insert a visual break before summaries
    insert_visual_break()

    # Print summaries
    log_message(f"Total Pages Crawled: {len(visited)}", include_timestamp=False)
    log_message(f"Total Broken Links Found: {broken_links_count}", "red", include_timestamp=False)
    log_message(f"Total Pattern Matches Found: {pattern_matches_count}", "green", include_timestamp=False)

    # Close the output file
    output_file.close()

ivan-crawl-all.py

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin, urlparse
from termcolor import colored
import datetime
from collections import deque

# Initialize counters and structures
visited = set()
broken_links_count = 0
pattern_matches_count = 0
queue = deque()

# List of excluded file extensions
excluded_extensions = ['.pdf', '.doc', '.docx', '.xls', '.xlsx', '.jpg', '.png', '.gif', '.svg', '.kml', '.kmz', '.zip']

# Open an output file
output_file_name = 'crawl-all-' + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + '.txt'
output_file = open(output_file_name, 'w')

def log_message(message, color=None, to_console=True, to_file=True):
    if to_console:
        if color:
            print(colored(message, color))
        else:
            print(message)

    if to_file:
        output_file.write(message + "\n")

def insert_visual_break(to_console=True, to_file=True):
    break_line = "=" * 40
    if to_console:
        print(break_line)
    if to_file:
        output_file.write(break_line + "\n")

def insert_minor_visual_break(to_console=True, to_file=True):
    break_line = "-" * 40
    if to_console:
        print(break_line)
    if to_file:
        output_file.write(break_line + "\n")

def is_internal_link(link, start_url):
    return urlparse(link).netloc == urlparse(start_url).netloc

def fetch_page(url):
    global broken_links_count

    # Skip non-HTTP/HTTPS URLs, such as tel: or mailto:
    if not url.startswith(('http://', 'https://')):
        return None

    # Check if the URL ends with an excluded extension
    if any(url.lower().endswith(ext) for ext in excluded_extensions):
        return None

    try:
        response = requests.get(url, timeout=5)
        if 'text/html' not in response.headers.get('Content-Type', ''):
            log_message(f"Skipped non-HTML content at {url}", color="yellow")
            return None

        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        log_message(f"Error fetching {url}: {e}", color="red")
        broken_links_count += 1
        return None

def process_links(base_url, html_content, pattern=None):
    global pattern_matches_count
    found_matches = False
    soup = BeautifulSoup(html_content, 'html.parser')

    for link in soup.find_all('a', href=True):
        href = link['href'].split('#')[0]  # Remove fragment identifiers

        # Skip non-HTTP/HTTPS URLs, including tel:, mailto:, etc.
        if href.startswith(('tel:', 'mailto:')):
            continue

        url = urljoin(base_url, href)

        if url in visited or any(url.lower().endswith(ext) for ext in excluded_extensions):
            continue

        visited.add(url)  # Mark the URL as visited

        if pattern and re.search(pattern, url):
            if not found_matches:
                insert_minor_visual_break()
                log_message(f"Pattern matches on: {base_url}", color="magenta")
                found_matches = True
            log_message(f"   Pattern match: {url}", color="green")
            pattern_matches_count += 1

        if is_internal_link(url, start_url):
            queue.append(url)

def crawl_website(start_url, pattern=None):
    global visited
    queue.append(start_url)
    visited.add(start_url)

    while queue:
        current_url = queue.popleft()
        insert_minor_visual_break()
        log_message(f"Crawling: {current_url}", "cyan")
        content = fetch_page(current_url)
        if content:
            process_links(current_url, content, pattern)

if __name__ == "__main__":
    try:
        start_time = datetime.datetime.now()
        start_url = "http://yourdomain.com/"
        pattern = r"https?://www\.yourdomain\.com/"
        
        insert_visual_break()
        log_message(f"Crawl Start Time: {start_time.strftime('%Y-%m-%d %H:%M:%S')}", "blue")
        crawl_website(start_url, pattern)
    except Exception as e:
        log_message(f"An unexpected error occurred: {e}", "red")
    finally:
        insert_minor_visual_break()
        end_time = datetime.datetime.now()
        log_message(f"Crawl End Time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}", "blue")
        log_message("Finished crawling.", "blue")

        insert_visual_break()
        log_message(f"Total Pages Crawled: {len(visited)}", "magenta")
        log_message(f"Total Broken Links Found: {broken_links_count}", "red")
        log_message(f"Total Pattern Matches Found: {pattern_matches_count}", "green")

        output_file.close()

ivan-crawl-all-sanitized.py

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin, urlparse
from termcolor import colored
import datetime
from collections import deque

# Initialize counters and structures
visited = set()
broken_links_count = 0
pattern_matches_count = 0
queue = deque()

# List of excluded file extensions
excluded_extensions = ['.pdf', '.doc', '.docx', '.xls', '.xlsx', '.jpg', '.png', '.gif', '.svg', '.kml', '.kmz', '.zip']

# Open an output file
output_file_name = 'crawl-all-sanitized-' + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + '.txt'
output_file = open(output_file_name, 'w')

def log_message(message, color=None, to_console=True, to_file=True):
    if to_console and color:
        print(colored(message, color))
    elif to_console:
        print(message)
    if to_file:
        output_file.write(message + "\n")

def insert_visual_break(to_console=True, to_file=True):
    break_line = "=" * 40
    if to_console:
        print(break_line)
    if to_file:
        output_file.write(break_line + "\n")

def insert_minor_visual_break(to_console=True, to_file=True):
    break_line = "-" * 40
    if to_console:
        print(break_line)
    if to_file:
        output_file.write(break_line + "\n")

def is_internal_link(link, start_url):
    return urlparse(link).netloc == urlparse(start_url).netloc

def fetch_page(url):
    global broken_links_count
    if any(url.lower().endswith(ext) for ext in excluded_extensions) or '#googtrans' in url:
        return None

    try:
        response = requests.get(url, timeout=5)
        if 'text/html' not in response.headers.get('Content-Type', ''):
            return None
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        insert_minor_visual_break()
        log_message(f"Error fetching {url}: {e}", color="red")
        broken_links_count += 1
        return None

def process_links(base_url, html_content, pattern=None):
    global pattern_matches_count
    found_matches = False
    soup = BeautifulSoup(html_content, 'html.parser')

    for link in soup.find_all('a', href=True):
        href = link['href'].split('#')[0]  # Remove fragment identifiers

        # Skip non-HTTP/HTTPS URLs, including tel:, mailto:, etc.
        if href.startswith(('tel:', 'mailto:')):
            continue

        # Ensure the href is a properly formatted URL
        if not href.startswith(('http://', 'https://', '/')):
            href = '/' + href

        url = urljoin(base_url, href)

        # Skip if the URL has already been visited or matches an excluded extension
        if url in visited or any(url.lower().endswith(ext) for ext in excluded_extensions):
            continue

        visited.add(url)  # Mark the URL as visited

        if pattern and re.search(pattern, url):
            if not found_matches:
                insert_minor_visual_break()
                log_message(f"Pattern matches on: {base_url}", color="magenta")
                found_matches = True
            log_message(f"   Pattern match: {url}", color="green")
            pattern_matches_count += 1

        if is_internal_link(url, start_url):  # Check against the original start_url
            queue.append(url)

def crawl_website(start_url, pattern=None):
    global visited
    queue.append(start_url)
    visited.add(start_url)

    while queue:
        current_url = queue.popleft()
        content = fetch_page(current_url)
        if content:
            process_links(current_url, content, pattern)

if __name__ == "__main__":
    try:
        start_time = datetime.datetime.now()
        start_url = "https://www.yourdomain.com/"
        pattern = r"https?://www\.yourdomain\.com/"
        
        insert_visual_break()
        log_message(f"Crawl Start Time: {start_time.strftime('%Y-%m-%d %H:%M:%S')}", color="blue")
        crawl_website(start_url, pattern)
    except Exception as e:
        log_message(f"An unexpected error occurred: {e}", color="red")
    finally:
        insert_minor_visual_break()
        end_time = datetime.datetime.now()
        log_message(f"Crawl End Time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}", color="blue")
        log_message("Finished crawling.", color="blue")

        insert_visual_break()
        log_message(f"Total Pages Crawled: {len(visited)}", color="magenta")
        log_message(f"Total Broken Links Found: {broken_links_count}", color="red")
        log_message(f"Total Pattern Matches Found: {pattern_matches_count}", color="green")

        output_file.close()