This week I put together a website crawler using Python. There’s a couple variations of the script, it can be repurposed for just about anything that anyone needs to crawl/automate our public facing assets.
This project contains three Python scripts designed for web crawling purposes. The first, ivan.py
, is tailored for single source/page crawling, ideal for checking a specific page for broken links or patterns. The second, ivan-crawl-all.py
, extends this functionality to crawl entire websites, identifying broken links and pattern matches across all accessible pages from a given starting URL. The third, ivan-crawl-all-sanitized.py
, further enhances the entire website crawling by implementing sanitized data handling, ensuring more efficient and focused crawling by filtering out irrelevant or redundant data.
Note: Keep in mind that this is a simple python crawler method that can be repurposed for whatever task you might need
Features
Single Page Crawler (“ivan.py”): - Crawls a specified single web page. - Identifies and logs broken links found on the page. - Searches for and logs specific patterns within the page URLs.
-
Entire Website Crawler (“ivan-crawl-all.py”):
- Crawls an entire website starting from a specified URL.
- Recursively follows all internal links to cover the entire website.
- Logs broken links and pattern matches found across the site.
- Organizes output into readable blocks for each crawled URL.
-
Entire Website Crawler with Data Sanitization (“ivan-crawl-all-sanitized.py”):
- Offers all the features of “ivan-crawl-all.py”.
- Includes advanced data sanitization to avoid processing non-essential content like documents, images, and external links, making the crawling process more efficient.
Requirements
- Python 3.x
- BeautifulSoup4
- Requests
- termcolor
Installation
Ensure Python 3.x is installed on your system. Then, install the required packages using pip
:
pip install beautifulsoup4 requests termcolor
Usage
Single Page Crawler (“ivan.py”)
-
Configure the script: Edit the
ivan.py
file to specify the URL you wish to crawl and, optionally, a pattern to search for within the page. -
Run the script:
python ivan.py
Entire Website Crawler (“ivan-crawl-all.py”)
-
Configure the script: Edit the
ivan-crawl-all.py
file to specify the starting URL for the website crawl and, optionally, a pattern to search for across the site. -
Run the script:
python ivan-crawl-all.py
Entire Website Crawler with Data Sanitization (“ivan-crawl-all-sanitized.py”)
-
Configure the script: Similar to
ivan-crawl-all.py
, edit theivan-crawl-all-sanitized.py
file to specify the starting URL and optional pattern. This version will automatically filter out non-essential content for a more focused crawl. -
Run the script:
python ivan-crawl-all-sanitized.py
Review the Output
Each crawler will generate a timestamped output file in the same directory, named according to the script run (crawl-YYYYMMDD-HHMMSS.txt
, crawl-all-YYYYMMDD-HHMMSS.txt
, or crawl-all-sanitized-YYYYMMDD-HHMMSS.txt
). Open these files to review the crawl results.
YYYYMMDD
- year, month, dayHHMMSS
- hour, minutes, seconds
Log Output Example
$ python ivan.py
========================================
Crawl Start Time: 2024-03-14 15:13:31
Crawling: https://ww2.valleyair.org/landing
----------------------------------------
Pattern match found: https://www.valleyair.org/busind/comply/source_testing.htm
Pattern match found: https://www.valleyair.org/videos/video_idx.htm
Pattern match found: https://www.valleyair.org/General_Info/AGLoader.htm
Pattern match found: https://www.valleyair.org/Programs/CCAP/bps/BPS_idx.htm
Error fetching https://ww2.valleyair.org/permitting/permit-exempt-equipment-registration-peer/ : 404 Client Error: Not Found for url: https://ww2.valleyair.org/permitting/permit-exempt-equipment-registration-peer/%20
Pattern match found: https://www.valleyair.org/Symposiums/symposiums_idx.htm
Pattern match found: https://www.valleyair.org/Programs/SpecialCitySelection/SCSC_idx.htm
Error fetching https://twitter.com/ValleyAir: 400 Client Error: Bad Request for url: https://twitter.com/ValleyAir
----------------------------------------
Crawl End Time: 2024-03-14 15:13:43
Finished crawling.
========================================
Total Pages Crawled: 1
Total Broken Links Found: 2
Total Pattern Matches Found: 6
Customizing the Crawler
- Starting URL: Change the
start_url
variable to the URL where you want the crawl to begin. - Pattern Matching: Modify the
pattern
variable with the regex pattern for the links you’re interested in. - Visual Breaks: Adjust the characters and length of
insert_visual_break
andinsert_minor_visual_break
functions for custom visual separators in the log.
The scripts
ivan.py
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin
from termcolor import colored
import datetime
# Initialize counters
visited = set()
broken_links_count = 0
pattern_matches_count = 0
# Open an output file
output_file_name = 'crawl-' + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + '.txt'
output_file = open(output_file_name, 'w')
def log_message(message, color=None, to_console=True, to_file=True, include_timestamp=True):
if include_timestamp:
time_str = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
formatted_message = f"{time_str} - {message}"
else:
formatted_message = message
if to_console:
if color:
print(colored(formatted_message, color))
else:
print(formatted_message)
if to_file:
output_file.write(formatted_message + "\n")
def insert_visual_break(to_console=True, to_file=True):
break_line = "=" * 40
if to_console:
print(break_line)
if to_file:
output_file.write(break_line + "\n")
def insert_minor_visual_break(to_console=True, to_file=True):
break_line = "-" * 40
if to_console:
print(break_line)
if to_file:
output_file.write(break_line + "\n")
def fetch_page(url):
global broken_links_count
if not url.startswith(('http://', 'https://')):
return None # Skip non-HTTP/HTTPS URLs
try:
response = requests.get(url, timeout=5)
response.raise_for_status()
return response.text
except requests.RequestException as e:
log_message(f"Error fetching {url}: {e}", "red", include_timestamp=False)
broken_links_count += 1
return None
def validate_links(base_url, html_content, pattern=None):
global pattern_matches_count
soup = BeautifulSoup(html_content, 'html.parser')
for link in soup.find_all('a', href=True):
href = link['href'].split('#')[0] # Remove fragment identifiers
# Skip non-HTTP/HTTPS URLs, including tel:, mailto:, etc.
if href.startswith(('tel:', 'mailto:')):
continue
url = urljoin(base_url, href)
if url in visited:
continue # Skip already visited URLs
visited.add(url) # Mark the URL as visited
if pattern and re.search(pattern, url):
log_message(f"Pattern match found: {url}", "green", include_timestamp=False)
pattern_matches_count += 1
else:
fetch_page(url) # Fetch page only if it's not a pattern match to avoid unnecessary requests
def crawl_website(start_url, pattern=None):
insert_visual_break()
global visited
start_time = datetime.datetime.now()
log_message(f"Crawl Start Time: {start_time.strftime('%Y-%m-%d %H:%M:%S')}", include_timestamp=False)
log_message(f"Crawling: {start_url}", include_timestamp=False)
insert_minor_visual_break()
visited.add(start_url)
content = fetch_page(start_url)
if content:
validate_links(start_url, content, pattern)
end_time = datetime.datetime.now()
insert_minor_visual_break()
log_message(f"Crawl End Time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}", include_timestamp=False)
log_message("Finished crawling.", include_timestamp=False)
if __name__ == "__main__":
start_url = "https://www.yourdomain.com/"
pattern = r"^https://www\.yourdomain\.com/"
crawl_website(start_url, pattern)
# Insert a visual break before summaries
insert_visual_break()
# Print summaries
log_message(f"Total Pages Crawled: {len(visited)}", include_timestamp=False)
log_message(f"Total Broken Links Found: {broken_links_count}", "red", include_timestamp=False)
log_message(f"Total Pattern Matches Found: {pattern_matches_count}", "green", include_timestamp=False)
# Close the output file
output_file.close()
ivan-crawl-all.py
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin, urlparse
from termcolor import colored
import datetime
from collections import deque
# Initialize counters and structures
visited = set()
broken_links_count = 0
pattern_matches_count = 0
queue = deque()
# List of excluded file extensions
excluded_extensions = ['.pdf', '.doc', '.docx', '.xls', '.xlsx', '.jpg', '.png', '.gif', '.svg', '.kml', '.kmz', '.zip']
# Open an output file
output_file_name = 'crawl-all-' + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + '.txt'
output_file = open(output_file_name, 'w')
def log_message(message, color=None, to_console=True, to_file=True):
if to_console:
if color:
print(colored(message, color))
else:
print(message)
if to_file:
output_file.write(message + "\n")
def insert_visual_break(to_console=True, to_file=True):
break_line = "=" * 40
if to_console:
print(break_line)
if to_file:
output_file.write(break_line + "\n")
def insert_minor_visual_break(to_console=True, to_file=True):
break_line = "-" * 40
if to_console:
print(break_line)
if to_file:
output_file.write(break_line + "\n")
def is_internal_link(link, start_url):
return urlparse(link).netloc == urlparse(start_url).netloc
def fetch_page(url):
global broken_links_count
# Skip non-HTTP/HTTPS URLs, such as tel: or mailto:
if not url.startswith(('http://', 'https://')):
return None
# Check if the URL ends with an excluded extension
if any(url.lower().endswith(ext) for ext in excluded_extensions):
return None
try:
response = requests.get(url, timeout=5)
if 'text/html' not in response.headers.get('Content-Type', ''):
log_message(f"Skipped non-HTML content at {url}", color="yellow")
return None
response.raise_for_status()
return response.text
except requests.RequestException as e:
log_message(f"Error fetching {url}: {e}", color="red")
broken_links_count += 1
return None
def process_links(base_url, html_content, pattern=None):
global pattern_matches_count
found_matches = False
soup = BeautifulSoup(html_content, 'html.parser')
for link in soup.find_all('a', href=True):
href = link['href'].split('#')[0] # Remove fragment identifiers
# Skip non-HTTP/HTTPS URLs, including tel:, mailto:, etc.
if href.startswith(('tel:', 'mailto:')):
continue
url = urljoin(base_url, href)
if url in visited or any(url.lower().endswith(ext) for ext in excluded_extensions):
continue
visited.add(url) # Mark the URL as visited
if pattern and re.search(pattern, url):
if not found_matches:
insert_minor_visual_break()
log_message(f"Pattern matches on: {base_url}", color="magenta")
found_matches = True
log_message(f" Pattern match: {url}", color="green")
pattern_matches_count += 1
if is_internal_link(url, start_url):
queue.append(url)
def crawl_website(start_url, pattern=None):
global visited
queue.append(start_url)
visited.add(start_url)
while queue:
current_url = queue.popleft()
insert_minor_visual_break()
log_message(f"Crawling: {current_url}", "cyan")
content = fetch_page(current_url)
if content:
process_links(current_url, content, pattern)
if __name__ == "__main__":
try:
start_time = datetime.datetime.now()
start_url = "http://yourdomain.com/"
pattern = r"https?://www\.yourdomain\.com/"
insert_visual_break()
log_message(f"Crawl Start Time: {start_time.strftime('%Y-%m-%d %H:%M:%S')}", "blue")
crawl_website(start_url, pattern)
except Exception as e:
log_message(f"An unexpected error occurred: {e}", "red")
finally:
insert_minor_visual_break()
end_time = datetime.datetime.now()
log_message(f"Crawl End Time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}", "blue")
log_message("Finished crawling.", "blue")
insert_visual_break()
log_message(f"Total Pages Crawled: {len(visited)}", "magenta")
log_message(f"Total Broken Links Found: {broken_links_count}", "red")
log_message(f"Total Pattern Matches Found: {pattern_matches_count}", "green")
output_file.close()
ivan-crawl-all-sanitized.py
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin, urlparse
from termcolor import colored
import datetime
from collections import deque
# Initialize counters and structures
visited = set()
broken_links_count = 0
pattern_matches_count = 0
queue = deque()
# List of excluded file extensions
excluded_extensions = ['.pdf', '.doc', '.docx', '.xls', '.xlsx', '.jpg', '.png', '.gif', '.svg', '.kml', '.kmz', '.zip']
# Open an output file
output_file_name = 'crawl-all-sanitized-' + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + '.txt'
output_file = open(output_file_name, 'w')
def log_message(message, color=None, to_console=True, to_file=True):
if to_console and color:
print(colored(message, color))
elif to_console:
print(message)
if to_file:
output_file.write(message + "\n")
def insert_visual_break(to_console=True, to_file=True):
break_line = "=" * 40
if to_console:
print(break_line)
if to_file:
output_file.write(break_line + "\n")
def insert_minor_visual_break(to_console=True, to_file=True):
break_line = "-" * 40
if to_console:
print(break_line)
if to_file:
output_file.write(break_line + "\n")
def is_internal_link(link, start_url):
return urlparse(link).netloc == urlparse(start_url).netloc
def fetch_page(url):
global broken_links_count
if any(url.lower().endswith(ext) for ext in excluded_extensions) or '#googtrans' in url:
return None
try:
response = requests.get(url, timeout=5)
if 'text/html' not in response.headers.get('Content-Type', ''):
return None
response.raise_for_status()
return response.text
except requests.RequestException as e:
insert_minor_visual_break()
log_message(f"Error fetching {url}: {e}", color="red")
broken_links_count += 1
return None
def process_links(base_url, html_content, pattern=None):
global pattern_matches_count
found_matches = False
soup = BeautifulSoup(html_content, 'html.parser')
for link in soup.find_all('a', href=True):
href = link['href'].split('#')[0] # Remove fragment identifiers
# Skip non-HTTP/HTTPS URLs, including tel:, mailto:, etc.
if href.startswith(('tel:', 'mailto:')):
continue
# Ensure the href is a properly formatted URL
if not href.startswith(('http://', 'https://', '/')):
href = '/' + href
url = urljoin(base_url, href)
# Skip if the URL has already been visited or matches an excluded extension
if url in visited or any(url.lower().endswith(ext) for ext in excluded_extensions):
continue
visited.add(url) # Mark the URL as visited
if pattern and re.search(pattern, url):
if not found_matches:
insert_minor_visual_break()
log_message(f"Pattern matches on: {base_url}", color="magenta")
found_matches = True
log_message(f" Pattern match: {url}", color="green")
pattern_matches_count += 1
if is_internal_link(url, start_url): # Check against the original start_url
queue.append(url)
def crawl_website(start_url, pattern=None):
global visited
queue.append(start_url)
visited.add(start_url)
while queue:
current_url = queue.popleft()
content = fetch_page(current_url)
if content:
process_links(current_url, content, pattern)
if __name__ == "__main__":
try:
start_time = datetime.datetime.now()
start_url = "https://www.yourdomain.com/"
pattern = r"https?://www\.yourdomain\.com/"
insert_visual_break()
log_message(f"Crawl Start Time: {start_time.strftime('%Y-%m-%d %H:%M:%S')}", color="blue")
crawl_website(start_url, pattern)
except Exception as e:
log_message(f"An unexpected error occurred: {e}", color="red")
finally:
insert_minor_visual_break()
end_time = datetime.datetime.now()
log_message(f"Crawl End Time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}", color="blue")
log_message("Finished crawling.", color="blue")
insert_visual_break()
log_message(f"Total Pages Crawled: {len(visited)}", color="magenta")
log_message(f"Total Broken Links Found: {broken_links_count}", color="red")
log_message(f"Total Pattern Matches Found: {pattern_matches_count}", color="green")
output_file.close()