Sunday, October 14, 2007

Very Simple Web Scraping

Got interested lately in extracting some data (namely emails and sublinks) from web pages I came out with a very simple, very straight forward class in Python that when instantiated will hold all the unique outgoing links the web page in question has and all the unique email addresses it has, just for practice. It is not intended for "production" use in any way as it does not respect any of the HTTP GET rules that even very simple fetchers support. This was just for practice. However, you are free to use it for your purpose. I wonder if this would be a good candidate when improved to build a graph for the web as it looks from the stand point of the specific "seed" web page. (ofcourse I will need to wrap it in a recursive algorithm to go deep following the links discovered).

Here is the code:

#!/usr/bin/env python

import re
import os
import sys
import urllib2

class URLRepo:
URL = None # holds this URL's address, as is the parent of the son URLs it will hold
def __init__(self, URL):
""" A Class representing the collection of all son URLs a parent URL holds.
URL is the internet URL to prcess."""
self.URL = URL
raw_html = urllib2.urlopen(URL).read() # note this completely disregards proper HTTP workflow e.g. proper GET headers
self.html = raw_html
emailre = re.compile('[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}',re.IGNORECASE)
linkre = re.compile('.*(.*)<\/a>') # match the URL and its corrsponding title
links = linkre.findall(raw_html)
emails = emailre.findall(raw_html)
self.emails = []

# filter out any link that may return us to the same parent URL and clean out email links
self.links = [i for i in links if (not i[0].startswith('/') and
not i[0].startswith('mailto') and
not i[0].startswith('#') and
not i[0].startswith(self.URL) and
(i[0].startswith('http') or
for eml in emails:
for eml in emails:

if __name__ == '__main__':
SEED_URL = sys.argv[1]
print "List of links:"
print "--------------"
for link in A.links:
print link
print "List of Email addresses"
print "-----------------------"
for eml in A.emails:
print eml

No comments: