Wednesday, October 24, 2007

HUBackup and rdiff-backup, or how I should have done it at the first place!

Okay, so exploring the rdiff-backup code base I now come across some ideas I do not want forgotten yet I don't feel like editing/creating a new spec (that will come later) but part of my plans for Hardy (again an LTS release) I want to start at the direction of making HUBackup the tool it was meant to be ;) so the first point I want to note:

* Use rdiff-backup as the backup/archiver tool instead of DAR; Although DAR is a truely amazing tool, it currently lacks good python bindings and the fact HUBackup uses it from a ptty is a bit of a pain to maintain and expand with features. This also stands in the way of a better restore process. Moreover, Using rdiff-backup, I now think of just letting it create its "meta data" (reverse diff) along side the already existing directories the user wants to backup, which will *greatly* reduce space overhead when backing up (essentially just burning the folders to the optical media together with the special information for restoring the permissions and other file attributes). Now, given that I need to explore how to slice up backup data to fit in more then one optical medium.

Sunday, October 14, 2007

Very Simple Web Scraping

Got interested lately in extracting some data (namely emails and sublinks) from web pages I came out with a very simple, very straight forward class in Python that when instantiated will hold all the unique outgoing links the web page in question has and all the unique email addresses it has, just for practice. It is not intended for "production" use in any way as it does not respect any of the HTTP GET rules that even very simple fetchers support. This was just for practice. However, you are free to use it for your purpose. I wonder if this would be a good candidate when improved to build a graph for the web as it looks from the stand point of the specific "seed" web page. (ofcourse I will need to wrap it in a recursive algorithm to go deep following the links discovered).

Here is the code:


#!/usr/bin/env python

import re
import os
import sys
import urllib2



class URLRepo:
URL = None # holds this URL's address, as is the parent of the son URLs it will hold
def __init__(self, URL):
""" A Class representing the collection of all son URLs a parent URL holds.
URL is the internet URL to prcess."""
self.URL = URL
raw_html = urllib2.urlopen(URL).read() # note this completely disregards proper HTTP workflow e.g. proper GET headers
self.html = raw_html
emailre = re.compile('[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}',re.IGNORECASE)
linkre = re.compile('.*(.*)<\/a>') # match the URL and its corrsponding title
links = linkre.findall(raw_html)
emails = emailre.findall(raw_html)
self.emails = []

# filter out any link that may return us to the same parent URL and clean out email links
self.links = [i for i in links if (not i[0].startswith('/') and
not i[0].startswith('mailto') and
not i[0].startswith('#') and
not i[0].startswith(self.URL) and
(i[0].startswith('http') or
i[0].startswith('www')))]
for eml in emails:
for eml in emails:
try:
self.emails.index(eml)
except:
self.emails.append(eml)







if __name__ == '__main__':
SEED_URL = sys.argv[1]
A = URLRepo(SEED_URL)
print "List of links:"
print "--------------"
for link in A.links:
print link
print "List of Email addresses"
print "-----------------------"
for eml in A.emails:
print eml