After receiving quite some emails about folks bewildered by the crashing hurestore I think it's only fair to create a post about how to restore files from archives created using hubackup using the underlying archiver , DAR.
Usually after a typical run of hubackup you will have two resulting files:
.hubackup-data/paepp-master-archive.1.dar
.hubackup-data/paepp-master-catalog.1.dar
(Thanks goes to Peter Päppinghaus for sending me an email about this)
To restore , which actually usually means extract in DAR's language you need to do something like:
dar -x .hubackup-data/paepp-master-archive -R TARGET_DIR
If there is more then one slice DAR knows how to switch between them properly.
That should be enough for most of operations, for more detailed explanation of what dar can do in restoration (or backup for that matter) DAR's manual pages are quite good.
Monday, January 07, 2008
Wednesday, October 24, 2007
HUBackup and rdiff-backup, or how I should have done it at the first place!
Okay, so exploring the rdiff-backup code base I now come across some ideas I do not want forgotten yet I don't feel like editing/creating a new spec (that will come later) but part of my plans for Hardy (again an LTS release) I want to start at the direction of making HUBackup the tool it was meant to be ;) so the first point I want to note:
* Use rdiff-backup as the backup/archiver tool instead of DAR; Although DAR is a truely amazing tool, it currently lacks good python bindings and the fact HUBackup uses it from a ptty is a bit of a pain to maintain and expand with features. This also stands in the way of a better restore process. Moreover, Using rdiff-backup, I now think of just letting it create its "meta data" (reverse diff) along side the already existing directories the user wants to backup, which will *greatly* reduce space overhead when backing up (essentially just burning the folders to the optical media together with the special information for restoring the permissions and other file attributes). Now, given that I need to explore how to slice up backup data to fit in more then one optical medium.
* Use rdiff-backup as the backup/archiver tool instead of DAR; Although DAR is a truely amazing tool, it currently lacks good python bindings and the fact HUBackup uses it from a ptty is a bit of a pain to maintain and expand with features. This also stands in the way of a better restore process. Moreover, Using rdiff-backup, I now think of just letting it create its "meta data" (reverse diff) along side the already existing directories the user wants to backup, which will *greatly* reduce space overhead when backing up (essentially just burning the folders to the optical media together with the special information for restoring the permissions and other file attributes). Now, given that I need to explore how to slice up backup data to fit in more then one optical medium.
Sunday, October 14, 2007
Very Simple Web Scraping
Got interested lately in extracting some data (namely emails and sublinks) from web pages I came out with a very simple, very straight forward class in Python that when instantiated will hold all the unique outgoing links the web page in question has and all the unique email addresses it has, just for practice. It is not intended for "production" use in any way as it does not respect any of the HTTP GET rules that even very simple fetchers support. This was just for practice. However, you are free to use it for your purpose. I wonder if this would be a good candidate when improved to build a graph for the web as it looks from the stand point of the specific "seed" web page. (ofcourse I will need to wrap it in a recursive algorithm to go deep following the links discovered).
Here is the code:
Here is the code:
#!/usr/bin/env python
import re
import os
import sys
import urllib2
class URLRepo:
URL = None # holds this URL's address, as is the parent of the son URLs it will hold
def __init__(self, URL):
""" A Class representing the collection of all son URLs a parent URL holds.
URL is the internet URL to prcess."""
self.URL = URL
raw_html = urllib2.urlopen(URL).read() # note this completely disregards proper HTTP workflow e.g. proper GET headers
self.html = raw_html
emailre = re.compile('[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}',re.IGNORECASE)
linkre = re.compile('.*(.*)<\/a>') # match the URL and its corrsponding title
links = linkre.findall(raw_html)
emails = emailre.findall(raw_html)
self.emails = []
# filter out any link that may return us to the same parent URL and clean out email links
self.links = [i for i in links if (not i[0].startswith('/') and
not i[0].startswith('mailto') and
not i[0].startswith('#') and
not i[0].startswith(self.URL) and
(i[0].startswith('http') or
i[0].startswith('www')))]
for eml in emails:
for eml in emails:
try:
self.emails.index(eml)
except:
self.emails.append(eml)
if __name__ == '__main__':
SEED_URL = sys.argv[1]
A = URLRepo(SEED_URL)
print "List of links:"
print "--------------"
for link in A.links:
print link
print "List of Email addresses"
print "-----------------------"
for eml in A.emails:
print eml
Saturday, April 14, 2007
A Word Of Gratitude
Jonathan, Paul , Daniel and Rob. Just wanted to thank you for a wonderful time a while back. Although it seems like a distant, warm dream right now that some time has passed, I am all intentions to do something like this again , on the same grounds - and explore further your wonderful, green and water rich country ;-)
Oh dear, now how do I recreate the fun earned from that amazing canoe ride? .....
Oh dear, now how do I recreate the fun earned from that amazing canoe ride? .....
Monday, December 04, 2006
Surely not a way to do performance testing, #2
Thanks to all the kind folks for commenting on my previous post . After a couple of people who managed to mildly experience some performance losses over my mentioned "experiment" one guy with another IBM/Lenovo PATA-to-SATA bridge based system managed to reproduce the issue. Is this a hardware related one? I'd like to kindly ask owners of such machines to make sure their Lenovo/IBM laptop model has the bridge and attempt my "experiment". If this is not Linux's fault then surely for people with warranty there should be a remedy. Now where is find for windows so I can "experiment" the same there...;)
Surely not a way to do performance testing, but still...
Back then, when I started using Linux based operating system (namely, RedHat, Mandrake to finally settle on Debian and then finally arriving at Ubuntu) I used to show off to my Windows using friends one o "features" at the time that was the major attraction GNU/Linux had for me. Rock solid, smooth multitasking that always kept the system responsive and usable.
A special case of that was that I showed them, how on heavy disk IO of the system, I can still have a responsive UI and use the desktop while their Windows desktop on the other hand , using the same exact hardware, running the same "benchmark" operation, started to lag on the desktop UI, have very jerky mouse movement performance and nearly choke to death.
Are those times over?
Recently, I revived this "experiment" using my ThinkPad T43p laptop with the following specs:
Now, what I have done (I urge you to try the same and let me know how that went for you) is make sure no running application are open after boot, open one gnome-terminal window, and there in a quick sequence I do:
find /
CTRL+SHIFT+T (open a new tab on the gnome-terminal)
And repeat this until you have 4 tabs with find running inside them. Now when I attempt this to open some more tabs (1-2) , the UI starts to lag, disk access becomes increasingly slow and the UI eventually becomes so unresponsive that even the mouse cursor refuses to obey the the mouse movements. Even when not displaying the tabs output (e.g. ALT+TAB to another window entering text) the performance loss doesn't go away, and even prevents me from easily entering text to this blog post. The most annoying part of this, is that ALT-TAB to switch another app becomes nearly impossible when the system is under this load (e.g. you can actually see UI redrawing etc as it happens) Something tells me this should not be the case...
Does anybody have an idea why this is caused? How can we possibly address this? In the beginning I thought using the -nolatency kernels could help, but this seems to have no effect on this. Any insight, comment or feedback on that are welcome.
A special case of that was that I showed them, how on heavy disk IO of the system, I can still have a responsive UI and use the desktop while their Windows desktop on the other hand , using the same exact hardware, running the same "benchmark" operation, started to lag on the desktop UI, have very jerky mouse movement performance and nearly choke to death.
Are those times over?
Recently, I revived this "experiment" using my ThinkPad T43p laptop with the following specs:
- PCI-E bus
- 1GB Ram
- 1.87GHz
- 60GB PATA disk, connected in what is know as a pata-to-sata bridge.
Now, what I have done (I urge you to try the same and let me know how that went for you) is make sure no running application are open after boot, open one gnome-terminal window, and there in a quick sequence I do:
find /
CTRL+SHIFT+T (open a new tab on the gnome-terminal)
And repeat this until you have 4 tabs with find running inside them. Now when I attempt this to open some more tabs (1-2) , the UI starts to lag, disk access becomes increasingly slow and the UI eventually becomes so unresponsive that even the mouse cursor refuses to obey the the mouse movements. Even when not displaying the tabs output (e.g. ALT+TAB to another window entering text) the performance loss doesn't go away, and even prevents me from easily entering text to this blog post. The most annoying part of this, is that ALT-TAB to switch another app becomes nearly impossible when the system is under this load (e.g. you can actually see UI redrawing etc as it happens) Something tells me this should not be the case...
Does anybody have an idea why this is caused? How can we possibly address this? In the beginning I thought using the -nolatency kernels could help, but this seems to have no effect on this. Any insight, comment or feedback on that are welcome.
Saturday, November 18, 2006
Some stuff I don't want to forget for hubackup so I'll put them here:
- Think about maybe integrating something into the installer, such that when a user is detected to have a CD/DVD drive and is interested in enabling hubackup desktop notification, the installer will offer him to create a special partition for storing the temporary archive and iso files before burning them.
- Add checks while backup is being created, such then when space is reaching too low, either abort the process with proper error or halt the dar process until user makes some more free space and then allow him to resume it.
Subscribe to:
Posts (Atom)