mini-HOWTO: Archiving your /. user journal using WGet

Journal cyranoVR's Journal: mini-HOWTO: Archiving your /. user journal using WGet 8

Journal by cyranoVR on Friday July 23, 2004 @12:24AM

Update 7/23/04 1:25 PM

= = Useless Content Follows = =

First of all, you should know that WGet's "recursive" switch will not help you. As far as I can see, itt simply doesn't recognize the intra-site URL convention that Taco et. al have devised - namely

<A HREF="//slashdot.org/journal.pl?op=display">

Use Firefox's "view source" and you'll see what I'm refering to. Apparently, they do this in order to keep you within your "topic" when you click on a link. For those of you that always browse under yro.slashdot.org Blah, whatever.So anyway, maybe I'm doing something wrong, but WGet no likey.

Easy workaround: take advantage of WGet's -i filename switch, which instructs WGet to obtain the list of URLs to download from filename .

Start by obtaining a list of all your journal URLs from the "All Journal Entries" page:

http://slashdot.org/journal.pl?op=list&uid=your_uid_here

and put them in a simple text file. How you harvest the URLS is up to you. I used a Visual Basic subroutine Sub GrabLinks(strURL) that I wrote last year for work. It utilizes WebBrowser.InternetExplorer and MSHTML.HTMLDocument ActiveX objects. God, how I have grown to hate Visual Basic.

Next week I am going to figure out how to do the same in perl using LWP::Simple , HTML::Parser and some clever regular expressions. I bet it could be done in something like 5-7 lines.

Actually, I bet you could set up a script that checks your journal index for new entries, calls WGet to download the new journals, and creates an updated local index.html page for your archive.

Anyways...

Now set up .wgetrc , the WGet configuration file, which is located in your home directory. If it doesn't exist, create it. Mine looks like this (minus the perl-style comments):

robots=off # uncouth but required! span_hosts=off # just to be safe cookies=on load_cookies=~/.mozilla/default/wierd-ass-profile-name/cookies.txt

We're almost ready to go. Navigate to the folder where you want your journals to reside and execute the following WGet command:

wget -p --convert-links -Pjournal --html-extension -i journals.txt -nH -nd -w 2

And watch the magic happen! After about 15 minutes I had all 521 (!) of my journal entries on my computer...and I didn't even get banned!

(Man...521 entries over 2 years...thats an entry every 1.4 days! I need a hobby...)

If you really want to know what the switches mean, see this reference page. Notes: put the full path to journals.txt if it's not in your current working directory. -Pjournal has WGet create a sub-directory "journal," in which it places the downloaded files.

There are still some problems - I tried to include my cookies file so it would see me as logged in, but slashdot is apparently smarter than that and it didn't work. This means that my comments viewing preferences didn't go through. I got "logged-out" pages.

However, if comments are your thing, you can pretty easily get all of those by using Slashdot's standard querystring interface Fortunately, all journal discussions that can be accessed on their own page - i.e.

comments.pl?sid=115456&threshold=3&mode=flat&commentsort=4&op=Change

so just get a list of all the discussions you want to archive, figure out the options you want in the query string, and WGet away!

Also, the journal.com?op=list page doesn't create relative links...however, parsing the file and making your own handy index page without the distracting slashdot stuff shouldn't be so hard.

What would really be cool: writing a perl script that parses the actual content out of the journal downloads. I'm pretty sure there are standard delimitters for the start and end of each journal entry...

And there's more...if you can't get WGet's cookies feature to trick slashdot into thinking you're a logged in user, they you don't have your date formatting preferences...and for some reason Taco et. al. default to MMM DD format (???) that means NO YEAR. Fortunately, the journals themselves are conveniently numbered, so you could write a perl script that processes each of the files adding the year to the journal "Posted on" date. Of course, that's going to be one ugly-ass regex...

I can really see now why so many people like perl..."it makes the hard jobs possible."

MORE FUN READING
All about SQL injection exploits.
CERT: Protecting web forms from cross-site scripting and code injection.

and

USA TODAY: Ambush TV - behind the scenes at "fake" news shows like CrossBalls and The Daily Show with John Stewart

This discussion was created by cyranoVR (518628) for Friends and Friends of Friends only, but now has been archived. No new comments can be posted.