Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

mini-HOWTO: Archiving your /. user journal using WGet

cyranoVR (518628) writes | more than 10 years ago

GNU is Not Unix 8

Update 7/23/04 1:25 PM

Use this perl script instead.

= = Useless Content Follows = =

First of all, you should know that WGet's "recursive" switch will not help you. As far as I can see, itt simply doesn't recognize the intra-site URL convention that Taco et. al have devised - namely

<A HREF="//slashdot.org/journal.pl?op=display">

Update 7/23/04 1:25 PM

Use this perl script instead.

= = Useless Content Follows = =

First of all, you should know that WGet's "recursive" switch will not help you. As far as I can see, itt simply doesn't recognize the intra-site URL convention that Taco et. al have devised - namely

<A HREF="//slashdot.org/journal.pl?op=display">

Use Firefox's "view source" and you'll see what I'm refering to. Apparently, they do this in order to keep you within your "topic" when you click on a link. For those of you that always browse under yro.slashdot.org Blah, whatever.So anyway, maybe I'm doing something wrong, but WGet no likey.

Easy workaround: take advantage of WGet's -i filename switch, which instructs WGet to obtain the list of URLs to download from filename .

Start by obtaining a list of all your journal URLs from the "All Journal Entries" page:

http://slashdot.org/journal.pl?op=list&uid=your_uid_here

and put them in a simple text file. How you harvest the URLS is up to you. I used a Visual Basic subroutine Sub GrabLinks(strURL) that I wrote last year for work. It utilizes WebBrowser.InternetExplorer and MSHTML.HTMLDocument ActiveX objects. God, how I have grown to hate Visual Basic.

Next week I am going to figure out how to do the same in perl using LWP::Simple , HTML::Parser and some clever regular expressions. I bet it could be done in something like 5-7 lines.

Actually, I bet you could set up a script that checks your journal index for new entries, calls WGet to download the new journals, and creates an updated local index.html page for your archive.

Anyways...

Now set up .wgetrc , the WGet configuration file, which is located in your home directory. If it doesn't exist, create it. Mine looks like this (minus the perl-style comments):

robots=off # uncouth but required!
span_hosts=off # just to be safe
cookies=on
load_cookies=~/.mozilla/default/wierd-ass-profile-name/cookies.txt

We're almost ready to go. Navigate to the folder where you want your journals to reside and execute the following WGet command:

wget -p --convert-links -Pjournal
      --html-extension -i journals.txt
      -nH -nd -w 2

And watch the magic happen! After about 15 minutes I had all 521 (!) of my journal entries on my computer...and I didn't even get banned!

(Man...521 entries over 2 years...thats an entry every 1.4 days! I need a hobby...)

If you really want to know what the switches mean, see this reference page. Notes: put the full path to journals.txt if it's not in your current working directory. -Pjournal has WGet create a sub-directory "journal," in which it places the downloaded files.

There are still some problems - I tried to include my cookies file so it would see me as logged in, but slashdot is apparently smarter than that and it didn't work. This means that my comments viewing preferences didn't go through. I got "logged-out" pages.

However, if comments are your thing, you can pretty easily get all of those by using Slashdot's standard querystring interface Fortunately, all journal discussions that can be accessed on their own page - i.e.

comments.pl?sid=115456&threshold=3&mode=flat&commentsort=4&op=Change

so just get a list of all the discussions you want to archive, figure out the options you want in the query string, and WGet away!

Also, the journal.com?op=list page doesn't create relative links...however, parsing the file and making your own handy index page without the distracting slashdot stuff shouldn't be so hard.

What would really be cool: writing a perl script that parses the actual content out of the journal downloads. I'm pretty sure there are standard delimitters for the start and end of each journal entry...

And there's more...if you can't get WGet's cookies feature to trick slashdot into thinking you're a logged in user, they you don't have your date formatting preferences...and for some reason Taco et. al. default to MMM DD format (???) that means NO YEAR. Fortunately, the journals themselves are conveniently numbered, so you could write a perl script that processes each of the files adding the year to the journal "Posted on" date. Of course, that's going to be one ugly-ass regex...

I can really see now why so many people like perl..."it makes the hard jobs possible."

MORE FUN READING
All about SQL injection exploits.
CERT: Protecting web forms from cross-site scripting and code injection.

and

USA TODAY: Ambush TV - behind the scenes at "fake" news shows like CrossBalls and The Daily Show with John Stewart

cancel ×

8 comments

Sorry! There are no comments related to the filter you selected.

Very nice (1)

Safety Cap (253500) | more than 10 years ago | (#9776486)

Unfortunately if you don't subscribe, you can't get that list of your earlier JEs without a whole lotta paging...

Re:Very nice (1)

cyranoVR (518628) | more than 10 years ago | (#9776526)

Not true. You can see a list of ALL of ANY user's Journal entries without a subscription. Just log out and try it [slashdot.org] .

OMG!!!111 LOL!!!111 WTF!!!11 (1)

Safety Cap (253500) | more than 10 years ago | (#9776585)

You are the Shizat, man. I humbly beg forgiveness :)

d00d u r teh 1337 (1)

the_mad_poster (640772) | more than 10 years ago | (#9778831)

Heh *ahem* sorry... I slipped into a bout of stupid there for a minute :p

Just chop out all the extraneous bullshit from my script (hint: using HTML::TokeParser is a lot easier than using HTML::Parser) and you could probably shrink the thing down from 500 lines to almost nothing (feeping creaturism strikes again...). Getting the journals is easy, turning them back into something useful isn't hard, but it's not quite as easy. :\

Re:d00d u r teh 1337 (1)

cyranoVR (518628) | more than 10 years ago | (#9779014)

Thanks for the tip on HTML:TokeParser...but what do you mean by "my script?" What script?

Re:d00d u r teh 1337 (1)

the_mad_poster (640772) | more than 10 years ago | (#9779701)

Clicky link [slashdot.org] .

Re:d00d u r teh 1337 (1)

cyranoVR (518628) | more than 10 years ago | (#9780521)

Well, well...great minds think alike I guess. I will take look this weekend.

Re:d00d u r teh 1337 (1)

cyranoVR (518628) | more than 10 years ago | (#9781239)

Well, I decided what the hell and tried it

perl getslash.pl -NcyranoVR -aC:\journal

Well look at that, it's working...

Horray. Once again I've wasted my time.
Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>