Journal Archiving re-re-re-revisited

Journal stoolpigeon's Journal: Journal Archiving re-re-re-revisited 3

Journal by stoolpigeon on Friday August 16, 2013 @08:46AM

I have written many times about how much I'd like to have a tool to archive my journal entries here. I've also written about the very nice tool that LeoP made. (wow - I wrote about it four years ago) But that sucker is now broken.

It probably says a lot about me that I'm interested in saving these. It's just me rambling on and most of it is rather limited in the length of time that it is relevant. But there's a part of me that doesn't like the idea that if slashdot is shut down tomorrow that all my journal entries go with it.

So anyway - I took a couple stabs at getting things working but usually got stuck and just got busy with other stuff. But now I have a script that I think will work.

I am not a proper programmer like Leo. There is no error handling. You need to put your username and password into the script. And I ran into a really odd case that I haven't handled yet that made it break. I had a journal entry listed, but when I followed the link to the entry I got a message about it not being there or that I couldn't see it. Strange. I hit the "edit" link from the list. That brought it up. I saved it from there and now it seems to be working properly. I've started the script again to see if it gets past it. (it did)

I'll post the script here and link to it so you can grab it if you are interested. I don't know what happens if you don't have a subscription. All this is built around how the site works for me right now.

I think it is also worth pointing out that it is rather slow. It takes anywhere from 1.5 - 3 seconds per journal entry. It took 52 minutes to download my 1,436 entries (Don't look at me like that). It doesn't look like it is doing anything when I run it in Konsole - that is because it outputs the html of each page and they always end the same. But it should either print out DONE when it finishes - or stop when it hits an error. I know, pretty amazing work.

Each entry is written to a text file. In the file is the url to the entry, the timestamp from when it was written, the title and the body. The body is in html format with the div that is wrapped around it in place.

Another file is created, linklist which will have every journal url in it. (without the http: part - it starts //)

I have not done anything to save comments yet - though I think I'd like to get that in there too. Quite a few of my entries are worthless without them. But that will have to wait for another time.

You can see it below or just download it. jarch.py

It should also be obvious - but just in case- you will need Python 2, twill and beautifulsoup installed to use it.

from twill.commands import * from bs4 import BeautifulSoup

nick = 'your nickname here' password = 'your password here' uid = 'your user id here'

go("http://www.slashdot.org") fv("3","unickname",nick) fv("3","upasswd",password)

submit()

go("http://slashdot.org/journal.pl?op=list&uid="+uid) all_links = showlinks() lf = open("linklist",'w')

urllen = 25 + len(nick)

for item in all_links:

if item.url[:urllen-1] == "//slashdot.org/~"+nick+"/journal" and len(item.url)>urllen: print>>lf,item.url jurl = "http:" + item.url go(jurl) soup = BeautifulSoup(show())

journalid = soup.find('span',{'class':'sd-key-firehose-id'}) bodytag = "text-" + journalid.string titletag = "title-" + journalid.string entrydate = soup.time.string[3:] entrytitle = soup.find('span',{'id':titletag}).text entrybody = soup.find('div', {'id': bodytag}).prettify() journalfile= "jfile" + journalid.string

f = open(journalfile,'w') f.write(jurl.encode('utf8')+'\n') f.write(entrydate.encode('utf8')+'\n') f.write(entrytitle.encode('utf8')+'\n') f.write(entrybody.encode('utf8')) f.close()

print("DONE") lf.close()

This discussion has been archived. No new comments can be posted.