Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Journal Archiving re-re-re-revisited

stoolpigeon (454276) writes | about a year ago

User Journal 3

I have written many times about how much I'd like to have a tool to archive my journal entries here. I've also written about the very nice tool that LeoP made. (wow - I wrote about it four years ago) But that sucker is now broken.

I have written many times about how much I'd like to have a tool to archive my journal entries here. I've also written about the very nice tool that LeoP made. (wow - I wrote about it four years ago) But that sucker is now broken.

It probably says a lot about me that I'm interested in saving these. It's just me rambling on and most of it is rather limited in the length of time that it is relevant. But there's a part of me that doesn't like the idea that if slashdot is shut down tomorrow that all my journal entries go with it.

So anyway - I took a couple stabs at getting things working but usually got stuck and just got busy with other stuff. But now I have a script that I think will work.

I am not a proper programmer like Leo. There is no error handling. You need to put your username and password into the script. And I ran into a really odd case that I haven't handled yet that made it break. I had a journal entry listed, but when I followed the link to the entry I got a message about it not being there or that I couldn't see it. Strange. I hit the "edit" link from the list. That brought it up. I saved it from there and now it seems to be working properly. I've started the script again to see if it gets past it. (it did)

I'll post the script here and link to it so you can grab it if you are interested. I don't know what happens if you don't have a subscription. All this is built around how the site works for me right now.

I think it is also worth pointing out that it is rather slow. It takes anywhere from 1.5 - 3 seconds per journal entry. It took 52 minutes to download my 1,436 entries (Don't look at me like that). It doesn't look like it is doing anything when I run it in Konsole - that is because it outputs the html of each page and they always end the same. But it should either print out DONE when it finishes - or stop when it hits an error. I know, pretty amazing work.

Each entry is written to a text file. In the file is the url to the entry, the timestamp from when it was written, the title and the body. The body is in html format with the div that is wrapped around it in place.

Another file is created, linklist which will have every journal url in it. (without the http: part - it starts //)

I have not done anything to save comments yet - though I think I'd like to get that in there too. Quite a few of my entries are worthless without them. But that will have to wait for another time.

You can see it below or just download it. jarch.py

It should also be obvious - but just in case- you will need Python 2, twill and beautifulsoup installed to use it.


from twill.commands import *
from bs4 import BeautifulSoup

nick = 'your nickname here'
password = 'your password here'
uid = 'your user id here'

go("http://www.slashdot.org")
fv("3","unickname",nick)
fv("3","upasswd",password)

submit()

go("http://slashdot.org/journal.pl?op=list&uid="+uid)
all_links = showlinks()
lf = open("linklist",'w')

urllen = 25 + len(nick)

for item in all_links:

        if item.url[:urllen-1] == "//slashdot.org/~"+nick+"/journal" and len(item.url)>urllen:
                print>>lf,item.url
                jurl = "http:" + item.url
                go(jurl)
                soup = BeautifulSoup(show())

                journalid = soup.find('span',{'class':'sd-key-firehose-id'})
                bodytag = "text-" + journalid.string
                titletag = "title-" + journalid.string
                entrydate = soup.time.string[3:]
                entrytitle = soup.find('span',{'id':titletag}).text
                entrybody = soup.find('div', {'id': bodytag}).prettify()
                journalfile= "jfile" + journalid.string

                f = open(journalfile,'w')
                f.write(jurl.encode('utf8')+'\n')
                f.write(entrydate.encode('utf8')+'\n')
                f.write(entrytitle.encode('utf8')+'\n')
                f.write(entrybody.encode('utf8'))
                f.close()

print("DONE")
lf.close()

cancel ×

3 comments

Sorry! There are no comments related to the filter you selected.

Slash::Journal (1)

pudge (3605) | about a year ago | (#44582551)

I had an easy archiver that I used before, using the Slash SOAP API and a module I wrote called Slash::Client::Journal (which should be on the CPAN). Dunno if the SOAP API still works though.

moo (1)

Chacham (981) | about a year ago | (#44582803)

I ought to look at doing this. I wouldn't mind a local copy of my own dribble.

Re:moo (0)

Anonymous Coward | about a year ago | (#44598537)

"drivel". You mean "drivel".

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?