Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Building a Fast Wikipedia Offline Reader

kdawson posted about 7 years ago | from the you-could-look-it-up dept.

Programming 208

ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."

cancel ×

208 comments

Sorry! There are no comments related to the filter you selected.

Wow! (2, Funny)

ferrocene (203243) | about 7 years ago | (#20220565)

After doing all that, I think you may have missed your flight! :)

Re:Wow! (-1, Troll)

Anonymous Coward | about 7 years ago | (#20220591)

Yeah right. You and the KKK my ass.

Why? (-1, Flamebait)

Anonymous Coward | about 7 years ago | (#20220645)

This is completely pointless. Everyone has an Internet connection now. This idiot is wasting his time. And LaTeX sucks for this. It is hard to use and nonstandard to PDF. And as soon someone edits wikipedia, is stupid version is obsolete! If I had an account, I would tag this "why." Everyone should tag this "why" because this is stupid. Programmers shouldn't be wasting time on these trivial, pointless projects. We need their work in other more important projects!

Re:Why? (3, Insightful)

bn557 (183935) | about 7 years ago | (#20220685)

This may seem like a stupid, trivial, and pointless project, but the programmer may have gained something from it that he could use later in something you don't feel that way about. If the programmer enjoyed doing it, that might have lead to a more productive coding session later in the day too.

Re:Why? (5, Funny)

rabblerabble (884373) | about 7 years ago | (#20220707)

I'll bite...Unfortunately, I don't have a basement, so therefore there are times that I am required to venture into the outer realm that happens to be heated by the big ball of gas known as Sol, as opposed to a pump ;P Seriously though, this is exactly what I have been looking for. What better way to show up your friends when they cry "You're wrong, google it!" knowing that there is no connection possible within twenty miles. Next time i'm drunk at the beach and someone wants to pretend to know the history of coffee harvesting, it's on.

Just settle it the old way (4, Funny)

EmbeddedJanitor (597831) | about 7 years ago | (#20220725)

Kick sand in their face!

Re:Just settle it the old way (4, Funny)

rabblerabble (884373) | about 7 years ago | (#20220793)

The goggles would work then. Your logic is flawed.

Take that, Mr Obviously A. Troll! (5, Funny)

ampathee (682788) | about 7 years ago | (#20220719)

Programmers shouldn't be wasting time on these trivial, pointless projects. We need their work in other more important projects!
Hah! I'm going to start work on (let's see..) a random lolcat generator now, just to piss you off.

Re:Take that, Mr Obviously A. Troll! (0)

Anonymous Coward | about 7 years ago | (#20220777)

Those fuckers at lolcats just rip off the work of anonymous and plaster a watermark a la eBaum's on the images they didn't even create. They've been labeled an enemy by anonymous and all those who proudly fly the banner of the one true Longcat.

Re:Take that, Mr Obviously A. Troll! (0)

Anonymous Coward | about 7 years ago | (#20220821)

tacgnol is clearly superior.

Re:Take that, Mr Obviously A. Troll! (0)

Anonymous Coward | about 7 years ago | (#20220903)

What the hell are Lolcats? It's Caturday, fucker. Get it right.

Re:Take that, Mr Obviously A. Troll! (2, Funny)

SoapDish (971052) | about 7 years ago | (#20220923)

Make sure to write it in LOL-CODE! (http://lolcode.com/ [lolcode.com] )

Re:Take that, Mr Obviously A. Troll! (4, Funny)

MarkRose (820682) | about 7 years ago | (#20221371)

You mean something like lolcatgenerator.com [lolcatgenerator.com] ? Looks like someone already tackled that important project! lol

Re:Why? (1)

Short Circuit (52384) | about 7 years ago | (#20220739)

I'm on 14.4Kbps dial-up, you insensitive clod.

And that's no joke. Noisy phone lines suck. It could be worse; I could have been on an OLPC machine in Africa.

Re:Why? (1)

rabblerabble (884373) | about 7 years ago | (#20221003)

At least you'd have a line of young bucks waiting to crank the OLPC for a few minutes... -->interpret that however you'd like, it works.

Re:Why? (1)

RuBLed (995686) | about 7 years ago | (#20220813)

Programmers shouldn't be wasting time on these trivial, pointless projects. We need their work in other more important projects!

Ironically, You're already reading slashdot. You had just wasted your time.















You're reading it again, wasting more time ehh??....

But the point is, if programming an offline wikipedia makes you happy and you don't need the money then you would understand....

Re:Why? (5, Insightful)

thePsychologist (1062886) | about 7 years ago | (#20221433)

Realize that some of the greatest things done by humankind were from doing "pointless projects" as you call them. Prime numbers for instance were studied by mathematicians just for fun, and now look, they're used for cryptography. Try doing your banking without them.

Complex numbers originated from something "useless" like trying to solve the quartic polynomial in radicals...try building a bridge without them. In fact all of science is built upon people going in random tangents doing things they enjoy, discovering seemingly "useless facts" but most of it becomes useful *and* gives us an idea of the universe in which we live.

Only working on immediate practical problems is very shortsighted, and if mandated throughout the academic community, would mean the death of innovation and most discoveries.

Re:Wow! (0)

Anonymous Coward | about 7 years ago | (#20222329)

Duh... I guess this dipshit (and everyone else here) does not comprehend the difference between a full text table index on the article bodies versus an actual "index" of topic titles, and the implication on search functionality.

I'm guessing this is the first time he has encountered the length of time it takes to create indices on text fields in a large db. You see, to get fast searches on huge amounts of text one must do a little work in advance.

Also I'd bet that the default offline wikipedia scripts define the table with indices, then imports the data, forcing the index to be rebuilt with every query (will take a painfully long time). I bet that a simple modification of the scripts to define the tables without indices, import the data then add the indices would speed it up considerably. (though index creation on large text databases will always take time no matter how you cut it)

A searchable zipped file of article titles? Whoopee fuckin doo.

RTFM assholes:
http://dev.mysql.com/doc/refman/4.1/en/indexes.htm l [mysql.com]

Re:Wow! (1)

ivlianvs (911238) | about 7 years ago | (#20222345)

Thanks! And now, what about a PDA version???

Re:Wow! (1)

CoolVibe (11466) | about 7 years ago | (#20222417)

Be by guest... You do need that 2.9 GB file somewhere on your PDA first. That just *might* be an issue.

Wow (1)

jrwr00 (1035020) | about 7 years ago | (#20220601)

Great Job, that is the power of open source for ya.

Now we need to work on porting that to over OS's and we will be set.

Re:Wow (1)

Dusty101 (765661) | about 7 years ago | (#20222393)

Indeed. Any chance of a port of this for the Nokia Internet Tablets? This'd work nicely on one of those with a big SD card...

Ho-Hum ... (5, Funny)

jabberwock (10206) | about 7 years ago | (#20220715)

What, no auto update? No User Agreement? No disabled features that are enabled by a mammoth key? No product registration?


Let us know when you're ready for prime time ... ;-)

Re:Ho-Hum ... (4, Insightful)

OzRoy (602691) | about 7 years ago | (#20221501)

Auto-update would be interesting. How do you keep the data up to date without downloading the entire 2.9G again? Is there some sort of diff file you can download?

Re:Ho-Hum ... (1, Interesting)

Anonymous Coward | about 7 years ago | (#20222187)

I think Wikipedia does in fact release diffs, but I wouldn't swear it.

Re:Ho-Hum ... (1)

MikkoApo (854304) | about 7 years ago | (#20222377)

This is a nice example of combining different tools to produce a working solution. Auto-update feature would be required though, because the process seems slightly broken and will loose parts of the data.

He splits the bz2 file into 900kb blocks. The original XML's tags might get broken when a start tag ends up in different block then the end tag. When run, the bash script will ignore all the broken tags.

Fixing that is probably pretty straightforward, but requires a bit more careful XML handling. Anyways, a nice effort and made me want to try the same in a different language :)

2X (0)

Anonymous Coward | about 7 years ago | (#20220721)

"Up to now, installing a local copy of Wikipedia is not for the faint of heart: it requires a LAMP or WAMP installation"

I'm sorry it requires WHAT in order to display HTML pages?

"Right now (August, 2007) the file is a 2.9GB download, always available from here. I"

  So an unmassaged file would fit onto a standard DVD. And a fully indexed file wouldn't take much more. Plus your equations could be images.

Re:2X (1)

computerman413 (1122419) | about 7 years ago | (#20220909)

Wikipedia is a little more than a bunch of HTML pages. From what little I understand, there's a massive database which composes Wikipedia, and there's a script which puts everything together into the Wikipedia we know and love.

Re:2X (5, Informative)

Brian Gordon (987471) | about 7 years ago | (#20221161)

Ahaha, 2.9GB? That's the text alone. Images will net you more than 200GB [wikimedia.org] more. And yes, you do need a LAMP/WAMP and working mediawiki, but it wouldn't take 'days' it would take a few hours max. Also is this guy aware that wikipedia is available on DVD [wikipedia.org] already?

Re:2X (5, Informative)

TubeSteak (669689) | about 7 years ago | (#20221457)

Also is this guy aware that wikipedia is available on DVD already?
Are you aware that the link you pointed to (1) is not the same thing as the link (2) the author pointed to?
(1) http://schools-wikipedia.org/ [schools-wikipedia.org]
(2) http://download.wikimedia.org/enwiki/latest/ [wikimedia.org]

1 is 4625 articles hand picked for school age children, hence the website name
2 is a straight dump of wikipedia

Just imagine my surprise when the schools-wikipedia website didn't have the wiki article on Goatse!

Next up: (0)

Anonymous Coward | about 7 years ago | (#20220723)

Building Cory Doctorow a better haircut.

Uh.... (1)

VonSkippy (892467) | about 7 years ago | (#20220769)

Why?

Re:Uh.... (5, Interesting)

dhwebb (526291) | about 7 years ago | (#20220837)

Programming something new to some people is like playing a video game. I love programming useless things just for the challenge. People who don't understand that have never had a true love for programming.

Re:Uh.... (2, Insightful)

Tablizer (95088) | about 7 years ago | (#20220947)

I love programming useless things just for the challenge.

Have you ever worked on a project called "Clippey", by chance?
     

Re:Uh.... (4, Funny)

Gazzonyx (982402) | about 7 years ago | (#20221377)

I love programming useless things just for the challenge.

Have you ever worked on a project called "Clippey", by chance?
No, he said he has a love for programming; not a seething hatred for users. Besides, everyone knows programmers only hate admins. ;) On behalf of the programmers, I'd like to say that this isn't true we love our admins. Who else makes sure that our connections*&#^$: Connection Reset By Peer

Re:Uh.... (1)

LittleBigLui (304739) | about 7 years ago | (#20221999)

No, he said he has a love for programming; not a seething hatred for users.


As if that was possible.

Re:Uh.... (3, Informative)

stephanruby (542433) | about 7 years ago | (#20221265)

Programming something new to some people is like playing a video game.
Speaking of which, http://www.pyweek.org/ [pyweek.org] is coming up this first week of September. It's time to dust off that python book (or borrow one from someone) and do whatever you have to do to get some days off that week.

I know the feeling (5, Insightful)

aepervius (535155) | about 7 years ago | (#20221543)

They say to you that their hobby is painting/music/walking/repairing old car/gardening/making reduced model etc... And they seem to think that their hobby are perfectly acceptable. But as soon as you say you like to program stuff, they don't understand how this would be a hobby. They mostly fail to recognize that every one of us has something in common : the joy of act of creation. The fact that our hobby entail creating something immaterial and full of "logic" does not matter. It is still a joy.

Re:I know the feeling (1)

cpu88 (1142407) | about 7 years ago | (#20222079)

"I like painting" "Umm.. crafting" "Music" "Guitar" "Hiking" "PROGRAMMING" "WHAT?" They mostly even don't know what programming is. And after u explained what this hobby is about, they replied with an odd emotion. "Come on, don't stay in your room the whole day"

Re:Uh.... (1)

Jugalator (259273) | about 7 years ago | (#20221779)

First, I don't think this tool was useless -- it was a quick way of achieving his goal of off-line Wikipedia browsing. Second, when I program I prefer to make something useful to me, and I don't think it has to do with a lack of passion for programming. It's just that I'd rather see my time come to good use, even if I enjoy the process by itself.

Re:Uh.... (1)

hobbesmaster (592205) | about 7 years ago | (#20220841)

So you can settle trivial arguments with your friends when away from an internet connection, duh!

(Or to always have something to read on your laptop while traveling - this is what I would use it for)

Re:Uh.... (2, Funny)

Gazzonyx (982402) | about 7 years ago | (#20221401)

So you can settle trivial arguments with your friends when away from an internet connection, duh!

(Or to always have something to read on your laptop while traveling - this is what I would use it for)
I bet you're quite the ladies man, huh?
Sorry, I couldn't resist!

local resource, better interface (1)

Zork the Almighty (599344) | about 7 years ago | (#20221125)

Because it might be useful to have something stored locally. I travel a lot with my laptop and I would like this. I would also appreciate the convenience of not having to fire up a web browser for wikipedia. You can search articles from the command line. You could also potentially write a better search feature, ie: bolt on some code to combine and summarize multiple related articles. The approach the guy used (a bunch of small bz2 files) is interesting and potentially useful. I'd say this was one of the better articles to hit Slashdot lately, and I'm glad I read it.

I hope (4, Funny)

Nikron (888774) | about 7 years ago | (#20220817)

That you don't dump the wiki at a bad time.

George W Bush

Is a dick head!!!!11

Re:I hope (5, Funny)

Anonymous Coward | about 7 years ago | (#20220979)

You mean before someone makes it inaccurate again?

Oh, nevermind, I see the problem:

George W Bush

Is a dick head!!!!11

should be

George W Bush

Is a dick head!!!!!!

Man, those out to mess with the content are getting more and more subtle...

But... (2, Funny)

Anonymous Coward | about 7 years ago | (#20220863)

What's the point of it if there are no vandals or flame wars to make it interesting?

Hitchhiker's guide here we come! (5, Funny)

Brietech (668850) | about 7 years ago | (#20220879)

Combine this and one of the new E-ink ebook readers, make it pretty rugged, slap a solar panel on the back and man. . . you have something really close to a genuine hitchhiker's guide to the galaxy. Ah, I love where technology is heading =)

Re:Hitchhiker's guide here we come! (4, Funny)

Sneftel (15416) | about 7 years ago | (#20220949)

As long as hitchhikers primarily need to know how to evolve a Pikachu into a Raichu, and how Benjamin Disraeli has been referenced in pop culture.

Re:Hitchhiker's guide here we come! (1, Interesting)

Anonymous Coward | about 7 years ago | (#20221033)

how to evolve a Pikachu into a Raichu

Well, if that isn't something a hitchhiker needs to know, it at least sounds like something they would need to know!

Does it also give the probability of the Pikachu turning into a bulldozer?

Re:Hitchhiker's guide here we come! (5, Funny)

RandomWhiteMan (685768) | about 7 years ago | (#20221051)

You laugh now, but just wait until you're stranded in the middle of Blackheath England, needing a ride from a conservative British History Scholar who has his son with him playing Pokemon Gold. Won't be so smug then, will you. I bet you won't even have your towel on you when this all goes down.

Give it a thunderstone, and Family Guy... (1)

Cyno01 (573917) | about 7 years ago | (#20221063)

Its really sad that i know both of those.

Re:Hitchhiker's guide here we come! (0)

Anonymous Coward | about 7 years ago | (#20221505)

You don't even know who I am.

-Benjamin Disraeli

Re:Hitchhiker's guide here we come! (4, Insightful)

cowens (30752) | about 7 years ago | (#20221949)

Ah, but that is what the original HHGTTG was as well. Tons of info on alcohol and Eccentricea Gallumbits (the triple breasted whore of Eroticon Six), but the entry for Earth was: Harmless. Later it was expanded: Mostly Harmless.

Re:Hitchhiker's guide here we come! (5, Funny)

nstlgc (945418) | about 7 years ago | (#20222091)

Just so we're clear, you can make Pikachu evolve into Raichu by using the Thunder Stone (which makes sense, since they're Electric Pokémon). However, due to the emotional value Pikachu has to trainers, most of them choose not to evolve him. Some Pokémon games even plain don't allow this. I hope this was helpful.

Re:Hitchhiker's guide here we come! (1)

dch24 (904899) | about 7 years ago | (#20220983)

Don't forget to put this on the cover, in large reassuring letters:

DON'T PANIC

Re:Hitchhiker's guide here we come! (1)

kars (100858) | about 7 years ago | (#20221139)

Yeah, it brings a whole new meaning to the term information -highway-...

Re:Hitchhiker's guide here we come! (5, Funny)

Gromius (677157) | about 7 years ago | (#20221905)

Yes its a perfect fit. Particularly as Wikipedia has now supplanted the Encyclopedia Britannica in many places as the standard repository of all knowledge and wisdom. Although it has many omissions, contains much that is apocryphal, or at least widely inaccurate, it scores over the older more pedestrian work in two important ways.

        * 1. It is slightly cheaper
        * 2. It has the words "You can copy and edit me for free" inscribed in large friendly letters in the license.

Also like the guide, although it cannot hope to be useful or informative on all matters, it does make the reassuring claim that where it is inaccurate, it is at least definitively inaccurate :)

Only 2 days huh (2, Funny)

Anonymous Coward | about 7 years ago | (#20220883)

I was able to build this in two days, most of which were spent searching for the appropriate tools. Simply unbelievable... toying around with these tools and writing less than 200 lines of code, and... presto!
Give that man a job at Google.

Re:Only 2 days huh (1, Funny)

Anonymous Coward | about 7 years ago | (#20222031)

Don't you mean ChaCha [slashdot.org] ?

Days? Please clarify (1)

Tablizer (95088) | about 7 years ago | (#20220917)

compared to loading the 'dump' into MySQL -- which, if you want to enable keyword searching, takes days."

Do you mean searching takes days, or loading? Searching should be quick if you index the words. If you are duplicating a bunch of local clones of wiki, then simply copy down the raw MySql table data files rather than reload from delimited files etc. (One needs to make sure their version of MySql is compatible with the table file format.)
       

Faster than a speeding slug... (1)

kcbrown (7426) | about 7 years ago | (#20220953)

"Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL -- which, if you want to enable keyword searching, takes days."

But....but....I thought MySQL was fast!

:-)

Re:Faster than a speeding slug... (1)

larry bagina (561269) | about 7 years ago | (#20221425)

remember when slashdot's comment parent id index overflowed (24 bits ought to be enough for anybody!) and slashdot was broken for 36 hours or so while it reindexed?

As an aside, postgresql would be slower to do the initial data load but it the table is accessible all the while. It's even accessible while indexing is taking place.

Just hope you don't get an effed image. (-1, Troll)

Spazntwich (208070) | about 7 years ago | (#20220997)

Given the sheer amount of petty editing wars and defacing that constantly plague Wikipedia, you would likely be better off just reading an Encyclopedia when you want some knowledge and an internet connection isn't available.

Seriously, I know wikipedia is the darling of open source, but the more I learn about it, the more I realize it's pure garbage.

Why? Educate yourself.

http://www.guardian.co.uk/technology/2005/oct/24/c omment.newmedia [guardian.co.uk]
http://www.theregister.co.uk/2005/10/24/wikipedia_ letters/ [theregister.co.uk]
http://homepage.univie.ac.at/horst.prillinger/blog /archives/2004/06/000623.html [univie.ac.at]
http://www.kapitalism.net/thoughts/wikipedia.htm [kapitalism.net]

And there's more, but you get the idea. Collusion to ruin people's lives when they run afoul of admins, corrupt editors doing and getting favors from the head honcho himself, pet pages that end up with incorrect information, speculation, or specious reasoning, and a general air of arrogance and groupthink reinforcing an internal idea that they can do no wrong.

Why bother, seriously?

Re:Just hope you don't get an effed image. (2, Insightful)

Anonymous Coward | about 7 years ago | (#20221035)

And that doesn't happen offline? Only naive people like you need to be worried about reading Wikipedia.

There are bastards of every academic, social, and financial background.

Re:Just hope you don't get an effed image. (4, Insightful)

Tacvek (948259) | about 7 years ago | (#20221083)

My very serious question to you is how much better do you think things are at a "real" encyclopedia. They have many of the same problems, but they are just not public. "Real" encyclopedias can be just an inaccurate as the Wikipedia on many articles. For a quick first reference, Wikipedia is an ideal tool. Just be sure to take things with a grain of salt if you are not checking the sources for further information. Guess what though, the same applies to "real" encyclopedias too. One difference is that with "real" encyclopedias, you always lack revision information, and you often lack information about the sources used by the editors. (Some encyclopedias are better than others in that respect.)

Re:Just hope you don't get an effed image. (3, Funny)

gad_zuki! (70830) | about 7 years ago | (#20221429)

Yes, the paper encyclopedias are missing all the anime trivia. Christ, its embarassing to see "references in pop culture" sections which just spell out every geeky guy stereotype. I dont know why those people dont get banned. Everything in existance has an anime reference. That is unsettling.

Re:Just hope you don't get an effed image. (1)

Jugalator (259273) | about 7 years ago | (#20221793)

Unfortunately for Wikipedia, the quality or lack of it in competing encyclopedias does not resolve the problems in Wikipedia. I hope Wikipedia can work on these issues because I am seeing some of it too. I'm also seeing article rot being quite common, in that old articles deteriorate, and not really from a lack of good will either. Someone discuss the problem in a blog here: http://nonbovine-ruminations.blogspot.com/2007/02/ where-are-stable-versions.html [blogspot.com]

Re:Just hope you don't get an effed image. (1)

Bombula (670389) | about 7 years ago | (#20221241)

It might defend on the topic/field in question. The articles you reference seem to be focused on tech stuff. I use wikipedia primarily for socioeconomic reference material, and find it in general to be pretty solid. There are places where the depth is limited, but it's definitely my first-reach resource as long as I have an internet connection - mainly because many of the specific things I'm after might not be in a general encyclopedia like Britannica - intertemporal equilibrium, hedonic regression, Edgeworth's limit theorem, the Bertrand paradox etc, etc.

Re:Just hope you don't get an effed image. (1)

Bombula (670389) | about 7 years ago | (#20221267)

Yikes, defend = depend

Re:Just hope you don't get an effed image. (1)

poopdeville (841677) | about 7 years ago | (#20221317)

Worthington's Law is the only economics anybody needs to know.

Re:Just hope you don't get an effed image. (1)

GPL Apostate (1138631) | about 7 years ago | (#20221731)

I use Wikipedia, as I use the Internet, for geeky computer stuff and electronics tech, tools, and info.

I can't imagine ever taking it that seriously that I would use it for mainstream 'non-nerd' stuff.

Re:Just hope you don't get an effed image. (1)

Short Circuit (52384) | about 7 years ago | (#20221263)

And there's more, but you get the idea. Collusion to ruin people's lives when they run afoul of admins, corrupt editors doing and getting favors from the head honcho himself, pet pages that end up with incorrect information, speculation, or specious reasoning, and a general air of arrogance and groupthink reinforcing an internal idea that they can do no wrong.
You missed a few, such as product placement pages and ancient "This page doesn't conform to {{?}} standards" tags. That, and obscure fields get limited attention.

Why bother, seriously?
Because the breadth of material covered in Wikipedia is unparalleled, as is the timeliness of information in many fields of interest. And it's a hell of a lot more compact than a 100 lb encyclopedia set, and cheaper to boot.

Re:Just hope you don't get an effed image. (0)

Anonymous Coward | about 7 years ago | (#20221311)

"ollusion to ruin people's lives when they run afoul of admins, corrupt editors doing and getting favors from the head honcho himself, pet pages that end up with incorrect information, speculation, or specious reasoning, and a general air of arrogance and groupthink reinforcing an internal idea that they can do no wrong."

Plus, I hear that's all trippled in the last six months!

Re:Just hope you don't get an effed image. (2, Funny)

ZzzzSleep (606571) | about 7 years ago | (#20222049)

Blatantly stolen from David Morgan-Mar [livejournal.com] .

In many of the more relaxed corners of the Outer Eastern Rim of the Internet, Wikipedia has already supplanted the great Encyclopaedia Britannica as the standard repository of all knowledge and wisdom, for though it has many omissions and contains much that is apocryphal, or at least wildly inaccurate, it scores over the older, more pedestrian work in two important respects.

First, it is slightly cheaper; and secondly it has the words "anyone can edit" inscribed in large friendly letters on its cover.

Good part of the page: the explanation (4, Insightful)

phliar (87116) | about 7 years ago | (#20221185)

For a change it's not just a link to a .tar.gz somewhere, but an actual article where he goes through what he did, and (more important) why he did things that way. Good reading even if you don't want an off-line Wikipedia.

Re:Good part of the page: the explanation (1)

cpu88 (1142407) | about 7 years ago | (#20221997)

yeah, you are right. Checking programme information of the mini-tv in front of him on the flight.

It doesn't take days (4, Informative)

BReflection (736785) | about 7 years ago | (#20221467)

It only takes days if you use the php import script to import the sql dump, which was not designed for importing the entire dump.

Use the ANSI C implementation, which takes about 20 minutes to convert the XML to SQL and then takes a few hours to import into MySQL. Please not that you need a properly configured MySQL server in order to efficiently run a local copy of Wikipedia, which must have at least 8GB of ram.

http://meta.wikimedia.org/wiki/Xml2sql [wikimedia.org]

Re:It doesn't take days (1)

BReflection (736785) | about 7 years ago | (#20221483)

By the way, he could have saved himself a lot of time if he would have just purchased a WikiStick http://www.wikistick.com/ [wikistick.com]

Re:It doesn't take days (1)

jschrod (172610) | about 7 years ago | (#20222467)

OK, I bite.

Your URL leads to a domain parking page. Google search for Wikistick didn't bring results on the first page either. AFAIK, full Wikipedia (text and images) is too large for a USB stick.

What did you want to tell us?

Linda Mack! (1, Funny)

Anonymous Coward | about 7 years ago | (#20221531)

I would be concerned that Slimvirgin and the other intelligence agent(s) might not be able to revert and ban the edits I would be making offline. Maybe Jimbo can give them authority to come rough me up at home and beat my lcd with a hammer.

http://yro.slashdot.org/article.pl?sid=07/07/27/19 43254 [slashdot.org]

Mass inserts into mysql... (3, Informative)

Splab (574204) | about 7 years ago | (#20221557)

is very very slow when you do it on a normal installation, the reason is MySQL comes with a "be nice to people who don't know what they are doing" setup. Go into the my.cnf and find the buffer settings, crank them up and restart the server. It can really do a lot (especially if you are running InnoDB which you of course are since MyISAM isn't a proper database).

MyISAM/InnoDB (0)

Anonymous Coward | about 7 years ago | (#20222181)

(especially if you are running InnoDB which you of course are since MyISAM isn't a proper database)



MyISAM is very limited compared to other databases but at least it's a lot faster for some specific (useful) loads. InnoDB is not faster than other databases but is still rather limited. Ergo: use MySQL with MyISAM if your problem is a good fit to its capabilities, use another database (PostgreSQL, Firebird, MSSQL, ...) otherwise.

Xapian (1)

paltemalte (767772) | about 7 years ago | (#20221591)

For those who didn't know, Xapian [xapian.org] , the search engine he used for this, is really awesome. Its very fast, stable, actively developed and packs some pretty impressive features. Its written in c++, but has bindings for Perl, Python, PHP, Java, Tcl, C#, and Ruby. If you need an embedded search function on a site, you should check it out.

I've used it for over 2 years on various sites and am really pleased with it.

What?? (5, Funny)

icydog (923695) | about 7 years ago | (#20221617)

TFA is:

1. Not a thinly-veiled attempt to advertise a crappy product
2. Not bashing Microsoft
3. Not about somebody who is trolling open-source (i.e. SCO)
4. Not about Bush taking away all our rights and ending freedom
5. Not about voting fraud and the end of democracy/America/the world
6. Not decrying Vista DRM and its ties to the MAFIAA
7. Posted on Slashdot

Furthermore, TFA is interesting and informative.

Am I in heaven?

Re:What?? (1)

mosiadh (1045736) | about 7 years ago | (#20221951)

Am I in heaven?

No, thats upstairs. Invitation only.

C&D Tomorrow? (1)

fishbowl (7759) | about 7 years ago | (#20221839)

Can't help but assume there will be a cease and desist order in the /. headlines tomorrow.

The Point? (1)

photomonkey (987563) | about 7 years ago | (#20221865)

I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?

The beauty of it is that it is online and always up-to-date (wrong, or less wrong).

Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.

If it's an academic project, that's really cool, but I don't see a practical point to it.

Re:The Point? (4, Insightful)

Mr. Roadkill (731328) | about 7 years ago | (#20222067)

I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?
Ummm... I think the whole point is, as you've pointed out, that not everyone has a permanent connection to the net everywhere they go. Or maybe they don't have access to everything they'd like even if they *do* have net access everywhere, or want to pay extravagant data rates while out and about.

Joe has all-you-can-eat broadband at home, or an understanding employer with a fat pipe, and spends two hours each day on the train. Two and a half gig per month (and lets face it, you probably don't want to update it more frequently that that) and he's got probably half his reading material sorted out.

Wang lives in Buttfuckistan, a fictional country with totalitarian leanings with too many real-world counterparts. The Great Firewall of Buttfuckistan (i.e. squidguard, under the control of Buttfuckistan Telecom, and settings in the routers to drop non-port-80 traffic half the time) makes it impossible to reliably access Wikipedia from inside their borders, which is a great shame because the entry on Buttfuckistan is particularly unflattering. Once a month, Joe sticks a DVD with five minutes from an old re-run of Friends and an encrypted dump of Wikipedia in an airmail envelope and sends it to Wang.

Mary is still at secondary school, and her particular school has wifi access for students who are encouraged to purchase their own laptops, but since the local pastor discovered http://en.wikipedia.org/wiki/Image:Dream_of_the_fi shermans_wife_hokusai.jpg [wikipedia.org] they've been forced to add wikipedia to the school's blocklist. Which is a pity, because it's a great first-approximation source for material or research directions, but there you go. Mary can make a local copy through her home broadband connection, and can access it locally on her laptop wherever she goes - even at school, or church. Bill, Jillian and Mungo (the pastor's son) find out about this, and now all four of them take it in turns to make the copy each month, sharing the bandwidth costs. Their friends Harry and Sally, who don't have broadband but are great friends of the other four, also get copies... and there are plans to distribute the copies further, as a kind of teenage grass-roots knowledge-sharing and social-justice effort.

Still can't see the point?

Re:The Point? (1)

Riktov (632) | about 7 years ago | (#20222147)

>>
The beauty of it is that it is online and always up-to-date (wrong, or less wrong).

Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.
>>

Sure, what's the point of reading an old version of the history of the Battle of Hastings, or the technical specifications of the P-51 Mustang, or the characteristics of a dominant seventh chord? After six months, it's complete obsolete and worthless, right?

WP:1.0 wants you (1)

Titoxd (1116095) | about 7 years ago | (#20221927)

Dude, WP:1.0 [wikipedia.org] wants YOU.

Why didn't he post his howto on wikipedia? (1)

nullchar (446050) | about 7 years ago | (#20222037)

Wikipedia seems the best place for the author's "how to download and use offline".

What about moulin? (1)

maubp (303462) | about 7 years ago | (#20222115)

How is this different to moulin which is a fully interactive, offline version of the entire Wikipedia (without pictures) on a CD-ROM:

http://moulinwiki.org/l/en/ [moulinwiki.org]

Re:What about moulin? (1)

Dillon2112 (197474) | about 7 years ago | (#20222193)

Well, for starters, it's in English.

Re:What about moulin? (1)

maubp (303462) | about 7 years ago | (#20222229)

Now I've read both articles:

This guy's work required about 3GB for the compressed Wikipedia data dump (split up into compressed chunks using bzip2recover), plus python, perl, a little database library (xapian) and a web server (Django). He seems to be working in English only, and doesn't seem to provide a "why" or who this might be useful to.

Moulin has a concrete aim in mind, they are starting with the much smaller French version of Wikipedia, and have built a CD-ROM sized offline viewer for released in West Africa. They've also been working on other languages including left-to-right support for Farsi and Arabic. It sounds like they plan to have the English language version of Wikipedia as an offline DVD, but the techinical details seem a little thin on the ground on their webpage (but there is source code).
http://moulinwiki.org/l/en/ [moulinwiki.org]

Re:What about moulin? (1)

LordSnooty (853791) | about 7 years ago | (#20222473)

Is Moulin anything to do with Kiwix [kiwix.org] ? Cos these guys were also building an offline WP viewer, though only featuring 2000 important articles. Dev seems to have stopped now, a pity as it was a nice package with an excellent page viewer. Ideal for slapping on a laptop and providing something to read when you're away from net.

...or the HTML export feature? (1)

georgewilliamherbert (211790) | about 7 years ago | (#20222167)

There's a one-button (for admins) export-the-whole-wiki-as-html feature in modern MediaWiki software installs...

But hey, two days and a few hundred lines of code is cool. You geek (verb). If we always took the easy way out we'd be using Windows and have committed suicide long ago.

can we get a PSP version of it? (3, Interesting)

mu22le (766735) | about 7 years ago | (#20222305)

A PSP is very portable (fits in your sweater/backpack), hackable, and has up to 8Gb of storage. I have been dreaming for an year about porting wikipedia to it. Unfortunately I'm not familiar with the kind of programming needed and I could never find the time...

There's a bug in TFA: Missing articles. (4, Insightful)

dannycim (442761) | about 7 years ago | (#20222493)

There's a serious problem with the article's way of treating the data that I didn't see addressed.

The wikipedia database file is one large bzip2'ed XML file which the author splits into blocks of 900k (bzip2's natural blocking) which he then parses for the "title" and "text" XML tags.

The problem with that approach is that some of these tags may well end up being split over block boundaries, so some articles risk being missed. EG:

END-OF-BLOCK: blablablabla...blabla[/text][othertag][ti

START-OF-NEXT-BLOCK: tle][sometag]blablablablabla...

So searching for "[title]" in boths blocks separately like TFA does will fail for one article.

(I've used square brackets instead of lessthans and greaterthans because slashdot won't let me use them.)
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>