Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Developing a Niche Online-Content Indexing System?

timothy posted more than 4 years ago | from the we-had-to-index-individual-clay-tablets dept.

Databases 134

tebee writes "One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. It lets you list the articles in any particular magazine or search for an article by keyword, title or author, refining the search if necessary by magazine and/or date. Unfortunately the firm which hosts the index have recently pulled it from their website, citing security worries and incompatibilities with the rest of their e-commerce website: the heart of the system is a 20-year-old DOS program! They have no plans to replace it as the original data is in an unknown format. So we are talking about putting together a team to build a open source replacement for this – probably using PHP and MySQL. The governing body for the hobby has agreed to host this and we are in negotiations to try and get the original data. We hope that by volunteers crowd-sourcing the conversion, we will be able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more.tebee continues: "It occurs to me that there could be existing open-source projects that do roughly what we want to do — maybe something indexing academic papers. But two days of trawling through script sites and googling has not produced any results.

Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!

So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"

cancel ×

134 comments

Sorry! There are no comments related to the filter you selected.

Sphinx or Lucene (3, Informative)

Anonymous Coward | more than 4 years ago | (#32938980)

Or did I misunderstand the question?

Re:Sphinx or Lucene (0)

Anonymous Coward | more than 4 years ago | (#32939148)

Yes, you misunderstood it. In fact, it's pretty clear that you didn't even bother to read anything he wrote. He clearly doesn't want to index a bunch of text for searching purposes. He basically just wants to build a directory web site, much like Yahoo! or the Open Directory Project, but targeted to his subject's niche.

Re:Sphinx or Lucene (4, Informative)

tebee (1280900) | more than 4 years ago | (#32939218)

Yes, you did misunderstand.

We do not have the full text of the article online , all we have is its title, author and some manually created keywords. It's necessary to have access to the physical magazine to read the content of the article, but this is a hobby(model railroading) where many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines.

All the solutions I could find seemed to be based, like those two, on indexing the text of the articles.

It would be much better if we did have the text as well, but as I said there is the minor problem of copyright. The fact that the index has been run for the last 10 years by the major (dead tree) publisher is this field has also discouraged development in this direction.

Re:Sphinx or Lucene (3, Interesting)

martin-boundary (547041) | more than 4 years ago | (#32939366)

Even if you have only the title/author, you're still indexing text. Think of a tiny little text file containing two or three lines: title, author, keywords. You'll need a volunteer to type this in. Then you dump those files in a directory and run an indexer.

If this isn't what you have in mind, please elaborate.

Re:Sphinx or Lucene (3, Interesting)

Trepidity (597) | more than 4 years ago | (#32939594)

If you have relatively little but highly structured data, running it through a general search engine like Lucene or Sphinx doesn't seem like the ideal solution, because it doesn't make it easy to do structured queries ("give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966").

A bibliography indexer would probably be a better choice. Two good free ones are Refbase [refbase.net] or Aigaion [aigaion.nl] . Both are targeted mainly at databases of scientific literature, so might need some tweaking for this purpose, though.

Re:Sphinx or Lucene (2, Interesting)

martin-boundary (547041) | more than 4 years ago | (#32939628)

Yes, I was mainly trying to point out that his problem is still conceptually a text indexing problem even if he doesn't have the text of the articles. A scientific bibliography database can be a good choice, as some journals can have arcane numbering systems, so they should be able to cope with a magazine collection.

Like someone else pointed out, though, if at some point he expects to get access to the full text or even just scans of the articles, he'd better have chosen a system that can easily expand to handle that.

Re:Sphinx or Lucene (1)

OrangeCatholic (1495411) | more than 4 years ago | (#32941224)

It's really not a text indexing problem, unless you are going to throw out rdbms and use a flat text file.

If you will use relational database, then it is a 3-table problem at most. Articles, sources, and articles to sources. If you can join those, you have the core of a classic content management system.

From what I gather, they haven't even gotten that far. It is just a master index of articles that are available (which point to nothing in particular), so it is a 1-table problem.

For 1-table problems I generally use Excel.

Re:Sphinx or Lucene (1)

Jherico (39763) | more than 4 years ago | (#32941146)

Solr [apache.org] in front of Lucene is a perfectly reasonable way to index highly structured information and allows structured queries.

Re:Sphinx or Lucene (1)

hsmyers (142611) | more than 4 years ago | (#32939398)

Not being up to speed on current open source that might prevent premature wheel re-invention my answer would be 'No'. That said, I don't see any particular trouble with the project itself. If I understand correctly, you've bare bones bibliographic information that you want to create an on-line index of. The notion of PHP and MySql seems sound although I suspect that Perl would work as well if not better, depending on the knowledge of your volunteer talent. I expose my bias here when I point out that text analysis is a particular strength of Perl. I'm currently involved with a project that does an enormous amount of semantic analysis which might be used to create key words on the fly for instance. Now that I think of it, there is no particular reason that the work couldn't be multi-lingual for that matter, leveraging the programmer base at your disposal. Continuing to think about it, I'd love to help--- reply to me at gmail.com if you are interested...

Re:Sphinx or Lucene (1)

hsmyers (142611) | more than 4 years ago | (#32939522)

Just noticed an thread on Hacker News on http://www.gotapi.com/html which might be of interest...

Re:Sphinx or Lucene (1)

symes (835608) | more than 4 years ago | (#32939434)

They have no plans to replace it as the original data is in an unknown format.

Well there aren't that many obvious candidates... any of these [d2ca.org] look familiar?

Re:Sphinx or Lucene (3, Informative)

OrangeCatholic (1495411) | more than 4 years ago | (#32939464)

So let me get this straight: This is a single table? You have one table (spreadsheet), where each row represents one article. The columns would be title, author, and either five or so columns of keywords, or a single varchar column that would hold them all (comma-delineated or whatever).

Then you need the standard row_id and whatever other crufty columns creep in. If this is all you need, you can do this in Excel (har har). Or install MySQL, create the table (we'll call it mr_article_list), then write the standard php scripts to add, edit, delete, and retrieve entries.

These scripts are basically just web forms that pass through the entered values into the database. You're talking a single code page for each of the inputs, and then a page each for the output/result, or 8 pages total.

For example, the mr_add.php script (mr_ stands for model railroad) retrieves a new row_id from the db. Then it presents a web form with input fields for the title, author, and keywords. Then it does db_insert(mr_article_list, $title, $author, $keywords). Then it calls mr_add2.php, which is either success or failure.

The edit, delete, and retrieve scripts are similarly simple. All you need is a linux box to do this, and the basic scripts could be written in two evenings (or one long one) - assuming you hired someone who does this for a living.

Now this is where it gets interesting:

>many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines

Do you want to store this information as well, so that people know who to call to get the issue? I assume this would be the real useful feature. So now you need a second table, mr_sources, which is basically a list of clubs/people, so the columns in this table would be like row_id, name, address, phone number (standard phone book shit).

Then you need a third table, mr_article_sources, which is real simple, it just matches up the rows in the article list to the rows in the source list. It's columns are simply row_id, article_row_id, source_row_id. This is a long and narrow table that cross-indexes the two shorter, fatter tables (the list of articles, and the list of sources).

Example, article_id #19 is "How to shoot your electric engine off the tracks in under three seconds." Source_id #5 is Milwaukee Railroad Club, #7 is San Jose Railroad Surfers, and #9 is Bill Gates Private Book Collection. All three of them have this article. So your cross-index table would look like this:

01 19 05
02 19 07
03 19 09

When you search for article #19, it finds sources 5, 7, and 9 in the cross-index table, then queries the source table for the names and phone numbers of those three clubs (and displays them).

Finally, if you're wondering how to query three different tables at the same time, well, databases were made to do exactly this.

No, no, NO! (3, Insightful)

RichiH (749257) | more than 4 years ago | (#32941368)

Your suggestions make sense, but suggesting to store comma-delimited plain text in a SQL table is wrong by any and all database standards & best practises. You fail to reach even the first normalized form.

Read http://en.wikipedia.org/wiki/Database_normalization [wikipedia.org]

You want to define a table "tags" or something with id, article_id, name, comment. Make the combination of id, parent_id, name unique.

* id is on auto-increase, not NULL
* article_id is a foreign key to the id of the article, not NULL
* name is the name of the tag, not NULL
* comment is an optional comment explaining the tag (for example in the mouse-over or on the site listing everything with that tag), may be NULL

Not only is that easier to maintain in the long run (think of parsing plain text out of a VARCHAR. argh!), but all of a sudden, you have the data you _store_ available to _access_.
How many artcles are tagged electric? SELECT count (1) FROM article_tags WHERE name = "electric";
Give me a list of all article relating to foo and bar? SELECT article_id FROM article_tags WHERE name = "foo" OR name = "bar".
etc pp.

If you want to go really fancy with multi-level tags, replace article_id with parent_id (referring to the id in the same table) and create a relation table as glue. If you want all upper levels to apply, throw in a transitive closure:

http://en.wikipedia.org/wiki/Transitive_closure [wikipedia.org]

Generally speaking, you want a table for magazines with their names, publication dates, publisher, whatnot; and only refer to them via foreign keys. Same goes for train models (which you could cross-ref via tags. Yay for clean db design!), authors, collectors, train clubs and and pretty much everything else.

One last word of advice: No matter what anyone tells you: Either you use a proper framework or you _ALWAYS_ use prepared statements. You get some performance benefits and SQL injection becomes impossible, for free! Repeat: Even if you ignore all the other tips above, you _MUST heed this.

http://en.wikipedia.org/wiki/SQL_injection [wikipedia.org]

Richard

PS: You are more than welcome to reply to this post once you have your DB design hammered out. I will have a look & optimize, if you want.

Re:No, no, NO! (1)

RichiH (749257) | more than 4 years ago | (#32941372)

The second statement should have read

SELECT article_id FROM article_tags WHERE name = "foo" AND name = "bar"

for obvious reasons.

Re:Sphinx or Lucene (1)

tolan-b (230077) | more than 4 years ago | (#32939544)

I think you should still have a look as Sphinx and Lucene. You can put whatever data you want into them, in whatever schema you want (at least with Lucene, I believe with Sphinx too). You can then easily create a UI as a front end and let the indexing engine do the hard work of slicing and dicing by your criteria. I believe the Zend Framework library has a Lucene API.

Also if you do manage to go fulltext later then it'll mean less work.

Re:Sphinx or Lucene (1)

banjo D (212277) | more than 4 years ago | (#32940168)

I don't know about Sphinx but I agree that Lucene could be a good solution, for the reasons tolan-b lists. I work on a digital library cataloging project that indexes it's metadata with Lucene. We use PHP to generate the user-facing website, which queries our Lucene index via a Solr server. We do have a highly structured metadata schema and we do run queries that include things such as "give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966" (which somebody in another comment suggested is not easy to do with Lucene, but in our experience was very easy). And adding a Solr server on top makes it easy to include features like faceted search.

Re:Sphinx or Lucene (1)

OrangeCatholic (1495411) | more than 4 years ago | (#32941262)

>we do run queries that include things such as "give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966"

SELECT * FROM banjo_articles WHERE title LIKE "%foo%", date BETWEEN "1950-01-01" AND "1966-12-31"

You're bragging that your "system" has a single line of code?

I've seen selects ten or twenty lines long, with multiple joins, and joins and selects within joins. Granted it's not fast, but it works, and it takes all of an hour (or less) to write such a query.

Re:Sphinx or Lucene (2, Interesting)

rs79 (71822) | more than 4 years ago | (#32940110)

I do the same thing for tropical fish and wrote a shitload of C code. If this is an old DOS program it should port to C/UNIX really stupid easy.

Drop me a line if you want to and I'll ask you to send me some sample data. This might be really easy.

Re:Sphinx or Lucene (1)

sporkboy (22212) | more than 4 years ago | (#32940700)

The standard /. IANAL applies here, but I'm pretty sure that if you have legal access to the copyrighted text (ie you or someone you know owns a copy of the magazine) then it is ok to create a derivative work for the purposes of searching that work. This is the loophole that Google (name your favorite search engine here) uses, and they go so far as to offer cached versions of some sites.

Lucene, or a more friendly wrapper around it like SOLR, has the option of creating a search index based on an original text from which the original content cannot be extracted (indexed=true, stored=false on a field), so that would seem to cover the case of finding an article without violating the rights of the author or the publisher.

As for not having the text online, I'd suggest either scraping the archive sites in the process of building your search index, it's pretty hard to search something that isn't digitized.

Best of luck, as this sounds like a worthwhile project. I do think that the volume of data you're discussing would fit easily in a SOLR instance that would consume very modest amounts of server resources to operate.

It would help (3, Insightful)

Xamusk (702162) | more than 4 years ago | (#32938982)

if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.

Re:It would help (3, Funny)

beakerMeep (716990) | more than 4 years ago | (#32939146)

Maybe it's the type of magazines that people used to read "for the articles?"

Re:It would help (2, Funny)

bsDaemon (87307) | more than 4 years ago | (#32939230)

I'm pretty sure porn indexing isn't niche... or a hobby. Its the true reason Google exists.

Re:It would help (1)

BitterOak (537666) | more than 4 years ago | (#32940492)

if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.

Maybe it's the type of magazines that people used to read "for the articles?"

And that's precisely the type of magazine that would catch the interest of the Slashdot crowd.

Re:It would help (4, Informative)

tebee (1280900) | more than 4 years ago | (#32939242)

OK the hobby is model railroading and the index was at http://index.mrmag.com/tm.exe [mrmag.com] but was removed , without warning, last week so there is not a lot to see.

Re:It would help (0)

Anonymous Coward | more than 4 years ago | (#32939304)

Who the fuck uses the suffix ".exe" for a Web page? These guys have bats in the belfry in general, it seems.

Re:It would help (1)

ZERO1ZERO (948669) | more than 4 years ago | (#32939342)

It's not a web page it's a DOS program, hence the ask slashdot....

Re:It would help (1)

tebee (1280900) | more than 4 years ago | (#32939444)

It's a DOS program that runs on the server, rather like a CGI script. It's output is a web page.

It is bit of a throw back to the dawn of the web when people thought up innovative ways to do things.

Re:It would help (1)

Bungie (192858) | more than 4 years ago | (#32941090)

CGI works by having the server executes the program (passing the data to it from STDIN or the command line) and then retreiving the page's complete HTML code from STDOUT. You can use any file that can be executed and use STDIN/STDOUT in this manner that is located in a specified location(like cgi-bin). On Windows this would be any exe,com,pif,bat or cmd file, and the extension must be there for the operating system to determine that it is an executable. On Linux you can use any file that +x permissions, compiled binaries or scripts with a bang at the beginning, so it can have any extension you want (or none) for a CGI.

People used to write a lot of CGI applications in perl because of it's text processing capabilities, but there were many CGI's that were compiled programs (written in languages like C). At one point Microsoft was really pushing the idea of easily writing CGI applications in Visual Basic [microsoft.com] and hosting them with IIS.

CGI fell out of popularity in favor of embedded scripting like PHP and ASP which have much less overhead (they don't have to create a new process to service every user request and wait for it's output) and are much less complex for people to use (they don't require special directories or permissions).

Wayback (3, Interesting)

martin-boundary (547041) | more than 4 years ago | (#32939474)

You can use the Wayback Machine to get a partial snapshot of the site. Try http://web.archive.org/web/*/http://index.mrmag.com/tm.exe [archive.org] , then follow the links on the archived page. If you vary the URL a bit, you might see even more missing data.

Re:Wayback (1)

Cylix (55374) | more than 4 years ago | (#32939620)

Definitely an easy re-write.

Just going to be painful to re-enter all that data if they can't use the original binary blob.

Long time ago I had a programming segment regarding binary blobs. Basically, unknown data structures within a binary. Provided they used no encryption it should be relatively painless to extract the data. It was trivial then and now I'm way better.

Re:Wayback (1)

Hognoxious (631665) | more than 4 years ago | (#32940692)

How will that help? As far as I understand, the pages are created on the fly, so without the "engine" behind them you won't get anything.

Re:It would help (0)

Anonymous Coward | more than 4 years ago | (#32939530)

Wrong crowd. He's hoping for people interested in online-content indexing systems, not a thread full of duffers commenting on the niche.

pubmed (0)

Anonymous Coward | more than 4 years ago | (#32939002)

ask the pubmed people at NIH: http://www.ncbi.nlm.nih.gov/pubmed

Developing a Niche Online-Content Indexing System? (3, Insightful)

omar.sahal (687649) | more than 4 years ago | (#32939010)

I don't know if this would be helpful, but the people of Wikipedia must know a far amount about running crowed sourced sites. Even if you can't talk with the higher ups there would be contributors who would know about best practices. Also when you deal with people they would be a lot more helpful if they benefit from helping you.

Re:Developing a Niche Online-Content Indexing Syst (1)

Tablizer (95088) | more than 4 years ago | (#32939818)

WikiPedia's search stinks in my opinion. It's gotten better of late, but still not the Gold Standard by any stretch.

Just migrate it to VMware or KVM (3, Informative)

RobiOne (226066) | more than 4 years ago | (#32939012)

Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.

Re:Just migrate it to VMware or KVM (0)

Anonymous Coward | more than 4 years ago | (#32939100)

without more info from OP, this is the end of the discussion.

Re:Just migrate it to VMware or KVM (1)

pspahn (1175617) | more than 4 years ago | (#32939108)

This could work and allows you enough time to not come up with something lame.

Re:Just migrate it to VMware or KVM (2)

OzPeter (195038) | more than 4 years ago | (#32939170)

Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.

That assumes that the original data is available to the OP. It may be that it is not.

Re:Just migrate it to VMware or KVM (1)

Threni (635302) | more than 4 years ago | (#32939340)

> That assumes that the original data is available to the OP. It may be that it is not.

If only the article in some way made this clear.

"we are in negotiations to try and get the original data."

Oh, it does.

Re:Just migrate it to VMware or KVM (1)

OzPeter (195038) | more than 4 years ago | (#32939516)

"we are in negotiations to try and get the original data."

In other words the OP does not have the data. And from the OP's reply below it may be that they never get it.

for my bunghole (0)

Anonymous Coward | more than 4 years ago | (#32940686)

Well until he does get it, any consideration of how to process it is somewhat moot.

Which makes me wonder why he bothered asking, the fucking twerp.

Re:for my bunghole (1)

OrangeCatholic (1495411) | more than 4 years ago | (#32941324)

>Well until he does get it, any consideration of how to process it is somewhat moot.

Not quite. He was clear enough to construct a data model. This customer knows what he wants. Problem is, it will take his own efforts to fill in the gaps (in terms of getting access).

"Hi, I want you to install a refrigerator in my apartment. It needs to fit in a hole 30 inches wide by 30 inches deep."

"Will you take a refrigerator 28 inches wide by 26 inches deep?"

"Sure but....lemme talk to my landlord first."

If Zeus descended from the sky and said, "I'll do whatever it takes to get this index online..."

Would Zeus succeed, or would the customer say to him, "I'm not ready?"

P.S. I'm NOT for hire on this job. I am not.even.a.programmer.anymore.

I will, however, take queries as far as I check my email (which is unreliable) and as far as I check this page (until tomorrow at the least).

You asked, you got your answers. 88 comments, perhaps 10 of them were useful. Anyone who says to "use X" is dumb. By the time you figure out how to use it, you could have written your own.

This is 1-3 tables, which for a real-world analogy is like 1-3 sheets of paper. Customer says what? Landlord? Landlord rules 1-3 sheets of paper. Good luck with that access.

Re:Just migrate it to VMware or KVM (1)

tebee (1280900) | more than 4 years ago | (#32939390)

As of now it is not available.

We are putting pressure on the current owners to make it available, as they have suffered a certain amount of bad publicity over this, but so far to no avail. They did purchase the program for real money 10 years ago, but the fact that they are unable to run it should indicate to them it has little or no value now.

My thoughts have been on the lines of running it on some old PC hanging off some ADSL line with dynamic DNS but virtualization may be a better idea. Does anyone offer virtual private servers that run Dos?

Re:Just migrate it to VMware or KVM (0)

Anonymous Coward | more than 4 years ago | (#32939460)

It seems to me that you don't need a VPS that runs DOS. Just signup for a Linux-based VPS and run DOSbox on top of it. The performance hit will be minimal.

Re:Just migrate it to VMware or KVM (0)

Anonymous Coward | more than 4 years ago | (#32939492)

I believe network solutions offers a virtual server with Domain/DNS that runs linux, but I am sure you could get one with DOS.

Re:Just migrate it to VMware or KVM (1)

OrangeCatholic (1495411) | more than 4 years ago | (#32939596)

Well the program and the data are two different things. At least to me they are.

All you need to do is run the program once, get a dump of the entire article list, and import it into your new MySQL table.

And running the program requires, what, DOS? Come on. Forget the web, that's out of the picture now with regards to the old, expired system. You just need ONE copy of the data and you can re-build the web interface yourself with php.

It sounds to me like the data is proprietary and they are being stingy with it. But what other use they have for it, I don't know. You could have all the private libraries index their own collections, and collate the results, but something tells me that would require and extensive level of participation.

Re:Just migrate it to VMware or KVM (2, Interesting)

b4dc0d3r (1268512) | more than 4 years ago | (#32940224)

If you do get the original data, I'll volunteer to either disassemble the exe or RE the data format or preferably both. Just for the fun of it. Contact me at the /. nick over in the google mail system.

Offer to let them host a redirect if they want - interstitial advert page with a 'we have moved', and offer to redirect to that page if they are not the referrer for a certain timeframe. They get some advert money, you get the data, I have something to entertain myself with.

Gimme just the DOS program at elast, I'll get you the format.

Re:Just migrate it to VMware or KVM (1)

commanderfoxtrot (115784) | more than 4 years ago | (#32941536)

Mod parent up!

Nothing necessarily wrong with the DOS program anyway- if it works, why break it?

You should be able to run it pretty easily with either a virtual machine or an emulator- you can then look at extracting from it the data and migrating it to a flashier site. Sticking with the DOS program sounds like the simpler solution for now.

Re:Just migrate it to VMware or KVM (0)

Anonymous Coward | more than 4 years ago | (#32941356)

The problem isn't how to handle the data from the legacy solution, he wants to know what would be good modern solution for indexing and searching the article information with their specific constraints. The question wasn't related to using the legacy platform at all, he might not even have access to the old system. Even if he did he wants to migrate to something newer but first he needs to know what will do the job (thus the ask Slashdot)...

put the data online if you can (1, Informative)

Anonymous Coward | more than 4 years ago | (#32939040)

There is an annoying "business model" that drives most commercial websites for greed reasons, and spreads from them to non-commercial websites for no good reason at all except lemming effect. That is when the site has an interesting chunk of data but instead of putting it online to download, wraps a web application around it to deal it out in dribs and drabs, so that users have to keep returning, clicking ads, and so forth.

Yeah having some kind of online query interface can be useful and you should certainly implement one if you can. But much more important is the actual data. Make a zip file for download, no SQLor PHP needed. The SQL and PHP can be done later.

Re:put the data online if you can (1)

martin-boundary (547041) | more than 4 years ago | (#32939408)

Very true. In fact, making the data available for download also solves the problem of bandwidth bills. After the initial bunch of people have downloaded their own copy, they can serve it from other websites, thus sharing the load.

Data hoarders (1)

tepples (727027) | more than 4 years ago | (#32939758)

It also introduces the problem of people who download the whole data set just to collect it, with no intention of accessing the vast majority of it or serving it up to someone else.

hoarding == massive replication (2, Interesting)

martin-boundary (547041) | more than 4 years ago | (#32939802)

Short term it's true that can eat some bandwith, but long term that's the solution of the problem you're facing right now. If you could ask a data hoarder to give you a copy of the website which just disappeared, then you wouldn't be asking today about how to recreate it from scratch.

Re:Data hoarders (1)

quickOnTheUptake (1450889) | more than 4 years ago | (#32940472)

Just stick it on bittorrent, if there is a big demand.
Realistically, though, I doubt the database is very large (moreover, I doubt there are all that many people who would want this data). I mean, if you are indexing 50 magazines, over 100 years, with an average of 10 articles in each one, that's 50k articles. Let's say each article has 200B of data, thats, what? ~2 meg uncompressed?

The binary file shouldn't be hard to read (2, Informative)

bartonski (1858506) | more than 4 years ago | (#32939102)

I would run the unix commands 'file' (you might get lucky and get a file type that it understands), 'strings' (to find any ASCII strings within the data) and 'hd' (hex dump) to figure out the structure of the data. My guess is that the data format isn't very complicated. If you figure out how the file is structured, you should be able to use C, or something akin to the 'pack' function found in Perl or Ruby to extract data, which you can load into a database.

Try Ruby on Rails (4, Funny)

olyar (591892) | more than 4 years ago | (#32939110)

I'm sure that Ruby on Rails could have a fully functional web site made from this data in about half an hour.

The downside is that if more than two people try to access the data, it will display a whale suspended by balloons.

(Please Note: This post is a joke, and not an attempt to start a flame war).

Re:Try Ruby on Rails (4, Funny)

greg1104 (461138) | more than 4 years ago | (#32940418)

It's data for model railroading magazine, so not only are they used to rails, they already have protocols to serialize access to shared resources and prevent collisions.

Silly question (0)

Anonymous Coward | more than 4 years ago | (#32939120)

This may seem like a silly question, but if the data is in an unknown format and it's handled by an existing DOS program, why not just keep using that old DOS program? It still works, probably has low resources and is dealing with (I gather) mostly fixed data. Maybe bring in some people to try to reserve engineer the data format? But, really, just move the DOS program and data to a different server and save yourself months of effort.

Re:Silly question (1)

Cylix (55374) | more than 4 years ago | (#32939646)

Bad idea.

It's a bad idea for the same reason they don't want to host a a dos executable anymore.

Even if some strange reason the text could not be retrieved from a binary blob (which is not likely) the application still works today.

A single command line wild card search would re-dump the text which could be parsed and stored in a simple database.

Re:Silly question (0)

Anonymous Coward | more than 4 years ago | (#32940644)

Who says it supports wildcards?

That is a data convertion project (1)

mrmeval (662166) | more than 4 years ago | (#32939172)

You could write a custom program that would scrape the the data from a website you setup to allow that program to run stand alone or you figure out what the data format is and write a program to convert that.

If you want to recreate the data from scratch then you'd need to set up a website your group would access and enter data. That would be crowd sourcing but you'd probably want something specific to your needs but using easily maintainable code.

As others have stated you could use virtualization. Inside the virtual machine you may even be able to run a LAMP stack and run the DOS program with dosbox running as as an unprivileged user. http://www.dosbox.com/ [dosbox.com] http://www.virtualbox.org/ [virtualbox.org] http://www.vmware.com/ [vmware.com] .

I would only consider the virtual solution a stop gap until you could get the database translated to something maintainable or recreate the data.

Security anyone? (0)

Anonymous Coward | more than 4 years ago | (#32939224)

Running 20 years old code is a security disaster - now you want to replace it in what... php? Few people would call that an improvement.

Screen Scrape the Site (1)

mbone (558574) | more than 4 years ago | (#32939258)

See if you can get access to the site again, and screen scrape it. That should not be too hard (search for all articles beginning with "A", then "B", etc.). Then, it should be straightforward to enter it into MySQL or your database of choice.

(It is just possible the search functionality is still there, with just the HTML being taken down. The WayBack Machine could be your friend here...)

Re:Screen Scrape the Site (1)

tebee (1280900) | more than 4 years ago | (#32939588)

If you could scape the site, I would have done it years ago. Unfortunately the programmer built in anti-scraping technology to the program to "protect his data". If you issue too many sequential requests it locks your IP out - Permanently ! I discovered this about 8 years ago when I was doing some manual scraping and it did it to me.

if you look at the site ( http://index.mrmag.com/ [mrmag.com] ) on the wayback machine you can see the strange error you get - it locked that out too!

Re:Screen Scrape the Site (1)

PerformanceDude (1798324) | more than 4 years ago | (#32939648)

Tebee,

My company has some pretty sophisticated data transformation tools that we use in forensics. You can connect with me via the /. friends system if you manage to get hold of the source data. We may be able to return it to you in something simple like CSV and then from there things should be easy.

Not promising a result but happy to at least take a look

um... google. (0)

Anonymous Coward | more than 4 years ago | (#32939274)

1 put each entry in text form
2 let google see it for a minute or two.
3 there is no step 3.

Ask Pubmed guys (2, Interesting)

mapkinase (958129) | more than 4 years ago | (#32939280)

Ask guys behind the Pubmed

http://www.ncbi.nlm.nih.gov/pubmed [nih.gov]

The database of scientific articles in the field of medicine and biology.

NCBI has the most generous software code licensing that is possible: the code is absolutely free, absolutely no restriction for distributing, changing, selling, even closing it. All because we, taxpayers, paid for it already.

I am surprised none of them reacted yet, I am sure they read ./

Re:Ask Pubmed guys (1)

GumphMaster (772693) | more than 4 years ago | (#32939616)

Or perhaps the NASA Astrophysical Data Service http://adswww.harvard.edu/ [harvard.edu]

Discogs as model (0)

Anonymous Coward | more than 4 years ago | (#32939302)

Discogs is a reasonably good example of a community effort.

Sadly. it and others, like Foobar, are still controlled by selfish people.

And a thousand Mac Fanbois ... (2, Funny)

rueger (210566) | more than 4 years ago | (#32939330)

... leap up and shout "Filemaker Pro! Cause it's so shiny and pretty!"

Oh, the number of times that I've heard that refrain... shudder ...

Re:And a thousand Mac Fanbois ... (1)

h4rr4r (612664) | more than 4 years ago | (#32939418)

Eww, the people responsible for that thing need to be lead into the street and shot.

Until quite recently you could not even talk sql to it.

Re:And a thousand Mac Fanbois ... (1)

arcsimm (1084173) | more than 4 years ago | (#32939738)

You know, I spent a semester of my life working for a department at my university that kept all of its operating information in FileMaker Pro databases. Of course, there were two of them, most of the data in one was replicated in the other, and if you actually wanted to *do* anything with the data in either, like have it show up on the departmental calendar or mailing list, you had to manually copy and paste it into still other databases. For most of that semester, my job was basically to function as a $10.00/hr database interface. Had I stayed on there any longer, my superiors would have probably showed up to work one day and discovered that all of their Filemaker DBs had mysteriously migrated into Postgres during the night...

Re:And a thousand Mac Fanbois ... (1)

Bungie (192858) | more than 4 years ago | (#32941516)

I also have spent a long time dealing with FileMaker too and it can be a huge PITA. Be thankful you didn't have to maintain a FileMaker Pro Server or web server for many people!

It is very easy for non-tech savvy people to use to build a bunch of databases and start using them which is cool. The problem is that the databases have a very simple design and most people don't even know how to setup a relationship between two fields. They just drag and drop fields onto a form and let FileMaker figure out how to store and share the data.

Those databases then tend to evolve and as they get more complex they are harder to manage using the simple interface that FileMaker Pro tries to provide. One person's quick inventory tracking database suddenly becomes a massive asset database used by the whole company years down the road, and you're left struggling to keep it running.FileMaker Pro and Lotus Domino are the worst for this kind of thing.

IIRC there are a few ways to extract the data from FileMaker Pro databases. There is an ODBC driver that comes with the FileMaker Pro client (at least it did back in the 3.x and 4.x days). That would be the easiest way to extract the data for other applications to use. FileMaker Pro 4.0 used to also come with a web server plugin that would use CDML [fmdeveloper.com] to generate dynamic web pages from the database (of course Claris HomePage was the best tool to nuild CDML apps at the time).

Drupal, hands down. (2, Interesting)

Beltanin (1760568) | more than 4 years ago | (#32939420)

Use Drupal (http://drupal.org), with Apache Solr (http://lucene.apache.org/solr/ and http://drupal.org/project/apachesolr [drupal.org] ) for indexing. At the last Drupalcon (SF 2010), there were even presentations by library staff related to article indexing, etc. Some handy resources, but there are far more, this was just a 1m search based on the conference alone... http://sf2010.drupal.org/conference/sessions/build-powerful-site-search-user-friendly-easy-install-search-lucene-api-module [drupal.org] , http://sf2010.drupal.org/conference/sessions/how-build-jobs-aggregation-search-engine-nutch-apache-solr-and-views-3-about [drupal.org] , http://sf2010.drupal.org/conference/sessions/case-studies-non-profits-jane-goodall-and-musescore [drupal.org] , http://sf2010.drupal.org/conference/sessions/case-studies-academia-drupal-asu-john-hopkins-knowledge-health [drupal.org]

Built in to mySQL (1)

Salamanders (323277) | more than 4 years ago | (#32939452)

MySQL 5's Fulltext index [mysql.com] with the "natural language search" option might do everything you need with almost no overhead. That, plus PHP's PDO [php.net] to connect to the database, and I think you might be done. How much data are we talking, anyhow? 10,000 magazine articles or less?

Those bastards... (0)

Anonymous Coward | more than 4 years ago | (#32939484)

I am guessing the index in question is this one:

http://index.mrmag.com/

They just devalued a bookcase full of magazines in my basement...

File format, not the implementation details (2, Insightful)

frisket (149522) | more than 4 years ago | (#32939580)

It doesn't matter a damn what you use to serve the stuff; what matters is that the data is stored in something preservable and long-lasting like XML, otherwise you'll be back here in a few years. By all means use PHP and MySQL to make it available, but don't confuse the mechanisms used to serve the information with the file format in which it is stored under the hood.

IMO it shouldnt be hard to re-parse the data (1)

Wookie_CD (639534) | more than 4 years ago | (#32939700)

If you're talking, like most of the commenters above, about retrieving the data from the server through tm.exe, then this does become an exercise in scraping. wget has builtin recursive-fetching capabilities and if you can access a complete index that would be a logical starting point. With my background, if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql. I'd read the data file(s) looking for textual content in a linked structure, and the rest is just research and a bit of perl work (or php etc, if you prefer). Once you figure out which table structure would contain the data, and you come up with a conversion which will put the data into an importable format, the job's almost done and you just need to bring in or write a CMS to access it. I have source code which would go towards some individual bits of a project like this, contact me if you like. Good luck...

B& (1)

tepples (727027) | more than 4 years ago | (#32939778)

wget has builtin recursive-fetching capabilities

Which will get the IP address of the machine running the scraper permanently banned. See the post above [slashdot.org] .

if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql

It's likely that the raw data is encrypted. Based on the comment so far, I see no reliable indication of from what country tebee operates or whether this country has a DMCA-alike.

Re:B& (1)

Wookie_CD (639534) | more than 4 years ago | (#32940028)

This is all a bit academic until the content owner either agrees to reopen web access to a conversion team, or releases the source data for analysis.

Re:B& (1)

b4dc0d3r (1268512) | more than 4 years ago | (#32940824)

It's not acedemic if we can show the poster some sort of very simple wiki-like CMS that people with 6 decades of back issues might volunteer to enter/edit information. If everyone were organized, 100 people could enter the data in a weekend. Allowing time to edit and refine keywords, without copying the actual content, would add some time. And the backend database could end up more valuable than the original.

Scraping the data isn't possible, getting the data looks unlikely. So you recreate it. Have people claim an issue, and enter the data. People with few issues will claim the ones they have so people with more comprehensive coverage can focus on what no one else has. Bonus is, if no one else is interested, no one bothers to enter what they know, so the project self-immolates.

hyperestraier (1)

sugarmotor (621907) | more than 4 years ago | (#32939754)

Take a look at http://hyperestraier.sourceforge.net/ [sourceforge.net] ... there might be something newer by the same author, Mikio Hirabayashi

Extracting the text from whatever files you have would be a separate step.

Alfresco + Drupal (0)

Anonymous Coward | more than 4 years ago | (#32939870)

Alfresco + Drupal

Fancy that (1, Funny)

Anonymous Coward | more than 4 years ago | (#32939884)

> One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. [...] The governing body for the hobby has agreed to host this

Huh, I didn't realize that porn had a governing body.

It's a library catalog. (1)

oneiros27 (46144) | more than 4 years ago | (#32939948)

Don't ask generic nerds -- ask library nerds : code4lib [code4lib.org] . They have a pretty active mailing list.

Also, there's oss4lib [oss4lib.org] which is specifically for open source software, but I haven't seen much activity on their list in a while, and I think most of us are on both lists. (there's also a few cataloging specific lists, but they get to be all library-sciencey, with discussions of RDA and FRBR and cataloging aggregates).

Re:It's a library catalog. (1)

dangitman (862676) | more than 4 years ago | (#32940220)

Denholm: It's settled. I've got a good feeling about you Jen and they need a new manager.
Jen: Fantastic! So, the people I'll be working with, what are they like?
Denholm: Standard nerds!

[Note: Not to be confused with standards nerds]

Using a Howitzer to Hunt Squirrels (2, Insightful)

salesgeek (263995) | more than 4 years ago | (#32940096)

Lots of people here are recommending using tools that are built for very large scale projects. Based on the fact you have a DOS based system that likely used a pretty common library for storing the data (something like c-tree, btrieve, a dbase library or simply saving binary data using whatever language the app was written in), using any RDBMS like MySQL or even SQLite probably would do the job. PHP, Python, Ruby and Perl would probably make writing the actual application a snap - and be able to handle more of a load that the DOS app could.

Here's to hoping you can get the data. Hopefully the vendor that pulled the database down realizes how important to marketing it is and reverses course.

This is the ModelRR mag. database (1)

codeaholic (1596129) | more than 4 years ago | (#32940148)

The description suggests that this is the Model RR magazine DB. Checking Kalmbach (the company that hosted it), shows that, indeed, it is off line. (http://index.mrmag.com/) The DB was a very simple (by today's standards) index of articles.

As many posters have said, it should be easy (for a programmer) to pull the data from the DB -- if you can get the original data files from Kalmbach. The data was not complex, and 80's DBs tended to have simple file formats. As many suggested, a C++, Java, Python or other script can pull the data out and dump it to XML, MySQL, CSV files, etc. From there, it is easy to rehost it wherever needed.

My suggestion is to simply replicate the old (very dated, but simple) UI: both for searching and for data entry. That can be done very easily in PHP & MySQL. These tools are readily available on any web host making the task fairly simple (for someone familiar with these tools.) It also means that the site's webmaster should know what needs to be done to secure the app.

Getting a straight replacement up validates the whole process, and restores the existing functionality. Only at that point should you consider extending the system, perhaps using many of the good ideas noted above. Obvious extensions are to license the full text of articles to provide a full-text index (rather than just hand-entered keywords as in the current system.) Perhaps provide links to publishers sell them online. Lots of ways to go.

Good luck. As a user of the DB, I'd love to see it back online & better than ever!

Re:This is the ModelRR mag. database (1)

codeaholic (1596129) | more than 4 years ago | (#32940488)

If this is done as a volunteer effort, I'd be happy to help, esp. with extracting the old data. Contact me using my Slashdot user name + att dot net. (I hope THAT fools the spambots!)

How about a database? (0)

Anonymous Coward | more than 4 years ago | (#32940302)

You don't need niche software for this. You just need a simple database. It sounds like that's really all the existing solution is. Your data schema is simple enough that it would probably fit nicely in a couple of tables (Authors, Issues, and Articles come to mind). I think you're making this harder than it needs to be.

Why PHP? (1)

WhiteHorse-The Origi (1147665) | more than 4 years ago | (#32940398)

Why use PHP? I would think Python would be better because you can cross-compile the code to run on any machine using Jython(in case they stop hosting for you). Personally, I would do a full scrape of the data and put it in BibTex .bib files or xml and then make your search page pass parameters to the python program. That's what NASA and Google Scholar use(they may use Perl instead). I'm not sure about the database...

I work for a Library (0)

Anonymous Coward | more than 4 years ago | (#32940664)

It may be overkill for what you want to do, but you should look at Evergreen, the open-source Integrated Library System (think card-catalog) used by the public libraries in the state of Georgia: http://www.open-ils.org/dokuwiki/doku.php?id=faqs:evergreen_faq_2 . It can certainly do what you want done, and a whole lot more. You can just ignore the parts about circulation (or strip them out). You may run into problems with library-specific jargon and standard practices that you don't necessarily need, but surely there's a librarian or two out there in the model railroad world.

A project very similar to Evergreen is Koha: http://koha.org/about

You may also want to look at LibraryThing: http://www.librarything.com/tour/. It's focused on books, but it may be possible to make it work with articles as well.

Postgresql has good text indices (0)

Anonymous Coward | more than 4 years ago | (#32940736)

Postgresql is pretty easy to set up full text indexes, if you're trying to make it a database-ish application. It's really flexible for stuff like this.

SWISH is a good, non-database index as well.

Of course, someone else already said lucene.

PHP might be "ok" for the web interface (especially if nothing else is available) but I wouldn't even think of using it to populate the index initially.

Dspace (1)

ericlondaits (32714) | more than 4 years ago | (#32940922)

Check out Dspace (http://www.dspace.org/). I'm by no means an expert in the area but it seems it might be what you need.

Hypercard 2.0? (1)

AHuxley (892839) | more than 4 years ago | (#32940942)

Something like an open source hypercard stack?
Anyone can understand a card system, enter unique data per card and save.
Humans are good at that.
Bring them all together and you have a huge digital stack to be sorted, searched or as the backend to a nice simple topic interface.
Computers are great at that now.
That would help your crowd sourcing if its open source no MS closed issues later on.

DOS Data (1)

nospam007 (722110) | more than 4 years ago | (#32941344)

If it's 20 years old DOS, chances are that it's either Paradox or dBASE or any xBASE format, which could be easily opened with Access or even Winword.

BibTeX (1)

GeniusDex (803759) | more than 4 years ago | (#32941400)

It may not be a complete solution, but have you looked at BibTeX? BibTeX itself is only a format for nicely stating the information you have available (which magazine, article title, which pages in the magazine, authors, etc), but in the entire BibTeX ecosystem a number of indexing systems are built. Quite a lot of them are for desktop use (so you can manage your own BibTeX entries), but I'd imagine there would be some web-based system for this as well.

let me do what I do best: nitpicking. (0)

Anonymous Coward | more than 4 years ago | (#32941508)

"probably using php and mysql"

That's nice. Why are you asking then?

It seems to me that the data is a static index that is entirely read-only except for casual updating -- how often a year? Does the system need to stay online during updating then? Meaning that you don't really need an RDBMS or the poor imitation mysql gives you (yes yes it's shiny and oracle owns it, now shup). Some SQL interface might seem useful, but how many different queries are you going to write? How large are you expecting your userbase to scale?

Just to give an example of a different approach: You could probably write a couple small shell scripts to generate lists of things sorted a couple different ways into static html pages from some master file. That's duplicating the data alright, but how much is it really? Plus it'll compress really well and serving up gzipped content saves a lot on bandwidth too. Serving static html is almost always going to be faster and easier to maintain and so on than executing a scripting language that accesses a database backend for each pageview. Mind, you don't need to stick to static pages for everything, but with a little scripting you'd have something to show for your efforts within minutes, and you can expand in your copious free time later.

This brought to you by a someone who wrote a static pages "cms" complete with custom markup and an index generator in a couple hundred lines awk. Want to do it in perl? Show me the one-liner. Point is, the most well-trodden path isn't always the best for any given problem. In the case of assuming PHP+MySQL, it's the obvious choice for people who don't know any better. And ignoring the nature of the problem and the properties of the data is a bit of a pity.

Sure About DOS? (1)

Bungie (192858) | more than 4 years ago | (#32941570)

Are you sure it's a true DOS application and not a Win32 Console App? I know it is entirely possible for someone to write a CGI in DOS but it seems really weird to me that they would use DOS since it didn't have anything that would server CGI, and coding a hand rolled database format would be a lot of extra work.

If it is using Win32 it might just be accessing a DAO database without using the mdb extension, which many companies do to make it look like a proprietary format you can't just open with MS Access. If you look at the raw data it might look crazy and unusable because JET databases use XOR to obfuscate the contents of the database file (and prevent you from extracting the strings inside).

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?