Developing a Niche Online-Content Indexing System?

Become a fan of Slashdot on Facebook

Developing a Niche Online-Content Indexing System? 134

Posted by timothy on Saturday July 17, 2010 @05:04PM from the we-had-to-index-individual-clay-tablets dept.

tebee writes "One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. It lets you list the articles in any particular magazine or search for an article by keyword, title or author, refining the search if necessary by magazine and/or date. Unfortunately the firm which hosts the index have recently pulled it from their website, citing security worries and incompatibilities with the rest of their e-commerce website: the heart of the system is a 20-year-old DOS program! They have no plans to replace it as the original data is in an unknown format. So we are talking about putting together a team to build a open source replacement for this – probably using PHP and MySQL. The governing body for the hobby has agreed to host this and we are in negotiations to try and get the original data. We hope that by volunteers crowd-sourcing the conversion, we will be able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more.

tebee continues: "It occurs to me that there could be existing open-source projects that do roughly what we want to do — maybe something indexing academic papers. But two days of trawling through script sites and googling has not produced any results.

Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!

So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"

This discussion has been archived. No new comments can be posted.

Developing a Niche Online-Content Indexing System?

Load All Comments

Search 134 Comments Log In/Create an Account

Comments Filter:

Sphinx or Lucene (Score:3, Informative)

by Anonymous Coward writes: on Saturday July 17, 2010 @05:09PM (#32938980)

Or did I misunderstand the question?

Share
twitter facebook
- Re:Sphinx or Lucene (Score:4, Informative)
  
  by tebee ( 1280900 ) writes: on Saturday July 17, 2010 @05:46PM (#32939218)
  
  Yes, you did misunderstand.
  We do not have the full text of the article online , all we have is its title, author and some manually created keywords. It's necessary to have access to the physical magazine to read the content of the article, but this is a hobby(model railroading) where many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines.
  All the solutions I could find seemed to be based, like those two, on indexing the text of the articles.
  It would be much better if we did have the text as well, but as I said there is the minor problem of copyright. The fact that the index has been run for the last 10 years by the major (dead tree) publisher is this field has also discouraged development in this direction.
  
  Parent Share
  twitter facebook
  - Re:Sphinx or Lucene (Score:4, Interesting)
    
    by martin-boundary ( 547041 ) writes: on Saturday July 17, 2010 @06:13PM (#32939366)
    
    Even if you have only the title/author, you're still indexing text. Think of a tiny little text file containing two or three lines: title, author, keywords. You'll need a volunteer to type this in. Then you dump those files in a directory and run an indexer.
    If this isn't what you have in mind, please elaborate.
    
    Parent Share
    twitter facebook
    - Re:Sphinx or Lucene (Score:4, Interesting)
      
      by Trepidity ( 597 ) writes: <[gro.hsikcah] [ta] [todhsals-muiriled]> on Saturday July 17, 2010 @06:46PM (#32939594)
      
      If you have relatively little but highly structured data, running it through a general search engine like Lucene or Sphinx doesn't seem like the ideal solution, because it doesn't make it easy to do structured queries ("give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966").
      A bibliography indexer would probably be a better choice. Two good free ones are Refbase [refbase.net] or Aigaion [aigaion.nl]. Both are targeted mainly at databases of scientific literature, so might need some tweaking for this purpose, though.
      
      Parent Share
      twitter facebook
      - Re: (Score:3, Interesting)
        
        by martin-boundary ( 547041 ) writes:
        
        Yes, I was mainly trying to point out that his problem is still conceptually a text indexing problem even if he doesn't have the text of the articles. A scientific bibliography database can be a good choice, as some journals can have arcane numbering systems, so they should be able to cope with a magazine collection.
        Like someone else pointed out, though, if at some point he expects to get access to the full text or even just scans of the articles, he'd better have chosen a system that can easily expand to
        
        Re: (Score:2)
        
        by OrangeCatholic ( 1495411 ) writes:
        
        It's really not a text indexing problem, unless you are going to throw out rdbms and use a flat text file.
        If you will use relational database, then it is a 3-table problem at most. Articles, sources, and articles to sources. If you can join those, you have the core of a classic content management system.
        From what I gather, they haven't even gotten that far. It is just a master index of articles that are available (which point to nothing in particular), so it is a 1-table problem.
        For 1-table problems I gen
        
        Re: (Score:2)
        
        by AmiMoJo ( 196126 ) writes:
        
        Couldn't you just scan and OCR the magazines and then use that data to compile a searchable database? You could even supply extracts from the articles where the search terms appear, similar to what Google does. By not presenting the full text of the article I think you would be on safe ground copyright wise.
        It would also be a good way of archiving old paper documents which can degrade over time. I'm not sure what copyright terms are in your country but some of those magazines might be in the public domain a
      - Re: (Score:2)
        
        by Jherico ( 39763 ) writes:
        
        Solr [apache.org] in front of Lucene is a perfectly reasonable way to index highly structured information and allows structured queries.
  - Re: (Score:1)
    
    by hsmyers ( 142611 ) * writes:
    
    Not being up to speed on current open source that might prevent premature wheel re-invention my answer would be 'No'. That said, I don't see any particular trouble with the project itself. If I understand correctly, you've bare bones bibliographic information that you want to create an on-line index of. The notion of PHP and MySql seems sound although I suspect that Perl would work as well if not better, depending on the knowledge of your volunteer talent. I expose my bias here when I point out that text an
    - Re: (Score:1)
      
      by hsmyers ( 142611 ) * writes:
      
      Just noticed an thread on Hacker News on http://www.gotapi.com/html which might be of interest...
  - Re: (Score:2)
    
    by symes ( 835608 ) writes:
    
    They have no plans to replace it as the original data is in an unknown format.
    Well there aren't that many obvious candidates... any of these [d2ca.org] look familiar?
  - Re:Sphinx or Lucene (Score:4, Informative)
    
    by OrangeCatholic ( 1495411 ) writes: on Saturday July 17, 2010 @06:26PM (#32939464)
    
    So let me get this straight: This is a single table? You have one table (spreadsheet), where each row represents one article. The columns would be title, author, and either five or so columns of keywords, or a single varchar column that would hold them all (comma-delineated or whatever).
    Then you need the standard row_id and whatever other crufty columns creep in. If this is all you need, you can do this in Excel (har har). Or install MySQL, create the table (we'll call it mr_article_list), then write the standard php scripts to add, edit, delete, and retrieve entries.
    These scripts are basically just web forms that pass through the entered values into the database. You're talking a single code page for each of the inputs, and then a page each for the output/result, or 8 pages total.
    For example, the mr_add.php script (mr_ stands for model railroad) retrieves a new row_id from the db. Then it presents a web form with input fields for the title, author, and keywords. Then it does db_insert(mr_article_list, $title, $author, $keywords). Then it calls mr_add2.php, which is either success or failure.
    The edit, delete, and retrieve scripts are similarly simple. All you need is a linux box to do this, and the basic scripts could be written in two evenings (or one long one) - assuming you hired someone who does this for a living.
    Now this is where it gets interesting:
    >many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines
    Do you want to store this information as well, so that people know who to call to get the issue? I assume this would be the real useful feature. So now you need a second table, mr_sources, which is basically a list of clubs/people, so the columns in this table would be like row_id, name, address, phone number (standard phone book shit).
    Then you need a third table, mr_article_sources, which is real simple, it just matches up the rows in the article list to the rows in the source list. It's columns are simply row_id, article_row_id, source_row_id. This is a long and narrow table that cross-indexes the two shorter, fatter tables (the list of articles, and the list of sources).
    Example, article_id #19 is "How to shoot your electric engine off the tracks in under three seconds." Source_id #5 is Milwaukee Railroad Club, #7 is San Jose Railroad Surfers, and #9 is Bill Gates Private Book Collection. All three of them have this article. So your cross-index table would look like this:
    01 19 05
    02 19 07
    03 19 09
    When you search for article #19, it finds sources 5, 7, and 9 in the cross-index table, then queries the source table for the names and phone numbers of those three clubs (and displays them).
    Finally, if you're wondering how to query three different tables at the same time, well, databases were made to do exactly this.
    
    Parent Share
    twitter facebook
    - No, no, NO! (Score:4, Insightful)
      
      by RichiH ( 749257 ) writes: on Sunday July 18, 2010 @03:32AM (#32941368) Homepage
      
      Your suggestions make sense, but suggesting to store comma-delimited plain text in a SQL table is wrong by any and all database standards & best practises. You fail to reach even the first normalized form.
      Read http://en.wikipedia.org/wiki/Database_normalization [wikipedia.org]
      You want to define a table "tags" or something with id, article_id, name, comment. Make the combination of id, parent_id, name unique.
      * id is on auto-increase, not NULL
      * article_id is a foreign key to the id of the article, not NULL
      * name is the name of the tag, not NULL
      * comment is an optional comment explaining the tag (for example in the mouse-over or on the site listing everything with that tag), may be NULL
      Not only is that easier to maintain in the long run (think of parsing plain text out of a VARCHAR. argh!), but all of a sudden, you have the data you _store_ available to _access_.
      How many artcles are tagged electric? SELECT count (1) FROM article_tags WHERE name = "electric";
      Give me a list of all article relating to foo and bar? SELECT article_id FROM article_tags WHERE name = "foo" OR name = "bar".
      etc pp.
      If you want to go really fancy with multi-level tags, replace article_id with parent_id (referring to the id in the same table) and create a relation table as glue. If you want all upper levels to apply, throw in a transitive closure:
      http://en.wikipedia.org/wiki/Transitive_closure [wikipedia.org]
      Generally speaking, you want a table for magazines with their names, publication dates, publisher, whatnot; and only refer to them via foreign keys. Same goes for train models (which you could cross-ref via tags. Yay for clean db design!), authors, collectors, train clubs and and pretty much everything else.
      One last word of advice: No matter what anyone tells you: Either you use a proper framework or you _ALWAYS_ use prepared statements. You get some performance benefits and SQL injection becomes impossible, for free! Repeat: Even if you ignore all the other tips above, you _MUST heed this.
      http://en.wikipedia.org/wiki/SQL_injection [wikipedia.org]
      Richard
      PS: You are more than welcome to reply to this post once you have your DB design hammered out. I will have a look & optimize, if you want.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by RichiH ( 749257 ) writes:
        
        The second statement should have read
        SELECT article_id FROM article_tags WHERE name = "foo" AND name = "bar"
        for obvious reasons.
        
        Re: (Score:2)
        
        by ultranova ( 717540 ) writes:
        
        No, OR is correct here. AND doesn't find any rows because field "name" has only a single value, so 'name = "foo" AND name = "bar"' can't ever be true for any row. You want something like
        SELECT article_id FROM article_tags WHERE name = "foo" INTERSECT SELECT article_id FROM article_tags WHERE name = "bar"
        
        Re: (Score:2)
        
        by RichiH ( 749257 ) writes:
        
        Ah, yes thanks... I should not be allowed to post before coffee...
  - Re: (Score:2)
    
    by tolan-b ( 230077 ) writes:
    
    I think you should still have a look as Sphinx and Lucene. You can put whatever data you want into them, in whatever schema you want (at least with Lucene, I believe with Sphinx too). You can then easily create a UI as a front end and let the indexing engine do the hard work of slicing and dicing by your criteria. I believe the Zend Framework library has a Lucene API.
    Also if you do manage to go fulltext later then it'll mean less work.
    - Re: (Score:1)
      
      by banjo D ( 212277 ) writes:
      
      I don't know about Sphinx but I agree that Lucene could be a good solution, for the reasons tolan-b lists. I work on a digital library cataloging project that indexes it's metadata with Lucene. We use PHP to generate the user-facing website, which queries our Lucene index via a Solr server. We do have a highly structured metadata schema and we do run queries that include things such as "give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966" (which somebody in another
      - Re: (Score:2)
        
        by OrangeCatholic ( 1495411 ) writes:
        
        >we do run queries that include things such as "give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966"
        SELECT * FROM banjo_articles WHERE title LIKE "%foo%", date BETWEEN "1950-01-01" AND "1966-12-31"
        You're bragging that your "system" has a single line of code?
        I've seen selects ten or twenty lines long, with multiple joins, and joins and selects within joins. Granted it's not fast, but it works, and it takes all of an hour (or less) to write such a query.
  - Re: (Score:3, Interesting)
    
    by rs79 ( 71822 ) writes:
    
    I do the same thing for tropical fish and wrote a shitload of C code. If this is an old DOS program it should port to C/UNIX really stupid easy.
    Drop me a line if you want to and I'll ask you to send me some sample data. This might be really easy.
  - Re: (Score:2)
    
    by sporkboy ( 22212 ) writes:
    
    The standard /. IANAL applies here, but I'm pretty sure that if you have legal access to the copyrighted text (ie you or someone you know owns a copy of the magazine) then it is ok to create a derivative work for the purposes of searching that work. This is the loophole that Google (name your favorite search engine here) uses, and they go so far as to offer cached versions of some sites.
    Lucene, or a more friendly wrapper around it like SOLR, has the option of creating a search index based on an original te
  - Re: (Score:2)
    
    by Unequivocal ( 155957 ) writes:
    
    Check out: http://xtf.sourceforge.net/ [sourceforge.net]
    I think it uses lucene on the backend. It's designed to map meta-data sources to meta-data outputs via XSL templates. I talked with some of the developers recently and it sounds reasonable. If your inputs are binary then it's probably not much help but for XML-like inputs it might give you some of the capabilities you're looking for. HTH
It would help (Score:3, Insightful)

by Xamusk ( 702162 ) writes: on Saturday July 17, 2010 @05:10PM (#32938982)

if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.

Share
twitter facebook
- Re:It would help (Score:4, Funny)
  
  by beakerMeep ( 716990 ) writes: on Saturday July 17, 2010 @05:34PM (#32939146)
  
  Maybe it's the type of magazines that people used to read "for the articles?"
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Funny)
    
    by bsDaemon ( 87307 ) writes:
    
    I'm pretty sure porn indexing isn't niche... or a hobby. Its the true reason Google exists.
  - Re: (Score:2)
    
    by BitterOak ( 537666 ) writes:
    
    if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.
    Maybe it's the type of magazines that people used to read "for the articles?"
    And that's precisely the type of magazine that would catch the interest of the Slashdot crowd.
- Re:It would help (Score:4, Informative)
  
  by tebee ( 1280900 ) writes: on Saturday July 17, 2010 @05:51PM (#32939242)
  
  OK the hobby is model railroading and the index was at http://index.mrmag.com/tm.exe [mrmag.com] but was removed , without warning, last week so there is not a lot to see.
  
  Parent Share
  twitter facebook
  - Wayback (Score:4, Interesting)
    
    by martin-boundary ( 547041 ) writes: on Saturday July 17, 2010 @06:30PM (#32939474)
    
    You can use the Wayback Machine to get a partial snapshot of the site. Try http://web.archive.org/web/*/http://index.mrmag.com/tm.exe [archive.org], then follow the links on the archived page. If you vary the URL a bit, you might see even more missing data.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by Cylix ( 55374 ) * writes:
      
      Definitely an easy re-write.
      Just going to be painful to re-enter all that data if they can't use the original binary blob.
      Long time ago I had a programming segment regarding binary blobs. Basically, unknown data structures within a binary. Provided they used no encryption it should be relatively painless to extract the data. It was trivial then and now I'm way better.
    - Re: (Score:2)
      
      by Hognoxious ( 631665 ) writes:
      
      How will that help? As far as I understand, the pages are created on the fly, so without the "engine" behind them you won't get anything.
  - - Re: (Score:2)
      
      by ZERO1ZERO ( 948669 ) writes:
      
      It's not a web page it's a DOS program, hence the ask slashdot....
      - Re: (Score:1)
        
        by tebee ( 1280900 ) writes:
        
        It's a DOS program that runs on the server, rather like a CGI script. It's output is a web page.
        It is bit of a throw back to the dawn of the web when people thought up innovative ways to do things.
    - Re: (Score:2)
      
      by Bungie ( 192858 ) writes:
      
      CGI works by having the server executes the program (passing the data to it from STDIN or the command line) and then retreiving the page's complete HTML code from STDOUT. You can use any file that can be executed and use STDIN/STDOUT in this manner that is located in a specified location(like cgi-bin). On Windows this would be any exe,com,pif,bat or cmd file, and the extension must be there for the operating system to determine that it is an executable. On Linux you can use any file that +x permissions, com
Developing a Niche Online-Content Indexing System? (Score:4, Insightful)

by omar.sahal ( 687649 ) writes: on Saturday July 17, 2010 @05:13PM (#32939010) Homepage Journal

I don't know if this would be helpful, but the people of Wikipedia must know a far amount about running crowed sourced sites. Even if you can't talk with the higher ups there would be contributors who would know about best practices. Also when you deal with people they would be a lot more helpful if they benefit from helping you.

Share
twitter facebook
- Re: (Score:1)
  
  by Tablizer ( 95088 ) writes:
  
  WikiPedia's search stinks in my opinion. It's gotten better of late, but still not the Gold Standard by any stretch.
Just migrate it to VMware or KVM (Score:3, Informative)

by RobiOne ( 226066 ) writes: on Saturday July 17, 2010 @05:13PM (#32939012) Homepage Journal

Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.

Share
twitter facebook
- Re: (Score:1)
  
  by pspahn ( 1175617 ) writes:
  
  This could work and allows you enough time to not come up with something lame.
- Re: (Score:3)
  
  by OzPeter ( 195038 ) writes:
  
  Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.
  That assumes that the original data is available to the OP. It may be that it is not.
  - Re: (Score:1)
    
    by Threni ( 635302 ) writes:
    
    > That assumes that the original data is available to the OP. It may be that it is not.
    If only the article in some way made this clear.
    "we are in negotiations to try and get the original data."
    Oh, it does.
    - Re: (Score:2)
      
      by OzPeter ( 195038 ) writes:
      
      "we are in negotiations to try and get the original data."
      In other words the OP does not have the data. And from the OP's reply below it may be that they never get it.
      - Re: (Score:2)
        
        by OrangeCatholic ( 1495411 ) writes:
        
        >Well until he does get it, any consideration of how to process it is somewhat moot.
        Not quite. He was clear enough to construct a data model. This customer knows what he wants. Problem is, it will take his own efforts to fill in the gaps (in terms of getting access).
        "Hi, I want you to install a refrigerator in my apartment. It needs to fit in a hole 30 inches wide by 30 inches deep."
        "Will you take a refrigerator 28 inches wide by 26 inches deep?"
        "Sure but....lemme talk to my landlord first."
        If Zeus d
  - Re: (Score:1)
    
    by tebee ( 1280900 ) writes:
    
    As of now it is not available.
    We are putting pressure on the current owners to make it available, as they have suffered a certain amount of bad publicity over this, but so far to no avail. They did purchase the program for real money 10 years ago, but the fact that they are unable to run it should indicate to them it has little or no value now.
    My thoughts have been on the lines of running it on some old PC hanging off some ADSL line with dynamic DNS but virtualization may be a better idea. Does anyone of
    - Re: (Score:2)
      
      by OrangeCatholic ( 1495411 ) writes:
      
      Well the program and the data are two different things. At least to me they are.
      All you need to do is run the program once, get a dump of the entire article list, and import it into your new MySQL table.
      And running the program requires, what, DOS? Come on. Forget the web, that's out of the picture now with regards to the old, expired system. You just need ONE copy of the data and you can re-build the web interface yourself with php.
      It sounds to me like the data is proprietary and they are being stingy w
    - Re: (Score:3, Interesting)
      
      by b4dc0d3r ( 1268512 ) writes:
      
      If you do get the original data, I'll volunteer to either disassemble the exe or RE the data format or preferably both. Just for the fun of it. Contact me at the /. nick over in the google mail system.
      Offer to let them host a redirect if they want - interstitial advert page with a 'we have moved', and offer to redirect to that page if they are not the referrer for a certain timeframe. They get some advert money, you get the data, I have something to entertain myself with.
      Gimme just the DOS program at ela
      - Re: (Score:2)
        
        by commanderfoxtrot ( 115784 ) writes:
        
        Mod parent up!
        Nothing necessarily wrong with the DOS program anyway- if it works, why break it?
        You should be able to run it pretty easily with either a virtual machine or an emulator- you can then look at extracting from it the data and migrating it to a flashier site. Sticking with the DOS program sounds like the simpler solution for now.
put the data online if you can (Score:1, Informative)

by Anonymous Coward writes:

There is an annoying "business model" that drives most commercial websites for greed reasons, and spreads from them to non-commercial websites for no good reason at all except lemming effect. That is when the site has an interesting chunk of data but instead of putting it online to download, wraps a web application around it to deal it out in dribs and drabs, so that users have to keep returning, clicking ads, and so forth.
Yeah having some kind of online query interface can be useful and you should certain
- Re: (Score:2)
  
  by martin-boundary ( 547041 ) writes:
  
  Very true. In fact, making the data available for download also solves the problem of bandwidth bills. After the initial bunch of people have downloaded their own copy, they can serve it from other websites, thus sharing the load.
  - Data hoarders (Score:2)
    
    by tepples ( 727027 ) writes:
    
    It also introduces the problem of people who download the whole data set just to collect it, with no intention of accessing the vast majority of it or serving it up to someone else.
    - hoarding == massive replication (Score:3, Interesting)
      
      by martin-boundary ( 547041 ) writes:
      
      Short term it's true that can eat some bandwith, but long term that's the solution of the problem you're facing right now. If you could ask a data hoarder to give you a copy of the website which just disappeared, then you wouldn't be asking today about how to recreate it from scratch.
    - Re: (Score:2)
      
      by quickOnTheUptake ( 1450889 ) writes:
      
      Just stick it on bittorrent, if there is a big demand.
      Realistically, though, I doubt the database is very large (moreover, I doubt there are all that many people who would want this data). I mean, if you are indexing 50 magazines, over 100 years, with an average of 10 articles in each one, that's 50k articles. Let's say each article has 200B of data, thats, what? ~2 meg uncompressed?
The binary file shouldn't be hard to read (Score:2, Informative)

by bartonski ( 1858506 ) writes:

I would run the unix commands 'file' (you might get lucky and get a file type that it understands), 'strings' (to find any ASCII strings within the data) and 'hd' (hex dump) to figure out the structure of the data. My guess is that the data format isn't very complicated. If you figure out how the file is structured, you should be able to use C, or something akin to the 'pack' function found in Perl or Ruby to extract data, which you can load into a database.
Try Ruby on Rails (Score:5, Funny)

by olyar ( 591892 ) writes: on Saturday July 17, 2010 @05:28PM (#32939110) Homepage Journal

I'm sure that Ruby on Rails could have a fully functional web site made from this data in about half an hour.
The downside is that if more than two people try to access the data, it will display a whale suspended by balloons.
(Please Note: This post is a joke, and not an attempt to start a flame war).

Share
twitter facebook
- Re:Try Ruby on Rails (Score:5, Funny)
  
  by greg1104 ( 461138 ) writes: <gsmith@gregsmith.com> on Saturday July 17, 2010 @09:43PM (#32940418) Homepage
  
  It's data for model railroading magazine, so not only are they used to rails, they already have protocols to serialize access to shared resources and prevent collisions.
  
  Parent Share
  twitter facebook
That is a data convertion project (Score:2)

by mrmeval ( 662166 ) writes:

You could write a custom program that would scrape the the data from a website you setup to allow that program to run stand alone or you figure out what the data format is and write a program to convert that.
If you want to recreate the data from scratch then you'd need to set up a website your group would access and enter data. That would be crowd sourcing but you'd probably want something specific to your needs but using easily maintainable code.
As others have stated you could use virtualization. Inside th
Screen Scrape the Site (Score:2)

by mbone ( 558574 ) writes:

See if you can get access to the site again, and screen scrape it. That should not be too hard (search for all articles beginning with "A", then "B", etc.). Then, it should be straightforward to enter it into MySQL or your database of choice.
(It is just possible the search functionality is still there, with just the HTML being taken down. The WayBack Machine could be your friend here...)
- Re: (Score:1)
  
  by tebee ( 1280900 ) writes:
  
  If you could scape the site, I would have done it years ago. Unfortunately the programmer built in anti-scraping technology to the program to "protect his data". If you issue too many sequential requests it locks your IP out - Permanently ! I discovered this about 8 years ago when I was doing some manual scraping and it did it to me.
  if you look at the site ( http://index.mrmag.com/ [mrmag.com] ) on the wayback machine you can see the strange error you get - it locked that out too!
  - Re: (Score:1)
    
    by PerformanceDude ( 1798324 ) writes:
    
    Tebee,
    My company has some pretty sophisticated data transformation tools that we use in forensics. You can connect with me via the /. friends system if you manage to get hold of the source data. We may be able to return it to you in something simple like CSV and then from there things should be easy.
    Not promising a result but happy to at least take a look
Ask Pubmed guys (Score:3, Interesting)

by mapkinase ( 958129 ) writes: on Saturday July 17, 2010 @05:57PM (#32939280) Homepage Journal

Ask guys behind the Pubmed
http://www.ncbi.nlm.nih.gov/pubmed [nih.gov]
The database of scientific articles in the field of medicine and biology.
NCBI has the most generous software code licensing that is possible: the code is absolutely free, absolutely no restriction for distributing, changing, selling, even closing it. All because we, taxpayers, paid for it already.
I am surprised none of them reacted yet, I am sure they read ./

Share
twitter facebook
- Re: (Score:2)
  
  by GumphMaster ( 772693 ) writes:
  
  Or perhaps the NASA Astrophysical Data Service http://adswww.harvard.edu/ [harvard.edu]
And a thousand Mac Fanbois ... (Score:3, Funny)

by rueger ( 210566 ) writes: on Saturday July 17, 2010 @06:06PM (#32939330) Homepage

... leap up and shout "Filemaker Pro! Cause it's so shiny and pretty!"

Oh, the number of times that I've heard that refrain... shudder ...

Share
twitter facebook
- Re: (Score:2)
  
  by h4rr4r ( 612664 ) writes:
  
  Eww, the people responsible for that thing need to be lead into the street and shot.
  Until quite recently you could not even talk sql to it.
  - Re: (Score:1)
    
    by arcsimm ( 1084173 ) writes:
    
    You know, I spent a semester of my life working for a department at my university that kept all of its operating information in FileMaker Pro databases. Of course, there were two of them, most of the data in one was replicated in the other, and if you actually wanted to *do* anything with the data in either, like have it show up on the departmental calendar or mailing list, you had to manually copy and paste it into still other databases. For most of that semester, my job was basically to function as a $1
    - Re: (Score:2)
      
      by Bungie ( 192858 ) writes:
      
      I also have spent a long time dealing with FileMaker too and it can be a huge PITA. Be thankful you didn't have to maintain a FileMaker Pro Server or web server for many people!
      It is very easy for non-tech savvy people to use to build a bunch of databases and start using them which is cool. The problem is that the databases have a very simple design and most people don't even know how to setup a relationship between two fields. They just drag and drop fields onto a form and let FileMaker figure out how to s
- - Re: (Score:2)
    
    by BiggerIsBetter ( 682164 ) writes:
    
    Strangely enough though, only had one customer with it on Mac. The rest have been fools running PC version under Windows... when they already have office installed with Access... or even an SQL Server on the network. ?!!?
    If you can show me a way to publish databases to the web that's as quick and easy as FileMaker Pro, I'd love to hear about it.
Drupal, hands down. (Score:2, Interesting)

by Beltanin ( 1760568 ) writes:

Use Drupal (http://drupal.org), with Apache Solr (http://lucene.apache.org/solr/ and http://drupal.org/project/apachesolr [drupal.org]) for indexing. At the last Drupalcon (SF 2010), there were even presentations by library staff related to article indexing, etc. Some handy resources, but there are far more, this was just a 1m search based on the conference alone... http://sf2010.drupal.org/conference/sessions/build-powerful-site-search-user-friendly-easy-install-search-lucene-api-module [drupal.org] , http://sf2010.drupal.org/co [drupal.org]
- Re: (Score:2)
  
  by SpzToid ( 869795 ) writes:
  
  mod up seriously. Knowing what I know about Drupal + Solr, along with these fantastic examples, this is informative, truly.
Built in to mySQL (Score:1)

by Salamanders ( 323277 ) writes:

MySQL 5's Fulltext index [mysql.com] with the "natural language search" option might do everything you need with almost no overhead. That, plus PHP's PDO [php.net] to connect to the database, and I think you might be done. How much data are we talking, anyhow? 10,000 magazine articles or less?
File format, not the implementation details (Score:3, Insightful)

by frisket ( 149522 ) writes: <peter@silm a r i l.ie> on Saturday July 17, 2010 @06:44PM (#32939580) Homepage

It doesn't matter a damn what you use to serve the stuff; what matters is that the data is stored in something preservable and long-lasting like XML, otherwise you'll be back here in a few years. By all means use PHP and MySQL to make it available, but don't confuse the mechanisms used to serve the information with the file format in which it is stored under the hood.

Share
twitter facebook
- Re: (Score:2)
  
  by jgrahn ( 181062 ) writes:
  
  It doesn't matter a damn what you use to serve the stuff; what matters is that the data is stored in something preservable and long-lasting like XML, otherwise you'll be back here in a few years. By all means use PHP and MySQL to make it available, but don't confuse the mechanisms used to serve the information with the file format in which it is stored under the hood.
  You captured the main point in by refer/BibTeX posting better than I did. Thanks.
  More than once I've had to salvage important data from obs
IMO it shouldnt be hard to re-parse the data (Score:1)

by Wookie_CD ( 639534 ) writes:

If you're talking, like most of the commenters above, about retrieving the data from the server through tm.exe, then this does become an exercise in scraping. wget has builtin recursive-fetching capabilities and if you can access a complete index that would be a logical starting point. With my background, if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql. I'd read the data file(s) looking for textual content in a linked structure, and th
- B& (Score:2)
  
  by tepples ( 727027 ) writes:
  
  wget has builtin recursive-fetching capabilities
  Which will get the IP address of the machine running the scraper permanently banned. See the post above [slashdot.org].
  if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql
  It's likely that the raw data is encrypted. Based on the comment so far, I see no reliable indication of from what country tebee operates or whether this country has a DMCA-alike.
  - Re: (Score:1)
    
    by Wookie_CD ( 639534 ) writes:
    
    This is all a bit academic until the content owner either agrees to reopen web access to a conversion team, or releases the source data for analysis.
    - Re: (Score:2)
      
      by b4dc0d3r ( 1268512 ) writes:
      
      It's not acedemic if we can show the poster some sort of very simple wiki-like CMS that people with 6 decades of back issues might volunteer to enter/edit information. If everyone were organized, 100 people could enter the data in a weekend. Allowing time to edit and refine keywords, without copying the actual content, would add some time. And the backend database could end up more valuable than the original.
      Scraping the data isn't possible, getting the data looks unlikely. So you recreate it. Have pe
hyperestraier (Score:2)

by sugarmotor ( 621907 ) writes:

Take a look at http://hyperestraier.sourceforge.net/ [sourceforge.net] ... there might be something newer by the same author, Mikio Hirabayashi
Extracting the text from whatever files you have would be a separate step.
Fancy that (Score:1, Funny)

by Anonymous Coward writes:

> One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. [...] The governing body for the hobby has agreed to host this
Huh, I didn't realize that porn had a governing body.
It's a library catalog. (Score:2)

by oneiros27 ( 46144 ) writes:

Don't ask generic nerds -- ask library nerds : code4lib [code4lib.org]. They have a pretty active mailing list.
Also, there's oss4lib [oss4lib.org] which is specifically for open source software, but I haven't seen much activity on their list in a while, and I think most of us are on both lists. (there's also a few cataloging specific lists, but they get to be all library-sciencey, with discussions of RDA and FRBR and cataloging aggregates).
- Re: (Score:2)
  
  by dangitman ( 862676 ) writes:
  
  Denholm: It's settled. I've got a good feeling about you Jen and they need a new manager.
  Jen: Fantastic! So, the people I'll be working with, what are they like?
  Denholm: Standard nerds!
  [Note: Not to be confused with standards nerds]
Using a Howitzer to Hunt Squirrels (Score:3, Insightful)

by salesgeek ( 263995 ) writes: on Saturday July 17, 2010 @08:22PM (#32940096) Homepage

Lots of people here are recommending using tools that are built for very large scale projects. Based on the fact you have a DOS based system that likely used a pretty common library for storing the data (something like c-tree, btrieve, a dbase library or simply saving binary data using whatever language the app was written in), using any RDBMS like MySQL or even SQLite probably would do the job. PHP, Python, Ruby and Perl would probably make writing the actual application a snap - and be able to handle more of a load that the DOS app could.
Here's to hoping you can get the data. Hopefully the vendor that pulled the database down realizes how important to marketing it is and reverses course.

Share
twitter facebook
This is the ModelRR mag. database (Score:1)

by codeaholic ( 1596129 ) writes:

The description suggests that this is the Model RR magazine DB. Checking Kalmbach (the company that hosted it), shows that, indeed, it is off line. (http://index.mrmag.com/) The DB was a very simple (by today's standards) index of articles.
As many posters have said, it should be easy (for a programmer) to pull the data from the DB -- if you can get the original data files from Kalmbach. The data was not complex, and 80's DBs tended to have simple file formats. As many suggested, a C++, Java, Python or oth
- Re: (Score:1)
  
  by codeaholic ( 1596129 ) writes:
  
  If this is done as a volunteer effort, I'd be happy to help, esp. with extracting the old data. Contact me using my Slashdot user name + att dot net. (I hope THAT fools the spambots!)
Why PHP? (Score:1)

by WhiteHorse-The Origi ( 1147665 ) writes:

Why use PHP? I would think Python would be better because you can cross-compile the code to run on any machine using Jython(in case they stop hosting for you). Personally, I would do a full scrape of the data and put it in BibTex .bib files or xml and then make your search page pass parameters to the python program. That's what NASA and Google Scholar use(they may use Perl instead). I'm not sure about the database...
Dspace (Score:2)

by ericlondaits ( 32714 ) writes:

Check out Dspace (http://www.dspace.org/). I'm by no means an expert in the area but it seems it might be what you need.
Hypercard 2.0? (Score:2)

by AHuxley ( 892839 ) writes:

Something like an open source hypercard stack?
Anyone can understand a card system, enter unique data per card and save.
Humans are good at that.
Bring them all together and you have a huge digital stack to be sorted, searched or as the backend to a nice simple topic interface.
Computers are great at that now.
That would help your crowd sourcing if its open source no MS closed issues later on.
DOS Data (Score:2)

by nospam007 ( 722110 ) * writes:

If it's 20 years old DOS, chances are that it's either Paradox or dBASE or any xBASE format, which could be easily opened with Access or even Winword.
BibTeX (Score:1)

by GeniusDex ( 803759 ) writes:

It may not be a complete solution, but have you looked at BibTeX? BibTeX itself is only a format for nicely stating the information you have available (which magazine, article title, which pages in the magazine, authors, etc), but in the entire BibTeX ecosystem a number of indexing systems are built. Quite a lot of them are for desktop use (so you can manage your own BibTeX entries), but I'd imagine there would be some web-based system for this as well.
Sure About DOS? (Score:2)

by Bungie ( 192858 ) writes:

Are you sure it's a true DOS application and not a Win32 Console App? I know it is entirely possible for someone to write a CGI in DOS but it seems really weird to me that they would use DOS since it didn't have anything that would server CGI, and coding a hand rolled database format would be a lot of extra work.
If it is using Win32 it might just be accessing a DAO database without using the mdb extension, which many companies do to make it look like a proprietary format you can't just open with MS Access.
- Re: (Score:1)
  
  by tebee ( 1280900 ) writes:
  
  Your right it is. A visit to the wayback machine found this page- http://web.archive.org/web/20070626092758/www.index.mrmag.com/tm.exe?tmpl=tm_info [archive.org]
  Nn which is written -
  The TM application is written in "C", and is based on an ISAM/Network database manager I wrote in the late 1980's. The code is highly portable, and versions exist for MS-DOS, Windows NT and several flavors of UNIX. I also run it on my HP Palmtop. The version running on this site is a Win32 console application.
  Which just goes to show I shou
  - Re: ISAM/Network database manager (Score:1)
    
    by gregben ( 844056 ) writes:
    
    If you or someone can get me the database files (from Kalmbach?) I am willing
    to try to extract useful data from them, into simple ASCII text file(s), suitable
    for loading into a relational database like Postgresql, for free.
There is at least one misunderstanding. (Score:2)

by Jane Q. Public ( 1010737 ) writes:

It simply cannot be the case that the original data is in an "unknown" format. If it were, it would never have been retrievable. The format might not be known to YOU, at this particular time, but that is not the same as unknown.

Your first priority is to find out how the original data is stored and accessed. If as you say it is about 20 years old, I strongly suspect it is stored in a C-ISAM or D-ISAM database, and known code libraries are used to access it.

You should then be able to lean heavily on exi
Nail down the file format first (Score:2)

by jgrahn ( 181062 ) writes:

It seems to me that your core problem is data preservation for long periods of time; if you can save that 1930--present data, you don't want to lose it again. You should go for a plaintext file format, and be aware that you *are* using a file format.
There are file formats for this. Probably there are XML languages if you like that kind of thing, but either of two older ones would serve you well I think: the refer(1) format for bibliographic databases, and the BibTeX format. At least the latter is still in
Semantic web (Score:1)

by sogon ( 1222604 ) writes:

Just make one xhtml document with semantic annotations There are plenty of solutions for indexing this information. Also a simple google site: will often suffice for most queries.
Brewster Kahle's Digital Library (Score:1)

by satellitedirect ( 1851742 ) writes:

Brewster Kahle is building a truly huge digital library -- every book ever published, every movie ever released, all the strata of web history ... It's all free to the public. A video describing his project can be found at : http://www.ted.com/talks/lang/eng/brewster_kahle_builds_a_free_digital_library.html [ted.com] Good luck with your project.
A library catalog system is needed (Score:2)

by Edgester ( 105351 ) writes:

This sounds almost exactly like a library catalog system. If the system doesn't index articles, then just treat each article as a book in a multi-volume set. I know that several open source library system exist. Look into those.
Backup everything (Score:2)

by Antique Geekmeister ( 740220 ) writes:

Seriously, first step, back up *EVERYTHING*. This includes your programs and your data.
Then see if your ancient programs can be run inside a useful modern emulation enviornment, like "dosbox" or "freedos". That can buy you another 10 years.
It also buys you access to the data without using your ancient hardware: you can read the backups and play with the data much more safely, to try and decode the format. Given the software's age, it's unlikely to be more sophistated than a very simply index and tables that
Use a blog (Score:1)

by dgriff ( 1263092 ) writes:

Feed all the data into a blog, one magazine edition per blog entry. Some blog software lets you set the date of the entry - use that to set the date to the edition publication date. Enter the keywords as blog tags. You can expand the blog entry to contain such things as (e.g.) a picture of the cover, some short descriptive text. You then also get a free forum where people can discuss the edition in question.
Zebra is great of bibliographic data (Score:1)

by kylemhall ( 1859514 ) writes:

The data could easily be converted into MARC bibliographic records and indexed with Zebra [indexdata.com]. You could then use zebra has a stand-alone Z29.50 server, or run Koha [koha-community.org] on top for easy searching. Zebra can search millions of records in seconds, so it would be ideal, considering this is essentially bibliographic data. I am a public library IT guy, and would seriously be willing to help out if can use me. Just send me a email ( kyle DOT m DOT hall AT gmail.com ). I can take the raw data and convert it to a bulk marc
CWIS Open Source Solution (Score:1)

by Snowdog ( 3038 ) writes:

It sounds like CWIS [wisc.edu] may be what you're seeking. It's a free web-based turnkey package, developed at the University of Wisconsin - Madison and funded in part by NSF under the National Science Digital Library [nsdl.org] initiative. CWIS is written in PHP/MySQL, includes a search engine, a recommender engine, and a raft of other features, and is currently in use in a wide [amser.org] array [railroadheritage.org] of contexts [matecnetworks.org].
- Re: (Score:2)
  
  by Cylix ( 55374 ) * writes:
  
  Bad idea.
  It's a bad idea for the same reason they don't want to host a a dos executable anymore.
  Even if some strange reason the text could not be retrieved from a binary blob (which is not likely) the application still works today.
  A single command line wild card search would re-dump the text which could be parsed and stored in a simple database.
- Re: (Score:2)
  
  by SEWilco ( 27983 ) writes:
  
  Yes, book or periodical indexing software may be suitable. Just pick one with keyword support and an assortment of search fields.
  If the individual article summaries are also made available on individual pages, let Googlebot index them and people will be able to discover relevant individual articles through Google as well. Then anyone looking for something covered by an article will discover your index, as well as which issue of the magazine they need.
- Re: (Score:1)
  
  by tebee ( 1280900 ) writes:
  
  At the moment things are a little fragmented but we seem to be congregating on this thread http://model-railroad-hobbyist.com/discountinued_mag_index [model-rail...bbyist.com]
  I hope things will get a little more organized once we have a clear idea whether the original data is going to be available to us
  I too have a website we could use , I'm currently putting up versions of some of the things that have been suggested here. It's at http://pc-cafe.co.uk/mr [pc-cafe.co.uk]

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Sphinx or Lucene (Score:3, Informative)

Re:Sphinx or Lucene (Score:4, Informative)

Re:Sphinx or Lucene (Score:4, Interesting)

Re:Sphinx or Lucene (Score:4, Interesting)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Re:Sphinx or Lucene (Score:4, Informative)

No, no, NO! (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

It would help (Score:3, Insightful)

Re:It would help (Score:4, Funny)

Re: (Score:3, Funny)

Re: (Score:2)

Re:It would help (Score:4, Informative)

Wayback (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Developing a Niche Online-Content Indexing System? (Score:4, Insightful)

Re: (Score:1)

Just migrate it to VMware or KVM (Score:3, Informative)

Re: (Score:1)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

put the data online if you can (Score:1, Informative)

Re: (Score:2)

Data hoarders (Score:2)

hoarding == massive replication (Score:3, Interesting)

Re: (Score:2)

The binary file shouldn't be hard to read (Score:2, Informative)

Try Ruby on Rails (Score:5, Funny)

Re:Try Ruby on Rails (Score:5, Funny)

That is a data convertion project (Score:2)

Screen Scrape the Site (Score:2)

Re: (Score:1)

Re: (Score:1)

Ask Pubmed guys (Score:3, Interesting)

Re: (Score:2)

And a thousand Mac Fanbois ... (Score:3, Funny)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Drupal, hands down. (Score:2, Interesting)

Re: (Score:2)

Built in to mySQL (Score:1)

File format, not the implementation details (Score:3, Insightful)

Re: (Score:2)

IMO it shouldnt be hard to re-parse the data (Score:1)

B& (Score:2)

Re: (Score:1)

Re: (Score:2)

hyperestraier (Score:2)

Fancy that (Score:1, Funny)

It's a library catalog. (Score:2)

Re: (Score:2)

Using a Howitzer to Hunt Squirrels (Score:3, Insightful)

This is the ModelRR mag. database (Score:1)

Re: (Score:1)