Sanely Moving from Word to the Web? 547
FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"
Scrapping (Score:5, Interesting)
1. Place all "to-process" documents in a specific folder in a webserver
2. Write a script to read those documents
3. Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).
Admittedly it was a tedious job at first to identify every possible template, however I'm amazed how predictable some documents are and once you get hold of such "blueprint", you can reformat documents to HTML/XML fairly easily.
Once the changes are done, I then preview them in a browser, and if everything's expected, I simply save the page and use it; If not, it's easy enough to make a few tweaks from the familar HTML environment.
Sounds like you should release on sourceforge (Score:5, Interesting)
Re:Sounds like you should release on sourceforge (Score:3, Insightful)
Plus the templates are probably in-house templates and thus would be useless outside of the company.
Re:Sounds like you should release on sourceforge (Score:3, Funny)
But we're not elite and I'm now going to learn how to do macros in emacs
Actually, an NDA probably doesn't matter. (Score:3, Interesting)
However, I am also not willing to just assume that no company would ever consider letting someone sourceforge a script like this. It is 1) worth good advertising and 2) clearly not important enough to be worth selling. Release it in the company's name, or not depending on what they prefer.
At a minimum a lot of small companies
Re:Actually, an NDA probably doesn't matter. (Score:5, Funny)
Re:Actually, an NDA probably doesn't matter. (Score:3, Insightful)
Did he say the website was public?
Trying this again (Score:3, Interesting)
s:STYLE=\"[ a-zA-Z0-9\:;-]*\"::Ig
s:</FONT>::Ig
s:<FONT[ -=\"A-Z0-9]*>::Ig
s:BORDER=[0-9]*::Ig
s:ALIGN=B
Re:Scrapping (Score:3, Insightful)
Of course, you really can't properly parse html just using regular expressions. You can get it right 90% of the time relatively quickly, and a day or so work will get you 5% more, but you could spend weeks trying to get that last 5% -- and never quite get it.
It's really better to use things that other people have made for parsing html. For example, if you use perl (and you shou
Re:Scrapping (Score:3, Informative)
Re:OpenOffice 2.0-beta "save-as" and "export" grea (Score:3, Interesting)
Perhaps he could use the OpenOffice API to automatically have a server-side instance of OpenOffice open submitted Word documents and save them as HTML. This should happen at the same time the user uploads the document - that way the user could preview the conversion to HTML, and if it was flawed, he could choose to publish the document as PDF.
OpenOffice API:
http://api.openoffice.org/ [openoffice.org]
Code snippet shows simplicity of converting OpenOffice Writer SXW document
Handy alternative to Notepad (Score:2)
What I basically do is paste the document into EditPlus, then I use a function called "Replace" to get rid of the big stuff and edit out the rest of the tags manually. It may not be the best solution, but it's visually easier than just using notepad.
Re:Handy alternative to Notepad (Score:2)
Free, os, macros, cross compatible (Linux, Windows), and recognizes syntax from multiple languages.
PDF? (Score:2, Insightful)
Just a thought
Tedious cleanup? (Score:2, Funny)
PDF? (Score:3, Insightful)
Re:PDF? (Score:3, Funny)
Re:PDF? (Score:3, Interesting)
However, if someone is getting the idea for another open source project to solve this dilema then I'd suggest somethin
Re:PDF? (Score:3, Informative)
Re:PDF? (Score:3, Insightful)
Use antiword (Score:2, Informative)
Re:Use antiword (Score:2)
Oh wait! You're serious [demon.nl].
Re:Use antiword (Score:3, Informative)
Dreamweaver (Score:5, Informative)
Re:Dreamweaver (Score:5, Informative)
Re:Dreamweaver (Score:5, Informative)
If you paste text into the Code view, DW removes the formatting completely and just uses the raw text.
Re:Dreamweaver (Score:4, Funny)
Re:Dreamweaver (Score:3, Informative)
Worked pretty well, once I'd got the search/replace stuff sussed out.
Mind you, on a big word file you can think it's crashed when actually it's just doing lots of thinking...
Re:Dreamweaver (Score:3, Insightful)
Indeed we'd love to move to advanced CSS for page formatting but that's a big step right now - there are no professional WYSIWYG editors that have the sheer range and quality of features we need - Page templates, ability for clients to update the site later
Antiword (Score:3, Informative)
Textism (Score:5, Informative)
http://textism.com/wordcleaner/ [textism.com]
I used it once and it did a pretty decent job at preserving the tables. Yet if they're using anything odd like graphics or it's been incredibly tweaked, it probably won't be 100% perfect.
Re:Textism (Score:4, Informative)
Re:Textism (Score:3, Informative)
One suggestion (Score:4, Funny)
You might consider a pack of monkeys and typewriters. They can ultimately reproduce Shakespeare so maybe, maybe they might be ablt to properly reformat the HTML gibberish Word produces.
Of course, you could also outsource to India but that's unethical to both the monkeys and the Americon economy.
Re:One suggestion (Score:3, Funny)
One Word... (Score:5, Funny)
Re:One Word... (Score:5, Funny)
Usually they are from academic users
It sounds like this might be a university environment. The correct answer should be grad students .
Re:One Word... (Score:2)
Or better yet, exchange students. The mythical tri-lingual graduate exchange students are, of course, ideal. They can work on the HTML and translate the content while they're at it. All your footnote are belong to us, etc.
PDF? (Score:2)
OpenOffice.Org, but not HTML export (Score:2)
Unless you're dealing with content that has to be accessed or updated frequently, then PDF is the way to go.
Re:OpenOffice.Org, but not HTML export (Score:2)
Change the Model? (Score:2)
Re: (Score:3, Interesting)
HTML Export (Score:3, Informative)
http://www.microsoft.com/downloads/details.aspx?F
I believe this functionality is built into later versions of Word.
Per the site, this produces simpler HTML with Office-specific tags removed. With that done, you could probably use a PERL script, and you might also try writing some Word macros or COM/VBA scripts that clean up the document from within Word.
Re:HTML Export (Score:5, Informative)
Re:HTML Export (Score:2)
Dreamweaver and the like may do a better job, but if you don't want to buy a tool like that for the occassional content paste, then this technique eliminates at least 80% of the work.
Dreamweaver (Score:5, Informative)
Good luck!
HTML Tidy (Score:5, Informative)
Tidy Flags (Score:5, Informative)
Re:Tidy Flags (Score:2, Informative)
Why? XHTML isn't any better than HTML 4.01 for almost anybody, and it's less compatible.
wvWare (Score:2)
It can programatically convert most Word documents into html documents, and does about as good a job as one could expect. And it makes better html than Word does itself.
Re:wvWare (Score:2)
javester said that it's not being maintained anymore, but I do see some recent updates at the Sourceforge page [sourceforge.net]. If it's not being actively developed anymore, it would seem to be because it works pretty well now.
In any event, I've found it to work very well for my use.
no link for you, Slashdot hordes! (Score:5, Informative)
Step 1: Let's look at his user page [slashdot.org]
Ahh! He put in a website with his profile. Let's all go and check out http://fennec.homedns.org/ [homedns.org]
Hmm... looks like a personal page. Not too sure what to make of the comic. Anyway, let's move on to..
Step 2: Let's look at his author [wfu.edu] page. Some interesting stuff here, including three separate e-mail addresses (which I won't post here. You're welcome
A-ha! There is a link to his employer! It's Economic History Services [eh.net]. And what do you know... there are a significant number of pages (especially under abstracts and book reviews) that seem to come straight out of a word processor, only with extensive cleaning. A quick look at the source reveals something interesting. It's clean. Very clean. We're talking on the level of I-use-vim-for-my-webpage-editor clean. Nice job.
Anyway, it looks like it was done by hand. I'm not saying its not good work (quite to the contrary), but I can see your need for an automated solution.
Re:no link for you, Slashdot hordes! (Score:5, Funny)
There is NO WAY the slashdot effect can be avoided. Resistance is futile...
Re:no link for you, Slashdot hordes! (Score:4, Informative)
Re:no link for you, Slashdot hordes! (Score:3, Interesting)
Re:no link for you, Slashdot hordes! (Score:5, Funny)
You must be new here.
Common... what? (Score:3, Funny)
What is this "common sense" of which you speak? Where may I download it from?
HTML Tidy (Score:3, Informative)
HTML Tidy has a special mode for cleaning up Word's crappy HTML export. HTML Tidy is a free command-line tool that is also embedded in a lot of popular HTML editors.
HTML Tidy:
http://tidy.sourceforge.net/ [sourceforge.net]
HTML Kit (great integration with HTML Tidy; it includes HTML Tidy so you can just grab HTML Kit without grabbing HTML Tidy)
http://www.chami.com/html-kit/ [chami.com]
Countless other editors integrate with HTML Tidy as well. Have fun and good luck!
Get it in PDF first. (Score:3, Interesting)
What I would do in your shoes is set up a (mostly) automated system to convert the Word files to PDF. You can buy Acrobat or you can go with a third-party, printer-driver-style converter, but in the end you'll probably save more headaches just using Acrobat.
Once you have a document in PDF, you can use any of the numerous (free and commercial) tools to convert that to HTML, text, whatever - all much more reliably than from Word directly. It's not perfect, but it's probably the closest you'll get.
Plus, you can post the PDFs themselves for download in case someone wants them - and at least Google will still happily index your PDFs.
Yes, you'll probably have to live with some NT variant to get that part done (though it might work with OSX) - but it's most likely your fastest path to *quality* conversions.
Homesite (Score:2)
HTML parsing (Score:2)
Some future version of Tomcat [apache.org] should have built-in content parsing in its filters so that filter writers could write simple filters to transform content in a meaningful way. But I haven't see
use kword (Score:2)
Everything else I tried sucked, including OO.o's export.
Resign from your executive position (Score:3, Interesting)
Re:Resign from your executive position (Score:5, Insightful)
I use mutt and fetchmail in a company of Exchange users. Almost every email I get at work now, from everybody, is in html. (Unless I sent it to myself.) I don't like it, but I deal with it. It's certainly easier to deal with it than to try and change everybody else.
I could change jobs, but over something as trivial as html emails? No. I like my job, I like the people I work with, so I just bend like the reed in the wind ...
Still, the executives are certainly worse about email ettiquette than most, and it's not just in this company -- everywhere I've worked I've found this to be the case. They don't include Subjects at all, or include useless ones like `message'. Some will type up a memo and send it as a .pdf file attachment, or worse as a .bmp file. They rarely trim anything when responding to a post -- they just top post away. (But many people do that ...)
Re:Resign from your executive position (Score:3, Insightful)
Why do you use mutt and fetchmail? Why? Why? Why? Just about everywhere I have worked it has been easier (and often there is no choice) to just use what they use rather than trying to be clever or different. It is good to gain wide experience and it is good to have the flexibility to use the tools at hand.
Amen (Score:5, Funny)
Re:Resign from your executive position (Score:5, Funny)
I was given 61 screenshots (blithely dubbed "program requirements"), each its own Word document. Each containing only a (weirdly scaled) picture, of course.
61 Word documents.
Re:Resign from your executive position (Score:3, Funny)
Re:Resign from your executive position (Score:3, Insightful)
and as another poster has suggested, perhaps it's the quality department to blame as a memo or whatever, isn't a real memo or whatever unless it has been created with the official approved template...
Use AbiWord on the command line (Score:2)
AbiWord --to=file.html file.doc
http://www.abisource.com/ [abisource.com]
is HTML really necessary (Score:2)
As a solution goes, it is pretty crude. However, it works quickly and easly, and produces nice looking output.
Yes! (Score:3, Insightful)
For documents that are going to be viewed online, it's infinitely preferable to use a free-form format like HTML (was designed to be) that can adjust to varying monitor and window sizes.
antiword (Score:2)
http://www.winfield.demon.nl/ [demon.nl]
Use a text editor (Score:2)
avoiding the hand-edit (Score:2)
I'm seeing a lot of 'use Dreamweaver' responses that are well-meaning and probably will solve this guy's dilemna. But what about those of us running CMS systems with text area inputs in forms? Our content people copy-and-paste directly from word and these crazy MsWord entities get crudly transposed into ASCII question marks.
Anyone got a good regsub routine for correctly substituting these entities for their approximate ASCII equivalents? I'm just looking for pattern matching here... Don't need a bunch of
Webworks Pro (Score:2)
You may want tot look at WebWorks pro application for sanely exporting Word files as HTML/XML. I've used it in the past (a handful of years ago) and it was pretty reasonable. It is worth investigating in any case.
fckeditor (Score:3, Informative)
The interface is similar to Word - maybe if you're lucky, you could get some of your content producers to use it.
HTML Tidy program (Score:5, Informative)
One program I've had luck with is the HTML Tidy program at http://www.w3.org/People/Raggett/tidy/ [w3.org]. It seems to clean up code (particularly from Word) quite a bit.
PDF - GhostScipt (Score:2)
Grrr... (Score:2)
I could really use a speling cheker.
WordML - FO - XHTML/PDF (Score:5, Informative)
Using a modern version of Word, output in WordML (xml format). Use a XSL stylesheet [antennahouse.com] to convert the WordML to FO (formatting objects).
From there, do anything you want, like XHTML or PDF.
Or just go to XHTML from WordML with some stylesheet. XSL is teh cool!
Recreating formatting? (Score:2, Insightful)
The problem with conversion of documents to HTML in general is the expectation that the formatting needs to be preserved. There have been times where I needed to "post" a document
Net-It is your magical tool (Score:4, Informative)
Oh, you mean non-commercial magical tools?
Does anyone have any...magical tools...? (Score:2)
Abiword (Score:2)
Pagify (Score:3, Informative)
Try this.... (Score:5, Informative)
Demoroniser is, in the author's own man pages words:
A Perl script which corrects incompatible HTML generated by Microsoft applications. [fourmilab.ch]
You can get it from the link in the same page. I must confess that I've not used it myself (don't use Office/Frontpage) but if it does what it says on the tin it should sort you out.
Dreamweaver MX 2004 (Score:3, Informative)
2) Select All -> Copy
3) Open Dreamweaver
4) File -> New Html Doc
5) Paste
6) Commands -> Clean up Word Html
7) Commands -> Apply Source Formatting (if you take the time to set the programs preferences to what you like)
8) Done
9) Drink beer
10) Sleep
wvHtml (Score:3, Informative)
From the sourceforge page:
I'm using this to convert all of our internal documentation. It does a pretty good job, even converts the images and acts in a relatively reliable manner with 2003, 2000, & 97 formatted files. There's some oddball output sprinkled in, but nothing a little sed fanciness can't fix.
Re:hi (Score:2, Funny)
Re:hi (Score:2, Funny)
Re:hi (Score:2, Funny)
I'm glad we have these little discussions. It makes my day so much more interesting.
Let's do lunch.
Re:You need an intern (Score:2)
1. Hire intern to transcribe work
2. Outsource html transcription job to india
4. Profit!!
Re:More specifically: Word into MS CMS (Score:2)
Re:More specifically: Word into MS CMS (Score:3, Informative)
You can also add an event handler for the updating event that does some regex tidying. Replacing the regex "]*>" will go a long way (better double-check that). You should be able to come up with a similar one for all the smarttag nonsense that gets inserted, too.
Still, Word formatting remains
Re:DEMORONISER (Score:3, Informative)
The Unmoroniser is an updated version that handles Unicode properly and will do things like convert proprietary Windows-only curly quotes to the appropriate HTML4 entities instead of dropping them back to less accurate, typographically offensive straight quotes. Same with ligatures and other characters that the Demoronizer would munge instead of convert.
http://rheme.net/unmoroniser/ [rheme.net]
Re:Export it as XML and XSLT it to HTML (Score:3, Informative)
1) get a copy of Word 2003
2) "save as" an exemplar as XML
3) write an XSLT to render it in a HTML with stylesheets etc as appropriate to your website
4) for every document you get, "save as" XML with the XSLT from 3) as the transformation.
5) publish
I've been wondering how long until using XSLT and XML was suggested. XML is supposed to be a common data transport format but most of the other comments talk about starting
HTML Tidy cleans Word HTML. (Score:3, Informative)
HTML Tidy cleans HTML, and has a special function for cleaning Word HTML junk.
It must be terrible to work at Microsoft and always do mediocre work.
--
If you support dishonesty and violence [doonesbury.com], don't say you are Christian.
Re:Convert to RTF first (Score:3, Interesting)