Beta

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Sanely Moving from Word to the Web?

Cliff posted more than 8 years ago | from the tag-soup dept.

Communications 547

FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"

cancel ×

547 comments

Sorry! There are no comments related to the filter you selected.

Scrapping (5, Interesting)

fembots (753724) | more than 8 years ago | (#13282117)

Interestingly, I have a similar job on a website (no link for you too, Slashdot hordes!), here's what I do (I'm sure there are smarter ways):

1. Place all "to-process" documents in a specific folder in a webserver
2. Write a script to read those documents
3. Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).

Admittedly it was a tedious job at first to identify every possible template, however I'm amazed how predictable some documents are and once you get hold of such "blueprint", you can reformat documents to HTML/XML fairly easily.

Once the changes are done, I then preview them in a browser, and if everything's expected, I simply save the page and use it; If not, it's easy enough to make a few tweaks from the familar HTML environment.

Sounds like you should release on sourceforge (5, Interesting)

arete (170676) | more than 8 years ago | (#13282136)

So if there's only a few templates and they were a pain to work out, how about releasing your regex scripts to sourceforge or similar? Or posting here?

Re:Scrapping (2, Insightful)

dougmc (70836) | more than 8 years ago | (#13282241)

Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).
Of course, you really can't properly parse html just using regular expressions. You can get it right 90% of the time relatively quickly, and a day or so work will get you 5% more, but you could spend weeks trying to get that last 5% -- and never quite get it.

It's really better to use things that other people have made for parsing html. For example, if you use perl (and you should -- it's the ideal tool for this), HTML::Parser works pretty well, though there's a signifigant learning curve in using it.

Duh (-1, Troll)

Anonymous Coward | more than 8 years ago | (#13282118)

Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"

Yeah, hire some sand monkeys from overseas to do it. That's what all the IT companies are doing. Duh.

Re:Duh (0, Offtopic)

sTalking_Goat (670565) | more than 8 years ago | (#13282193)

how does a racist Troll get moderated Interesting?

Re:Duh (1)

rlandrum (714497) | more than 8 years ago | (#13282236)

Perhaps if the workforce in the US didn't use phrases like "sand monkeys", IT companies wouldn't be so inclined to look for good workers overseas.

Re:Duh (0)

Anonymous Coward | more than 8 years ago | (#13282240)

"Yeah, hire some sand monkeys from overseas to do it. That's what all the IT companies are doing. Duh."

Sand monkey? Did you write that with a sheet over your head?

Handy alternative to Notepad (1)

bigwavejas (678602) | more than 8 years ago | (#13282119)

I use a program called, "EditPlus" http://www.editplus.com/ [editplus.com] It has syntax highlighting for the most common extensions (HTML, CSS, PHP, ASP, Perl, C/C++, Java, JavaScript and VBScript).

What I basically do is paste the document into EditPlus, then I use a function called "Replace" to get rid of the big stuff and edit out the rest of the tags manually. It may not be the best solution, but it's visually easier than just using notepad.

Re:Handy alternative to Notepad (1, Informative)

Anonymous Coward | more than 8 years ago | (#13282183)

Why use editplus when you can use Crimson Editor when its free, open source, and has all the capabilities of edit plus, functionality, and then some? (http://www.crimsoneditor.com/ [crimsoneditor.com] ) The built in macro functionality is really sweet too!

hi (-1, Offtopic)

Anonymous Coward | more than 8 years ago | (#13282122)

Hello everyone!

Re:hi (2, Funny)

Anonymous Coward | more than 8 years ago | (#13282141)

hello. how are you?

Re:hi (2, Funny)

Anonymous Coward | more than 8 years ago | (#13282205)

I'm fine thank you

Re:hi (2, Funny)

Anonymous Coward | more than 8 years ago | (#13282251)

I'm fine too.

I'm glad we have these little discussions. It makes my day so much more interesting.

Let's do lunch.

PDF? (2, Insightful)

Anonymous Coward | more than 8 years ago | (#13282124)

How about Word -> PDF -> HTML?

Just a thought ... and probably a dumb one.

Tedious cleanup? (1, Funny)

Timesprout (579035) | more than 8 years ago | (#13282125)

Sounds like a job for Mrs FooAtWFU

Two words: (0)

Anonymous Coward | more than 8 years ago | (#13282127)

Chinese Children.

Coincidentally, the captcha for this post is "chinked". Fucking hillarious.

PDF? (2, Insightful)

night_flyer (453866) | more than 8 years ago | (#13282128)

you can either hot link to the .doc themselves in a new window or convert to .pdf and do the same thing

Use antiword (2, Informative)

shura57 (727404) | more than 8 years ago | (#13282130)

It takes Word file and spits out plain text. It can also do some more tricks.

Re:Use antiword (1)

uberdave (526529) | more than 8 years ago | (#13282239)

I tried it before, but it merely spit out keystrokes and mouse gestures.

Oh wait! You're serious [demon.nl] .

Dreamweaver (5, Informative)

necro2607 (771790) | more than 8 years ago | (#13282131)

I would suggest using Macromedia Dreamweaver... it's what we use where I work and essentially all of our content entry involves pasting in content from Word documents supplied by clients. Dreamweaver is pretty good for formatting and working with stylesheets.

Re:Dreamweaver (5, Informative)

fean (212516) | more than 8 years ago | (#13282172)

in Dreamweaver, there's a command "Clean up MS Word HTML". Its made to clean up Word's crappy html, and does a pretty nice job of it.

Re:Dreamweaver (1)

kortex (590172) | more than 8 years ago | (#13282207)

Ditto on this idea - Dreamweaver also includes a nifty "Clean Up Word HTML" tool that really kicks ass. It will nicely and completely sterilize all the redundancy and nested bs --- as well as Word-specific tags to leave you with nice, clean HTML afterwards.
Apply style sheets to format and voila! - yer done :)

Re:Dreamweaver (1)

zerofret (880158) | more than 8 years ago | (#13282267)

I also find Dreamweaver to be pretty good at converting Word DOCs into HTML. It doesn't do a fully automated conversion as I've got some mandated from above 'corporate image' weirdness I have to adjust for, but it gets me about 80% of the way there.

Re:Dreamweaver (0)

Anonymous Coward | more than 8 years ago | (#13282270)

"Dreamweaver is pretty good for formatting and working with stylesheets."

Are you serious!!?? I almost spilled my coffee all over my keyboard when I read that line. Tables are also perfect for Standards Based XHTML layout!

Dreamweaver absolutely butchers Stylesheets...

I don't mean any disrespect to the poster, but unless you switch away from Dreamweaver and Tables, you are toast in the near future for Web Design.

Antiword (3, Informative)

alanp (179536) | more than 8 years ago | (#13282135)

Try antiword, it's got a real decent HTML option.

Textism (4, Informative)

NoInfo (247461) | more than 8 years ago | (#13282137)

Here's a tool I saw linked off of O'Reilly Radar once:

http://textism.com/wordcleaner/ [textism.com]

I used it once and it did a pretty decent job at preserving the tables. Yet if they're using anything odd like graphics or it's been incredibly tweaked, it probably won't be 100% perfect.

Re:Textism (3, Informative)

e**(i pi)-1 (462311) | more than 8 years ago | (#13282222)

a standalone Perl script, I use daily is demoronizer [fourmilab.ch] .

One suggestion (3, Funny)

Da Fokka (94074) | more than 8 years ago | (#13282138)

You might consider a pack of monkeys and typewriters. They can ultimately reproduce Shakespeare so maybe, maybe they might be ablt to properly reformat the HTML gibberish Word produces.

Of course, you could also outsource to India but that's unethical to both the monkeys and the Americon economy.

Re:One suggestion (3, Funny)

Anonymous Coward | more than 8 years ago | (#13282232)

It's hard to find qualified monkeys - most of them already have jobs editing /. and cnn.com...

One Word... (5, Funny)

ScentCone (795499) | more than 8 years ago | (#13282140)

..."Intern"

Re:One Word... (5, Funny)

Cerdic (904049) | more than 8 years ago | (#13282198)

No, no, no...

Usually they are from academic users

It sounds like this might be a university environment. The correct answer should be grad students .

Dreamweaver (1)

GroovinWithMrBloe (832127) | more than 8 years ago | (#13282142)

Dreamweaver comes with a function explicitly for dealing with Word goodness (Clean Word HTML IIRC). Also, perhaps try HTML Tidy?

MOD PARENT (0)

Anonymous Coward | more than 8 years ago | (#13282195)


yes dreamweaver has a handy "clean up word HTML" function, you can even grab a trial version (but its worth the money imho)

Tidy (0)

Anonymous Coward | more than 8 years ago | (#13282144)

tidy, from w3c. Dreamweaver will clean up some Word HTML.

Convert to RTF first (0)

Anonymous Coward | more than 8 years ago | (#13282145)

I dont have a link, or proper info, but I recall seeing someting here a few weeks back in which someone suggested saving the word doc as RTF, then they had a util to convert RTF to HTML - apparently it was really useful.

PDF? (1)

nizo (81281) | more than 8 years ago | (#13282147)

Perhaps offer every document as a pdf (there are plenty of conversion tools out there, such as ps2pdf, which you can use after printing the document to a postscript file), as well as offer it in whatever format was sent to you?

You could... (0)

Anonymous Coward | more than 8 years ago | (#13282149)

ask them to save it as an RTF file... Reading an RTF is much easier whilst supporting almost all of the important text formatting features.

OpenOffice.Org, but not HTML export (1)

KillerBob (217953) | more than 8 years ago | (#13282150)

OpenOffice.Org supports the ability to export a document as PDF. As you probably know, PDF viewers are available for all mainstream OSes, including Linux, from Adobe themselves.

Unless you're dealing with content that has to be accessed or updated frequently, then PDF is the way to go.

Re:OpenOffice.Org, but not HTML export (1)

emaneman (905766) | more than 8 years ago | (#13282155)

I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?

Change the Model? (1)

Stanistani (808333) | more than 8 years ago | (#13282154)

Could you provide forms on your website for your academic users to submit the information directly?

Re:Change the Model? (2, Interesting)

danheskett (178529) | more than 8 years ago | (#13282248)

That's the best way. Really. You want the data to be in a structured format. Semantically structured if possible, but at least structured. Define a bunch of templates. Use a templating system like smarty or whatever to make it happen. Give your users a simple form - HTML, Windows, Java, whatever, that selects a template and reads in a list of fields from the template. Dynamically generate the form fields to be filled based on the template. Store the data. To generate a page start from the master record - be it in a database, an xml file, or whatever. Load the template and fill the data from the relational store. If you do it right you can even substitute different rendering layers and get an X/HTML version, a Word version, and a PDF version without any real substantial work. This also helps (1) create consistent documents, (2) create documents for more than one target format, (3) create searchable content with rich meta-data and (4) move to a more robust system later without tons of extra work. I've done it before, and if you spend a week engineering the solution properly it'll last years.

Macromedia Dreamweaver (1)

jonthegm (525546) | more than 8 years ago | (#13282157)

It's a dream come true for eliminating Word formatting in an html file, or for just copy/pasting from one file to the other.

It doesn't seem to transfer colors over, but that may be user-error.

You could always just write some php/perl scripts for scrubbing, too! RegExps are your friends!

PDF is your friend (1)

jeffs72 (711141) | more than 8 years ago | (#13282158)

I've found it easiest to just PDF bothersome word files myself. Call me lazy, but it works.

HTML Export (2, Informative)

electroniceric (468976) | more than 8 years ago | (#13282160)

If you're using Office 2000, you can find the HTML filter here:

http://www.microsoft.com/downloads/details.aspx?Fa milyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displa ylang=EN [microsoft.com]

I believe this functionality is built into later versions of Word.

Per the site, this produces simpler HTML with Office-specific tags removed. With that done, you could probably use a PERL script, and you might also try writing some Word macros or COM/VBA scripts that clean up the document from within Word.

Re:HTML Export (4, Informative)

Marxist Hacker 42 (638312) | more than 8 years ago | (#13282244)

Whew- I hoped I didn't have to post this 40 comments down in the thread. Yes, Office 2000 has the above tool- and Office 2002 or 2003 has it on the Save As menu. The option you want is "Web Page (filtered)|*.html". I saw an interview once with somebody on the Word development team, and he claimed that the original Save As HTML was built for passing Word Documents over the web- and never meant to be read by human beings as a web page at all. Web Page (filtered) cuts out all the extra shyte that Save As HTML used to put in for managing version controled updates and changing the font every bloody character- and builds a real web page.

Re:HTML Export (1)

MemeRot (80975) | more than 8 years ago | (#13282249)

I second this suggestion. It does a great job of producing comprehensible html out of the normal word mess. Now why this isn't the default option I'm at a loss to explain.

Dreamweaver and the like may do a better job, but if you don't want to buy a tool like that for the occassional content paste, then this technique eliminates at least 80% of the work.

T-I-D-Y works for me!!! (1)

jaltoids (9737) | more than 8 years ago | (#13282161)

And it works pretty well... TIDY!!! [w3.org]

Frankly I didnt think much of this tool till I had to convert a LOT of pages where there was going to be a ton of cleanup by hand. In some cases it was easier to go back and get word to spit out ugly html and then let tidy fix that (if you can belive it). Best of all it is FREE and easy to use!!!

Dreamweaver (4, Informative)

SlashChick (544252) | more than 8 years ago | (#13282162)

Check out Commands -> Clean Up Word HTML in Dreamweaver. it does a nice job of getting rid of extraneous tags. While you're at it, take a look at Commands -> Apply Source Formatting as well. This can be customized to your specifications in the preferences section, and automatically tabs out, adds newlines, and converts tags to lowercase where appropriate in the HTML document. Dreamweaver is the closest thing I know of to a program that "automatically" cleans up Word HTML.

Good luck!

Have you tried Microsoft Frontpage? (1)

Cerdic (904049) | more than 8 years ago | (#13282163)

If not, give it a try. In the past, anything I've taken from .doc to html (in particular, resumes), seemed to convert nicely if I did a cut-and-paste from Word straight into an html project in Frontpage.

Considering that they have a common core and a part of the Office suite, they seem like they should be the most directly compatibile with each other.

Textism's Word HTML Cleaner rocks (1)

sco (116197) | more than 8 years ago | (#13282164)

http://textism.com/wordcleaner/ [textism.com]

"A tool that strips proprietary Microsoft tags and other cruft from Word HTML documents, leaving basic formatting intact. File sizes are greatly reduced, and the returned HTML is easier to read, revise and employ."

5 for a 24-hour pass, 20 for a 1-year individual subscription.

No, I don't work there.

Re:Textism's Word HTML Cleaner rocks (1)

sco (116197) | more than 8 years ago | (#13282189)

Ahem. Those prices should read EUR 5 and EUR 20.

HTML Tidy (5, Informative)

N8F8 (4562) | more than 8 years ago | (#13282166)

Save the Word document as filtered HTML and pipe the HTML through HTML Tidy [sourceforge.net] . Nice clean HTML.

Tidy Flags (5, Informative)

N8F8 (4562) | more than 8 years ago | (#13282206)

Almost forgot. The Tidy Docs [sourceforge.net] will tell you to select "--bare" and "--word-2000" and I also recommend "--output-xhtml" and "--indent".

Tidy (0, Redundant)

s4f (523726) | more than 8 years ago | (#13282168)

http://tidy.sourceforge.net/ [sourceforge.net] As I recall HTML-Tidy allows you to remove all of Words "enhancments".

Tidy (1)

valis (947) | more than 8 years ago | (#13282169)

http://tidy.sourceforge.net/ [sourceforge.net]

Check out the -bare and -clean options to remove microsoft cruft.

wvWare (1)

dougmc (70836) | more than 8 years ago | (#13282171)

wvWare [wvware.sou] is what you want.

It can programatically convert most Word documents into html documents, and does about as good a job as one could expect. And it makes better html than Word does itself.

Re:wvWare (1)

javester (260116) | more than 8 years ago | (#13282277)

We used wvWare in an open-source CMS for a Fortune 10 to let PR English Majors upload Word files.

The site takes about 15 mill hits/months.

It works but there are a lot of caveats. Also, wvWare is no longer an active project so I wouldn't recommend it.

We ended up giving the PR guys a Word macro to sanitize Word a bit before giving it to wvWare for processing.

Also, the HTML it produces is very verbose and very heavy. It removes the MSisms but doesn't necessarily take out the redundant formatting info.

your website (0)

Anonymous Coward | more than 8 years ago | (#13282173)

no link for you, Slashdot hordes! (5, Informative)

SeanTobin (138474) | more than 8 years ago | (#13282174)

Hmmm... sounds like a challenge to me. Let's see what we can dig up.

Step 1: Let's look at his user page [slashdot.org]

Ahh! He put in a website with his profile. Let's all go and check out http://fennec.homedns.org/ [homedns.org]

Hmm... looks like a personal page. Not too sure what to make of the comic. Anyway, let's move on to..

Step 2: Let's look at his author [wfu.edu] page. Some interesting stuff here, including three separate e-mail addresses (which I won't post here. You're welcome :)

A-ha! There is a link to his employer! It's Economic History Services [eh.net] . And what do you know... there are a significant number of pages (especially under abstracts and book reviews) that seem to come straight out of a word processor, only with extensive cleaning. A quick look at the source reveals something interesting. It's clean. Very clean. We're talking on the level of I-use-vim-for-my-webpage-editor clean. Nice job.

Anyway, it looks like it was done by hand. I'm not saying its not good work (quite to the contrary), but I can see your need for an automated solution.

Re:no link for you, Slashdot hordes! (5, Funny)

Dunbal (464142) | more than 8 years ago | (#13282237)

Which only goes to show:

      There is NO WAY the slashdot effect can be avoided. Resistance is futile...

Word has this feature... (1)

fervent_raptus (664099) | more than 8 years ago | (#13282176)

Open in Word Select All Hit: Control + Space

Re:Word has this feature... (1)

Rude Turnip (49495) | more than 8 years ago | (#13282253)

I just tried that out on a 12 page letter in Word...what does it do exactly? I saw the bold formatting disappear from the cover page, but couldn't see any other changes.

Print to PDF... (0)

Anonymous Coward | more than 8 years ago | (#13282181)

I would suggest installing a PDF printer driver, printing to it to generate a PDF and then going from there, such as using any number of PDF to HTML applications, avaliable from google.com

Outsource (0)

Anonymous Coward | more than 8 years ago | (#13282182)

Outsource the boring and tideous work to India.

HTML Tidy (1)

LittleVito (625033) | more than 8 years ago | (#13282190)

I once had to convert a large number of pages generated by Word into something that was at least close to validating and I used Tidy HTML [sourceforge.net] . It took a little bit of poking around with all the arguments to get it to do what I wanted, but once I had I just ran it on all the Word exports and it popped out clean code. It even had a special flag (though I don't remember it off the top of my head) to specifically deal with Word exports.

HTML Tidy (2, Informative)

John_Booty (149925) | more than 8 years ago | (#13282191)

HTML Tidy has a special mode for cleaning up Word's crappy HTML export. HTML Tidy is a free command-line tool that is also embedded in a lot of popular HTML editors.

HTML Tidy:
http://tidy.sourceforge.net/ [sourceforge.net]
HTML Kit (great integration with HTML Tidy; it includes HTML Tidy so you can just grab HTML Kit without grabbing HTML Tidy)
http://www.chami.com/html-kit/ [chami.com]

Countless other editors integrate with HTML Tidy as well. Have fun and good luck!

Donkey Punch Them (0)

Anonymous Coward | more than 8 years ago | (#13282192)

That's the only way go fix the problem.

Three applications in Windoze to suggest... (0)

Anonymous Coward | more than 8 years ago | (#13282194)

For batch conversions, there is nothing better that I know of than TextPipe. I also like askSam [the import feature lets you grab content from many different filetypes]. If there are not many files to do on a given day, or you just want a low resource-intensive approach, try PureText (I use 2.0). It is extremely easy to use. Good luck!

You need an intern (1)

supercolony (840210) | more than 8 years ago | (#13282200)

Assuming that it is not in your power to change the material coming to you, then you must change how you process it.

Quite frankly, the most cost effective way to deal with this problem is to hire an intern, temp or clerk. Train this person to formal very plain HTML, to your liking (or XML, or XHTML or whatever you prefer). Then use your application to apply the style you like to the HTML the temp made.

If you want to involve more programming, you could whip up a parser to validate the intern's work. But the reality of the situation here is that unless you are working on a truly overwhelming volume of documents, it will be much cheaper to use human labor than to invest the programming time to automate the process.

-jr

Re:You need an intern (1)

0xABADC0DA (867955) | more than 8 years ago | (#13282260)

So basically:

1. Hire intern to transcribe work .doc into html
2. Outsource html transcription job to india ...
4. Profit!!

Macro in VBA (1)

Yonatanz (798506) | more than 8 years ago | (#13282201)

If the articles are mostly text, then you can write yourself a simple VBA macro in Word, that iterates over the object model of the Word document, and creates the simplified HTML code.

For example, you can turn every underlined, 18-pt text into <H1> headers, etc.

This way you can keep the consistency quite easily, while still staying flexible.

You can even create HTML that is compatible with the IDs and CLASSes of your site's existing CSS.

This, however, requires that you know VB, and spend some time getting to know the Word object model, which is not too difficult

Get it in PDF first. (2, Interesting)

frostman (302143) | more than 8 years ago | (#13282210)

I'm assuming you have the right to republish the Word documents. I'm also assuming you have no control over how many Word-specific formatting features are used by the authors.

What I would do in your shoes is set up a (mostly) automated system to convert the Word files to PDF. You can buy Acrobat or you can go with a third-party, printer-driver-style converter, but in the end you'll probably save more headaches just using Acrobat.

Once you have a document in PDF, you can use any of the numerous (free and commercial) tools to convert that to HTML, text, whatever - all much more reliably than from Word directly. It's not perfect, but it's probably the closest you'll get.

Plus, you can post the PDFs themselves for download in case someone wants them - and at least Google will still happily index your PDFs.

Yes, you'll probably have to live with some NT variant to get that part done (though it might work with OSX) - but it's most likely your fastest path to *quality* conversions.

Convert to PDF (1, Informative)

Anonymous Coward | more than 8 years ago | (#13282212)

Adobe Acrobat PDF conversion preserves look

Many free or cheap printing filters / converters available

Homesite (1)

Hank Chinaski (257573) | more than 8 years ago | (#13282213)

Homesite has a function to import and clean Word Documents.

HTML parsing (1)

ChiralSoftware (743411) | more than 8 years ago | (#13282214)

If you really want to do it right, use an HTML parser to extract the content, and then re-render it. That's exactly what our mobile search [mwtj.com] engine does to convert web pages to mobile pages. It's non-trivial stuff. The advantage of doing it is that you do end up with clean, uniform HTML (or WML or XHTML in our case).

Some future version of Tomcat [apache.org] should have built-in content parsing in its filters so that filter writers could write simple filters to transform content in a meaningful way. But I haven't seen that as a proposal anywhere.

similar problem in Quark (0)

Anonymous Coward | more than 8 years ago | (#13282216)

I have a similar problem in QuarkXpress. My current solution is to export the doc as HTML and then search and replace in BBEdit in order to clean things up. A regex would do the job except for the fact that Quark generates arbitrarily named stylesheets that require manual changes. I am considering writing a script that would parse Xpress-tagged output and convert it to HTML.

I'd suggest something similar for Word...export as rtf (?) and parse it into valid HTML. However, Word's HTML is *much* worse than Quark's.

wvWare (1)

funkmeister (783995) | more than 8 years ago | (#13282218)

Try wvWare (http://wvware.sourceforge.net/ [sourceforge.net] ). It works amazingly well for Word Excel and Powerpoint. I have used in Zope applications and have had very good results.

From the site:

This is the home of the wv library. The original name of the project, mswordview, was uncomfortably close to Microsoft's own product named wordview, so the library was renamed.

wv is a library which allows access to Microsoft Word files. It can load and parse Word 2000, 97, 95 and 6 file formats. (These are the file formats known internally as Word 9, 8, 7 and 6.) There is some support for reading earlier formats as well: Word 2 docs are converted to plaintext.

wv compiles and works under most operating systems. Although most development is carried out with Linux, wv should work on BSD, Solaris, OS/2, AIX, OSF1, and even (with varying levels of success) AmigaOS VMS. The GnuWin32 project maintains a port for Windows, and it is required to compile and work on all of AbiWord's supported platforms, of which there are a lot.

wv allows other programs access to Word documents for the purpose of converting them to other formats. It is currently being used by AbiWord as its Word importer, and concepts and bits of code are being used by the KDE folks over at KWord in their word importer.

Try HTML Transit by Stellent (1)

RaSchi.de (599502) | more than 8 years ago | (#13282220)

I've had a similar task once and we used HTML Transit, a software by Stellent (http://www.stellent.com/ [stellent.com] ) and distributed by Avantstar (http://www.avantstar.com/ [avantstar.com] ). You can define templates for all kinds of word styles and fine-tweak the HTML output quite neatly. And, another advantage, I had excellent support when some questions arose.

Beautiful Soup (1, Informative)

Anonymous Coward | more than 8 years ago | (#13282221)

If you like Python, there is an app out ther called Beautiful Soup [crummy.com] which can suck in ugly, malformed markup and give you a parse tree you can play with before dumping it back out to html.

P.S. There is a Ruby Port [crummy.com] as well.

use kword (1)

0xABADC0DA (867955) | more than 8 years ago | (#13282224)

I had to do this recently, but to print a 20-page document as 12 pages by removing page breaks. I found that kword using the html export filter and setting it to "HTMl 4.01 + Light (strict xhtml)" was the best mode. This doesn't use the style sheets and just converts to basic html.... no fancy positioning or fonts, just some headers and basic styles. This was using kwork 1.4.1 / kde 3.4.2 btw.

Everything else I tried sucked, including OO.o's export.

Resign from your executive position (2, Interesting)

Fastball (91927) | more than 8 years ago | (#13282227)

What is it with executives and directors and their fixation with sending simple memos and messages via Word attachments in e-mails? Everybody else is on board with plain text (except some folks who are smitten with font coloring). Why can't the dolts at the top of the totem pole type in their mail client's editor and hit "Send?"

Re:Resign from your executive position (4, Insightful)

dougmc (70836) | more than 8 years ago | (#13282281)

Everybody else is on board with plain text
I don't know where you live/work, but out here in the real world, not everybody is on board with plain text. Not anymore.

I use mutt and fetchmail in a company of Exchange users. Almost every email I get at work now, from everybody, is in html. (Unless I sent it to myself.) I don't like it, but I deal with it. It's certainly easier to deal with it than to try and change everybody else.

I could change jobs, but over something as trivial as html emails? No. I like my job, I like the people I work with, so I just bend like the reed in the wind ...

Still, the executives are certainly worse about email ettiquette than most, and it's not just in this company -- everywhere I've worked I've found this to be the case. They don't include Subjects at all, or include useless ones like `message'. Some will type up a memo and send it as a .pdf file attachment, or worse as a .bmp file. They rarely trim anything when responding to a post -- they just top post away. (But many people do that ...)

More specifically: Word into MS CMS (1)

snowwrestler (896305) | more than 8 years ago | (#13282229)

We run Microsoft CMS for my company's Web site, which annoyingly accepts pastes direct from Word, complete with all the extraneous code. (As opposed to a normal text box, which strips formatting when accepting pasted text.)

Since we style the text with CSS, we have to train everyone who works on the site to first paste anything from Word into Notepad to strip out Word code crap, then paste that into the CMS browser client, then re-apply formatting with the tools in the client toolbar. What a pain! I'd love to know if anyone has figured out a way to allow people to paste Word content directly into MS CMS without having to go through all those extra steps.

Use AbiWord on the command line (1)

dominator (61418) | more than 8 years ago | (#13282238)

I'd recommend using the CVS version of AbiWord. It'll preserve almost all of your visual and semantic meaning using XHTML and CSS. This includes fairly complex things like endnotes, footnotes, tables, floating text boxes, etc.

AbiWord --to=file.html file.doc

http://www.abisource.com/ [abisource.com]

is HTML really necessary (1)

lakeland (218447) | more than 8 years ago | (#13282242)

Most web users now seem to tolerate PDF files, and exporting from word to PDF is much more reliable than exporting from word to HTML.

As a solution goes, it is pretty crude. However, it works quickly and easly, and produces nice looking output.

Re:is HTML really necessary (1)

ryanov (193048) | more than 8 years ago | (#13282262)

PDF is often irritating to use for certain applications, and for me is a turn off (load time for Adobe). When putting up my resume I had the same issue, as I wished to provide it in HTML as a third option (HTML, Word, PDF).

Dreamweaver (1)

brickballs (839527) | more than 8 years ago | (#13282246)

Doesn't dreamweaver have an 'unfuckup' button that fixes word-html?

Sure. (0)

Anonymous Coward | more than 8 years ago | (#13282247)

Avoid all HTML export tools.

Edit -> Copy
Switch to gvim
Edit -> Paste

Seriously. People need to stop using Word (or FrontPage, for that matter) to design pages.

AppleScript! (1)

jimijon (608416) | more than 8 years ago | (#13282255)

Sounds like a perfect job for AppleScript. You can create a scriptable folder, drop your documents in it and let it copy and paste all the paragraphs and add some html tags, etc. Very flexible.

Use a text editor (1)

chia_monkey (593501) | more than 8 years ago | (#13282258)

That's what I did. Copy all the text in Word, paste it in a text editor (which kills all the formatting assuming you're not using RTF), copy that and paste it in your HTML editor (usually the same editor to code your HTML) or you can paste into Dreamweaver or similar and go that route. Quick and easy.

avoiding the hand-edit (1)

SethJohnson (112166) | more than 8 years ago | (#13282259)



I'm seeing a lot of 'use Dreamweaver' responses that are well-meaning and probably will solve this guy's dilemna. But what about those of us running CMS systems with text area inputs in forms? Our content people copy-and-paste directly from word and these crazy MsWord entities get crudly transposed into ASCII question marks.

Anyone got a good regsub routine for correctly substituting these entities for their approximate ASCII equivalents? I'm just looking for pattern matching here... Don't need a bunch of code.

Appreciatively,

Seth

Webworks Pro (1)

99BottlesOfBeerInMyF (813746) | more than 8 years ago | (#13282261)

You may want tot look at WebWorks pro application for sanely exporting Word files as HTML/XML. I've used it in the past (a handful of years ago) and it was pretty reasonable. It is worth investigating in any case.

A solution (1)

Dr_Ish (639005) | more than 8 years ago | (#13282271)

I have run into a similar kind of set of problems, as I run an on-line philosophy journal (see http://ejap.louisiana.edu [louisiana.edu] ). The solution I found was to convert the documents down into RTF format as an intemediate step. There are a number of shareware RTF-to-HTML converters available. Unfortunatly, I cannot find the name of the program I usually use at the moment, or a link for it, but googling for "RTF to HTML" shareware produces quite a few likely candidates. This system worksjust fine for me. What I like best about the program I have is that it puts the HTML codes in in French! If you look at the source code for the most recent edition of my journal, you can see the system in action.

fckeditor (2, Informative)

mixmasterjake (745969) | more than 8 years ago | (#13282274)

fckeditor [fckeditor.com] is an in-browser WSYWIG. It has a "Paste from MS Word" button that actually strips out a lot of the unecessary baggage. I don't know how well it handles embedded images or tricky layouts, but for the basic stuff it works well.

The interface is similar to Word - maybe if you're lucky, you could get some of your content producers to use it.

HTML Tidy program (4, Informative)

Todd Knarr (15451) | more than 8 years ago | (#13282278)

One program I've had luck with is the HTML Tidy program at http://www.w3.org/People/Raggett/tidy/ [w3.org] . It seems to clean up code (particularly from Word) quite a bit.

PDF - GhostScipt (1)

Embedded Geek (532893) | more than 8 years ago | (#13282279)

Some have suggested using PDFs. To do this, I use Ghostscipt and Ghostword. Here [oreilly.com] is a good description from O'Reilly's Word Hacks on how to install it in Word.

DEMORONISER (0)

Anonymous Coward | more than 8 years ago | (#13282280)

(Perl script)demoroniser - correct moronic and gratuitously incompatible HTML generated by Microsoft applications

http://www.fourmilab.ch/webtools/demoroniser/ [fourmilab.ch]

My past employer can (0)

Anonymous Coward | more than 8 years ago | (#13282288)

607-272-4817, ask for Jim. Cyrus Company is a web development firm in upstate NY. I worked there for the last 3 years - I'm in the UK now - and we had a client who needed just this type of thing. Jim set it up (he can program in all kinds of languages I don't understand) and you can copy and paste from Word to an HTML form, keeping the format. There might be a browser requirement, but that's about it. I was amazed when I first saw it myself. If you have any questions, email me at paper@paperskies.com - sorry for the ad-sounding post, but it's the truth and I can't really think of any other way to put it! Regardless, good luck with the search.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?
or Connect with...

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>