CopyAsMarkup

By Ron Kok ▾contact: ronkok {at} HilbertLLC.com

Introduction

CopyAsMarkup™ is a conversion utility that gives writers the ability to compose their work in Microsoft Word and contribute the result to a web page or an eBook.


FeatureSketch

CopyAsMarkup is an add-in that works within Microsoft Word. To use CopyAsMarkup (CaM), one selects text in a Word document and invokes one of these commands:

Ribbon

CopyAsMarkup will then convert many features of the selection into the target markup language and place the result onto the Windows clipboard, ready to paste into an HTML or Markdown file.

Yes, Microsoft Word already has its own Paste Special…as HTML command. But that HTML is cluttered with a lot of formatting information. []Microsoft has a good reason for all that clutter. Their HTML is often destined to be pasted into an email. And since there is no standard CSS for email, Microsoft loads up their HTML with a tag soup of ad hoc formatting information. Inelegant, but it works. The output from CopyAsMarkup is sparse and written for an HTML page that has its own CSS. Humans can read the HTML from CopyAsMarkup.

Optional: One can first set some CopyAsMarkup settings, reached by clicking the SettingsButton button on the Add-Ins | Copy-as-… section of the Word menu ribbon.

Feature Summary

CopyAsMarkup will translate the following Word features into HTML or Markdown.

  • Headings
  • Tables
  • Lists
  • Images
  • Text boxes
  • Math zones ↦ TeX
  • Block quotes
  • Code blocks
  • Footnotes
  • Hyperlinks
  • Bookmarks ↦ id="…"    
  • Manual line breaks
  • Horizontal rules
  • Paragraph styles
  • Table styles
  • List styles
  • Drop caps
  • Character styles
  • subscripts
  • superscripts
  • bold ↦ <strong>
  • italic ↦ <em>
  • strikethrough
  • all caps
  • small caps
  • underline

It’s also worthwhile to know what CopyAsMarkup does not do.

CopyAsMarkup is not a page layout tool. It’s a writing tool. A functional HTML page will need <head>, <body>, and <div> structures, and CSS. You’ll have to get them elsewhere.

If you want 100% round-trip copy-and-paste compatibility with Microsoft Word, then CopyAsMarkup isn’t for you. CopyAsMarkup does not support all Word features.

CopyAsMarkup does not generate CSS.

It will not create form tags or buttons.

It does not work on Apple computers or on Windows RT tablets.

In short, CopyAsMarkup is a tool for people who write content for an HTML page. It is not a full-stack tool for building the infra­structure of a web page.

Example

This section illustrates some capabilities of CopyAsMarkup. I wrote this section in a Word document, which you can view here. Then CaM translated it. You can also view the HTML and the Markdown that CaM produces from this example.

Copy-as-Markup translates many character formats, such as bold, italic, bold-italic, subscripts, superscripts, ALL CAPS, small caps, strike-through, and underline. You can create other character formats, such as drop-cap, with Word styles and corresponding CSS.

CopyAsMarkup does lists:

  1. It does numbered lists, too.
  2. Item 2

In the table below, notice that CaM has aligned the numbers on their decimals.

Caption: Numeric Table
Heading 1 Heading 2 Heading 3
Item 1 10.01 40,000.01
Item 2 5.20 200.00
Item 3 100.20 4,500.00

Caption: Merged Cells
CaM does tables with horizontally merged cells…
- …and
vertically
merged
cells.
- - -
- - …and cells merged in
both directions.
- -
- - - -

CopyAsMarkup does paragraph styles. The one below I call Fenced.

if x%2 == 0:
    print 'x is even'
else:
    print 'x is odd'

CaM translates the Word style Quote into an HTML <blockquote>, like this:

Now is the time for all good men to come to the aid of their party.

CaM does math notation:

\[\left(\frac{x_n +5}2 +\sqrt{b^2}\right)\times\begin{pmatrix}1 & 0 \\0 & 1\end{pmatrix}\times\hat{\theta}\times\int_{x=0}^\infty{e^x\,\text d x}\]

sunset
Sunset

It can recognize a Word text box, and will translate it into an HTML <div>. Or, if a text box consists entirely of an image and a caption, CaM will write it as a <figure>, complete with a <figcaption>.

This example demonstrates the way in which CaM is dependent upon CSS by others. A CSS file defines the styles used for items like caption and inset-right.

Let’s wrap up this example with a footnote1. CaM will place the text of the footnote at the bottom of the selection.


  1. Here is the text of the footnote.

Installation

First, keep in mind that CopyAsMarkup (CaM) works inside Microsoft Word. If you don’t have Word, then don’t bother getting CaM. Also, CaM is a Windows Add-In. It does not work on any Apple computers.

CopyAsMarkup is released under the MIT license. So you can use CaM for free.

Download

To install CopyAsMarkup, download the CopyAsMarkup.dotm file and place it into folder: C:\Users\YourUserIdHere\AppData\Roaming\Microsoft\Word\Startup.

You will need administrator privileges in order to place a file into that folder. If more than one person logs into a computer and uses CaM, each will need their own installation.

CopyAsMarkup.dotm was most recently revised on 17 January 2017.

Tips

The following sections contain some tips for authors who write in Word and aim to contribute to web pages or eBooks. It’s all written in more detail below, but the short version is:

  1. Don’t bother with tabs.
  2. Use Word styles instead of inline formats.
  3. Use Word math zones to write math.
  4. Understand that HTML uses image links, not embedded images.

Tabs

HTML has no tab stops (except within <pre> tags). CopyAsMarkup will translate any tabs it encounters into spaces. So there is no point in using tab characters in your Word document. I use tables instead.

Styles

CopyAsMarkup maps Word styles onto HTML classes. It maps paragraph styles onto block classes and it maps character styles onto span classes. That is…

If you assign a paragraph style to a paragraph of your document, then CopyAsMarkup will write the paragraph tags like this:

    <p class="style-name">…</p>

If two or more consecutive paragraphs have the same paragraph style, CaM will write:

    <div class="style-name">
        <p>…</p>
        <p>…</p>
    </div>

If you assign a character style to some text, CaM will write:

    <span class="style-name">…</span>

In order to take advantage of this, you may need to change some habits. For instance, it will do no good to click on the paragraph justify button that exists on Word’s Home ribbon. CopyAsMarkup ignores paragraph justification. Instead, assign a paragraph style, and make sure the web page’s CSS has a class with the same name as the Word style. For example:

I wrote this paragraph in Word with the style name Indented. CaM wrote it as a paragraph with the attribute class="indented". The CSS for this page contains a line that says:
.indented { margin-left: 2.5em; }

Formatting a big Word document will work more smoothly if one uses Word styles instead of individual paragraph formatting. In HTML, it is even more crucial to use classes. Without classes, a document ends up as a tag soup that is hard to edit. So it is important, when writing a Word document that is destined to become HTML, to use certain techniques and habits.

  1. Use a template Word document with style names that correspond to your CSS class names.
  2. Use styles, not paragraph formats, for headings, quote blocks, and other distinctive formats.
  3. Use styles, not character formats, for character fonts, sizes, and colors.
    Exception: CaM does recognize character formats for: subscripts, superscripts, bold, italic, strike-through, and underline. Those buttons you can use.
  4. Did you know that Word has table styles? List styles? They exist, and CaM will use them to assign classes to tables and lists. []I personally find Word’s list styles to be difficult to use. So I just type a class name into the text editor after I’ve pasted in the HTML.
  5. CopyAsMarkup has a hard-coded class name for certain Word features. Your CSS will need the following class names if you intend to use these features:
Small caps class="small-caps"
All caps class="all-caps"
Drop cap <p class="drop-cap-paragraph">
<span class="drop-cap">T</span>…</p>
Text box,
wrapped
class="inset-left", or
class="inset-right"
  1. CopyAsMarkup has a hard-coded Word style name for certain HTML tags. Your Word document will need the following styles if you want CaM to apply the corresponding tags:
Code <Code> (if applied to a word)
<Pre><Code> (if applied to a paragraph)
Aside <aside>
PullQuote <aside class="pull-quote">
Term <dt> (inside a <dl>)
Description <dd> (inside a <dl>)
Toggle See footnote 2 here.
  1. After you’ve run CaM, you’ll paste the result into an HTML file. You may be surprised at how often the selection contains format and style artifacts that were invisible in Word. The problem is especially bad if you have copied something from a web page and pasted it into Word.

    The way to avoid this is to hit Word’s Clear Formatting button after you have pasted some text into Word. It is not enough to hit the Normal button on the Style bar. That only changes the paragraph style. To erase all the character styles, you have to hit Clear Formatting. That button can be found when you click the down arrow on the Style bar. Another copy of that button is on the Copy-as-… bar.


CaM will convert Word style names from camelCase to hyphen-delimited-lower-case HTML class names.

Images

Your Word document may contain embedded images. In HTML files, this is not commonly done. Instead, an HTML file almost always contains an <img src="…"> tag that links to an external image file.

If CopyAsMarkup encounters an embedded image, it will export the image file and write an <img> tag that links your HTML to the image. []You can choose to avoid embedded images altogether, and instead insert image links into your Word documents. Word’s standard Insert | Picture command will insert the full absolute path name of that link. If you prefer a relative path, then use command Add-Ins | Copy-as… | Images | Insert relative link to image…. You can select the image’s destination folder at run time, or you can pre-select the destination folder in CopyAsMarkup settings (the settings button).

This brings up the choice of relative vs. absolute file paths. Each has their own advantage. Relative paths are good if you anticipate that an HTML file and its linked images will someday move together to a different location. Absolute paths are better if the image files will stay in one place while the HTML gets copied about.

CaM will write a relative path into the <img> tag if your Word document and the linked image file are on the same drive. So your local image folder location, relative to your Word document, should mirror the directory structure of your web site. []I’ve experimented a bit with a setting that defines an absolute alias for a local drive link, but my experiments were all awkward to use. If you want an absolute path, you can do a global search-and-replace in your HTML file that searches for the string <img src="../images/, and replaces it with the absolute path.

A note about vector graphics: They are clear, scalable, and compact. In HTML, the standard vector image format is SVG. So it’s unfortunate that Microsoft Word does not render SVG. If you paste an SVG image into Word, Word will convert the image into a format that it can render, probably PNG or EMF. If you want the original SVG image in your HTML page, you’ll need to be careful about which file you link to.

Zip Files

Let’s say that you have avoided embedding any images into your Word file, and instead your Word file contains links to external image files. How then does one email the package of files?

Simple. Put the files into a folder in Windows Explorer. Then right-click on the folder and choose the option to: Send to | compressed (zipped) folder. The result is a zip file, which is easy to attach to an email.

Math in Microsoft Word

Microsoft Word now has an excellent equation editor to help you generate math notation. When you invoke the command Insert | Symbols | Equation or when you type the keyboard combination ALT key=, Word will create a math zone in your document .

The power of the math zone is best appreciated when you see it in action, as in the videos located here and here. Did you just watch a video? Good, now practice with an equation of your own. First, type ALT key=. Then, inside the math zone, type:

a_b + 20^3 + y\bar + ((1+5)/2) + \theta + \sqrt 15

As you type, the math objects will build up and the result will be:

\(a_b +{20}^3+\bar y +\left(\frac{1+5}2\right) +\theta +\sqrt{15}\)

To my mind, that is a pretty easy way to generate math notation.

More math tips:

When you are in a math zone, use Word's Equation Tools ribbon for more sophis­ticated math notations, such as \(\small \int_0^\infty\) or \(\tiny \begin{pmatrix}1 & 0 \\0 & 1\end{pmatrix}\)

If the Equation Tools ribbon doesn’t have the tool you seek, try a right-click on the math zone. Word will show a pop up menu with more editing commands.

Since an underline _ indicates the beginning of a subscript, as in a_b ↦ \(a_b\), you need another way to make a literal underline appear in an equation. Use \_.

Spacing: CaM translates Word’s non-breaking space character (Ctrl-Shift-Space) into the TeX spacing function \,

CaM can write your math with a toggled note that gives your reader something easy to copy and paste. There is an explanation here.

If you prefer to write your math on a touch screen, the Microsoft Math Input Panel allows you to do that. You can launch it by typing math input panel into the Start button search box. Then write your math. When you hit the Insert button the panel will try to send the equation directly into the application that currently has focus.

If that application is Microsoft Word, it will work well. Many other applications are not equipped to deal with the MathML format of the data.

Math on the Web

The output from CopyAsMarkup will render properly in most math-oriented websites. I’ve written more background here, but if you are an author, you probably need take no further action.

If you are a web site administrator and want to equip your website for math, I’ve written some information for you as well.

If your web site or eBook has only a small amount of math notation and you don’t want to add a third party library, you could use SVG images▾ instead.

SVG is an image format that can display math in better quality than bit-mapped images. But, like other image formats, it’s tricky to get the image to display at the correct size. If the size is set directly inside the <img> tag, then the size can only be written in units of px. We want the size to be defined in em’s, so that the equation will resize when the reader changes the font size.

The trick is to set the SVG image width to 100%, then nest the <img> inside a <span> or <div> and set the container’s width in em’s. CopyAsMarkup will do that for you if you take the following steps.

  1. Write your math in a Word math zone.
  2. Use Copy-as-TeX to get a TeX version of the equation.
  3. Use LaTeX Previewer to get an SVG image of the TeX. Load the SVG into your web site's directory.
  4. Use CaM’s Wrap a math SVG command. It will place the proper markup onto the Windows clipboard.
  5. In your HTML file, paste the image markup in place of the TeX.

To illustrate, the equation below is an SVG image.

Faraday’s Law

The markup that CaM wrote for that equation looks like this:

<div style="margin: 1em auto 1em auto; width: 12.3em;">
    <img src="images/faraday.svg" alt="Faraday’s Law" style="width:100%;">
</div>

Settings

You can control some settings for the CopyAsMarkup output. They are reached via the button at: Add-Ins | Copy-as-… | SettingsButton

Use HTML5 tags. If this preference is selected, CaM will:

  1. Write <br> and <hr> instead of <br /> and <hr />.
  2. Use the <figure> and <figcaption> tags for images with a caption.
  3. Use the <aside> tag for paragraph styles Aside, PullQuote, and Sidenote.
  4. Use the <cite> tag for paragraph styles Cite, Title, Author, and Citation.

Add Id to each heading

Adds an id attribute to each heading tag. The text of the id will be the heading text, converted to hyphen-delimited-lower-case.

Set links to open in new tab

Adds attribute target="_blank" to each link. That rather cryptic attribute tells the browser to open the link in a new tab.

Use eBook tags

  1. When this setting is selected, CaM wraps each heading with a <div id="…">…</div> and writes the id into the <div>, not into the heading tag. This works around a Kindle quirk.
  2. Footnotes are written with an epub:type="footnote" attribute.

Export embedded images to folder:

Controls whether to choose the image’s destination folder at run time or to pre-define a destination folder.

Footnotes

See the notes here.

Toggle Notes

A toggle note is a parenthetical note that a reader can turn on and off with a click. The default marker for this click is the character “”. You can substitute your own character in the settings. []"The marker I use is [<span class='reduced'>▼</span>] and my CSS defines a class: .reduced { font-size:60%; }

Tufte CSS uses the checkbox hack to implement a toggle note. That technique has a good part (no JavaScript) and a bad part (more complex HTML markup). I personally use a different markup for toggle notes, but if you prefer the Tufte approach, you can choose so in the settings.

Math Toggle

If this preference is chosen, math will include not only TeX for the browser to render, but also another span that becomes visible when the user does a left-click on the math. That note will show the math in TeX and in Microsoft's UnicodeMath format. It gives the user something easy to copy and paste. You can try it on the equation below.

\[x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}\]LF: x= (−b±√(b^2−4ac))/2a
TeX: \[x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}\]

For this to work, your page must have the CSS and JavaScript for a toggle note.

Feature Details

Go here for a detailed explanation of each CopyAsMarkup feature.