Wordpress to Hugo

¹/Oct 2016 By admin

I’ve been planning on converting my blog to some static site generator for a while now, but the (perceived) amount of work involved seemed scary, so I kept coming up with reasons not to do it. I really like the idea of static sites. Using gamedev jargon, this basically means we “precompute” as much as possible. Instead of retrieving posts from the database and building pages on the fly (like WP), we do it all offline and then push static HTML files online. In other words, we (creators) take a small hit once, but the end user experience should be much better (just DL few KBs worth of data, no fancy scripts running, no DB operations).

Wordpress isn’t ideal, but it’s easy to use and it works ™, so didn’t have enough motivation to convert 200+ posts. That’s until some of my WP posts got mysteriously corrupted. Most of the quotation marks (“) and dashes (-) were replaced with some weird characters (outside the 0-127 ASCII range). That was enough to push me over the edge and start looking for alternatives. Have to admit I didn’t really research it very extensively, I’ve heard good things about Jekyll, but right about that time I’ve also noticed people mentioning Hugo a lot. Had a quick look at the documentation, liked what I’ve seen and never looked back.

As mentioned, over the years I’ve published over 200 posts, so converting them all to markdown required a fairly complex process.

Exporting Wordpress posts/comments

There’s a few ways you can do it. At first, I tried wordpress-to-hugo-exporter. It’s almost completely automatic (WP plugin) and worked OK, but the output was a little bit noisy (e.g. posts included my WP social bookmark plugins data etc). Fortunately, I’ve found another tool - exitwp. Technically, its meant for Jekyll, but in the end we get a list of markdown files, which is good enough. It’s a little bit more cumbersome to use, you have to export your WP content to XML first, then run a Python script, but the output is very clean and nicely organized.

Cleaning up

As I mentioned, I still had some problems with my WP posts I had to address (weird encoding, for example). Decided to write a Python script that’ll deal with all the issues. Perl would probably be more suitable, but I don’t use often enough and I’m much more comfortable with Python. The script deals with the following problems:

Broken encoding. For unknown reasons, some of the quotation marks and dashes have been replaced with sequences of [0xC3, 0xA2, 0x3F, 0x3F] and [0xC3, 0x82, 0xC2, 0xA0]. I made it a little bit smarter than plain search & replace. AFAICT, there was no way to distinguish between “ and - anymore, so I tested whether the invalid sequence was surrounded by spaces. If so - replace with dash, otherwise - with a quotation mark
Fixing internal links. Easy, my WP links were formatted like: http://msinilo.pl/blog/?p=1400, I had to convert it to http://msinilo.pl/blog2/yyy-mm-dd-title (blog2 isn’t important, the p=XXX to slug is). I actually had to change my slug format later, when handling redirections, but I’ll get to it. Was just a matter of building a map of slugs[old_url] = new_url
Fixing internal images. This one was a mess, to be honest and it was my fault because I was quite sloppy when adding images to my old posts. I had to handle 3 different cases, sometimes images were in blog/img, sometimes in wp-content/uploads. Some of them had captions, some didn’t. Had to add monstrosities like:
```
urlRe = "\[\!_\-\.\w\,\/\(\:\'\]\?\="
newLine = re.sub(\
"\[caption [\w\=_\"\s\'\(\)\.\,]+\]\[*\!\[([_\-\.\w\,\s\/\(\)\:\']*)\]\(([" + urlRe + "]+)\)\]*" \
"\(*([" + urlRe + "]*)\s*[_\-\.\w\,\s\/\(\)\:\']*\[\/caption\]", \
"&#123;&#123< img src=\"\g<2>\" title=\"\g<1>\" link=\"\g<3>\">}}", newLine) 
```
…but eventually got it to work more or less correctly. Added custom img shortcode to my layout, so that it handled captions properly.
Syntax highlighting. Didn’t even have to use my Python script here, I simply needed to replace [code lang=“cpp”] with “highlight cpp”. Can be done with the -pi Perl switch. Couldn’t really get it to handle multiple files (given extension) under Windows correctly, so had to resort to a simple batch file:
```
for %%i in (*.markdown) do perl -pi.old -e "s/\[code lang=\"cpp\"\]/&#123;&#123< highlight cpp \"linenos=inline\" >&#125;&#125;/" %%i
```

(same for other languages)

Redirecting

I wanted to make sure that all the old post URLs would be redirected to the new blog. This can be easily done using Apache’s mod_rewrite. I had to redirect URLs like /blog/?p=1391 to /blog2/dd2016-video. As you can see, can’t really be done using any automatic rule, I’d have to provide a translation dictionary. It’s actually possible using RewriteMap, but it can’t be used in .htaccess file, map has to be declared in server context. Was too lazy to check whether I had this option with my provider, so decided to change slugs from old posts. ?p=1391 becomes p1391 now and we can handle it with the following RewriteRule in .htaccess:

RewriteCond %{QUERY_STRING} ^p=(\d+)
RewriteRule ^blog /blog2/post/p%1? [R=302,L]

This meant I had to fix my internal links again (as slugs have changed), but it was just a matter of applying yet anothet translation. Added redirections for categories as well (just a few of them, so could be done by hand), e.g. blog/cat=7 -> blog2/categories/memtracer/

Theme

Using redlounge. Didn’t really have to modify it much except for the aforementioned img shortcode. I did have some problems with pagination, but it might have been caused by me using ‘weird’ path (blog2). In the end, I just hacked to build URLs by hand:

&#123;&#123; if $paginator.HasPrev }}
<a href="{{ .Site.BaseURL }}/page/{{ add $paginator.PageNumber -1 }}" class="prev">Prev</a>
{{ end }}

Comments

Hooked up Disqus for new comments, wrote yet another Python script to extract old comments from the XML file and append them to the posts they were belonging to.

Other

Some other minor issues I ran into were:

tables - for some reason they were not converted correctly. Luckily, I don’t use them much, so just fixed them manually
fractions - Black Friday - a markdown processor that Hugo uses converts numbers like 5/8 to fraction automatically (⁵⁄₈) and by default. Can be disabled in your config.toml (fractions=false) [caveat: doesn’t apply to ½, ¾ & ¼]

Hugo

I’m pretty happy with Hugo, it’s easy to use and well documented. Local server + live preview + hot reload is great, means I don’t even need any fancy Markdown editor, I just edit my posts in any text editor and keep an eye on the preview in my browser.

.m.m.s.

Maciej Sinilo