Importing an external Feed into Jekyll


A simple script approach to import posts from an external feed into a Jekyll blog using HEY World’s atom feed as an example. Contains some special workarounds to fix the feed’s contents, but should work with every well-formed RSS feed out there with a little customisation.

As seen in my test post, I tried HEY World, a new blogging service by Basecamp that works by sending an email that’s then published to the web. While it’s not the most powerful blogging tool, it’s definitely one that’s easy to use.

To keep everything in one place and enjoy basic functionality like syntax highlighting even for the posts made in HEY World, I decided to import the feed they provide into my homepage and fix a few things on the way.
I am aware that this might lead to some SEO issues in case Google considers this to be double content, but I’ll cross that bridge when I get there.

Edit: Of course I noticed a few problems just after writing this post, e.g. that tags were no longer extracted in all cases. I made my script more fail-safe and updated the parts below. Here’s the post on HEY World to see the script in action.

Here’s the latest version of the script, I’ll go into some of the problems I ran into that are HEY World specific though. Parsing the feed was done using feedjira which is an amazingly easy to use library for all kinds of XML feeds, so I’ll skip this part.

Converting the Post Content

Trix, the WYSIWYG editor used by Rails’ Action Text and consequently by HEY, produces HTML which is used as entry content in the atom feed. It looks like they just insert the default trix layout which still contains e.g. a surrounding div.trix-content and a lot of formatting with a few… interesting choices.

To make this as easy for me as possible, I decided to convert the HTML I get from HEY into markdown and get rid of all HTML that’s not inside pre tags. There’s a great gem called reverse_markdown out there that I already used in the past and in a perfect world, all I’d have to do was a one-liner on the content:

ReverseMarkdown.convert(content, github_flavored: true, unknown_tags: :bypass)

However, I had to work around a few of Trix’s behaviours as well as get some additional information out of the post beforehand:

1. Trix and its Lack of Paragraphs

Other editors usually produce a <p> tag for a double newline, Trix goes the literal way and inserts two <br> tags. Not wrong, but a bit unexpected and a bit difficult when it comes to markdown conversion.

Let’s take a look at the reverse_markdown conversion rule for <br> tags:

class Br < Base
  def convert(node, state = {})
    " \n"
  end
end

As you can see, it inserts 2 spaces and a newline which is “stay in the same paragraph, but start a new line” in markdown. So, what happens with Trix’s <br><br> pattern?

"hello<br><br>world".to_markdown
=> "hello \n \nworld"

While kramdown, the default markdown parser in Jekyll, ultimately converts this into paragraphs as it ignores the spaces at the end of the lines, Jekyll seems to have problems generating a post teaser out of this. By default, Jekyll takes the first paragraphs of a post for the index page tiles (depending on their length), but was unable to do so with the double <br> tags:

image.png

It looks like everything up to the code block was interpreted as a single paragraph (which is correct as explained above). To work around this with as little work as possible, I decided to introduce an own pseudo HTML tag as replacement for the double <br>s that I could translate to markdown myself:

class DoubleBr < Base
  def convert(*)
    "\n\n"
  end
end

"hello<doublebr />world".to_markdown
#=> "hello\n\nworld"

2. Headings

While the HEY editor provides a “Heading” style, it seems to only be able to produce H1 tags which is probably not the semantically best way to go on a website. 

I therefore decided to add an own markdown converter again which simply translates H1 into ##. Works well and since I know that the input will never contain any other headings, this shouldn’t cause problems in the future.

3. Producing proper Code Blocks

While Trix has a basic code block functionality, it doesn’t support syntax highlighting, but introduces a few formatting tags like <em> into the code.

What it doesn’t do is inserting a proper <code> tag inside its <pre>s which would be the semantically correct way to display a block of code. I understand that they intended to simply provide a “we won’t touch whitespace here” functionality into their editor, but this behaviour made it harder to further process it.

First, I had to get rid of the extra formatting HTML that’s inserted into the code block. Luckily, the actual code contains twice encoded HTML entities while the formatting tags were only encoded once for the feed:

# A comment
class << self
&lt;em&gt;
  A comment
&lt;/em&gt;

class &amp;lt;&amp;lt; self

This way, it’s possible to decode it once and let Nokogiri remove all actual HTML tags while keeping the content intact. The easiest way to achieve this was to use #content on the <pre> itself which returns the pure text content of said node.

Additionally, I needed to alter two things in the HTML I got from HEY World: 

  1. Add a code tag so the markdown converter understands that this is in fact code
  2. Extract the language to be highlighted

To achieve 2), I added e.g. # lang: ruby to the bottom of code blocks in the HEY World editor and extract these lines in my script.

The resulting part of the script was pretty easy in the end:

xml.css("pre").each do |pre|
  code = Nokogiri::XML::Node.new("code", xml)

  # Search for a "lang: something" line at the end of the pre.
  # If there is one, use it for syntax highlighting
  if (match = pre.content.lines.last.match(/lang: (\w+)/))
    code["lang"] = match[1]
    code.content = pre.content.lines[0...-1].join
  end

  pre.content = nil
  code.parent = pre
end

I had to extend reverse_markdown again with an own converter for <code> though as by default it doesn’t take the lang attribute into account:

class TrixPre < Pre
  def language(node)
    super.presence || node.at_css("code")["lang"]
  end
end

4. Assigning Tags to Posts

Since HEY World itself doesn’t support tags, I used the same approach as for the code block highlighting language and search for a tag line (“#tag1 #tag2”) at the end of the post. I originally did this before turning the post into markdown, but the HTML I got wasn’t consistent enough, so I now treat the last markdown line as possible tags.

def tags
  @tags ||= begin
    tag_line = markdown_content.lines.last

    if tag_line.match?(/^((#\w+) ?)+$/)
      tag_line.scan(/#(\w+)/).flatten
    else
      []
    end
  end
end

If a tag line exists in the markdown content, it is removed before writing it to disk.

Front Matter

The front matter for the Jekyll post (the YAML part at the beginning that contains the post’s title and other information) can be easily generated now that all relevant information has been extracted from the RSS entry:

<entry>
  <id>tag:world.hey.com,2005:World::Post/11280</id>
  <published>2021-05-14T19:48:10Z</published>
  <updated>2021-05-23T15:15:38Z</updated>
  <title>Testing HEY world</title>
...
</entry>
def front_matter
  {
    date: item.published.localtime.to_s,
    last_modified_at: item.updated.localtime.to_s,
    layout: "post",
    tags: tags,
    title: item.title
  }.transform_keys(&:to_s)
end

The important part here is to make sure that all Hash keys are strings, otherwise, a YAML dump would generate Symbols which Jekyll doesn’t expect.

File Generation

To keep this as simple as possible, I decided to reload all posts from the external feed every time I run the script, deleting every post that might have already been there. In case of stex.codes, this is done when building its docker container for deployment in CI.

I tried using an own collection (= an own folder) in Jekyll for these posts, but those behave slightly different from normal posts and require a bit more configuration to be included in the main feed. Therefore, I decided to simply add a distinct pattern to their file names while still following Jekyll’s naming conventions:

image.png

This allows me to remove all possibly existing posts before loading the feed again:

FileUtils.rm_rf(POSTS_DIR.join("*--hey--*"))

Re-Building the Website on Feed Changes

Since changes to HEY World should be synced over to my blog, I wrote a small script / docker image to watch for changes and trigger a docker hub build. I didn’t find a better solution than polling yet, unfortunately.

Conclusion

Well, it works. I’ll have to adjust a few styles on the website to e.g. prohibit smaller images being displayed in full width, but creating a blog post through my email client is definitely a lot easier than having to code it. And since I’m still able to create more complex posts directly in Jekyll, I’ll go with this for all simpler topics for now.