Problem

I want to format exported highlights from iBooks to look like Kindle clippings. I’ll write about why in a different post.

Details

The My Clippings.txt file where a Kindle saves highlights is formatted like this:

==========
The Years of Lyndon Johnson: Master of the Senate (Robert A. Caro)
- Highlight on Page 1465 | Loc. 22456-58  | Added on Wednesday, December 12, 2012, 09:38 AM

Talking with Proxmire, Richard Russell told him that his “position reminded him of a bull who had charged a locomotive train…. That was the bravest bull I ever saw, but I can’t say a lot for his judgment.”
==========
The Years of Lyndon Johnson: Master of the Senate (Robert A. Caro)
- Highlight on Page 1476 | Loc. 22626-27  | Added on Wednesday, December 12, 2012, 09:55 AM

Watching Lyndon Johnson hurry through the corridors, coat-tails flapping, journalists coined jokes about his intensity. “Light a match behind Lyndon and he’d orbit,” was 
==========
The Years of Lyndon Johnson: The Passage of Power (Robert A. Caro)
- Highlight on Page 14 | Loc. 212-13  | Added on Wednesday, December 12, 2012, 05:20 PM

But although the cliché says that power always corrupts, what is seldom said, but what is equally true, is that power always reveals.

If you get your iBooks highlights into a text file1, they are formatted like this:

October 18, 2015 No title
Halifax, unconvinced, continued to hint at the desirability of negotiations. Beck’s principal private secretary told a British diplomat that these hints “tend to create an element of doubt as to the fixity of [Britain’s] purpose.” They evoked memories of Munich, he added, which was “not a good precedent.”43
October 18, 2015 No title
Here their error was not only spectacular, it was historic.
October 18, 2015 No title
the Nazis’ favorite diplomat, Sir Nevile Henderson, His Majesty’s ambassador to the Third Reich. Henderson defended the anti-Semitic pogroms of Danzig Nazis, opposed any link between London and Warsaw, thought the Poles should “talk a little less” about their courage and think “a little more” about the “realities” of their position on the eastern fringe of Grossdeutschland

Solution

The key is to insert this text/template:

==========
[book title] ([book author])

before each line that states the date.

You can insert that template or any other text you want before lines that fit a regular expression (regex) with the following perl command.

perl -pe "s/(^[A-Z][a-z]*\s{1}\d{1,2},\s{1}\d{4}.*)/==========\nThe Last Lion: Winston Spencer Churchill, Alone 1932-1940 (William Manchester)\n\1\n/" Manchester--Churchill-Alone.txt > converted-Manchester--Churchill-Alone.txt

Let me break down the command:

perl Uses the perl command/language in the command-line.
-pe -p flag tells perl to act on each line, -e flag “allows you to provide the program as an argument rather than in a file.
"..." You will give perl the command in quotation marks.
s/.../.../ s/// tells perl you will do a substitution (although as I will explain later, we’re not substituting in this case). What follows the first / is the string or regex expression to match, and what follows the second / is the replacement.
(^[A-Z][a-z]*\s{1}\d{1,2},\s{1}\d{4}.*) I’ll break down this regex below.
==========\nThe Last Lion: Winston Spencer Churchill, Alone 1932-1940 (William Manchester)\n\1\n/ The replacement text (which in this case includes the date we matched, explanation below).
Manchester--Churchill-Alone.txt Name of file that perl will process.
> converted-Manchester--Churchill-Alone.txt Output file.

The (^[A-Z][a-z]*\s{1}\d{1,2},\s{1}\d{4}.*) is a regular expression that matches the dates in the iBooks highlights (e.g, February 28, 2016).

^ Match to the beginning of the line.
[A-Z][a-z]* Any number of upper or lower case letters.
*\s{1} Only one whitespace character.
\d{1,2}, One or two digits followed by a comma.
\s{1} Only one whitespace character.
\d{4}.* Four digits (for the year) followed by any number of characters.
(...) The whole expression is wrapped in parentheses to allow us to refer to it later and include it in the replacement text. We include it because we don’t want to replace it, we want to append to it.
==========\nThe Last Lion: Winston Spencer Churchill, Alone 1932-1940 (William Manchester)\n\1\n/ Ten equal signs, followed by a newline \n, followed by book title and author name, followed by a newline, followed by the date we matched (which we refer to with \1, meaning the first match), followed by a newline.

This isn’t anything amazing to people with a lot of experience with regex, but I want to document this anyway for myself and for anyone else who could use a nice representative example of what regex can do.

The command converts this:

October 18, 2015 No title
Halifax, unconvinced, continued to hint at the desirability of negotiations. Beck’s principal private secretary told a British diplomat that these hints “tend to create an element of doubt as to the fixity of [Britain’s] purpose.” They evoked memories of Munich, he added, which was “not a good precedent.”43
October 18, 2015 No title
Here their error was not only spectacular, it was historic.
October 18, 2015 No title
the Nazis’ favorite diplomat, Sir Nevile Henderson, His Majesty’s ambassador to the Third Reich. Henderson defended the anti-Semitic pogroms of Danzig Nazis, opposed any link between London and Warsaw, thought the Poles should “talk a little less” about their courage and think “a little more” about the “realities” of their position on the eastern fringe of Grossdeutschland

To this:

==========
The Last Lion: Winston Spencer Churchill, Alone 1932-1940 (William Manchester)
October 18, 2015 No title

Here their error was not only spectacular, it was historic.
==========
The Last Lion: Winston Spencer Churchill, Alone 1932-1940 (William Manchester)
October 18, 2015 No title

the Nazis’ favorite diplomat, Sir Nevile Henderson, His Majesty’s ambassador to the Third Reich. Henderson defended the anti-Semitic pogroms of Danzig Nazis, opposed any link between London and Warsaw, thought the Poles should “talk a little less” about their courage and think “a little more” about the “realities” of their position on the eastern fringe of Grossdeutschland

Nice.

Appendix A

I want to take a moment to tell you about Patterns.

Patterns icon

Patterns is one of my favourite OS X utility apps. I go weeks or months without using it, and one day I’ll need to write a nontrivial regex, and Patterns makes it so much fun2.

You can paste all or a sample of the text that you want to match within, and then experiment with the regular expression and watch as the matches update live as you type. It’s beautiful. I’m always a bit sad whenever I’ve found my expression and have no more reason to keep the app open.

Patterns screenshot

Bonus: Patterns has a great regex reference sheet that you can consult as you work on your expression.

Patterns cheatsheet

Nice × 2.

Appendix B

Solving this problem took me a long time because I spent ages trying to implement solutions using sed. For whatever reason, I just couldn’t get it to work. The impression I get is that there are many different versions of sed that differ in their capabilities. Even installing GNU sed on OS X using homebrew didn’t work for me.

I am including some of the links that recommended sed solutions below, but my experience with documentation and online discussions and solutions is poor.


See also:

  1. Go into the iOS app ➝ open the book ➝ tap on the list icon in the top left ➝ tapp on the Notes tab ➝ tap on the Share icon in the top right ➝ tap on Edit Notes (why is that under the Share button?) ➝ tap on Select All in the bottom left ➝ tap on the Share text button to the right of where Select All was ➝ Email it to yourself ➝ Select All in the email + Copy ➝ Paste into a text editor. It could be easier to do, but I’m just glad it can be done at all.

  2. That’s right. Regex. FUN!