'Take no one's word for it' in 2016

“has it been a year already!?”

I hear James Stacey say into my ears as I stare at this draft wondering how to start.1

Unlike with most previous years, I did not have this experience with 2016. This has been a good year for me; I did a lot, and made plenty of progress personally and professionally. My subjective experience is not that it went by too fast, but that the passage of time feels just right.

The knock on new year’s resolutions is that they encourage you to wait until a seemingly arbitrary moment in time before you make a big change or do something to make your life better. Another knock is that this encourages you to attempt large changes instead of piecemeal changes, which increases the amount of discipline required for success, and therefore increases the chances of failure. Larger changes would happen less frequently, and that makes error-correction harder.

I think there is truth in there, but as with a lot of things people criticize today, the criticism loses a lot of nuance or selectivity and becomes absolute. You shouldn’t wait until new year’s to make your life better, but setting checkpoints for retrospectives and projections at regular intervals is useful. New year’s is arbitrary, but no more arbitrary than any other time or date if you don’t have better reasons for them. Just make sure you’re not using it as an excuse to procrastinate.

Personally I think an annual cycle is too infrequent for most stock-taking and revising stops. You can start a cycle on January 1st, but make it triannual or quarterly. Or choose your own date if January 1st is too problematic for you.

Last year I said I wanted to write more, and that’s the closest I’ve come to making a “resolution”. I like writing for what it helps me learn and get better at, including writing itself, and quantity should only increase when it’s a mean, not an end.

I’m happy with how 2016 turned out for Take no one’s word for it, and am tickled pink to share the visualizations for the year.

Posts by month and year

Total posts by year

Words by month and year

Total words by year

Future

I’m moving to a new country and starting in a new research scientist role in 2017, and one way or another I think that will affect my writing here. What I hope will happen is that I’ll be able to write more about science and data as I learn more things faster in my new position.

I’m excited.

  1. I don’t usually listen to podcasts when I write, but I wanted to get myself into a certain mindset.

5c1eb0da854d40280554fc99d15e5256d65f5658
tmux workspace scripts

tmux describes itself as a “terminal multiplexer”. Pleasantly, it goes on to explain what that means:

It lets you switch easily between several programs in one terminal, detach them (they keep running in the background) and reattach them to a different terminal. And do a lot more.

The way I would describe it is that tmux runs terminal sessions independently of the terminal window you’re viewing those sessions in. This means that you can do some work in a tmux session, close the terminal window, or “detach” from the tmux window, and later reattach to the tmux session and find your work, tmux windows, and tmux panes exactly as you left them.

tmux windows and panes1 are the other features I really appreciate about tmux, in addition to the great ability to detach and close terminal windows without killing the work or processes running in the tmux session. Each pane within each window is a separate shell session.

I’ve been using tmux for a few years (I think), but until recently, my use had reached a plateau: I would manually start a tmux session, and start creating windows and manually splitting them into panes as I need for my work. When done, I would, inefficiently, start entering a bunch of exit commands to close all the panes one by one, until closing the last one kills the tmux session.

I was setting up a complicated workspace for simplestatistics when I thought to look into the possibility of writing a script that I could run to set up all the windows and panes I need. Unsurprisingly, it is possible, and great.

This is the finished simplestatistics tmux workspace script in its current form. You can find an up-to-date version of it here:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# !/usr/local/bin/fish

# detach from a tmux session if in one
tmux detach > /dev/null ^ /dev/null

# don't set up the workspace if there's already a simplestatistics session running
if tmux list-sessions -F "#{session\_name}" | grep -q "simplestatistics";
	echo "simplestatistics session already running"
else
# okay no simplestatistics session is running

cd ~/projects/simplestatistics
tmux new -d -s simplestatistics

# window 0 - main
tmux rename-window main

# set up window 1 - documentation
# - index.rst
# - README.md
# - __init__.py
# fourth empty pane
tmux new-window -n documentation

tmux split-window -h -p 45
tmux select-pane -t 0
tmux split-window -v
tmux select-pane -t 0
tmux send-keys "cd ~/projects/simplestatistics/simplestatistics/" C-m
tmux send-keys "vim __init__.py" C-m

tmux select-pane -t 1
tmux send-keys "cd ~/projects/simplestatistics/" C-m
tmux send-keys "vim README.md" C-m

tmux select-pane -t 2
tmux send-keys "cd ~/projects/simplestatistics/simplestatistics/" C-m
tmux send-keys "vim index.rst" C-m
tmux split-window -v

# set up window 2 - changelogs
tmux new-window -n changelogs
tmux send-keys "cd ~/projects/simplestatistics/" C-m
tmux send-keys "vim changelog.txt" C-m

tmux split-window -h
tmux send-keys "cd ~/projects/simplestatistics/" C-m
tmux send-keys "vim HISTORY.rst" C-m

# back to window 0 - main
# 2 vertical panes: both will be used to edit main statistics functions
tmux select-window -t 0
tmux send-keys "cd ~/projects/simplestatistics/simplestatistics/statistics" C-m
tmux send-keys "ls" C-m
tmux split-window -h
tmux send-keys "cd ~/projects/simplestatistics/simplestatistics/statistics" C-m

tmux select-pane -t 0
tmux split-window -v
tmux send-keys "cd ~/projects/simplestatistics" C-m
tmux send-keys "bpython" C-m
tmux select-pane -t 0

tmux attach-session -t simplestatistics
end

If you attempt to start a session within a session, tmux warns you that sessions should be nested with care, and nesting sessions is not something I want to do anyway, but I want the ability to start session Y and attach to it while in session X. So lines 3 ➝ 4 attempt to detach from a tmux session, and sends normal and error output to /dev/null. If I’m attached, it detaches me before creating the session, and if I’m not, it fails silently.

Lines 6 ➝ 9 check to see if there’s already a running session named simplestatistics and stop execution with a message that reads "simplestatistics session already running" if it does find it.

Lines 12 ➝ 65 do the work of creating the workspace, which is made up of three windows.

window 1 - documentation

The second window (tmux windows are zero-indexed) contains the panes I use to edit and generate documentation for simplestatistics. The right pane is created with 45% of the window width.

Clockwise from top left:

  • __init__.py To add the new function I’m working on.
  • index.rst The main documentation page for Sphinx.
  • README.md
  • A shell for generating documentation.

window 2 - changelogs

Opens two versions of the changelogs in vim:

  • changelog.txt A Markdown-based changelog for all reasonable persons and machines.
  • HISTORY.rst A restructured version for PyPi.

window 0 - main editing

The layout is a bit unusual. The top left and entire right are listings of the directory that contains the function files. I use the big right pane to work on the new function, and the left one for general shell work and references.

The bottom left pane runs bpython for interactive testing.

Closing notes

If you work in the terminal and don’t use tmux, consider using it. It’s so nice to have several workspaces that never die until you kill them. If you do use tmux and often end up with complicated workspaces, consider scripting them!

  1. The terminology here is confusing: windows are actually tabs, their names appear at the bottom of the window, and they contain panes arranged in different layouts. It would make more sense to rename windows ➝ tabs, and rename panes ➝ windows.

4f76f7bbb6150089d10416cefd32fa7f814ac1d2
Sanitizing dirty Medium links on Pinboard with R

I’ve been on a Pinboard API roll lately. In hindsight it’s not surprising since I use Pinboard so much. Today’s post is another one in which I use R and the Pinboard API to fix a wrong in the world.

Problem

Have you ever noticed those Medium post links? Here’s an example:

https://medium.com/@timmywil/sign-your-commits-on-github-with-gpg-566f07762a43#.ncvbvfg3r

See that #.ncvbvfg3r tacked on the end? I noticed it a while ago, and I’m not the only one. That appendage tracks referrals, and I can imagine it allows Medium to build quite the social graph. I don’t like it for two reasons:

  1. Hey buddy? Don’t track me.
  2. It makes it difficult to know if you’ve already bookmarked a post because it’s likely that if you come across the post again, its url is not the same as the one you already saved. When you try to save it to your Pinboard account, it won’t warn you that you already saved it in the past.

You can find a discussion about this on the Pinboard Google Group.

Maciej Cegłowski, creator of Pinboard, was reassuringly himself about the issue:

I think the best thing in this situation is for Medium to die.

Should that happen I will shed few tears. I don’t want Medium to die, but they need to get better. In the meantime, they exist and I have to fix things on my end.

(½) Solution

I wrote a script that downloads all my Pinboard links, and removes that hash appendage before saving them back to my Pinboard account.

This is half a solution because it only solves reason 1, the tracking. Each time I visit or share a sanitized link, a new appendage will be generated, breaking its connection to how I came across the link in the first place.

It doesn’t solve reason 2 – if I had already saved a link to my Pinboard account, and then come across it again and try to save it, having forgotten that I already did so in the past, Pinboard won’t match the urls since the one it has is sanitized. Unless Maciej decides to implement a Medium-specific feature to strip those tracking tokens, there’s not much I can do about that.

First, let’s load some libraries and get our Pinboard links.

library(httr)
library(magrittr)
library(jsonlite)
library(stringr)

# My API token is saved in an environment file
pinsecret <- Sys.getenv('pin_token')

# GET all my links in JSON
pins_all <- GET('https://api.pinboard.in/v1/posts/all',
                query = list(auth_token = pinsecret,
                             format = 'json'))

pins <- pins_all %>% content() %>% fromJSON()

I load my API token from my .Renviron file, use the GET() function from the httr package to sent the GET request for all my links in JSON format, and then convert the returned data into a data frame by using content() from httr and piping the output to the fromJSON() function from jsonlite package.

Let’s examine the pins dataframe:

pins %>% 
    select(href, time) %>% 
    head() %>%  
    knitr::kable()

Which gives us:

href time
https://twitter.com/Samueltadros/status/800208013709688832 2016-11-20T14:23:11Z
http://gizmodo.com/authorities-just-shut-down-what-cd-the-best-music-torr-1789113647 2016-11-19T15:21:06Z
http://www.theverge.com/2016/11/17/13669832/what-cd-music-torrent-website-shut-down 2016-11-19T15:18:33Z
http://www.rollingstone.com/music/news/torrent-site-whatcd-shuts-down-destroys-user-data-w451239 2016-11-19T15:16:16Z
https://twitter.com/whatcd/status/799751019294965760 2016-11-18T23:56:23Z
https://twitter.com/sheriferson/status/799761561149722624/photo/1 2016-11-18T23:49:49Z

Let me break down that last command:

  • Start with pins dataframe.
  • Pipe that into select(), selecting the “href” and “time” columns.
  • Pipe the output into head() which selects the top (latest, in this case) 5 rows.
  • Pipe the output into kable() function from the knitr package, which converts the dataframe into a Markdown table.

That last part is very handy.

Now we have all our links, let’s select the ones for Medium links.

medium <- pins %>%
    filter(str_detect(href, 'medium.com'))

Again, let’s break it down

  • Store into medium the output of…
  • Piping pins into the filter() function from dplyr package.
  • Piping the output of that into filter() function, which is using str_detect() from the stringr package to search for “medium.com” in the “href” column.

Checking the medium dataframe shows…

href time
https://medium.com/something-learned/not-imposter-syndrome-621898bdabb2 2016-10-25T18:50:36Z
https://medium.com/@timmywil/sign-your-commits-on-github-with-gpg-566f07762a43#.ncvbvfg3r 2016-10-11T06:15:48Z
https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.by7z0gq33 2016-10-02T01:07:24Z
https://medium.com/@schtoeffel/you-don-t-need-more-than-one-cursor-in-vim-2c44117d51db#.nmev5f200 2016-09-19T23:35:16Z
https://medium.com/@akelleh/a-technical-primer-on-causality-181db2575e41 2016-09-07T16:30:57Z

Now, this looks like it worked, but I’m paranoid. It’s possible that the filtering caught links that have domains that end with “medium.com” but are not Medium links.

I want to be more careful, so I’ll use a function that I used before to extract the hostname from links.

get_hostname <- function(href) {
  tryCatch({
    parsed_url <- parse_url(href)
    if (!parsed_url$hostname %>% is.null()) {
      hostname <- parsed_url$hostname %>% 
        gsub('^www.', '', ., perl = T)
      return(hostname)  
    } else {
      return('unresolved')
    }
    
  }, error = function(e) {
    return('unresolved')
  })
}

pins$hostname <- map_chr(pins$href, .f = get_hostname)

medium <- pins %>%
    filter(hostname == 'medium.com')

This is dataframe of Medium links that I am more confident about.1

Now! Let’s remove that gunk.

medium$cleanhref <- sub("#\\..{9}$", "", medium$href)

That’s all. A quick regex substitution to remove the trailing hash garbage.

Old links Clean links
https://medium.com/something-learned/not-imposter-syndrome-621898bdabb2 https://medium.com/something-learned/not-imposter-syndrome-621898bdabb2
https://medium.com/@timmywil/sign-your-commits-on-github-with-gpg-566f07762a43#.ncvbvfg3r https://medium.com/@timmywil/sign-your-commits-on-github-with-gpg-566f07762a43
https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.by7z0gq33 https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471
https://medium.com/@joshuatauberer/civic-techs-act-iii-is-beginning-4df5d1720468 https://medium.com/@joshuatauberer/civic-techs-act-iii-is-beginning-4df5d1720468
https://medium.com/@schtoeffel/you-don-t-need-more-than-one-cursor-in-vim-2c44117d51db#.nmev5f200 https://medium.com/@schtoeffel/you-don-t-need-more-than-one-cursor-in-vim-2c44117d51db
https://medium.com/@ESAJustinA/ant-to-advance-data-equality-in-america-join-us-were-hiring-developers-and-data-scientists-147f1bfedcb5#.mh8dpuqz9 https://medium.com/@ESAJustinA/ant-to-advance-data-equality-in-america-join-us-were-hiring-developers-and-data-scientists-147f1bfedcb5

Now we need to put this data back into the I N T E R N E T.

As far as I can tell reading the Pinboard API2, there’s no way to update a bookmark in-place with a new url. The best way to do this is to delete the old bookmarks and add the new ones with the tags, shared, to-read status, and date-time information of the old ones.

This is the dangerous part. I want to be as careful as possible. I want to store the https responses for each deletion and addition, and just so I don’t anger the rate-limiting gods, I will inject a 5 second delay between requests. 5 seconds is probably overkill, but this isn’t production code, it’s a personal thing and I don’t mind waiting.

medium$addition_response <- vector(length = nrow(medium))
medium$deletion_response <- vector(length = nrow(medium))

for (ii in 1:nrow(medium)) {
    deletion <- GET('https://api.pinboard.in/v1/posts/delete',
                    query = list(auth_token = pinsecret,
                                 url = medium$href[ii]))
    
    medium$deletion_response[ii] <- deletion$status_code
    
    addition <- GET('https://api.pinboard.in/v1/posts/add',
                    query = list(auth_token = pinsecret,
                                 url = medium$cleanhref[ii],
                                 description = medium$description[ii],
                                 extended = medium$extended[ii],
                                 tags = medium$tags[ii],
                                 dt = medium$time[ii],
                                 shared = medium$shared[ii],
                                 toread = medium$toread[ii]))
    
    medium$addition_response[ii] <- addition$status_code
    
    Sys.sleep(5)
}

A quick inspection of the deletion and addition response codes reveals nothing but sweet, sweet 200s. A quick inspection of the Medium links on my Pinboard account reveals clean, shiny, spring-scented urls.

The full code is available as a gist here and embedded below:

  1. The dataframe created using the hostname extraction function has the same number of rows as the one created with a simple grep of “medium.com”, which means it probably wouldn’t have been a problem to stick with the earlier solution. The second solution is still a lot better.

  2. … which is a link that must be a record-holding in the number of times I’ve linked to it from this site.

770d927c95aa22985c0a01ca5718715a49706e15
Solving my read later problem

Attention conservation notice: A post on writing a small technical hack to improve what ideally I could do without needing a hack.

I do most of my learning by reading articles, guides, and blog posts online, and I manage this using Pinboard.1 All the links I’ve read or want to read in the future live there.

Problem

My read later list was growing a lot faster than I could go through it.

I rarely felt like going to my account to choose an article to read. When I did, I faced choice paralysis. I would scan the links and not feel like starting any of them. The problem was static friction.

One way I tried to solve this was to use Pinboard’s “random article” bookmarklet, which opened a randomly chosen unread link from your account. This worked to an extent, but I would sometimes land at an article that needed more time or attention than I had and I would click the bookmarklet again. Once you start making exceptions and spinning again, it becomes easy to do what is effectively scanning many articles before actually reading one.

I realized what I wanted was somewhere in between: I want to see some options that were randomly chosen.

Solution

My solution is punread which is built on top of BitBar.

BitBar (by Mat Ryer - @matryer) lets you put the output from any script/program in your Mac OS X Menu Bar.

Go to the link to see some screenshots and examples. The idea is that you write a script that produces an output and tell BitBar how often you want it run. There’s a lot of syntax available for you to control the output, how it looks, what happens when you click it, etc.

punread shows the number of unread bookmarks in my menu bar, and when I click on the number, I see 30 randomly chosen links. I can click on one, read it in the browser, and then mark it as read using another one of Pinboard’s bookmarklets.

punread is two files, the first is punread.30m.sh, which is the shell script BitBar wants to have:

#!/bin/bash
# <bitbar.title>punread</bitbar.title>
# <bitbar.version>v1.0</bitbar.version>
# <bitbar.author>Sherif Soliman</bitbar.author>
# <bitbar.author.github>sheriferson</bitbar.author.github>
# <bitbar.desc>Show pinboard unread count</bitbar.desc>
# <bitbar.dependencies>python</bitbar.dependencies>
# <bitbar.abouturl>https://github.com/sheriferson/punread</bitbar.abouturl>

links=$(/usr/local/bin/python3 /Users/sherif/projects/punread/punread.py)
echo "$links"

echo "---"
echo "📌 Random article | href=https://pinboard.in/random/?type=unread"

It doesn’t do much. It runs the second file, punread.py and shows its output. It also tacks on a final menu item that will show me a random unread article in case I didn’t like any of the 30 already listed. I don’t think I’ve ever used that option.

The second file is punread.py, which does most of the work. It talks to the Pinboard API, saves some state, and returns the 30 links for BitBar to display.

import json
import os.path
import pickle
import random
import re
import requests
import sys
import time

# get the path to punread.py
pathToMe = os.path.realpath(__file__)
pathToMe = os.path.split(pathToMe)[0]

last_updated_path = os.path.join(pathToMe, 'lastupdated.timestamp')
unread_count_path = os.path.join(pathToMe, 'unread.count')
links_path = os.path.join(pathToMe, 'links')
api_token_path = os.path.join(pathToMe, 'api_token')
last_run_path = os.path.join(pathToMe, 'lastrun.timestamp')

backup_file = '/Users/sherif/persanalytics/data/unread_pinboard_counts.csv'

def print_random_unread_links(count, unread, n = 30):
    count = str(count) + ' | font=SourceSansPro-Regular color=cadetblue\n---\n'
    sys.stdout.buffer.write(count.encode('utf-8'))
    random_unread_indexes = random.sample(range(1, len(unread)), 30)
    for ii in random_unread_indexes:
        description = unread[ii]['description']
        description = description.replace("|", "|")
        link_entry = '📍 ' + description + " | href=" + unread[ii]['href'] + " font=SourceSansPro-Regular color=cadetblue\n"
        sys.stdout.buffer.write(link_entry.encode('utf-8'))

def log_counts(total_count, unread_count):
   """
   A function to write the time, total bookmark count, and unread bookmark count
   to a csv file.
   """
   now = int(time.time()) 
   row = str(now) + ',' + str(total_count) + ',' + str(unread_count) + '\n'

   with open(backup_file, 'a') as bfile:
       bfile.write(row)

# check if there's a lastrun.timestamp, and if it's there
# check if the script ran less than 5 mins ago
# if yes, quit
if os.path.isfile(last_run_path):
    last_run = pickle.load(open(last_run_path, 'rb'))
    if time.time() - last_run < 300:
        unread_count = pickle.load(open(unread_count_path, 'rb'))
        links = pickle.load(open(links_path, 'rb'))
        unread = [link for link in links if (link['toread'] == 'yes')]
        print_random_unread_links(unread_count, unread)
        exit()
    else:
        pickle.dump(time.time(), open(last_run_path, 'wb'))
else:
    pickle.dump(time.time(), open(last_run_path, 'wb'))

with open(api_token_path, 'rb') as f:
    pintoken = f.read().strip()

par = {'auth_token': pintoken, 'format': 'json'}

if os.path.isfile(last_updated_path) and os.path.isfile(unread_count_path):
    last_updated = pickle.load(open(last_updated_path, 'rb'))
    unread_count = pickle.load(open(unread_count_path, 'rb'))
    links = pickle.load(open(links_path, 'rb'))
else:
    last_updated = ''
    unread_count = 0

last_updated_api_request = requests.get('https://api.pinboard.in/v1/posts/update',
        params = par)

last_updated_api = last_updated_api_request.json()['update_time']

if last_updated != last_updated_api:
    r = requests.get('https://api.pinboard.in/v1/posts/all',
            params = par)

    links = json.loads(r.text)

    unread = [link for link in links if (link['toread'] == 'yes')]
    total_count = len(links)
    unread_count = len(unread)

    pickle.dump(last_updated_api, open(last_updated_path, 'wb'))
    pickle.dump(unread_count, open(unread_count_path, 'wb'))
    pickle.dump(links, open(links_path, 'wb'))

    log_counts(total_count, unread_count)
    print_random_unread_links(unread_count, unread)
else:
    unread = [link for link in links if (link['toread'] == 'yes')]
    print_random_unread_links(unread_count, unread)

There are too many lines of code for me to walk through this step by step, but I’ll paint a general picture.

Some notes and things I had to keep in mind while writing the script:

  • The Pinboard API has rate limits. I can’t hit the posts/all method more than once every five minutes.
  • Pinboard recommends you use the API token to authenticate, rather than regular HTTP auth. I keep my API token in a file that I added to .gitignore so I don’t accidentally publish it somewhere.
  • I wanted to keep track of the total number of bookmarks and unread bookmarks over time (see below).
  • I wanted to minimize the number of times I used the posts/all method. The Pinboard API makes this easy: the posts/update returns the timestamp of the last update to any of your bookmarks. My script saves the last value returned by this method, and if the next time it runs it gets the same value, it never tries to use posts/all.
  • The thing I struggled with, by far, was string output. If you see some ‘squirrely’ things like sys.stdout.buffer.write(count.encode('utf-8')) and wonder why I don’t just print(), it’s because I ran into a lot of trouble with Python3’s string encoding and BitBar’s understanding or lack thereof of what I was giving it. It took me a long time to arrive at this solution.
  • You might also notice description = description.replace("|", "|"). The pipe character is the one character I had to avoid in my output, as it has special meaning to Unix and BitBar. The code is replacing the classic pipe character “|” with what is officially called “FULLWIDTH VERTICAL LINE”.2 It maintains the appearance of pipes in article titles without tripping BitBar up.

Results

This was a fun project, and I think it achieved what I wanted from it. I’ve put a serious dent into the number of unread links since I started using punread.

I’m not a big fan of seeing a lot of metrics. I disable most red iOS notification bubbles. But the reason I do that is exactly why I think punread works for me: I haven’t trained myself to see and ignore a lot of numbers. I see punread’s unread count in the menu bar, and I stick to a plan of not letting it climb a lot over time.

This wouldn’t be a Take no one’s word for it post if it didn’t have a plot or two.

and a zeroed out y-axis for the fundamentalists

The rapid buildup of unread links led me to raise my threshold of what’s good or relevant enough for me to read, and I’ve been deleting any articles that failed to reach that threshold. We can see that in this plot which marks deletions with red points and corresponding labels.

I couldn’t get the text labels to work without it being a mess, so here’s a version without the labels.

and the useless zeroed y-axis version

I know the plots are not beautiful.

Each red point is a measurement that was lower than the one before it, with a total of 110 deleted articles. This way of measurement can miss some deletions if between time t and time t+1 I deleted an article and added a new article; in that instance the measurement would not register a change. I’m aware of at least one case of that happening. It doesn’t make a big difference, but it’s good to be aware of when your measurement has faults or blind spots.

I’m sure that in addition to punread helping me, I was also motivated by the idea of using software that I wrote for myself, and by wanting to see that number and line plot go down. Regardless of how the variables interact to produce the final result, I declare it a success.

  1. “a bookmarking website for introverted people in a hurry”

  2. Unicode: U+FF5C, UTF-8: EF BD 9C

0c85ee943aa2b99e677642cfcadd9a9c645c9d6b
Recently (read)

This is a special reading-only issue of Recently.

Reading

Articles

  • My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis

    This introduction to random forests was the right mix of explanatory and practical for me at the time I found it. I had used random forests before, but only very simply and naïvely.

  • Rbitrary Standards

    An alternative R FAQ about some of the history and idiosyncrasies of the R language. Go for the knowledge, stay for the great humor and wit.

    See also: The R Inferno by Patrick Burns.

  • 5 Psychological Studies that Require a Second Look

    I come from a cognitive science background, which made the topic relevant to my interests. I think there is a slow-growing but positive trend towards scientists (and in the case of my background, psychological researchers) talking about the unpleasant sides of how the sausage is made, and this is an example of that trend.1 The 5 psychological studies referred to in the title and studies the author himself worked on.

  • More thoughts on Music in iOS 10 (beta)

    Unfortunately, unlike iTunes on the Mac and Windows, Music on iOS still only sorts albums by name, giving us no option to sort them by date instead. This is one of those head slap moments that makes you wonder if anyone at Apple has ever been a serious music collector. As far as I’m concerned, Apple cannot claim that it loves music ever again until it gives us the option to sort albums by date. No self-respecting music geek sorts albums by name. I don’t care if you hide the option in the Settings app, just give me the option for date, you wankers.

    Also worth noting: Apple removed the ability to rate songs in the iOS Music app.

  • Vesper, Adieu

    I bought Vesper but didn’t use it much. Looking back at it now, it looks and behaves great, and I maybe would have started using it more with a second look.

  • Gawker & the Left’s Selective Outrage

    Lest this post be too uncontroversial for you. I really liked this article.

Books

Inside the Third Reich by Albert Speer.

This is a big book and it’s been on my to-read list for a while. It’s an easier read than you would think.

I’ve read and continue to read a lot about history and WWII. I’m not special; walk into any second-hand book store, and I guarantee that one of the biggest sections you’ll find is the WWII section. WWII is strange, tragic, and difficult to comprehend. I think Speer’s memoirs might be the closest we can get to a look into the sociology and psychology of the top members of the German government at the time.

If you are at all interested in the history of World War II, I really recommend this book. It’s a classic for a good reason.

  1. Not a very recent example. The article is dated Feb 18, 2014. I only read it in Sept 2016 because I’m doing a sweep of my unread pinboard bookmarks. More on that in a later post.

e46fb83d4df7125919cb5b31ba17b882ac4ee356
Submitting a Python package with GitHub and PyPI

I maintain and publish a math and statistics Python package called simplestatistics. The library is a port of its javascript ancestor simple-statistics. While building it up, I wanted it to be available on the Python Package Index, PyPI1, and I had to learn how to do that. The process turned out to be more complicated than I expected it to be, and so here are the steps it takes to publish your package, or its updates, to PyPI.

Note: The order of steps matters.

Note 2: Some of these steps are specific to hosting your package on GitHub.

Do you have a setup.py?

You need to create a setup script to publish your package using distutils. This is the one for simplestatistics.

from distutils.core import setup

setup(
    name = 'simplestatistics',
    packages = ['simplestatistics', 'simplestatistics.statistics'],
    version = '0.2.5',
    description = 'Simple statistical functions implemented in readable Python.',
    author = 'Sherif Soliman',
    author_email = 'sherif@ssoliman.com',
    copyright = 'Copyright (c) 2016 Sherif Soliman',
    url = 'https://github.com/sheriferson/simplestatistics',
    download_url = 'https://github.com/sheriferson/simplestatistics/tarball/0.2.5',
    keywords = ['statistics', 'math'],
    classifiers = [
        'Programming Language :: Python :: 2',
        'Programming Language :: Python :: 3',
        'Topic :: Scientific/Engineering :: Mathematics',
        'Intended Audience :: Developers',
        'Intended Audience :: Education',
        'Intended Audience :: End Users/Desktop',
        'Intended Audience :: Science/Research',
        'Operating System :: MacOS',
        'Operating System :: Unix',
        'Topic :: Education',
        'Topic :: Utilities'
        ]
)

Update changelog.txt

If you maintain a changelog file, make sure you update it with the release version and date.

If you don’t maintain a changelog file, you should.

I maintain a separate HISTORY.rst file, so I make sure I update that too.

Update version number in documentation

Good documentation is important to me. It helps you understand your code and project better, and definitely helps anyone else trying to use it. simplestatistics documentation is hosted and generated automatically by Read the Docs using the very useful Sphinx package.2

When I’m publishing a new release, I make sure I update the version number in Sphinx’s conf.py.

Update version number in setup.py

I mentioned setup.py in the first step. Make sure to update the version number.

from distutils.core import setup

setup(
    ...
    version = '0.2.5',
    ...

Convert README.md to README.rst

PyPI doesn’t like Markdown. Actually, it’s not that it doesn’t like it, it just doesn’t care about it one way or another. PyPI likes reStructuredText (RST from this onwards). If you want PyPI to render the README on the package homepage like you can see on the simplestatistics PyPI page, it has to be in RST.

I’ve had a lot of trouble with RST. In my experience, it’s very fragile. It takes one extra space in a table to break rendering for the whole file.

pandoc is a great tool you could use to convert your README.md to README.rst, but PyPI may not like the default output of pandoc’s conversion. In my use case, the conversion of the Markdown tables to RST tables was the part that often angered PyPI rendering.

After some troubleshooting, I found that this command prevents the tables from wrapping around to new lines and causing README rendering on PyPI to fail.

pandoc --columns=100 --output=README.rst --to rst README.md

Here’s a bonus tip: this online reStructuredText editor, made available by Andrey Rublev, has been a huge help in debugging and fixing RST errors.

Add tarball download url to setup.py

In a later step, we will add a git tag and push it to GitHub. This will create a new release on the GitHub page, and this release will include .zip and .tar.gz files of the release (see an example here).

These compressed files are the ones that PyPI will pull from when you push your release or update. PyPI gets that download url from setup.py.

The result of this circle is that you need to anticipate the url for the release on GitHub before you push the release commit to GitHub.

You can set the new download url in setup.py based on your new version number:

setup(
    ...
    url = 'https://github.com/sheriferson/simplestatistics',
    download_url = 'https://github.com/sheriferson/simplestatistics/tarball/0.2.5',
    ...

Yes, you do set this url and commit it before it actually exists.

If there are new files that should be included, edit MANIFEST.in

The MANIFEST.in file is how you tell disutils to include files in the release file that it wouldn’t include otherwise. This is my MANIFEST.in file:

include LICENSE.txt
include README.rst
HISTORY.rst

Commit all those changes. Have a clean repo.

Commit everything we’ve done so far. Have a consistent commit comment for those changes. I usually add the message: “Prep for 0.2.5 release.”

Add/create a git tag

git tag 1.2.3 -m "Adds 1.2.3 tag for PyPI

Once you have the project in the state you want for creating the release, you add a git tag with the version number of the release. This will be reflected in the “releases” page of your GitHub repository.

Push git tag to remote

git push --tags origin master

Push those tags to GitHub.

Confirm that GitHub has generated the release file

Browse to your releases page (example) and make sure the new version has a release entry with its corresponding files.

Release testing

python setup.py register -r pypitest

You are, or aspire to be a good programmer who wants to be as cautious as possible, and so you’d like to test releasing the update on PyPI before actually doing it.

PyPI provides a test system that you can use to test the registration, upload, and installation of your package. Let’s make use of that gift.

python setup.py register -r pypitest will register the package on the pypitest server.

If you get a message indicating the all-clear, continue.

python setup.py sdist upload -r pypitest

… will upload the distribution of your package to pypitest.

If something in the package is broken, you cannot make changes and reupload the package with the same version number. This means that if you want to make a change or fix something, you will have to change the version number in setup.py (and accordingly everywhere else) and start all over again.3

pip install -i https://testpypi.python.org/pypi simplestatistics

… is the final step in the process of testing the release. This will try to install the new version of the package from pypi. If all goes well, you should be able to import the package normally in the REPL or in a Python script.

Once you’ve tested importing the package, it’s a good idea to pip uninstall your package so you can test the installation from the live PyPI servers.

Release

python setup.py register -r pypi

You are finally at the actual release stage. This will register the new version with PyPI.

python setup.py sdist upload -r pypi

… uploads the distribution to PyPI.

pip install simplestatistics

… tests the installation from PyPI. This is why we pip uninstalled the version from pypitest.

Add changelog notes to GitHub release page/tag

This is not necessary, but it’s good practice and shows care for maintaining documentation of your open source project. Edit the new GitHub release and add notes about what changed in nicely formatted Markdown.

Congratulations!

You made it. It feels good to have a package on PyPI. It helps you use your own package in the future, and it’s a good contribution to make it available to everyone else. Go celebrate.

See also

  1. Which it is! pip install simplestatistics

  2. The thing I like most about Sphinx is that you write the code, and in the case of simplestatistics the tests, in the docstrings of each .py file. The thing I like the least is that you have to use reStructured text, which is a bouquet of sadness.

  3. At the time of writing. Things might change and become more flexible in the future. I hope so.

81ab470a365d2b050cc1b36d526db1a5cacdaf91
Recently

I’m not a big fan of the months when the post right before the month’s Recently is the previous month’s Recently.

Code

I released simplestatistics 0.2.0 with added functions for:

  • linear_regression_line() returns a function that, provided with a (m, b) tuple, where m is the slope and b is the y intercept, calculates y values based on given x values.
  • root_mean_square()
  • interquartile_range()
  • sum_nth_power_deviations() to calculate sum of the deviations raised to Nth power.

Since that release, I added a harmonic_mean() function, and the ability to train a naive Bayesian classifier using the bayesian_classifier() class. The classifier feels like a significant milestone.

I think I’m on track to release version 1.0 in a month.

Reading

Articles

Books

  • Saddam Hussein: The Politics of Revenge by Saïd Aburish (2001).

    My Goodreads review:

    This is a well-written book on the historical context around the rise of Saddam Hussein and the progression of his rule.

    As I do with any book about the Middle East, I take what’s written in it with a grain of salt (and especially the last 10%). But it gave me quite a bit to think about, and the author, Saïd Aburish, is sufficiently persuasive about his criticism of Western actions not being criticism of Liberal values or morals, or coming from a place of anti-Western ideology, or the beliefs in Western conspiracies that fill Arab minds.

    I knew little about Saddam before reading this book, and I can recommend it as a good place to start

9b5e6ff9dd4f7ad3a40282076881eee97c33b86c
Recently
Book Bazaar, Ottawa

Code

I released simplestatistics 0.1.5, with added functions for:

  • factorial()
  • choose()
  • binomial() to calculate binomial distribution probabilities.
  • normal() to calculate normal distribution probabilities.
  • kurtosis() to calculate kurtosis/”tailedness” of a probability distribution of a variable.
  • skew() to calculate Pearson’s moment coefficient of skewness.
  • linear_regression() to calculate slope (m) and y intercept (b) of line of best fit.

And reorganized and improved documentation for all functions.

On the currently-not-open-source front, I rejuvenated an old iOS Swift project that I wrote to be able to see and edit my task list on my phone. I use t to manage my todos, and named the iOS app accordingly.

The project was a good excuse for me to learn Swift, iOS programming, and the Dropbox iOS SDK as well as get a taste of the challenges of writing software that edits files and syncs them over the network while avoiding sync conflicts and out-of-date UI (brief report: the challenges are considerable).

I’m still figuring out background fetch, but overall it’s been fun and a success.

Reading

Books

Listening

Tinker Tailor Soldier Spy” by Alberto Iglesias (2011)

LateNightTales: Bonobo” by Bonobo (2013)

9eab735c77ee699c314f9d86d20381bfb3b7fbeb
Archive