This post is Part 2 of a 2-part series about using R to download and examine bookmarks using the httr package and the Pinboard API.

Part 1: Talking to the Pinboard API with R

In the previous post I described how to use the Pinboard API to download your bookmarks with the help of the httr R package. In this post I’ll talk about some of the munging, aggregation, and exploration I did with the data.

Descriptives

There are 4944 links in the dataset.
The earliest link is dated 2011-12-09 23:47:21 EST.
The last link is dated 2016-06-23 16:14:31 EDT

Top 10 hosts in my bookmarks, in descending order:

hostname count comments
twitter.com 434 You can tell Pinboard to archive all tweets that you favorite.
economist.com 391  
leancrew.com 168 I’m a big fan of Dr. Drang’s blog
thedailywtf.com 76 I used to read TheDailyWTF a lot. I stopped a long time ago. Should probably delete those bookmarks.
marco.org 72  
github.com 70  
stackoverflow.com 68  
inessential.com 66 I am also a big fan of Brent Simmons’ inessential.
objc.io 60  
youtube.com 60 I pay for Pinboard’s archiving account, so I bookmark YT videos I think might disappear in the future.

Code

Let’s jump in! I’ll start by loading some libraries, followed by the data.

require(ggplot2)
require(httr) # for getting headers and parsing server responses
require(jsonlite) # convert json to a data frame
require(magrittr) # %>% 
require(purrr) # to use map*() functions
require(scales) # to use date_breaks() in scale_x_datetime()

I need to read the .RData file that the code from the last post saved for me, and convert the json to an R dataframe.

# load the pins response object from file
# this was graciously saved there for us by get_pins.R
setwd('~/projects/pins/')
load('pins.RData')
pins <- content(pins_all) %>% fromJSON()

Great, that was easy enough.

Next, I need to do some cleanup. This involves converting two variables (shared and toread) to factors, which I do using %>% as.factor().

I also need to clean up the timestamps Pinboard returns, which are in UTC/GMT timezone and come in this unusual format: 2016-06-23T23:45Z.

I will format and convert them to POSIXct type, and put them in my US/Eastern timezone.

Cleanup

pins$shared <- pins$shared %>% as.factor()
pins$toread <- pins$toread %>% as.factor()

# pinboard exports timestamps in UTC/GMT, and in this format: '2016-06-23T23:45Z'
# I want them in POSIX, and in US/Eastern

# convert to POSIX with timezone information as GMT
pins$time <- as.POSIXct(pins$time, 
                        format = '%Y-%m-%dT%H:%M:%SZ', 
                        tz = 'GMT')
# convert to US/Eastern timezone
pins$time <- format(pins$time, tz = 'US/Eastern') %>% 
  as.POSIXct()

This concludes the cleanup of the data that came out of Pinboard.

Now I need to create some of my own.

Extracting hostnames

I’m interested in seeing how many links I have for each host. I’m also interested in finding out which hosts have the most broken links.

To figure that out, first I need to extract the hostnames out of the urls. I wrote a function get_hostname() which accepts a url and attempts to use the httr::parse_url() function to extract the hostname.

If it successfully extracts a hostname, the function will remove “www.” from the beginning of the string if it exists. I do this because some hostnames, like http://leancrew.com, used to be “www.leancrew.com”. I have bookmarked links with both urls, and without this fix, the aggregates and counts will be split between the two hostnames, and therefore, wrong.

I also do this for cleanliness.

# oooo                     .                                 
# `888                   .o8                                 
#  888 .oo.    .oooo.o .o888oo ooo. .oo.   ooo. .oo.  .oo.   
#  888P"Y88b  d88(  "8   888   `888P"Y88b  `888P"Y88bP"Y88b  
#  888   888  `"Y88b.    888    888   888   888   888   888  
#  888   888  o.  )88b   888 .  888   888   888   888   888  
# o888o o888o 8""888P'   "888" o888o o888o o888o o888o o888o 

## hostname
get_hostname <- function(href) {
  tryCatch({
    parsed_url <- parse_url(href)
    if (!parsed_url$hostname %>% is.null()) {
      hostname <- parsed_url$hostname %>% 
        gsub('^www.', '', ., perl = T)
      return(hostname)  
    } else {
      return('unresolved')
    }
    
  }, error = function(e) {
    return('unresolved')
  })
}

With the function written, now I need to apply it to all the rows in the dataframe.

I could have used one of the lapply family of functions. But I thought I would try the map* family from the purrr package instead, and I have no complaints.

pins$hostname <- map_chr(pins$href, .f = get_hostname)

Let’s take a look at the top 10 bookmarked sites in my account.

I’ll make a table of the counts of different hosts, convert it into a dataframe, give it column names, and order the dataframe in descending order of hostname frequency.

knitr::kable() will convert the R table into a more readable and blog-friendly Markdown table.

## aggregate hostnames
hostname.freq <- pins$hostname %>% 
  table() %>% 
  as.data.frame()
colnames(hostname.freq) <- c('hostname', 'count')
hostname.freq <- hostname.freq[order(-hostname.freq$count),]
rownames(hostname.freq) <- c(1:nrow(hostname.freq))
# knitr::kable makes pretty markdown tables from R output
knitr::kable(hostname.freq[,c('hostname', 'count')] %>% 
               head(10), row.names = T, align = c('c'))
  hostname count
1 twitter.com 434
2 economist.com 391
3 leancrew.com 168
4 thedailywtf.com 76
5 marco.org 72
6 github.com 70
7 stackoverflow.com 68
8 inessential.com 66
9 objc.io 60
10 youtube.com 60

Sweet!

What about number of bookmarks added by month?

pins$month <- strftime(pins$time, format = '%Y-%m')
links.month <- pins$month %>% 
  table() %>% 
  as.data.frame()
colnames(links.month) <- c('month', 'count')
links.month$month <- links.month$month %>% 
  paste0('-01') %>% 
  as.POSIXct()

ggplot(links.month, aes(x = month, y = count)) +
  geom_point() +
  geom_line(group = 1) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 45, hjust = 1)
        ) +
  scale_x_datetime(breaks = date_breaks('2 months'), 
                 labels = date_format('%Y-%m')) +
  xlab('') +
  ylab('Bookmarks') +
  ggtitle('Links added per month')

The scale is ruined by the extreme number in the first month. I imported a bit over 400 bookmarks into Pinboard when I started my account, and that’s what the outlier is about.

Let’s exclude it.

ggplot(links.month[-1, ], aes(x = month, y = count)) +
  geom_point() +
  geom_line(group = 1) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 45, hjust = 1)
        ) +
  scale_x_datetime(breaks = date_breaks('2 months'), 
                 labels = date_format('%Y-%m')) +
  xlab('') +
  ylab('Bookmarks') +
  ggtitle('Links added per month')

That’s better.

URL responses

It’s now time for the big show. I want to check the server response for each of the 4944 urls to see which ones are dead or timing out.

To do this, I wrote another function, check_url_response() which, like get_hostname() accepts a url.

The function tries to get the url headers using the httr::HEAD() function, and then tries to return the $status_code from the response object.

It might fail, like when the request times out. If that happens, the function returns 0.

For reference, here is a list of all HTTP status codes.

#  .ooooo.  oooo d8b oooo d8b  .ooooo.  oooo d8b  .oooo.o 
# d88' `88b `888""8P `888""8P d88' `88b `888""8P d88(  "8 
# 888ooo888  888      888     888   888  888     `"Y88b.  
# 888    .o  888      888     888   888  888     o.  )88b 
# `Y8bod8P' d888b    d888b    `Y8bod8P' d888b    8""888P' 

check_url_response <- function(href) {
  cat('Checking', href, '...\n')
  tryCatch({
    check_head <- HEAD(href)
    return(check_head$status_code %>% as.integer())
  }, error = function(e) {
    return(0)
  })
}

This is an expensive operation time-wise. I recorded the time it took to check all links in my dataset using system.time():

    user   system  elapsed 
 641.169   13.808 3893.732 

3893 seconds, or about an hour of elapsed time on my Mac Mini.1

To avoid accidentally triggering this expensive process when I run the script, I check to see if I’ve already saved a file titled responses.RData with these response codes. If the file is there, the script will load it and move on, and will only start checking urls if it can’t find the file.2

Then I add the responses as a new variable to the pins dataframe.

if (file.exists('responses.RData')) {
    load('responses.RData')
} else {
  response_time <- system.time(responses <- map(pins$href, .f = check_url_response))
  save(response_time, file = 'responses.RData')
}

pins$response <- unlist(responses)

The httr package has another response-related function: http_status(), which translates the response code to a human readable description.

Let’s check it out.

http_status(502)
## $category
## [1] "Server error"
## 
## $reason
## [1] "Bad Gateway"
## 
## $message
## [1] "Server error: (502) Bad Gateway"

The function returns an object with $category, $reason, and $message.

response.aggs <- pins$response %>% 
  table() %>% 
  as.data.frame(stringsAsFactors = F)
colnames(response.aggs) <- c('response', 'count')
response.aggs <- response.aggs[order(response.aggs$count, decreasing = T),]

parse_response <- function(response_code) {
  if (response_code == '0' | response_code == '522') { return('Timed out')}
  if (response_code == '520') {return('Unknown error: (520)')}

  status <- response_code %>% 
    as.integer() %>% 
    http_status()
  return(status$message)
}
response.aggs$description <- map(response.aggs$response, .f = parse_response) %>% 
  unlist()
rownames(response.aggs) <- c(1:nrow(response.aggs))
response.aggs %>%  knitr::kable(row.names = T)
  response count description
1 200 4644 Success: (200) OK
2 404 164 Client error: (404) Not Found
3 0 34 Timed out
4 405 28 Client error: (405) Method Not Allowed
5 401 13 Client error: (401) Unauthorized
6 403 13 Client error: (403) Forbidden
7 429 12 Client error: (429) Too Many Requests (RFC 6585)
8 409 6 Client error: (409) Conflict
9 400 5 Client error: (400) Bad Request
10 500 4 Server error: (500) Internal Server Error
11 406 2 Client error: (406) Not Acceptable
12 503 2 Server error: (503) Service Unavailable
13 520 2 Unknown error: (520)
14 204 1 Success: (204) No Content
15 415 1 Client error: (415) Unsupported Media Type
16 502 1 Server error: (502) Bad Gateway
17 522 1 Timed out

I expected more links to be dead (404 + the ones that timed out).

I think some urls are getting redirected without me being able to detect it – note the absence of any 3xx (redirection) status codes. Also, note that I got 12 “Too Many Requests” errors, which probably would have been 200 (OK) if I wasn’t making too many requests in such short time.

Errors over time

Let’s look into the breakdown of errors over time.

# consider anything other than 200 an error
# an imperfect heuristic, but a satisfactory approximation
pins$error <- ifelse(pins$response == 200, 0, 1)
pin.errors.month <- aggregate(error ~ month, data = pins, 
                              FUN = sum)

pin.errors.month$month <- pin.errors.month$month %>% 
  paste0('-01') %>% 
  as.POSIXct()

ggplot(pin.errors.month, aes(x = month, y = error)) +
  geom_point(color = 'red') +
  geom_line(group = 1, color = 'red') +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 45, hjust = 1)
        ) +
  scale_x_datetime(breaks = date_breaks('2 months'), 
                 labels = date_format('%Y-%m')) +
  xlab('') +
  ylab('Errors') +
  ggtitle('Errors over time')

That’s not a great plot for the same reason the first links per month plot wasn’t a good plot. There’s an extreme value in the first month that ruins the scale.

Let’s exclude that first month.

ggplot(pin.errors.month[-1, ], aes(x = month, y = error)) +
  geom_point(color = 'red') +
  geom_line(group = 1, color = 'red') +
  geom_smooth(se = F) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 45, hjust = 1)
        ) +
  scale_x_datetime(breaks = date_breaks('2 months'), 
                 labels = date_format('%Y-%m')
  ) +
  xlab('') +
  ylab('Errors') +
  ggtitle('Errors over time')

That’s a surprise to me. I expected to find the highest number of errors in the oldest links, instead, the highest number of errors comes from links that I added in December of 2014.

As with most things data, this plot might imply something that’s not true: that the number of response errors is at all related to time.

If we check the errors against the link count data, we’ll find that they are highly correlated, which is not a surprise.

cor.test(links.month$count, pin.errors.month$error)
## 
##  Pearson's product-moment correlation
## 
## data:  links.month$count and pin.errors.month$error
## t = 13.94, df = 53, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8120584 0.9324242
## sample estimates:
##       cor 
## 0.8863938

A correlation of .88 is large. Further investigation would help, but it might be that once we partial out the number of links added per month, there is no significant relationship between time and number of broken links.

Let’s look at errors aggregated by host.

Errors by host

hostname.errors <- aggregate(error ~ hostname, data = pins, 
                             FUN = sum)
hostname.errors <- hostname.errors[order(hostname.errors$error),] %>% 
  tail(20)

hostname.errors$hostname <- hostname.errors$hostname %>% 
  factor(levels = hostname.errors[, 'hostname'])

ggplot(hostname.errors, aes(x = hostname, y = error)) +
  geom_bar(stat = 'identity', fill = '#8FC3Eb') +
  coord_flip() +
  scale_y_discrete(limits = seq(0, 30, by = 2)) +
  theme_bw() +
  theme(panel.border = element_blank()) +
  xlab('Hosts') +
  ylab('Errors') +
  ggtitle('Errors by host')

We can’t conclude a lot from this plot. It should be the beginning of an examination of the data, not the end of it.

For example, https://reddit.com comes fourth with 12 “errors”. But if we dig deeper we find:

pins[, c('hostname', 'response')] %>% 
  subset(hostname == 'reddit.com' & response != 200) %>% 
  knitr::kable(row.names = F)
hostname response
reddit.com 429
reddit.com 429
reddit.com 429
reddit.com 429
reddit.com 429
reddit.com 429
reddit.com 429
reddit.com 429
reddit.com 429
reddit.com 429
reddit.com 429
reddit.com 429

All the errors are Client errors: (429) Too Many Requests (RFC 6585). They’re not because reddit screwed up, they are my fault for hitting their servers more often than I should.

Full code

require(ggplot2)
require(httr) # for getting headers and parsing server responses
require(jsonlite) # convert json to a data frame
require(magrittr) # %>% 
require(purrr) # to use map*() functions
require(scales) # to use date_breaks() in scale_x_datetime()

# load the pins response object from file
# this was graciously saved there for us by get_pins.R
setwd('~/projects/pins/')
load('pins.RData')
pins <- content(pins_all) %>% fromJSON()

pins$shared <- pins$shared %>% as.factor()
pins$toread <- pins$toread %>% as.factor()

# pinboard exports timestamps in UTC/GMT, and in this format: '2016-06-23T23:45Z'
# I want them in POSIX, and in US/Eastern

# convert to POSIX with timezone information as GMT
pins$time <- as.POSIXct(pins$time, 
                        format = '%Y-%m-%dT%H:%M:%SZ', 
                        tz = 'GMT')
# convert to US/Eastern timezone
pins$time <- format(pins$time, tz = 'US/Eastern') %>% 
  as.POSIXct()

# oooo                     .                                 
# `888                   .o8                                 
#  888 .oo.    .oooo.o .o888oo ooo. .oo.   ooo. .oo.  .oo.   
#  888P"Y88b  d88(  "8   888   `888P"Y88b  `888P"Y88bP"Y88b  
#  888   888  `"Y88b.    888    888   888   888   888   888  
#  888   888  o.  )88b   888 .  888   888   888   888   888  
# o888o o888o 8""888P'   "888" o888o o888o o888o o888o o888o 

## hostname
get_hostname <- function(href) {
  tryCatch({
    parsed_url <- parse_url(href)
    if (!parsed_url$hostname %>% is.null()) {
      hostname <- parsed_url$hostname %>% 
        gsub('^www.', '', ., perl = T)
      return(hostname)  
    } else {
      return('unresolved')
    }
    
  }, error = function(e) {
    return('unresolved')
  })
}

## aggregate hostnames
hostname.freq <- pins$hostname %>% 
  table() %>% 
  as.data.frame()
colnames(hostname.freq) <- c('hostname', 'count')
hostname.freq <- hostname.freq[order(-hostname.freq$count),]
rownames(hostname.freq) <- c(1:nrow(hostname.freq))
# knitr::kable makes pretty markdown tables from R output
knitr::kable(hostname.freq[,c('hostname', 'count')] %>% 
               head(10), row.names = T, align = c('c'))

pins$month <- strftime(pins$time, format = '%Y-%m')
links.month <- pins$month %>% 
  table() %>% 
  as.data.frame()
colnames(links.month) <- c('month', 'count')
links.month$month <- links.month$month %>% 
  paste0('-01') %>% 
  as.POSIXct()

ggplot(links.month, aes(x = month, y = count)) +
  geom_point() +
  geom_line(group = 1) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  scale_x_datetime(breaks = date_breaks('2 months'), 
                   labels = date_format('%Y-%m')) +
  xlab('') +
  ylab('Bookmarks') +
  ggtitle('Links added per month')

ggplot(links.month[-1, ], aes(x = month, y = count)) +
  geom_point() +
  geom_line(group = 1) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  scale_x_datetime(breaks = date_breaks('2 months'), 
                   labels = date_format('%Y-%m')) +
  xlab('') +
  ylab('Bookmarks') +
  ggtitle('Links added per month')

#  .ooooo.  oooo d8b oooo d8b  .ooooo.  oooo d8b  .oooo.o 
# d88' `88b `888""8P `888""8P d88' `88b `888""8P d88(  "8 
# 888ooo888  888      888     888   888  888     `"Y88b.  
# 888    .o  888      888     888   888  888     o.  )88b 
# `Y8bod8P' d888b    d888b    `Y8bod8P' d888b    8""888P' 

check_url_response <- function(href) {
  cat('Checking', href, '...\n')
  tryCatch({
    check_head <- HEAD(href)
    return(check_head$status_code %>% as.integer())
  }, error = function(e) {
    return(0)
  })
}

if (file.exists('responses.RData')) {
load('responses.RData')
} else {
response_time <- system.time(responses <- map(pins$href, .f = check_url_response))
save(response_time, file = 'responses.RData')
}

pins$response <- unlist(responses)

http_status(502)

response.aggs <- pins$response %>% 
  table() %>% 
  as.data.frame(stringsAsFactors = F)
colnames(response.aggs) <- c('response', 'count')
response.aggs <- response.aggs[order(response.aggs$count, decreasing = T),]

parse_response <- function(response_code) {
  if (response_code == '0' | response_code == '522') { return('Timed out')}
  if (response_code == '520') {return('Unknown error: (520)')}
  
  status <- response_code %>% 
    as.integer() %>% 
    http_status()
  return(status$message)
}
response.aggs$description <- map(response.aggs$response, .f = parse_response) %>% 
  unlist()
rownames(response.aggs) <- c(1:nrow(response.aggs))
response.aggs %>%  knitr::kable(row.names = T)

# consider anything other than 200 an error
# an imperfect heuristic, but a satisfactory approximation
pins$error <- ifelse(pins$response == 200, 0, 1)
pin.errors.month <- aggregate(error ~ month, data = pins, 
                              FUN = sum)

pin.errors.month$month <- pin.errors.month$month %>% 
  paste0('-01') %>% 
  as.POSIXct()

ggplot(pin.errors.month, aes(x = month, y = error)) +
  geom_point(color = 'red') +
  geom_line(group = 1, color = 'red') +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  scale_x_datetime(breaks = date_breaks('2 months'), 
                   labels = date_format('%Y-%m')) +
  xlab('') +
  ylab('Errors') +
  ggtitle('Errors over time')

ggplot(pin.errors.month[-1, ], aes(x = month, y = error)) +
  geom_point(color = 'red') +
  geom_line(group = 1, color = 'red') +
  geom_smooth(se = F) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  scale_x_datetime(breaks = date_breaks('2 months'), 
                   labels = date_format('%Y-%m')
  ) +
  xlab('') +
  ylab('Errors') +
  ggtitle('Errors over time')

cor.test(links.month$count, pin.errors.month$error)

hostname.errors <- aggregate(error ~ hostname, data = pins, 
                             FUN = sum)
hostname.errors <- hostname.errors[order(hostname.errors$error),] %>% 
  tail(20)

hostname.errors$hostname <- hostname.errors$hostname %>% 
  factor(levels = hostname.errors[, 'hostname'])

ggplot(hostname.errors, aes(x = hostname, y = error)) +
  geom_bar(stat = 'identity', fill = '#8FC3Eb') +
  coord_flip() +
  scale_y_discrete(limits = seq(0, 30, by = 2)) +
  theme_bw() +
  theme(panel.border = element_blank()) +
  xlab('Hosts') +
  ylab('Errors') +
  ggtitle('Errors by host')

pins[, c('hostname', 'response')] %>% 
subset(hostname == 'reddit.com' & response != 200) %>% 
knitr::kable(row.names = F)

Closing notes

The purpose of this post was learning and exploration rather than definitive investigation. I got some interesting aggregates and plots, but I can’t conclude anything with certainty without digging deeper. With the exception of one correlation test, all I did was data cleaning, aggregation, and plotting. To make any claims about this data, I need to use statistical methods instead of intuition.

It was fun to look around this dataset and learn about the httr and purrr packages in the process. It was also fun to learn how to write this entire post in R Markdown and make it ready for publishing in Jekyll with minimal touch-ups. That’s a story for another post though.

See also / references

  1. Not that the processor or machine should matter much. User and system times are a fraction (~17%) of the total elapsed time. The majority of waiting is probably on the network. 

  2. There is a problem with this solution: if I download a new list of bookmarks from the Pinboard API, the responses.RData file will be outdated. It will also be outdated in a month or two as more links change or die. When either of those happens, I will manually delete it, causing the script to recheck all links the next time I run it. Avoiding accidentally calling 4944 urls is important enough that I’m willing to skip automating that part of the process for now.