My lab investigates how blood pressure can be treated more effectively. Much of that work involves the painstaking development of new concepts and research methods to move forward the state of the art. For example, our work on urinary extracellular vesicles’ mRNA as an ex vivo assay of the ligand-activated transcription factor activity of mineralocorticoid receptors is challenging, fun, and rewarding. With a lot of work from Andrea Berrido and Pradeep Gunasekaran in my lab, we have been moving the ball forward on several key projects on that front.
Awaiting regulatory approvals for those projects has provided me an opportunity to look toward public data to further develop my data analysis skills, and, I hope, to publish some important insights. This post documents what I’m learning and still need to learn about analyzing data in the popular JSON format. This is a real-time look at what I’m thinking about today. This post is inspired by the openness and sharing I find in the computer science community, in particular my brother, Will Byrd (@webyrd)(see http://webyrd.net/). It’s also directly inspired by Eric Vallabh Minikel’s (@cureffi) blogging.
After Vivek Bhalla (@Stanford_HTN_Center), Glenn Chertow (@gchertow), and I wrote in the New England Journal about a series of recalls of specific angiotensin receptor blocker products for nitrosamine impurities and watched new recalls of ranitidine products for nitrosamine impurities, I’ve been more interested in how FDA regulation works in practice. This makes sense, I suppose, because I completed a clinical pharmacology research fellowship prior to cardiology fellowship and because I treat patients.
I’ve been exploring the OpenFDA API designed by Taha Kass-Hout (@kasshout) before his move to Chief Medical Officer at Amazon.
First, I will say there are some disclaimers on the OpenFDA website, which you should read before considering whether the data are suitable for a given purpose you may have in mind.
The OpenFDA API is based on an open source search and analytics engine called elasticsearch. Cleverly, it’s possible to query databases API by writing straightforward URLs that embed your questions. And the code powering OpenFDA’s API is available on FDA’s github. All in all, a very cool initiative.
There are interactive visualization demos on the OpenFDA site that show off some of what can be done with the data.
It’s also possible to query OpenFDA using an R package, ‘openfda’. Unfortunately, I find very few examples of how to construct a query using the R ‘openfda’ package.
One can also download the OpenFDA datasets (called “endpoints”) in JSON format.
JSON is a hierarchical data format, which can be confusing for those accustomed to tidy data. Tidy data are defined as data in which:
“Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.”
In contrast to tidy data, JSON’s hierarchical format allows things like a version or date or disclaimer to appear in the dataset once at the highest level of hierarchy. Those data then refer to all data in the dataset without being repeated in each row. That’s clearly an efficiency in many settings, so I see the appeal. Also, one of the developers of OpenFDA has explained that the data are highly relational, meaning a record might be related to multiple other records . To ‘flatten’ this into 2 dimensions is inefficient. Finally, individual entries in a JSON format document are human-readable due to the straightforward structure of the string comprising an a JSON entry.
Those are advantages of JSON. A big disadvantage is that it’s difficult to look at the entire dataset in tabular format. That makes sense since JSON format documents often contain tables-within-tables. Before you say, “That’s a feature not a bug; it’s design is not that of a flat table, but a hierarchy,” I’ll stipulate that point. That’s fine, but I’m interested in analyzing and visualizing the data, and I find that a flat, tidy data file is helpful for that purpose.
I am more familiar with R than with Python, although both offer packages for reading data in JSON format. Both can be used to flatten or unnest JSON data. But, how is it done?
I would characterize getting the JSON file into R or Python as an easy task. There are packages for R and Python that are designed to read JSON data.
The next step in my analytical plan for these data involves putting the data into a tidy data structure. My rationale is that when exploring an unfamiliar dataset, the ability to see the data and how it’s represented, using an intuitive, visual approach is essential.
My experience & my code so far:
The data that interest me can be downloaded from openFDA’s website in JSON format and read into R and then made into a dataframe easily enough.
At this point, let’s load some packages that could be useful for our data wrangling:
library(jsonlite) # flexible, robust, high performance tools for working with JSON in R
library(dplyr) # A fast, consistent tool for working with data frame like objects
library(tidyverse)
Now let’s load the JSON file:
path<-"<path to JSON file>"
Df<-as.data.frame(fromJSON(path))
Within the dataframe is a series of variables that are named ‘meta.variablename’ and ‘results.variablename’. I can see from importing the data into Python that ‘meta’ and ‘results’ are top-level divisions of the hierarchy (Python code below). Back to the R representation of the dataframe… We have 10,417 observations of 32 variables. Among the variables is results.openfda, which is itself a dataframe containing observations of additional variables. We don’t yet have a flat, tidy dataframe.
@chris_wright suggested a very nice online tutorial
I found this really helpful:https://t.co/O5JbWwd0Go
— chris_wright (@chris_wright) October 11, 2019
of course it uses the tidyverse - which makes R so much better for me…
Following the tutorial, I converted the top-level dataframe into a tibble, a special type of dataframe. This line also converts the nested openfda dataframe to a list. That operation completes quickly and successfully:
tibbleDf<-Df %>%
discard(is_empty) %>%
map_if(is.data.frame, list) %>%
as_tibble()
The next step in the tutorial I followed is to use ‘unnest’ to unnest the list. I feared this might be too simple given that the structure of a JSON file could theoretically be very complicated, with lots of nesting. But I would be able to inspect the data in the RStudio and come to some conclusions about the success of the procedure.
unnested_Df <- unnest(tibbleDf)
Unfortunately, running unnest gives me this error:
'Error: can not allocate vector of size 827.9 Mb'
— J. Brian Byrd, MD, MS 🐦 (@thebyrdlab) October 12, 2019
Is this #RStats' way of saying 'I love you, but you're gonna need aws for this one…"?
I’ve also tried a flatten function. Note that as I made this post, I realized that there are two conflicting flatten functions loaded by the packages:
flatDf<-flatten(Df)
I end up with a list of length 322948, and it appears to comprise a list of integers with a value of 1 based on preliminary inspection. In any event, it does not appear to be what I’m seeking, or close to what I’m seeking.
I received some advice via Twitter from @statesdj:
Or you need to move your data to an RDBMS, like Postgres, do the heavy lift data slicing and dicing there, and pull the result set you really need into R w Rpostgres
— David States (@statesdj) October 12, 2019
That’s a suggestion I’m going to table for now since I have no familiarity with RDBMS / Postgres. But it sounds interesting.
I also learned via @troutgirl that @simonw would use Python for this type of task.
I see it’s very easy to import the data into Python. I’m using Spyder as an IDE, since I’m used to RStudio, and it seems the closest to RStudio, though not quite as slick.
import json
import pandas as pd
from pandas.io.json import json_normalize #package for flattening json in pandas df
#load json object
with open(r'<path to JSON file>') as f:
d = json.load(f)
In Python, I can read in the data, with variable name assignment very similar to what the R JSON reader was doing:
combinedData = json_normalize(d)
Also, I am able to read the two top-level hierarchies smoothly, so that they remain separate in the final data structure:
meta = json_normalize(d['meta'])
meta.head(3)
results = json_normalize(d['results'])
results.head(3)
However, none of these structures achieves a tidy and accurate representation of the dataframe that is one level deeper in the hierarchy.
When I instead flatten the data using a function by that name, I get a dict of size 308375. It appears to contain all of the data I want, but in a single list. This starts to feel whack-a-mole, and that’s why I’m posting, to share what I’ve learned so far, and to learn more.
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(d)
json_normalize(flat)
Other approaches to open JSON files include with Excel (not an approach that interests me) and special software intended to view JSON files. I checked both approaches out of curiosity, and neither seemed helpful.
If you have ideas about how I can accelerate my understanding of JSON files & flattening these sometimes complicated files, please don’t hesitate to tweet at me. @thebyrdlab