Wrangling JSON Data

Once again, I return to Pleiades, and my quest for excellent source data. When I first downloaded and explored their JSON data, it was confusing to me. Even with the flattening applied, the result was one, massive list, divided into 2 main groups. Those lists had more lists, several layers deep in some sets. But even then, it was easy to see how the data lined up. 

I clicked on the list to bring up the R Studios view tab. This allowed me to collapse and expand the lists to get a sense of what was in there. If you want to keep it all in-console, use glimpse(), then use the $ to select a list from the level below. You can use $ for as many levels of data there are in the list.

R Studio – View(pleiades_full_json)– view of the file with the @graph list expanded.
glimpse(pleiades_full_json$@graph) – view of the same list. <list> types shown, then individual structures are shown with dimensions.

What I really wanted was the “id”, or the unique record ID that identifies every place in Pleiades. That should allow me to pair those ID’s up to any of these data sets and maintain the fidelity of the information. I have always wanted those references, so I went for that, first.

  • Pull out both the id list and the references list from the full list, then convert them into their own objects called plei_ids and plei_refs, respectively.
  • Pull the 2 objects together in one tibble at the same time. The id will pair up to the respective records in the references table. Because one id may refer to multiple references, a nested value is created in the new tibble for the references column. unnest(references) will expand these into their own rows.
  • Note: places with many references will therefore generate many new rows. If there are more lists, they will show up as they have previously, nested inside the columns. It looks like that is not the case here, since my URL field has been converted to <chr>:
The data set with the above steps applied.

Did a little spot checking.

  • select() only the id column and use distinct() just to make sure I am getting 100% unique id’s. I then sample 3 of them at random using slice_sample(n = 3).
  • filter() to only those id’s chosen and see what URL’s are tied to them.
  • Check each of those id’s on https://pleiades.stoa.org/ and make sure the same references are being identified under the References section of the place’s page. I did this a few times to make sure.

I thought this looked good! At some point, I want to use these references to do some more exploration. There are a ton of other database systems linked in here (like Trismegistos!) that would be really fun to poke at.

Next, I wanted to check out the Locations data. At first glance, the data looked bonkers – tons of nested tables. However, it’s really not that bad once you understand what’s in the set. Many of these nested tables were available in the top-level of the list, which would allow you to link the id directly to the value, rather than trying to unnest the columns in the larger set – exactly as I did with the references set above. I was interested in the representative coordinates (latitude, longitude), which are stored in pleiades_full_json$@graph$reprPoint.

  • Store an object called plei_rep_locs with the data from pleiades_full_json$`@graph`$reprPoint.
  • The next steps flow together to make the final location set:
    • Create a tibble called pleiades_locations_id_match that combines plei_ids and location_data (as we did before).
    • The location_data comes over as a double-valued list. If you unnest(), you will create a new record for each coordinate (latitude will be on one line, longitude under it). That is not helpful. Instead, we can use unnest_wider() to coerce those values into 2 columns.
      • names_sep = "_" tells unnest to use the column name as the base name for the new columns, and apply an underscore ( _ ) to separate the name and the column position of the set. Since I only have 2 coordinates, this will create location_data_1 and location_data_2.
    • Finally, rename the columns so you know for sure which coordinate is which.
  • Create an empty list(). Here mine is called pleiades_from_json_cleaned_tidied.
  • Add your sets to the list. The name inside the brackets [[ ... ]] will be the name inside your list. Name it something meaningful.
  • Save as an .rds. Here, I have a predefined value for my objects directory, but this can be any place you want to save the file. Again, name the .rds something meaningful.

Leave a Reply

Your email address will not be published. Required fields are marked *