Category: fuzzyjoin

  • Finding Location Information with fuzzyjoin

    Now that I had some good location data, I wanted to use my coordinates to join the Pleiades data and see what …stuff is in the immediate area of my locations. To do this, I used a package called fuzzyjoin, which has a function called geo_inner_join (and geo_right_join, geo_left_join, etc.)* which allows you to join 2 sets of coordinates based on distance – either kilometers or miles. I decided that 2km (1.24 m) would be a good place to start. The joining took several seconds and produced a long data set with each identified each Pleiades ID on its own line.

    # Read in object.
    locations_master  <- readRDS(paste0(objects_directory,"locations_reversed_joined.rds"))
    
    
    #attestations based on lat/long - this will take several seconds.
    locations_join <- locations_master %>% 
      filter(!is.na(lat)) %>% 
      rename("orig_lat" = lat,
             "orig_long" = longi) %>% 
      geo_inner_join(
        pleiades_from_json$pleiades_locations %>% 
          filter(!is.na(long)),
        by = c("orig_long" = "long",
               "orig_lat" = "lat"),
        unit = c("km"),
        max_dist = 2,
        distance_col = "distance_from_original"
        )
    • Filter out the NA values in lat (which should remove any incomplete coordinates).
    • Rename my existing lat/long columns so that I can easily identify which frame they belong to after I perform my join.
    • Perform the geo_inner_join.
      • This will look at the list pleiades_from_json$pleiades_locations and filter out any incomplete coordinates.
      • Join the 2 sets: my orig_long to Pleiades’ long, and my orig_lat to Pleiades’ lat.
      • unit = c("km") – use kilometers (km) for the units.
      • max_dist = 2 – include all matches within 2km of my coordinates.
      • Put the distance value in a column called “distance_from_original“.
    • Save it all in an object called locations_join.
    • * – the inner_join indicates that the script it will keep only the records that match the requirements of the join.

    I forgot that I didn’t have names paired with those coordinates I pulled from the JSON file. Good thing I have that object at the ready. I decided to utilize the title value to give me the basic name of the record associated with the Pleiades ID’s. The pleiades_full_json$@graph$names list has every single variation on a name in Pleiades, and is a little beyond what I needed (for the moment!) I made a set of ID’s-to-titles and saved it to my pleiades_from_json_cleaned_tidied object to keep it handy. Then I joined it to locations_join.

    # As before:
    plei_ids <- pleiades_full_json$`@graph`$id
    
    # Grab titles.
    plei_titles <- pleiades_full_json$`@graph`$title
    
    # Pull id's and titles together.
    pleiades_titles <- tibble(
      id = plei_ids,
      plei_title = plei_titles
    ) 
    
    # Save these to the big list real quick.
    pleiades_from_json_cleaned_tidied[["pleiades_titles"]] <- pleiades_titles
    
    # Save the list.
    saveRDS(pleiades_from_json_cleaned_tidied, paste0(objects_directory, "pleiades_from_json_cleaned_tidied.rds"))
    
    # Join the titles to the joined set so we can see the names of the matched records.
    locations_full_join_w_titles <- locations_join %>% 
      left_join(
        pleiades_titles,
        by = join_by(id == id))
    • Extract titles list and store in plei_titles.
    • Pull both sets together into one tibble called pleiades_titles.
    • Save this new set to the existing pleiades_from_json_cleaned_tidied list.
    • Join the pleiades_titles to locations_join using the id column in both data sets.

    Cool. Next I wanted to see which locations had the most attestations based on these coordinates.

    locations_full_join_w_titles %>% 
      group_by(
        ancientcity,
        country,
        orig_lat,
        orig_long
      ) %>% 
      summarize(n_plei_records = n_distinct(id)) %>% 
      arrange(desc(n_plei_records))
    • Aggregate the data with summarize() and count each distinct id in each group. Sort it so you can see the highest number of ID’s at the top. with arrange(desc()).

    There were some famous, well attested sites that came up, but where was Rome? Or Constantinople? I decided to widen my geo_inner_join to 5km (3.11 miles), then aggregate.

    locations_5km <- locations_master %>% 
      filter(!is.na(lat)) %>% 
      rename("orig_lat" = lat,
             "orig_long" = longi) %>% 
      geo_inner_join(
        pleiades_from_json$pleiades_locations %>% 
          filter(!is.na(long)),
        by = c("orig_long" = "long",
               "orig_lat" = "lat"),
        unit = c("km"),
        max_dist = 5,
        distance_col = "distance_from_original"
      )
    
    locations_5km %>% 
    group_by(
      ancientcity,
      country,
      orig_lat,
      orig_long
    ) %>% 
      summarize(n_plei_records = n_distinct(id)) %>% 
      arrange(desc(n_plei_records))

    How about 10km? (Note: this code is all run together and goes right to the summary.)

    locations_master %>% 
      filter(!is.na(lat)) %>% 
      rename("orig_lat" = lat,
             "orig_long" = longi) %>% 
      geo_inner_join(
        pleiades_from_json$pleiades_locations %>% 
          filter(!is.na(long)),
        by = c("orig_long" = "long",
               "orig_lat" = "lat"),
        unit = c("km"),
        max_dist = 10,
        distance_col = "distance_from_original"
      ) %>% 
      group_by(
        ancientcity,
        country,
        orig_lat,
        orig_long
      ) %>% 
      summarize(n_plei_records = n_distinct(id)) %>% 
      arrange(desc(n_plei_records))

    More obscure cities with tons of records. It also took a little longer to perform the join. The wider the search, the more location overlap I was likely to have. I decided that 5km might be the highest tolerance to apply in my search. And, at some point, it might be interesting to check out which ID’s my records shared in common.

    Portion of the main Pleiades record for Dura with all associated sites plotted.
    All these references!

    Then, I wanted to check on the locations that did not produce any matches.

    # Who did not have matches?
    no_matches <- locations_master %>% 
      left_join(locations_5km %>% 
        group_by(
          locationID) %>% 
          summarize(n_plei_records = n_distinct(id),
                    .groups = "drop"),
        by = join_by(locationID == locationID)) %>% 
      filter(is.na(n_plei_records))
    • Take my original locations_master and left_join() to the 5km data set. Since we just want a quick check of number of records, inside that left_join(), group the records to locationID and then get the total number of distinct Pleiades ID’s per locationID.
    • Join by the locationID in each data set.
    • filter(is.na(n_plei_records)) to focus on the records that did not have any ID’s found for it’s coordinates.

    Finally, save the locations_5km and the locations_master objects for the next round of matches.

    saveRDS(locations_5km, paste0(objects_directory,"locations_5km.rds"))
    saveRDS(locations_master, paste0(objects_directory,"locations_master.rds"))
    saveRDS(no_matches, paste0(objects_directory,"no_matches.rds"))
    

    Robinson D (2025). fuzzyjoin: Join Tables Together on Inexact Matching. R package version 0.1.6, https://github.com/dgrtwo/fuzzyjoin.