Post

DP04 Stacking CSV Data Files with Different Structure

Description

This post depicts situations when data is collected in separate fragments, such as different time periods, places, or other subsets, and needs to be consolidated into a unified table for comprehensive analysis. This process is also known as “stacking” tables. Combining these individual data parts is essential to perform in-depth data processing and obtain insights that encompass the entirety of the original files.

The stacking process requires the same field names across all files. In cases where data includes varying names, the current script aids in automating the comparison process with a defined function that shows the fields that differs in names or is present in only one of the data sources. This results in significant time savings and error prevention whenever wide tables, meaning with many fields, are involved.

R Script - Stacking CSV files with different structure

Example: Two DataTables with different fields. Check before concatenate

04 Two DataTables with different fields. Check before merge Two DataTables with different fields. Check before merge

Function: Comparing two data tables with different named fields.

1
2
3
4
5
6
fx_comparison_two_DFs_Fields <- function(df_x, df_y){
  LeftW <- data.frame(field=names(df_x),match=match(names(df_x), names(df_y)), "L in R")
  RightW <- data.frame(field=names(df_y),match=match(names(df_y), names(df_x)),"R in L")
  compardf <- LeftW %>% full_join(RightW, by= 'field') %>% filter(is.na(match.x) | is.na(match.y) )
  print(compardf)
}

Comparing two data tables with different named fields.

Result of the comparison by using the script

04 Comparison of Fields with Differences Result: Comparison of Differences

Fixing the discrepancies

By using dplyr::rename, it is possible to rename the field names to match each other, allowing the fields to concatenate correctly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14

tibble_1mo <- tibble_1mo %>% dplyr::rename(Medicaid.ID = Current.MID,
                      Name = Patient.Name,
                      DOB = Patient.DOB,
                      Age = Patient.Age,
                      Gender = Patient.Gender,
                      Phone = Patient.Phone,
                      Address = Patient.Address,
                      City = Patient.City,
                      Zip = Patient.Zip
                      )

fx_comparison_two_DFs_Fields(tibble_1mo, tibble_2mo)

__

End of Post

This post is licensed under CC BY 4.0 by the author.