Post

DP17 Splitting and Summarizing String Values

Description

In the example, there is a “Reasons for Service” field that comprises the checked reasons for service each interaction had. Some are single reasons but most cases have grouped two or more reasons separated by commas. To measure the monthly impact of each reason individually, it is necessary to apply to split and summarize this data.

In order to split these values we can tally the total number of reasons for all the patients in a period, by using the base::strsplit() function. It splits the field into a list of lists. These lists are then converted into a sparse matrix to summarize them.

R Script - Split string of values

Initial Data for the Field: Reasons for Service

For a data frame with 9355 records, let’s check the most frequent combination of “Reasons for Service”, with the most frequent values at the top and listing the top 20. For that purpose, the next code is applied:

1
2
head(dfINT %>% group_by(Reason.s..for.Service) %>% tally %>% arrange(desc(n)),20)

Table: Combination of Reasons. Top 20 most frequent.

Reason.s..for.Servicen
Assessment,Education,Engagement,Other2505
Post Discharge Call / Visit1155
Care Coordination1048
Engagement505
Assessment,Education,Engagement320
Assessment310
Assessment,Engagement304
Assessment,COVID-19 Education,Education,Engagement215
Other205
Assessment,Engagement,Post Discharge Call / Visit197
Medication Management171
Monitoring150
Care Coordination,Education148
Assessment,Other,Post Discharge Call / Visit110
Education,Engagement101
Education,Engagement,Monitoring93
Assessment,COVID-19 Education,Education,Engagement,Other77
Assessment,COVID-19 Education,Other,Post Discharge Call…75
Engagement,Post Discharge Call / Visit70
Education51

Combination of Reasons in the R Console

17 format Date Initial Data: Reasons for Service merged with commas

Add an ID Index column

dfRecapNoUTR$ID <- seq.int(nrow(dfRecapNoUTR))

Reasons for Service for the first five records

IDReasons
1Assessment,Education,Engagement,Other
2Assessment,Other,Post Discharge Call / Visit
3Post Discharge Call / Visit
4Assessment,Education,Engagement,Other
5Assessment,Education,Engagement,Other

Getting List of Lists

The next step involves splitting the string values into multiple columns using the strsplit function, using comma as separator.

1
dat <- with(dfRecapNoUTR, strsplit(Reason.s..for.Service, ','))  # Result: List of Lists

List of Lists for first five records

17 format Date Partial Results: List within List for first 5 records

Unlisting the values and relating them with the corresponding ID they belong to

The index ID is repeated for each reason in the record as many times as the number of reasons within the record:

1
2
3
df2 <- data.frame(ID = rep(dfRecapNoUTR$ID, 
                           times = lengths(dat)),
                  Reason.s..for.Service = unlist(dat))

Resulting Unlisted Reasons by ID

IDReason.s..for.Service
1Assessment
1Education
1Engagement
1Other
2Assessment
2Other
2Post Discharge Call / Visit
3Post Discharge Call / Visit
4Assessment
4Education
4Engagement
4Other
5Assessment
5Education
5Engagement
5Other

Create a table ID ~ Reasons

1
2
3
4
# Sparse Matrix. Convert table to data frame
df3 <- as.data.frame(cbind(ID = dfRecapNoUTR$ID,
                   table(df2$ID, df2$Reason.s..for.Service)))  
head(df3)

Resulting Sparse Matrix

17 Sparse Matrix Result: Sparse Matrix

Joining the sparse matrix to the main data and Recap per Year ~ Month

1
2
3
4
5
6
7
dfRecapReasons <- dfRecapNoUTR %>% select("YearMonth", "ID") %>% 
  inner_join(df3, by = 'ID')
dfRecapReasons <- data.table(dfRecapReasons %>% 
                               group_by(YearMonth) %>% 
                               select(-c("ID")) %>% 
                               summarise_all(list(sum)))

17 Summarized Values Result: Summarized values

__

End of Post

This post is licensed under CC BY 4.0 by the author.