How can we create log files to track data merges?
To be able to make sure that this does not lead to any errors (e.g., inflated data sets due to non-unique identifiers), we can maintain a merge tracker like this.
merge <- "dat1+dat2=dat1"
count <- nrow(dat1)
check_t1 <- data.frame(merge, count)
dat1 <- join(dat1, dat2, by = "id1", type = "left")
count <- nrow(dat1)
check_t2 <- data.frame(merge, count)
checkmerge <- rbind(checkmerge, check_t1, check_t2)
We can use a function like this which contains a stopifnot condition. It will throw an error if our join inflates our data.frame
myfun <- function(df1, df2, id, jtype, msg) {
require(plyr)
print(msg)
M <- join(df1, df2, by = id, type = jtype)
stopifnot(nrow(df1)==nrow(M))
return(M)
}
library(plyr)
myfun(mtcars, mtcars, "cyl", "left", "mtcars, mtcars")