How can we create log files to track data merges?

872    Asked by ColinPayne in Data Science , Asked on Nov 9, 2019
Answered by Nitin Solanki

To be able to make sure that this does not lead to any errors (e.g., inflated data sets due to non-unique identifiers), we can maintain a merge tracker like this.

merge <- "dat1+dat2=dat1"

count <- nrow(dat1)

check_t1 <- data.frame(merge, count)

dat1 <- join(dat1, dat2, by = "id1", type = "left")

count <- nrow(dat1)

check_t2 <- data.frame(merge, count)

checkmerge <- rbind(checkmerge, check_t1, check_t2)

We can use a function like this which contains a stopifnot condition. It will throw an error if our join inflates our data.frame

myfun <- function(df1, df2, id, jtype, msg) {

    require(plyr)

    print(msg)

    M <- join(df1, df2, by = id, type = jtype)

    stopifnot(nrow(df1)==nrow(M))

    return(M)

}

library(plyr)

myfun(mtcars, mtcars, "cyl", "left", "mtcars, mtcars")



Your Answer

Interviews

Parent Categories