--- title: "assertable File Assertion Intro" author: "Grant Nguyen" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{assertable file assertion intro} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results='asis'} library(assertable) ``` The assertable package includes two functions to check and import multiple files into one dataset ## Data We will use the CO2 dataset, which has 64 rows and 5 columns of data from an experiment related to the cold tolerances of plants. First, we take in the CO2 dataset and save the whole dataset three times into three separate csv files as data/file_#.csv, with a unique id_var. ```{r, results='asis', eval=FALSE} for(i in 1:3) { data <- CO2 data$id_var <- i write.csv(data,file=paste0("../data/file_",i,".csv"),row.names=F) } ``` ## Checking file existence check_files checks to see how many of your files currently exist, and stop script execution if not all files exist. We can use the system.file command to locate them within the assertable namespace. ```{r, results='markup'} files <- paste0("file_",c(1:3),".csv") filenames <- system.file("extdata", files, package = "assertable") filenames ``` ```{r, results='markup', error=TRUE} check_files(filenames) ``` Here, let's add another file to filenames. ```{r, results='markup', error=TRUE} filenames <- c(filenames,"new_file.csv") check_files(filenames) ``` By setting continual = T, you can keep checking for the files every few seconds (specified by sleep_time) for a designated number of minutes (specified by sleep_end). This is particularly useful when monitoring the progress of distributed compute jobs, or pausing execution of a step until all previous steps have successfully produced otuput files. ```{r, results='markup', error=TRUE} filenames <- c(filenames,"new_file.csv") check_files(filenames, continual=T, sleep_time = 1, sleep_end = .10) ``` check_files only prints out missing files if 75% of the requested files exist. You can change this using the display_pct argument. This is useful to see what specific files/processes may have errored out, but without filling up your logs while they are computing. ```{r, results='markup', error=TRUE} filenames <- c(filenames,"new_file_1.csv","new_file_2.csv") check_files(filenames, display_pct=50) ``` ## Importing files All files are imported using a wrapper of rbindlist and lapply -- so this assumes that your data is similarly-formulated, tabular in nature, and able to be appended together using rbindlist. It accepts a function FUN, which will be used to import your data -- you must set the library for this function before using it. You can specify use.names and fill arguments to pass onto rbindlist. In addition, if multicore=T, import_files will use mclapply instead of lapply -- you can specify mc.preschedule and mc.cores as options to mclapply. Finally, you can pass on FUN-specific arguments via named arguments to import_files ```{r, results='markup'} library(data.table) files <- paste0("file_",c(1:3),".csv") filenames <- system.file("extdata", files, package = "assertable") data <- import_files(filenames, FUN=fread) data ``` Here, we can use read.csv and pass on the stringsAsFactors argument to read.csv. ```{r, results='markup'} data <- import_files(filenames, FUN=read.csv, stringsAsFactors=F) data ``` import_files first scans to make sure that all requested files exist prior to bringing them in. This can save a lot of time if you have numerous large files and currently only stop execution if your read.csv or other data import function breaks (potentially after importing many other files beforehand). ```{r, results='markup', error=TRUE} filenames <- c(filenames,paste0("new_file_",c(1:10),".csv")) import_files(filenames) ```