The assertable package includes two functions to check and import multiple files into one dataset
We will use the CO2 dataset, which has 64 rows and 5 columns of data from an experiment related to the cold tolerances of plants. First, we take in the CO2 dataset and save the whole dataset three times into three separate csv files as data/file_#.csv, with a unique id_var.
check_files checks to see how many of your files currently exist, and stop script execution if not all files exist. We can use the system.file command to locate them within the assertable namespace.
files <- paste0("file_",c(1:3),".csv")
filenames <- system.file("extdata", files, package = "assertable")
filenames
## [1] "/tmp/Rtmp7zyftE/Rinst6794c83ddcc/assertable/extdata/file_1.csv"
## [2] "/tmp/Rtmp7zyftE/Rinst6794c83ddcc/assertable/extdata/file_2.csv"
## [3] "/tmp/Rtmp7zyftE/Rinst6794c83ddcc/assertable/extdata/file_3.csv"
## [1] "All results are present"
Here, let’s add another file to filenames.
## [1] "Have 3 files: expecting 4 at 2024-11-05 03:10:55.363304"
## [1] "Still Missing: new_file.csv"
## Error in check_files(filenames): Files not complete; stopping execution -- set continual=T for continual file checks
By setting continual = T, you can keep checking for the files every few seconds (specified by sleep_time) for a designated number of minutes (specified by sleep_end). This is particularly useful when monitoring the progress of distributed compute jobs, or pausing execution of a step until all previous steps have successfully produced otuput files.
filenames <- c(filenames,"new_file.csv")
check_files(filenames, continual=T, sleep_time = 1, sleep_end = .10)
## [1] "Have 3 files: expecting 5 at 2024-11-05 03:10:55.376472"
## [1] "Have 3 files: expecting 5 at 2024-11-05 03:10:56.377792"
## [1] "Have 3 files: expecting 5 at 2024-11-05 03:10:57.379222"
## [1] "Have 3 files: expecting 5 at 2024-11-05 03:10:58.380664"
## [1] "Have 3 files: expecting 5 at 2024-11-05 03:10:59.382106"
## [1] "Have 3 files: expecting 5 at 2024-11-05 03:11:00.3836"
## [1] "Have 3 files: expecting 5 at 2024-11-05 03:11:01.385052"
## Error in check_files(filenames, continual = T, sleep_time = 1, sleep_end = 0.1): Files not complete; stopping execution after 0.1 minutes
check_files only prints out missing files if 75% of the requested files exist. You can change this using the display_pct argument. This is useful to see what specific files/processes may have errored out, but without filling up your logs while they are computing.
## [1] "Have 3 files: expecting 7 at 2024-11-05 03:11:02.401963"
## Error in check_files(filenames, display_pct = 50): Files not complete; stopping execution -- set continual=T for continual file checks
All files are imported using a wrapper of rbindlist and lapply – so this assumes that your data is similarly-formulated, tabular in nature, and able to be appended together using rbindlist. It accepts a function FUN, which will be used to import your data – you must set the library for this function before using it.
You can specify use.names and fill arguments to pass onto rbindlist. In addition, if multicore=T, import_files will use mclapply instead of lapply – you can specify mc.preschedule and mc.cores as options to mclapply. Finally, you can pass on FUN-specific arguments via named arguments to import_files
library(data.table)
files <- paste0("file_",c(1:3),".csv")
filenames <- system.file("extdata", files, package = "assertable")
data <- import_files(filenames, FUN=fread)
## [1] "All results are present"
## Plant Type Treatment conc uptake id_var
## <char> <char> <char> <int> <num> <int>
## 1: Qn1 Quebec nonchilled 95 16.0 1
## 2: Qn1 Quebec nonchilled 175 30.4 1
## 3: Qn1 Quebec nonchilled 250 34.8 1
## 4: Qn1 Quebec nonchilled 350 37.2 1
## 5: Qn1 Quebec nonchilled 500 35.3 1
## ---
## 248: Mc3 Mississippi chilled 250 17.9 3
## 249: Mc3 Mississippi chilled 350 17.9 3
## 250: Mc3 Mississippi chilled 500 17.9 3
## 251: Mc3 Mississippi chilled 675 18.9 3
## 252: Mc3 Mississippi chilled 1000 19.9 3
Here, we can use read.csv and pass on the stringsAsFactors argument to read.csv.
## [1] "All results are present"
## Plant Type Treatment conc uptake id_var
## <char> <char> <char> <int> <num> <int>
## 1: Qn1 Quebec nonchilled 95 16.0 1
## 2: Qn1 Quebec nonchilled 175 30.4 1
## 3: Qn1 Quebec nonchilled 250 34.8 1
## 4: Qn1 Quebec nonchilled 350 37.2 1
## 5: Qn1 Quebec nonchilled 500 35.3 1
## ---
## 248: Mc3 Mississippi chilled 250 17.9 3
## 249: Mc3 Mississippi chilled 350 17.9 3
## 250: Mc3 Mississippi chilled 500 17.9 3
## 251: Mc3 Mississippi chilled 675 18.9 3
## 252: Mc3 Mississippi chilled 1000 19.9 3
import_files first scans to make sure that all requested files exist prior to bringing them in. This can save a lot of time if you have numerous large files and currently only stop execution if your read.csv or other data import function breaks (potentially after importing many other files beforehand).
## [1] "Have 3 files: expecting 13 at 2024-11-05 03:11:02.452702"
## Error in import_files(filenames): These files do not exist: new_file_1.csv new_file_2.csv new_file_3.csv new_file_4.csv new_file_5.csv new_file_6.csv new_file_7.csv new_file_8.csv new_file_9.csv new_file_10.csv