Title: | Verbose Assertions for Tabular Data (Data.frames and Data.tables) |
---|---|
Description: | Simple, flexible, assertions on data.frame or data.table objects with verbose output for vetting. While other assertion packages apply towards more general use-cases, assertable is tailored towards tabular data. It includes functions to check variable names and values, whether the dataset contains all combinations of a given set of unique identifiers, and whether it is a certain length. In addition, assertable includes utility functions to check the existence of target files and to efficiently import multiple tabular data files into one data.table. |
Authors: | Grant Nguyen [aut, cre], Max Czapanskiy [ctb] |
Maintainer: | Grant Nguyen <[email protected]> |
License: | GPL-3 |
Version: | 0.2.8 |
Built: | 2025-03-05 03:16:13 UTC |
Source: | https://github.com/gnguy/assertable |
Given a data.frame or data.table object, assert that all columns in the colnames argument exist as columns.
assert_colnames(data, colnames, only_colnames = TRUE, quiet = FALSE)
assert_colnames(data, colnames, only_colnames = TRUE, quiet = FALSE)
data |
A data.frame or data.table |
colnames |
Character vector with column names corresponding to columns in data |
only_colnames |
Assert that the only columns in the data object should be those in colnames. Default = T. |
quiet |
Do you want to suppress the printed message when a test is passed? Default = F. |
Throws error if test is violated.
assert_colnames(CO2, c("Plant","Type","Treatment","conc","uptake")) assert_colnames(CO2, c("Plant","Type"), only_colnames=FALSE)
assert_colnames(CO2, c("Plant","Type","Treatment","conc","uptake")) assert_colnames(CO2, c("Plant","Type"), only_colnames=FALSE)
Given a data.frame or data.table object, assert that all columns in the names of the coltypes argument match the types of the elements of the coltypes argument.
assert_coltypes(data, coltypes, quiet = FALSE)
assert_coltypes(data, coltypes, quiet = FALSE)
data |
A data.frame or data.table |
coltypes |
List with names corresponding to columns in data. The types of the columns in data will be tested against types of the elements in coltypes. |
quiet |
Do you want to suppress the printed message when a test is passed? Default = F. |
Throws error if test is violated.
# Should pass assert_coltypes(CO2, list(Plant = integer(), conc = double())) # Should fail ## Not run: assert_coltypes(CO2, list(Plant = character(), conc = character())) ## End(Not run)
# Should pass assert_coltypes(CO2, list(Plant = integer(), conc = double())) # Should fail ## Not run: assert_coltypes(CO2, list(Plant = character(), conc = character())) ## End(Not run)
Given a data.frame or data.table object and a named list of id_vars, assert that all possible combinations of id_vars exist in the dataset,
that no combinations of id_vars exist in the dataset but not in id_vars,
and that there are no duplicate values within the dataset within unique combinations of id_vars.
If ids_only = T and assert_dups = T, returns all combinations of id_vars along with the n_duplicates: the count of duplicates within each combination.
If ids_only = F, returns all duplicate observations from the original dataset along with n_duplicates
and duplicate_id: a unique ID for each duplicate value within each combination of id_vars.
assert_ids(data, id_vars, assert_combos = TRUE, assert_dups = TRUE, ids_only = TRUE, warn_only = FALSE, quiet = FALSE)
assert_ids(data, id_vars, assert_combos = TRUE, assert_dups = TRUE, ids_only = TRUE, warn_only = FALSE, quiet = FALSE)
data |
A data.frame or data.table |
id_vars |
A named list of vectors, where the name of each vector must correspond to a column in data |
assert_combos |
Assert that the data object must contain all combinations of id_vars. Default = T. |
assert_dups |
Assert that the data object must not contain duplicate values within any combinations of id_vars. Default = T. |
ids_only |
By default, with assert_dups = T, the function returns the unique combinations of id_vars that have duplicate observations. If ids_only = F, will return every observation in the original dataset that are duplicates. |
warn_only |
Do you want to warn, rather than error? Will return all offending rows from the first violation of the assertion. Default=F. |
quiet |
Do you want to suppress the printed message when a test is passed? Default = F. |
Note: if assert_combos = T and is violated, then assert_ids will stop execution and return results for assert_combos before evaluating the assert_dups segment of the code. If you want to make sure both options are evaluated even in case of a violation in assert_combos, call assert_ids twice (once with assert_dups = F, then assert_combos = F) with warn_only = T, and then conditionally stop your code if either call returns results.
Throws error if test is violated. Will print the offending rows. If warn_only=T, will return all offending rows and only warn.
plants <- as.character(unique(CO2$Plant)) concs <- unique(CO2$conc) ids <- list(Plant=plants,conc=concs) assert_ids(CO2, ids)
plants <- as.character(unique(CO2$Plant)) concs <- unique(CO2$conc) ids <- list(Plant=plants,conc=concs) assert_ids(CO2, ids)
Given a data.frame or data.table object and a target number of rows, check that a dataset has that many rows
assert_nrows(data, target_nrows, quiet = FALSE)
assert_nrows(data, target_nrows, quiet = FALSE)
data |
A data.frame or data.table |
target_nrows |
Numeric – number of expected rows |
quiet |
Do you want to suppress the printed message when a test is passed? Default = F. |
Throws error if test is violated
assert_nrows(CO2,84)
assert_nrows(CO2,84)
Given a data.frame or data.table object, make assertions about values of the columns within the object. Assert that a column contains no missing/infinite values, or that it is greater/less than, equal to, or contains either a single value, vector with nrow(data) values, or a vector of any length(for in option).
assert_values(data, colnames, test = "not_na", test_val = NA, display_rows = TRUE, na.rm = FALSE, warn_only = FALSE, quiet = FALSE)
assert_values(data, colnames, test = "not_na", test_val = NA, display_rows = TRUE, na.rm = FALSE, warn_only = FALSE, quiet = FALSE)
data |
A data.frame or data.table |
colnames |
Character vector with column names corresponding to columns in data |
test |
The type of evaluation you want to assert in your data
|
test_val |
A single value, a vector with length = nrow(data), or a vector of any length (if using the in option for test. Must match the character type of colnames. |
display_rows |
Do you want to show the actual rows that violate the assertion? Default=T |
na.rm |
Do you want to remove NA and NaN values from assertions? Default=F |
warn_only |
Do you want to warn, rather than error? Will return all offending rows from the first violation of the assertion Default=F |
quiet |
Do you want to suppress the printed messages when a test is passed? Default = F. |
Throws error if test is violated. If warn_only=T, will return all offending rows from the first violation of the assertion.
assert_values(CO2, colnames="uptake", test="gt", 0) # Are all values greater than 0? assert_values(CO2, colnames="conc", test="lte", 1000) # Are all values less than/equal to 1000? ## Not run: assert_values(CO2, colnames="uptake", test="lt", 40) # Are all values less than 40? # Fails: not all values < 40. ## End(Not run) assert_values(CO2, colnames="Treatment", test="in", test_val = c("nonchilled","chilled")) CO2_mult <- CO2 CO2_mult$new_uptake <- CO2_mult$uptake * 2 assert_values(CO2, colnames="uptake", test="equal", CO2_mult$new_uptake/2) ## Not run: assert_values(CO2, colnames="uptake", test="gt", CO2_mult$new_uptake/2, display_rows=F) # Fails: uptake !> new_uptake/2 ## End(Not run)
assert_values(CO2, colnames="uptake", test="gt", 0) # Are all values greater than 0? assert_values(CO2, colnames="conc", test="lte", 1000) # Are all values less than/equal to 1000? ## Not run: assert_values(CO2, colnames="uptake", test="lt", 40) # Are all values less than 40? # Fails: not all values < 40. ## End(Not run) assert_values(CO2, colnames="Treatment", test="in", test_val = c("nonchilled","chilled")) CO2_mult <- CO2 CO2_mult$new_uptake <- CO2_mult$uptake * 2 assert_values(CO2, colnames="uptake", test="equal", CO2_mult$new_uptake/2) ## Not run: assert_values(CO2, colnames="uptake", test="gt", CO2_mult$new_uptake/2, display_rows=F) # Fails: uptake !> new_uptake/2 ## End(Not run)
Given a character vector of filenames, check how many of them currently exist. Optionally, can keep checking for a specified amount of time, at a given frequency
check_files(filenames, folder = "", warn_only = FALSE, continual = FALSE, sleep_time = 30, sleep_end = (60 * 3), display_pct = 75)
check_files(filenames, folder = "", warn_only = FALSE, continual = FALSE, sleep_time = 30, sleep_end = (60 * 3), display_pct = 75)
filenames |
A character vector of filenames (specify full paths if you are checking files that are not in present working directory) |
folder |
An optional character containing the folder name that contains the files you want to check (if used, do not include folderpath in the filenames characters). If not specified, will search in present working directory. |
warn_only |
Boolean (T/F), whether to end with a warning message as opposed to an error message if files are still missing at the end of the checks. |
continual |
Boolean (T/F), whether to only run once or to continually keep checking for files for sleep_end minutes. Default = F. |
sleep_time |
numeric (seconds); if continual = T, specify the number of seconds to wait in-between file checks. Default = 30 seconds. |
sleep_end |
numeric (minutes); if continual = T, specify number of minutes to check at sleep_time intervals before terminating. Default = 180 minutes. |
display_pct |
numeric (0-100); at what percentage of files found do you want to print the full list of still-missing files? Default = 75 percent of files. |
Prints the number of files that match. If warn_only = T, returns a character vector of missing files
## Not run: for(i in 1:3) { data <- CO2 data$id_var <- i write.csv(data,file=paste0("file_",i,".csv"),row.names=FALSE) } filenames <- paste0("file_",c(1:3),".csv") check_files(filenames) ## End(Not run)
## Not run: for(i in 1:3) { data <- CO2 data$id_var <- i write.csv(data,file=paste0("file_",i,".csv"),row.names=FALSE) } filenames <- paste0("file_",c(1:3),".csv") check_files(filenames) ## End(Not run)
Given a character vector of filenames, check how many of them currently exist. Optionally, can keep checking for a specified amount of time, at a given frequency
import_files(filenames, folder = "", FUN = fread, warn_only = FALSE, multicore = FALSE, use.names = TRUE, fill = TRUE, mc.preschedule = FALSE, mc.cores = getOption("mc.cores", 2L), ...)
import_files(filenames, folder = "", FUN = fread, warn_only = FALSE, multicore = FALSE, use.names = TRUE, fill = TRUE, mc.preschedule = FALSE, mc.cores = getOption("mc.cores", 2L), ...)
filenames |
A character vector of filenames (specify full paths if you are checking files that are not in present working directory) |
folder |
An optional character containing the folder name that contains the files you want to check (if used, do not include folderpath in the filenames characters). If not specified, will look in present working directory. |
FUN |
function: The function that you want to use to import your data, e.g. read.csv, fread, read_dta, etc. |
warn_only |
Boolean (T/F), whether to send a warning message as opposed to an error message if files are missing prior to import. Will only import the files that do exist. |
multicore |
boolean, use lapply or mclapply (multicore = T) to loop over files in filenames for import. Default=F. |
use.names |
boolean, pass to the use.names option for rbindlist |
fill |
boolean, pass to the fill option for rbindlist |
mc.preschedule |
boolean, pass to the mc.preschedule option for mclapply if multicore = T. Default = F. |
mc.cores |
pass to the mc.preschedule option for mclapply if multicore = T. Default = mclapply default. |
... |
named arguments of FUN to pass to FUN |
One data.table that contains all files in filenames, combined together using rbindlist. Returns an error if any file in filenames does not exist
## Not run: for(i in 1:3) { data <- CO2 data$id_var <- i write.csv(data,file=paste0("file_",i,".csv"),row.names=FALSE) } filenames <- paste0("file_",c(1:3),".csv") import_files(filenames, FUN=fread) import_files(filenames, FUN=read.csv, stringsAsFactors=FALSE) import_files(filenames, FUN=fread, multicore=T, mc.cores=1) # Only if you have a multi-core system ## End(Not run)
## Not run: for(i in 1:3) { data <- CO2 data$id_var <- i write.csv(data,file=paste0("file_",i,".csv"),row.names=FALSE) } filenames <- paste0("file_",c(1:3),".csv") import_files(filenames, FUN=fread) import_files(filenames, FUN=read.csv, stringsAsFactors=FALSE) import_files(filenames, FUN=fread, multicore=T, mc.cores=1) # Only if you have a multi-core system ## End(Not run)