Join the Shiny Community every month at Shiny Gatherings

R and Pharmaceutical Data Analysis- Top packages for clinical trial data and predictive modeling

R Programming and Pharmaceutical Data Analysis (Packages for Clinical Trial Data)


In the clinical trials reporting industry, there is an incorrect assumption that SAS software is ideal as regulatory agencies “require” it. Regulatory agencies generally do not mandate the use of specific software for clinical trial reporting. They primarily focus on the accuracy, integrity, and compliance of the reported data.

Recently, other software options, such as the open-source R language, have gained attention across the life sciences, despite facing resistance due to this misconception. R has a strong history of use in academia for statistical research and is being utilized in the pharmaceutical industry. However, its adoption in regulatory submissions has been limited.

In recent years, developers from the pharmaceutical industry have taken advantage of R and collaboratively developed open-source libraries that can be used in clinical trials data analysis and reporting. This blog article/post will provide an overview of a few R packages, including their key features and how they can be used, along with resources for learning more about them.

Open Source and Proprietary Works

Please note, this is not legal advice! Explore Posit’s responses to related questions on:

One of the common misconceptions about using open-source packages in software development is that it requires companies to publicly disclose their own code and proprietary workflows. This misconception often arises from the fact that many popular open-source packages, such as R and its various libraries, are licensed under the GNU General Public License (GPL) or other similar licenses that require the distribution of source code. 

However, it is important to note that using open-source packages does not necessarily require companies to disclose their own code or proprietary workflows. The GPL and other similar licenses only require the distribution of source code if the software that uses the open-source packages is also distributed. 

For example, if a company is developing an internal application that uses R and various open-source libraries, and that application is only used within the company, then there is no requirement to distribute the source code for that application. The company can keep its code and workflows confidential, even if it uses open-source packages.

On the other hand, if a company develops a commercial application that incorporates open-source packages and then distributes that application to customers, then it may be required to distribute the source code for the application, including any modifications made to the open-source packages. However, this requirement can often be satisfied by providing access to the source code through a written offer, rather than including it with the distributed software.

There is an alternative interpretation where your code, which utilizes external R libraries, is not compiled and loads the libraries into memory when the app is executed. Therefore, the library mentioned does not refer specifically to a given library from CRAN; it can be any library from your local drive with a different license. As a result, your code allows for the integration of any library that adheres to the given interface and does not rely on any specific GPL library. This indicates that your app is a separate component that uses other packages but is not linked as one product, eliminating the requirement to open-source your code.

Understanding Clinical Data Standards

CDISC (Clinical Data Interchange Standards Consortium) is a global, non-profit organization that develops and promotes data standards to support the acquisition, exchange, submission, and archive of clinical research data and metadata. 

CDISC Foundational Standards provide a comprehensive set of data standards that improve the quality, efficiency, and cost-effectiveness of clinical research. These standards cover various aspects of clinical research, from study design to data collection, management, analysis, and reporting.

The standards include the

These standards help ensure that clinical trial data is organized and analyzed consistently and accurately, enhancing the efficiency and quality of clinical research. The QRS supplements provide a standardized way to assess clinical concepts or task-based observations.

ADaM is required by both the FDA (US) and PMDA (Japan) for new drug applications, while the FDA requires SEND for nonclinical studies. CDISC standards improve transparency and traceability, making it easier for regulators and others to conduct data reviews. 

Insights Engineering

Pharmaceutical Engineering open source R projects

{teal}: Interactive Exploratory Data Analysis with Shiny Web Applications

{teal} is a framework for interactive exploratory data analysis that uses Shiny web applications. {teal} applications require specifying data, including CDISC data, independent datasets, related datasets, and MultiAssayExperiment objects. 

The framework also provides modules for performing analysis, such as outlier exploration and data visualization. {teal} modules are built within the framework and can be found in packages like {teal.modules.general}, {teal.modules.clinical} and {teal.modules.{hermes} }. 

The framework’s functionality is derived from packages like {teal.data}, {teal.widgets}, {teal.slice}, {teal.code}, {teal.transform}, {teal.logger} and {teal.reporter}. There is also a package called {teal.osprey} that takes community teal modules. Users can refer to these packages for more information on how to use different parts of the {teal} framework.

{hermes}

{hermes} is a tool that helps with the preprocessing, analysis, and reporting of RNA-seq data. It has the ability to import RNAseq count data and annotate gene information from a central database like BioMart. It also adds quality control flags to genes and samples, filters the data set, and normalizes counts. 

{hermes} can work with data structures from bioconductor packages, thereby allowing interoperability. It can also quickly generate descriptive plots, perform principal components analysis, and produce a QC report based on a template. Additionally, it can perform differential expression analysis.

{tern}

The R package called {tern} offers various analysis functions for generating tables and graphs commonly used in clinical trial reporting. This package provides a wide range of functionalities including data visualizations such as forest plots, line plots, Kaplan-Meier plots, as well as statistical model fits like logistic regression and Cox regression. 

Additionally, {tern} allows for the creation of summary tables containing information about unique patients, exposure across patients, and changes from baseline for parameters. Furthermore, {tern} outputs can be added to {teal} applications for interactive exploration of data through modules available in the {teal.modules.clinical} package.

Reference Based Multiple Imputation {rbmi}

The R package called {rbmi} is designed for imputing missing data in clinical trials with continuous multivariate normal longitudinal outcomes. It can handle missing data under a missing at random (MAR) assumption, reference-based imputation methods, and delta adjustments for sensitivity analysis like tipping point analyses. 

The package offers both Bayesian and approximate Bayesian multiple imputation, which is combined with Rubin’s rules for inference, as well as frequentist conditional mean imputation with jackknife or bootstrap resampling.

Pharmaverse

 

Pharmaverse is a network of pharmaceutical industry professionals working collaboratively to create a curated and opinionated subset of open-source software packages and codebases based on the R language. 

The objective is to deliver a complete clinical data pipeline from data collection to regulatory submission that is more efficient and sustainable through shared development and maintenance efforts, with a focus on reducing duplication of efforts and gaining increased harmonization across the industry. The initiative aims to attract the next generation of software developers and data scientists to the industry and provide increased transparency. 

The scope of Pharmaverse is the journey from Case Report Form (CRF) through to submission for clinical trial analysis reporting via R packages, with three categories of R packages recommended: 

  • External to pharma (transcends specifically pharma needs).
  • Pharma-specific independent of pharmaverse (created for use in pharma, but not necessarily following the pharmaverse charter and recommendations).
  • Pharma-specific under pharmaverse (created for use in pharma, following the pharmaverse charter and recommendations). 

The aim is not to agree on cross-industry implementations of CDISC standards, but rather to act as a starting point for code reuse that is standard agnostic and future-proof. The design and architecture of the package allow companies to use them to adapt to internal workflows that are proprietary. For example, {admiral} package has {admiral.vaccines} and {admiral.ophta}, but there is also an {admiral.roche} that is internal to Roche and not shared (with proprietary license).

Pharmaverse packages will be easily locatable and accessible via a single site, with clear use cases for clinical trial reporting and sharing of differences and unique merits to help companies or users choose which packages or tools to adopt.

Pharmaverse End-to-End Clinical Reporting Packages

Let us explore some of the End-to-End Clinical Reporting Packages from Pharmaverse. The following compilation comprises ‘some’ of the open-source R packages that are relevant to end-to-end clinical reporting in the pharmaceutical industry. The Pharmaverse council aims to organize and curate these packages into a well-defined stack in due course.

Package NameCategoryStatusDescriptionURL
sdtmchecksSDTMReleasedThis package contains functions to identify common data issues in SDTM data. These checks are intended to be generalizable, actionable, and meaningful for analysis.https://pharmaverse.github.io/sdtmchecks/index.html
datacutrSDTMUpcomingThis package processes tabulation data in compliance with the SDTM standard. It assumes that supplemental qualifiers have already been merged with their parent domain before applying the cut process. Users can choose the type of cut they want to apply to each domain, such as either no cut, patient cut, date cut, or a special DM cut.https://pharmaverse.github.io/datacutr/main/
roakSDTMUpcomingEnables SDTM mapping algorithms via R functions.Not yet available
admiralADaMReleasedADaM In R Asset Library, is an open-source and modular toolbox that empowers pharmaceutical programmers to generate ADaM datasets using R programming.https://pharmaverse.github.io/admiral/cran-release/
admiraloncoADaMReleasedComplementary toolbox to admiral for enabling users to generate ADaM datasets based on oncology.https://pharmaverse.github.io/admiralonco/main/
admiralopthaADaMReleasedComplementary toolbox to admiral for enabling users to generate ADaM datasets based on opthalmology.https://pharmaverse.github.io/admiralophtha
admiralvaccineADaMUpcomingComplementary toolbox to admiral for enabling users to generate ADaM datasets based on vaccine domains.https://pharmaverse.github.io/admiralvaccine/main/
rtablesTLGs – TablesReleasedA framework for declaring complex multi-level tabulations and then applying them to datahttps://insightsengineering.github.io/rtables/
pharmaRTFTLGs – TablesReleasedEnhanced RTF wrapper is written in R for use with existing R tables packages such as huxtable or GThttps://atorus-research.github.io/pharmaRTF/
TplyrTLGs – TablesReleasedA package to simplify the data manipulation necessary to create clinical reportshttps://cran.r-project.org/web/packages/Tplyr/index.html
tfrmtTLGs – TablesReleasedA language for defining display-related metadata to automate the transformation from an Analysis Results Dataset (ARD) to a tablehttps://gsk-biostatistics.github.io/tfrmt/
tidytlgTLGs – TablesReleasedGenerate tables, listings, and graphs (TLG) using the Tidyversehttps://pharmaverse.github.io/tidytlg
visRTLGs – PlotsReleasedA package to enable fit-for-purpose, reusable clinical and medical research-focused visualizations and tables with sensible defaults and based on sound graphical principleshttps://openpharma.github.io/visR/
ggplot2TLGs – PlotsReleasedAn implementation of the Grammar of Graphics in R, and the most popular plotting package for static plots in Rhttps://ggplot2.tidyverse.org/
tidyCDISCTLGs – InteractiveReleasedA Shiny app to easily create custom tables and figures from ADaM-ish data setshttps://biogen-inc.github.io/tidyCDISC/
ternTLGs – FrameworksReleasedLayers analytics from descriptive summaries to more complex statistics on top of the foundational table layouts, analytic and content controlshttps://insightsengineering.github.io/tern/
tealTLGs – FrameworksReleasedA framework that leverages the R Shiny package to scale the development of shiny appshttps://insightsengineering.github.io/teal/
chevronTLGs – TablesUpcomingHolds TLG template standards to feed into other tabulation packagesNot yet available
xportreSubReleasedServes the dual purpose of creating SAS transport files and conducting validation checks on pharmaceutical-specific datasets.https://atorus-research.github.io/xportr/
pkgliteeSubReleasedThe purpose of this is to enable the representation and exchange of R package source code in the form of text files that can be easily read and understood.https://merck.github.io/pkglite/
metacoreMetadataReleasedOffers an interface that allows for the ingestion of diverse metadata sources and the storage of such metadata in a uniform object.https://atorus-research.github.io/metacore/
metatoolsMetadataReleasedBy leveraging metadata, this tool facilitates the creation, improvement, and validation of datasets using metacore objects. This includes the ability to add or remove supplementary qualifiers from or to the parent SDTM domain.https://pharmaverse.github.io/metatools/
logrxUtilityReleasedThis tool produces an execution log which helps ensure the reproducibility and traceability of an R script.https://pharmaverse.github.io/logrx/
synthetic.cdisc.dataUtilityReleasedGenerates synthetic CDISC datahttps://github.com/Roche/synthetic.cdisc.data
envsetupUtilityReleasedSupports the setup of an R environmenthttps://pharmaverse.github.io/envsetup/main/index.html
covtracerPackage ValidationReleasedThis package offers automated tracing of code coverage by means of a network of test executions, linking tests to code and code to documentation. Its primary use case is to provide a traceability matrix for CICD-based validation.https://github.com/Genentech/covtracer
thevalidatoRPackage ValidationReleasedThis is a GitHub action available on the GitHub Marketplace, designed to deliver consistent build reports for validation purposes. It serves as an alternative to valtools and is based on the standard R framework for validation.https://github.com/marketplace/actions/r-package-validation-report
valtoolsPackage ValidationReleasedThis validation framework for R packages provides the ability to incorporate supplementary validation documentation as needed.https://phuse-org.github.io/valtools/
difffdfQuality CheckingReleasedThe diffdf package is intended to facilitate a comprehensive comparison of two data.frames, providing detailed insights into any differences between them.https://gowerc.github.io/diffdf/

Conclusion: R Packages for Clinical Trial Data

The life sciences industry is expanding rapidly, and open-source initiatives have provided a significant impetus to innovation. With the advent of package developments with programming languages like R, conducting clinical trials, monitoring, and analyzing data has become more streamlined and precise. 

Collaborative projects in the life sciences domain, spearheaded by pharmaceutical companies and experts, have propelled advancements in research, enabling more rapid and precise discoveries. These open-source projects, which leverage cutting-edge technologies and techniques, are expected to pave the way for accelerated progress in the life sciences industry, leading to better healthcare outcomes for people worldwide.