Molecular data gathered on human immunodeficiency virus (HIV) is useful for understanding the systems of epidemic spread of HIV. Such understanding can better allow us to intervene and treat high-risk groups of individuals. Methods of epidemic intervention include treatments such as antiretroviral therapy (ART) and awareness programs [1]. Adherence to ART can cause viral suppression in people living with HIV (PLWH) and significantly reduces their risk of transmission, making ART distribution a potentially effective approach to combating the spread of HIV. However, a major issue for public health officials is how to allocate the limited amount of available resources.
In many parts of the world, when testing and treating PLWH, it has become standard practice to record various metadata on the patients, including viral genomic sequences (often of the pol and gag regions). This information is often used to determine groups of individuals with high risk of future transmission, which can help public health officials better allocate limited resources [2]. The prioritization of PLWH can be explored through a computational framework: given a list of individuals along with metadata and viral sequences, order the individuals in descending order of inferred risk of future transmission.
Molecular epidemiology provides a natural framework for prioritizing individuals from viral sequence data. Currently, the standard approach is to use HIV-TRACE [3] to infer transmission clusters based on pairwise distances between sequences, monitor the growth of the transmission clusters over time, and prioritize individuals in descending order of transmission cluster growth. ProACT [4], on the other hand, is a prioritization approach that utilizes properties of a phylogeny inferred from the viral sequences.
The following questions naturally arise: how well does a given prioritization method perform, and which method is superior in specific contexts? With real-world data, the ground truth of who transmitted to whom is typically unavailable or error-prone. Further, even with a known transmission history, it is unclear how to quantify effectiveness: do we count the number of transmissions from a single individual, or the total number of transmissions in a transmission chain seeded from a single individual, or perhaps we are interested in properties of the underlying contact network (e.g. individuals with large numbers of social contacts)? Thus, it is unclear how to quantitatively assess the performance of different prioritization methods.
To address this open problem, we introduce SEPIA (Simulation-based Evaluation of PrIoritization Algorithms), a novel simulation-based framework for measuring the effectiveness of prioritization algorithms. Previously, in Moshiri et al. (2021) [4], ProACT and HIV-TRACE were compared with respect to effectiveness, but the comparisons were limited to a simulated epidemic dataset modeling the San Diego HIV epidemic between 2005 and 2014. Like this prior work, SEPIA utilizes simulated epidemic data, such as those generated by FAVITES [5] or PANGEA.HIV.sim [6], to define a ground truth with which prioritization methods can be directly compared. However, SEPIA expands upon this prior work by generalizing the task of prioritization effectiveness comparison and further exploring the mathematical meaning of “effectiveness” by defining 6 metrics of effectiveness, each inspired by properties of epidemics that are inherently of interest to public health officials for intervention. Specifically, the user runs a prioritization method on a simulated dataset; then, given the prioritization and the simulated dataset, SEPIA will measure the effectiveness of the prioritization using the metrics defined below.