Assistant Professor
Department of Statistics
Texas A&M University
I joined the Department of Statistics at Texas A&M University (TAMU) in Fall 2021. I am passionate about creating novel computationally-efficient statistical methods for the analysis of time series and longitudinal data in areas such as sleep research, neuroscience, and psychiatry. I am involved in numerous transdisciplinary projects in these areas and aim to produce high-quality publications in top statistics and scientific journals.
I am always looking for new and exciting opportunities for collaboration. Please contact me if you are interested in working together on a project.
Improvements in data collection and storage capabilities have led to a growing interest in the study of temporal relationships and dynamics for large, complex data. I am involved in the development of novel statistical methods for the analysis of time series and longitudinal data generated from modern experimental and observational studies in areas such as sleep research, neuroscience, and psychiatry. Many applications in these fields are not served holistically by existing theory and methods; I am passionate about developing practical, computationally efficient analytical tools that address and utilize the full nature of the complexity underlying the data generating process.
My methodological research with strong practical applications has led to a number of successful collaborations and research in high-impact statistics and scientific journals, such as JASA, Biometrics and IEEE Transactions on Big Data.
Background
Loop diuretics are commonly used for patients with heart failure (HF) but it remains unknown if one loop diuretic is clinically superior.
Hypothesis
Biomarkers and proteomics provide insight to how different loop diuretics may differentially affect outcomes.
Methods
Blood and urine were collected from outpatients with HF who were taking torsemide or furosemide for >30 days. Differences were assessed in cardiac, renal, and inflammatory biomarkers and soluble protein panels using the Olink Cardiovascular III and inflammation panels.
Results
Of 78 subjects, 55 (71%) were treated with furosemide and 23 (29%) with torsemide, and 25 provided a urine sample (15 treated with furosemide, 10 with torsemide). Patients taking torsemide were older (68 vs 64 years) with a lower mean eGFR (46 vs 54 ml/min/1.73 m2), a higher proportion were women (39% vs 24%) and Black (43% vs 27%). In plasma, levels of hs-cTnT, NT-proBNP, and hsCRP were not significantly different between groups. In urine, there were significant differences in urinary albumin, β-2M, and NGAL, with higher levels in the torsemide-treated patients. Of 184 proteins testing in Olink panels, in plasma, 156 (85%) were higher in patients taking torsemide but none were significantly different after correcting for false discovery.
Conclusions
We show differences in urinary biomarkers but few differences in plasma biomarkers among HF patients on different loop diuretics. Olink technology can detect differences in plasma protein levels from multiple biologic domains. These findings raise the importance of defining differences in mechanisms of action of each diuretic in an appropriately powered study.
Background
Pediatric SARS-CoV-2 data remain limited and seropositivity rates in children were reported as <1% early in the pandemic. Seroepidemiologic evaluation of SARS-CoV-2 in children in a major metropolitan region of the US was performed.19 disease among asymptomatic HCW and community participants in Northern Virginia during 6 months of follow-up.
Methods
Children and adolescents ≤19 years were enrolled in a cross-sectional, observational study of SARS-CoV-2 seroprevalence from July-October 2020 in Northern Virginia, US. Demographic, health, and COVID-19 exposure information was collected, and blood analyzed for SARS-CoV-2 spike protein total antibody. Risk factors associated with SARS-CoV-2 seropositivity were analyzed. Orthogonal antibody testing was performed, and samples were evaluated for responses to different antigens.
Results
In 1038 children, the anti-SARS-CoV-2 total antibody positivity rate was 8.5%. After multivariate logistic regression, significant risk factors included Hispanic ethnicity, public or absent insurance, a history of COVID-19 symptoms, exposure to person with COVID-19, a household member positive for SARS-CoV-2 and multi-family or apartment dwelling without a private entrance. 66% of seropositive children had no symptoms of COVID-19. Secondary analysis included orthogonal antibody testing with assays for 1) a receptor binding domain specific antigen and 2) a nucleocapsid specific antigen had concordance rates of 80.5% and 79.3% respectively.
Conclusion
A much higher burden of SARS-CoV-2 infection, as determined by seropositivity, was found in children than previously reported; this was also higher compared to adults in the same region at a similar time. Contrary to prior reports, we determined children shoulder a significant burden of COVID-19 infection. The role of children’s disease transmission must be considered in COVID-19 mitigation strategies including vaccination.
Background
Because of their direct patient contact, healthcare workers (HCW) face an unprecedented risk of exposure to COVID-19. The aim of this study was to examine incidence of COVID-19 disease among asymptomatic HCW and community participants in Northern Virginia during 6 months of follow-up.
Methods
This is a prospective cohort study that enrolled healthy HCW and residents who never had a symptomatic COVID-19 infection prior to enrolment from the community in Northern Virginia from April to November 2020. All participants were invited to enrol in study, and they were followed at 2-, and 6-months intervals. Participants were evaluated by commercial chemiluminescence SARS-CoV-2 serology assays as part of regional health system and public health surveillance program to monitor the spread of COVID-19 disease.
Findings
Of a total of 1,819 asymptomatic HCW enrolled, 1,473 (96%) had data at two-months interval, and 1,323 (73%) participants had data at 6-months interval. At baseline, 21 (1.15%) were found to have prior COVID-19 exposure. At two-months interval, COVID-19 rate was 2.8% and at six months follow-up, the overall incidence rate increased to 4.8%, but was as high as 7.9% among those who belong to the youngest age group (20–29 years). Seroconversion rates in HCW were comparable to the seropositive rates in the Northern Virginia community. The overall incidence of COVID-19 in the community was 4.5%, but the estimate was higher among Hispanic ethnicity (incidence rate = 15.3%) potentially reflecting different socio-economic factors among the community participants and the HCW group. Using cross-sectional logistic regression and spatio-temporal mixed effects models, significant factors that influence the transmission rate among HCW include age, race/ethnicity, resident ZIP-code, and household exposure, but not direct patient contact.
Interpretation
In Northern Virginia, the seropositive rate of COVID-19 disease among HCW was comparable to that in the community.
This article introduces a flexible nonparametric approach for analyzing the association between covariates and power spectra of multivariate time series observed across multiple subjects, which we refer to as multivariate conditional adaptive Bayesian power spectrum analysis (MultiCABS). The proposed procedure adaptively collects time series with similar covariate values into an unknown number of groups and nonparametrically estimates group-specific power spectra through penalized splines. A fully Bayesian framework is developed in which the number of groups and the covariate partition defining the groups are random and fit using Markov chain Monte Carlo techniques. MultiCABS offers accurate estimation and inference on power spectra of multivariate time series with both smooth and abrupt dynamics across covariate by averaging over the distribution of covariate partitions. Performance of the proposed method compared with existing methods is evaluated in simulation studies. The proposed methodology is used to analyze the association between fear of falling and power spectra of center-of-pressure trajectories of postural control while standing in people with Parkinson's disease.
Purpose
Gait modifications designed to change a single kinematic parameter have reduced first peak internal knee abduction moment (PKAM). Prior research suggests unintended temporospatial and kinematic changes occur naturally while performing these modifications. We aimed to investigate i) the concomitant kinematic and temporospatial changes and ii) the relationship between gait parameters during three gait modifications (toe-in, medial knee thrust, and trunk lean gait).
Methods
Using visual real-time biofeedback, we collected 10 trials for each modification using individualized target gait parameters based on participants’ baseline mean and standard deviation. Repeated measures ANOVA was performed to determine significant differences between conditions. Mixed effects linear regression models were then used to estimate the linear relationships among variables during each gait modification. All modifications reduced KAM by at least 5%.
Results
Modifications resulted in numerous secondary changes between conditions such as increased knee abduction during toe-in gait and increased knee flexion with medial knee thrust. Within gait modifications, relationships between kinematic parameters were similar for toe-in gait and medial knee thrust (i.e. increased toe-in and decreased knee abduction), while increased trunk lean showed no relationship with any other kinematic parameters during trunk lean trials.
Conclusion
Two main mechanisms were found as a result of this investigation; the first being a pattern of toeing-in, knee abduction, flexion, and internal hip rotation, while trunk lean modification presented as a separate gait pattern with limited secondary changes. Future studies should consider providing feedback on multiple linked parameters, as it may feel more natural and optimize KAM reductions.
The time-varying power spectrum of a time series process is a bivariate function that quantifies the magnitude of oscillations at different frequencies and times. To obtain low-dimensional, parsimonious measures from this functional parameter, applied researchers consider collapsed measures of power within local bands that partition the frequency space. Frequency bands commonly used in the scientific literature were historically derived, but they are not guaranteed to be optimal or justified for adequately summarizing information from a given time series process under current study. There is a dearth of methods for empirically constructing statistically optimal bands for a given signal. The goal of this article is to provide a standardized, unifying approach for deriving and analyzing customized frequency bands. A consistent, frequency-domain, iterative cumulative sum based scanning procedure is formulated to identify frequency bands that best preserve nonstationary information. A formal hypothesis testing procedure is also developed to test which, if any, frequency bands remain stationary. The proposed method is used to analyze heart rate variability of a patient during sleep and uncovers a refined partition of frequency bands that best summarize the time-varying power spectrum. Supplementary materials for this article are available online.
The twenty-four hour sleep-wake pattern known as the rest-activity rhythm (RAR) is associated with many aspects of health and well-being. Researchers have utilized a number of interpretable, person-specific RAR measures that can be estimated from actigraphy. Actigraphs are wearable devices that dynamically record acceleration and provide indirect measures of physical activity over time. One class of useful RAR measures are those that quantify variability around a mean circadian pattern. However, current parametric and non-parametric RAR measures used by applied researchers can only quantify variability from a limited or undefined number of rhythmic sources. The primary goal of this article is to consider a new measure of RAR variability: the log-power spectrum of stochastic error around a circadian mean. This functional measure quantifies the relative contributions of variability about a circadian mean from all possibly frequencies, including weekly, daily, and high-frequency sources of variation. It can be estimated through a two-stage procedure that smooths the log-periodogram of residuals after estimating a circadian mean. The development of this measure was motivated by a study of depression in older adults and revealed that slow, rhythmic variations in activity from a circadian pattern are correlated with depression symptoms.
Alcohol use accelerates during late adolescence, predicting the development of alcohol use disorders (AUDs) and other negative outcomes. Identifying modifiable risk factors for alcohol use during this time could lead to novel prevention approaches. Burgeoning evidence suggests that sleep and circadian factors are cross-sectionally and longitudinally linked to alcohol use and problems, but more proximal relationships have been understudied. Circadian misalignment, in particular, is hypothesized to increase the risk for AUDs, but almost no published studies have included a biological measure of misalignment. In the present study, we aimed to extend existing research by assessing the relationship between adolescent circadian misalignment and alcohol use on a proximal timeframe (over two weeks) and by including three complementary measures of circadian alignment. We studied 36 healthy late (18-22 years old, 22 females) alcohol drinkers (reporting ≥1, standard drink per week over the past 30 days) over 14 days. Throughout the study, participants reported prior day's alcohol use and prior night's sleep each morning via smartphone and a secure, browser-based interface. Circadian phase was assessed via the dim light melatonin onset (DLMO) in the laboratory on two occasions (Thursday and Sunday nights) in counterbalanced order. The three measures of circadian alignment included DLMO-midsleep interval, "classic" social jet lag (weekday-weekend difference in midsleep), and "objective" social jet lag (weekday-weekend difference in DLMO). Multivariate imputation by chained equations was used to impute missing data, and Poisson regression models were used to assess associations between circadian alignment variables and weekend alcohol use. Covariates included sex, age, Thursday alcohol use, and Thursday sleep characteristics. As predicted, greater misalignment was associated with greater weekend alcohol use for two of the three alignment measures (shorter DLMO-midsleep intervals and larger weekday-weekend differences in midsleep), while larger weekday-weekend differences in DLMO were associated with less alcohol use. Notably, in contrast to expectations, the distribution of weekday-weekend differences in DLMO was nearly equally distributed between individuals advancing over the weekend and those delaying over the weekend. This unexpected finding plausibly reflects the fact that college students are not subject to the same systematically earlier weekday schedules observed in high school students and working adults. These preliminary findings support the need for larger, more definitive studies investigating the proximal relationships between circadian alignment and alcohol use among late adolescents.
Dramatic increases in the size and complexity of modern datasets have made traditional "centralized" statistical inference prohibitive. In addition to computational challenges associated with big data learning, the presence of numerous data types (e.g. discrete, continuous, categorical, etc.) makes automation and scalability difficult. A question of immediate concern is how to design a data-intensive statistical inference architecture without changing the basic statistical modeling principles developed for "small" data over the last century. To address this problem, we present MetaLP, a flexible, distributed statistical modeling framework suitable for large-scale data analysis, where statistical inference meets big data computing. This framework consists of three key components that work together to provide a holistic solution for big data learning: (i) partitioning massive data into smaller datasets for parallel processing and efficient computation, (ii) modern nonparametric learning based on a specially designed, orthonormal data transformation leading to mixed data algorithms, and finally (iii) combining heterogeneous "local" inferences from partitioned data using meta-analysis techniques to arrive at the "global" inference for the original big data. We present an application of this general theory in the context of a nonparametric two-sample inference algorithm for Expedia personalized hotel recommendations based on 10 million search result records.
The historical and geographical spread from older to more modern languages has long been studied by examining textual changes and in terms of changes in phonetic transcriptions. However, it is more difficult to analyse language change from an acoustic point of view, although this is usually the dominant mode of transmission. We propose a novel analysis approach for acoustic phonetic data, where the aim will be to model the acoustic properties of spoken words statistically. We explore phonetic variation and change by using a time–frequency representation, namely the log‐spectrograms of speech recordings. We identify time and frequency covariance functions as a feature of the language; in contrast, mean spectrograms depend mostly on the particular word that has been uttered. We build models for the mean and covariances (taking into account the restrictions placed on the statistical analysis of such objects) and use these to define a phonetic transformation that models how an individual speaker would sound in a different language, allowing the exploration of phonetic differences between languages. Finally, we map back these transformations to the domain of sound recordings, enabling us to listen to the output of the statistical analysis. The approach proposed is demonstrated by using recordings of the words corresponding to the numbers from 1 to 10 as pronounced by speakers from five different Romance languages.
Many studies of biomedical time series signals aim to measure the association between frequency‐domain properties of time series and clinical and behavioral covariates. However, the time‐varying dynamics of these associations are largely ignored due to a lack of methods that can assess the changing nature of the relationship through time. This article introduces a method for the simultaneous and automatic analysis of the association between the time‐varying power spectrum and covariates, which we refer to as conditional adaptive Bayesian spectrum analysis (CABS). The procedure adaptively partitions the grid of time and covariate values into an unknown number of approximately stationary blocks and nonparametrically estimates local spectra within blocks through penalized splines. CABS is formulated in a fully Bayesian framework, in which the number and locations of partition points are random, and fit using reversible jump Markov chain Monte Carlo techniques. Estimation and inference averaged over the distribution of partitions allows for the accurate analysis of spectra with both smooth and abrupt changes. The proposed methodology is used to analyze the association between the time‐varying spectrum of heart rate variability and self‐reported sleep quality in a study of older adults serving as the primary caregiver for their ill spouse.
Player tracking data provides a platform for the creation of new basketball statistics that can dramatically improve the ability to evaluate and compare player performance. However, the increasing size of this new data source presents challenges in how to efficiently analyze the data and interpret findings. A scalable analytical framework is needed that can effectively reduce the dimensionality of the data while retaining the ability to compare player performance.
In this paper, Principal Component Analysis (PCA) is used to identify four components accounting for 68% of the variation in player tracking data from the 2013-2014 regular season. The most influential statistics on these new dimensions are used to construct intuitive, practical interpretations. In this high variance, low dimensional space, comparisons across any or all of the principal components are possible to evaluate characteristics that make players and teams similar or unique. A simple measure of similarity between player or team statistical profiles based on the four principal components is also constructed. The Statistical Diversity Index (SDI) allows for quick and intuitive comparisons using the entirety of the player tracking data. As new statistics emerge, this framework is scalable as it can incorporate existing and new data sources by reconstructing principal component dimensions and SDI for improved comparisons.
Using principal component scores and SDI, several use cases are presented for improved personnel management. Team principal component scores are used to quickly profile and evaluate team performance, more specifically how New York’s lack of ball movement negatively impacted success despite high average scoring efficiency as a team. SDI is used to identify players across the NBA with the most similar statistical performances to specific players. All-Star Tony Parker and shooting specialist Anthony Morrow are used as two examples and presented with in-depth comparisons to similar players using principal component scores and player tracking statistics. This approach can be used in salary negotiations, free agency acquisitions and trades, role player replacement, and more.
In the classroom, I challenge students to think critically about real-world analytical problems, then design and apply statistical data modeling and computational solutions to address them.