Detecting Interactions in High-Dimensional Data Using Cross Leverage Scores
We develop a variable selection method for interactions in regression models on large data in the context of genetics. The method is intended for investigating the influence of single-nucleotide polymorphisms ({SNPs}) and their interactions on health outcomes, which is a $p gg n$ problem. We introduce cross leverage scores ({CLSs}) to detect interactions of variables while maintaining interpretability. Using this method, it is not necessary to consider every possible interaction between variables individually, which would be very time-consuming even for moderate amounts of variables. Instead, we calculate the {CLS} for each variable and obtain a measure of importance for this variable. Calculating the scores remains time-consuming for large data sets. The key idea for scaling to large data is to divide the data into smaller random batches or consecutive windows of variables. This avoids complex and time-consuming computations on high-dimensional matrices by performing the computations only for small subsets of the data, which is less costly. We compare these methods to provable approximations of {CLS} based on sketching, which aims at summarizing data succinctly. In a simulation study, we show that the {CLSs} are directly linked to the importance of a variable in the sense of an interaction effect. We further show that the approximation approaches are appropriate for performing the calculations efficiently on arbitrarily large data while preserving the interaction detection effect of the {CLS}. This underlines their scalability to genome wide data. In addition, we evaluate the methods on real data from the {HapMap} project.
- Published in:
Biometrical Journal - Type:
Article - Authors:
Teschke, Sven; Ickstadt, Katja; Munteanu, Alexander - Year:
2024
Citation information
Teschke, Sven; Ickstadt, Katja; Munteanu, Alexander: Detecting Interactions in High-Dimensional Data Using Cross Leverage Scores, Biometrical Journal, 2024, 66, 8, e70014, December, https://onlinelibrary.wiley.com/doi/10.1002/bimj.70014, Teschke.etal.2024a,
@Article{Teschke.etal.2024a,
author={Teschke, Sven; Ickstadt, Katja; Munteanu, Alexander},
title={Detecting Interactions in High-Dimensional Data Using Cross Leverage Scores},
journal={Biometrical Journal},
volume={66},
number={8},
pages={e70014},
month={December},
url={https://onlinelibrary.wiley.com/doi/10.1002/bimj.70014},
year={2024},
abstract={We develop a variable selection method for interactions in regression models on large data in the context of genetics. The method is intended for investigating the influence of single-nucleotide polymorphisms ({SNPs}) and their interactions on health outcomes, which is a $p gg n$ problem. We introduce cross leverage scores ({CLSs}) to detect interactions of variables while maintaining...}}