Hi-LASSO in Python¶
This library provides Hi-LASSO(High-Dimensional LASSO).
What is Hi-LASSO?¶
Hi-LASSO(High-Dimensional LASSO) can theoretically improves a LASSO model providing better performance of both prediction and feature selection on extremely high-dimensional data. Hi-LASSO alleviates bias introduced from bootstrapping, refines importance scores, improves the performance taking advantage of global oracle property, provides a statistical strategy to determine the number of bootstrapping, and allows tests of significance for feature selection with appropriate distribution. In Hi-LASSO will be applied to use the pool of the python library to process parallel multiprocessing to reduce the time required for the model.
Installation guide¶
Dependencies¶
Hi-LASSO
support Python 3.6+, Additionally, you will need numpy
, scipy
, tqdm
and glmnet
.
However, these packages should be installed automatically when installing this codebase.
Installing Hi-LASSO¶
Hi-LASSO
is available through PyPI and can easily be installed with a
pip install:
pip install hi_lasso
The PyPI version is updated regularly, however for the latest update, you should clone from GitHub and install it directly:
git clone https://github.com/datax-lab/Hi-LASSO.git
cd hi_lasso
python setup.py
Installation error¶
If the glmnet packages failed, you can try a follow solutions.
error: extension '_glmnet' has Fortran sources but no Fortran compiler found
You should install anaconda3
and then install conda fortran-compiler
.
error: Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio": https://visualstudio.microsoft.com/downloads/
You need to install Microsoft Visual C++ 14.0.
API Reference¶
Hi-LASSO model¶
-
class
hi_lasso.hi_lasso.
HiLasso
(q1='auto', q2='auto', L=30, alpha=0.05, logistic=False, random_state=None, n_jobs=1)[source]¶ Bases:
object
Hi-LASSO(High-Demensinal LASSO) is to improve the LASSO solutions for extremely high-dimensional data.
The main contributions of Hi-LASSO are as following:
- Rectifying systematic bias introduced by bootstrapping.
- Refining the computation for importance scores.
- Providing a statistical strategy to determine the number of bootstrapping.
- Taking advantage of global oracle property.
- Allowing tests of significance for feature selection with appropriate distribution.
Parameters: - q1 ('auto' or int, optional [default='auto']) – The number of predictors to randomly selecting in Procedure 1. When to set ‘auto’, use q1 as number of samples.
- q2 ('auto' or int, optional [default='auto']) – The number of predictors to randomly selecting in Procedure 2. When to set ‘auto’, use q2 as number of samples.
- L (int [default=30]) – The expected value at least how many times a predictor is selected in a bootstrapping.
- alpha (float [default=0.05]) – significance level used for significance test for feature selection
- logistic (Boolean [default=False]) – Whether to apply logistic regression model. For classification problem, Hi-LASSO can apply the logistic regression model.
- random_state (int or None, optional [default=None]) – If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.default_rng
- n_jobs ('None' or int, optional [default=1]) – The number of jobs to run in parallel. If “n_jobs is None” or “n_jobs == 0” could use the number of CPU cores returned by “multiprocessing.cpu_count()” for automatic parallelization across all available cores.
Variables: - n (int) – number of samples.
- p (int) – number of predictors.
Examples
>>> from hi_lasso import HiLasso >>> model = HiLasso(q1='auto', q2='auto', L=30, logistic=False, random_state=None, n_jobs=1) >>> model.fit(X, y, sample_weight=None, significance_level=0.05)
>>> model.coef_ >>> model.intercept_ >>> model.p_values_
-
fit
(X, y, sample_weight=None)[source]¶ Fit the model with Procedure 1 and Procedure 2.
Procedure 1: Compute importance scores for predictors.
Procedure 2: Compute coefficients and Select variables.
Parameters: - X (array-like of shape (n_samples, n_predictors)) – predictor variables
- y (array-like of shape (n_samples,)) – response variables
- sample_weight (array-like of shape (n_samples,), default=None) – Optional weight vector for observations. If None, then samples are equally weighted.
Variables: - coef (array) – Coefficients of Hi-LASSO.
- p_values (array) – P-values of each coefficients.
- intercept (float) – Intercept of Hi-LASSO.
Returns: self
Return type: object
Utilities for Hi-LASSO¶
Getting Started¶
Data load¶
[1]:
import pandas as pd
X = pd.read_csv('https://raw.githubusercontent.com/datax-lab/Hi-LASSO/master/simulation_data/X.csv')
y = pd.read_csv('https://raw.githubusercontent.com/datax-lab/Hi-LASSO/master/simulation_data/y.csv')
[2]:
X.head()
[2]:
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V91 | V92 | V93 | V94 | V95 | V96 | V97 | V98 | V99 | V100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.348426 | 0.517518 | -0.152922 | -0.529037 | -0.761765 | -0.521656 | -0.283716 | 0.490174 | 0.260491 | -0.273269 | ... | -1.594468 | -2.296667 | 0.688847 | -0.884829 | 0.147586 | 0.259268 | -1.594399 | -1.482207 | 1.597720 | 1.102619 |
1 | 0.585253 | 0.468142 | 0.270562 | -0.697588 | -0.073801 | -0.487004 | -0.282624 | 0.076316 | 0.220236 | -0.703602 | ... | -1.391808 | -0.824046 | 0.727239 | -0.046264 | -1.482356 | -1.080878 | 0.451564 | -0.785846 | 2.037206 | 0.526190 |
2 | -0.301555 | -0.901268 | -0.486487 | -0.221278 | -0.374391 | 0.080040 | -2.082875 | -1.964580 | -2.079053 | -1.966530 | ... | 0.455793 | 2.248457 | -1.362962 | 1.233083 | -0.685365 | 0.315421 | -0.695921 | 0.099969 | 1.589188 | 0.328288 |
3 | -0.534475 | -0.339979 | -0.189170 | 1.250699 | 1.147584 | 0.574485 | 1.078780 | 0.024339 | 1.112345 | 1.119577 | ... | 0.105867 | 0.106110 | 1.911083 | 0.063714 | 0.054433 | -0.293569 | -2.066818 | -0.376513 | 0.734963 | -0.206481 |
4 | 0.097464 | 0.575279 | 0.353189 | -0.816164 | -0.769933 | -0.224066 | 0.368319 | 0.407449 | 1.145179 | 0.726215 | ... | -0.132294 | 0.492144 | 0.991359 | -1.561440 | 0.063365 | -0.789509 | 1.067706 | 0.440007 | 1.675415 | 0.922870 |
5 rows × 100 columns
[3]:
y.head()
[3]:
V1 | |
---|---|
0 | 1.233550 |
1 | 6.753310 |
2 | -1.866632 |
3 | 1.954930 |
4 | 1.194514 |
General Usage¶
[4]:
from hi_lasso.hi_lasso import HiLasso
hilasso = HiLasso(q1='auto', q2='auto', L=30, alpha=0.05, logistic=False, random_state=None, parallel=False, n_jobs=None)
hilasso.fit(X, y, sample_weight=None)
0%| | 0/60 [00:00<?, ?it/s]
Procedure 1
100%|████████████████████████████████████████████████████████████████████████████████| 60/60 [01:02<00:00, 1.03s/it]
3%|██▋ | 2/60 [00:00<00:03, 16.99it/s]
Procedure 2
100%|████████████████████████████████████████████████████████████████████████████████| 60/60 [00:03<00:00, 18.40it/s]
[4]:
<hi_lasso.hi_lasso.HiLasso at 0x1f41e360f48>
[5]:
hilasso.coef_
[5]:
array([ 0.46210394, 1.24552088, 0. , 0.90706553, 0. ,
0. , 2.23579843, 0.30876645, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0.31372601, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0.59941971, 0. , 0. , 0. , 1.04376768,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.41696402, 0. ,
0. , -0.32200276, -0.36038398, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , -0.44779802, 0. , 0. , 0. ,
0. , 0.37710895, 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ])
[6]:
hilasso.p_values_
[6]:
array([5.08869776e-03, 2.11419331e-18, 9.99720318e-01, 1.24542123e-20,
9.22045686e-01, 1.00000000e+00, 7.40877404e-29, 3.92122793e-02,
9.98979737e-01, 1.00000000e+00, 1.00000000e+00, 9.99720318e-01,
9.99998519e-01, 1.00000000e+00, 9.99720318e-01, 1.41450530e-04,
6.93387034e-01, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 9.99999995e-01,
9.99999995e-01, 1.62235113e-05, 1.00000000e+00, 1.00000000e+00,
1.00000000e+00, 1.71842539e-19, 9.98979737e-01, 1.00000000e+00,
9.99999995e-01, 9.99999872e-01, 1.00000000e+00, 9.99720318e-01,
9.99998519e-01, 1.00000000e+00, 9.99999872e-01, 1.00000000e+00,
9.99999995e-01, 1.00000000e+00, 9.99999995e-01, 6.93387034e-01,
9.99988764e-01, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
9.99999995e-01, 6.86046534e-02, 1.00000000e+00, 9.99999995e-01,
1.00000000e+00, 9.99998519e-01, 2.57446165e-01, 1.00000000e+00,
9.99988764e-01, 9.99999995e-01, 9.99999872e-01, 9.99999995e-01,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 5.08869776e-03,
9.58646715e-01, 9.99936916e-01, 4.99632744e-06, 1.62235113e-05,
8.66299610e-01, 1.00000000e+00, 1.00000000e+00, 9.99999995e-01,
9.99999872e-01, 1.00000000e+00, 9.99999872e-01, 2.57446165e-01,
5.20130241e-09, 9.99936916e-01, 1.00000000e+00, 9.99999872e-01,
6.93387034e-01, 2.34773777e-08, 1.00000000e+00, 1.00000000e+00,
1.00000000e+00, 1.00000000e+00, 6.86046534e-02, 9.99999995e-01,
9.99720318e-01, 1.00000000e+00, 1.00000000e+00, 9.99999995e-01,
1.00000000e+00, 1.00000000e+00, 1.13056169e-01, 1.75622359e-01,
9.22045686e-01, 6.93387034e-01, 9.99936916e-01, 1.00000000e+00])
[7]:
hilasso.intercept_
[7]:
0.6223126908082419
Parallel Processing Usage¶
If Parallel
is set to ‘True’, parallel processing can be used for bootstrapping.
You can specify n_jobs
(the number of cores) to use in parallel processing.
[8]:
from hi_lasso.hi_lasso import HiLasso
hilasso_parallel = HiLasso(q1='auto', q2='auto', L=30, alpha=0.05, logistic=False, random_state=None, parallel=True, n_jobs=None)
hilasso_parallel.fit(X, y, sample_weight=None)
0%| | 0/60 [00:00<?, ?it/s]
Procedure 1
100%|████████████████████████████████████████████████████████████████████████████████| 60/60 [00:10<00:00, 5.73it/s]
0%| | 0/60 [00:00<?, ?it/s]
Procedure 2
100%|████████████████████████████████████████████████████████████████████████████████| 60/60 [00:01<00:00, 31.52it/s]
[8]:
<hi_lasso.hi_lasso.HiLasso at 0x1f43d82b8c8>
[9]:
hilasso_parallel.coef_
[9]:
array([ 0. , 1.51419872, 0. , 0.87322631, 0. ,
0. , 2.23200103, 0.4323608 , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0.2099277 , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0.27517399, 0. , 0. , 0. , 0.9882923 ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.19367698, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.47607576, -0.27887515, 0. , 0. , 0. ,
0.42069865, 0.40714104, 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ])
[11]:
hilasso_parallel.p_values_
[11]:
array([9.80448765e-01, 1.57407956e-27, 9.99958412e-01, 9.94516205e-25,
9.99143705e-01, 1.00000000e+00, 1.91411057e-32, 2.61639765e-09,
6.77919624e-01, 9.99993965e-01, 1.00000000e+00, 9.99143705e-01,
9.99787731e-01, 9.99999973e-01, 9.99787731e-01, 8.31122195e-03,
9.97154801e-01, 1.00000000e+00, 9.99993965e-01, 9.99999973e-01,
9.99999973e-01, 9.99787731e-01, 1.00000000e+00, 1.00000000e+00,
1.00000000e+00, 1.66201145e-03, 1.00000000e+00, 1.00000000e+00,
9.99999424e-01, 9.94516205e-25, 9.57991693e-01, 1.00000000e+00,
9.99999973e-01, 9.99999973e-01, 1.00000000e+00, 9.99143705e-01,
1.00000000e+00, 9.99999973e-01, 9.99958412e-01, 1.00000000e+00,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 3.24181317e-02,
1.00000000e+00, 1.00000000e+00, 9.99999973e-01, 9.99999973e-01,
1.00000000e+00, 9.57991693e-01, 9.99993965e-01, 9.99999424e-01,
1.00000000e+00, 9.99999973e-01, 4.45406012e-01, 1.00000000e+00,
9.99999973e-01, 1.00000000e+00, 9.99787731e-01, 9.99999424e-01,
1.00000000e+00, 1.00000000e+00, 9.97154801e-01, 8.60318467e-01,
9.91985777e-01, 8.60318467e-01, 1.57267909e-01, 2.35833514e-01,
7.79033117e-01, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
1.00000000e+00, 1.00000000e+00, 9.99999424e-01, 9.96648055e-06,
3.14407728e-05, 1.00000000e+00, 1.00000000e+00, 9.99993965e-01,
2.61639765e-09, 9.31257597e-05, 9.99999973e-01, 9.99999424e-01,
9.99999424e-01, 9.99993965e-01, 9.99787731e-01, 1.00000000e+00,
9.99999424e-01, 1.00000000e+00, 1.00000000e+00, 9.99999424e-01,
9.99999424e-01, 1.00000000e+00, 1.57267909e-01, 1.57267909e-01,
9.99787731e-01, 9.91985777e-01, 1.00000000e+00, 1.00000000e+00])
[13]:
hilasso_parallel.intercept_
[13]:
0.6503488998795035
Credit¶
Hi-LASSO was primarily developed by Dr. Youngsoon Kim, with significant contributions and suggestions by Dr. Joongyang Park, Dr. Mingon Kang, and many others. The python package was developed by Jongkwon Jo. Initial supervision for the project was provided by Dr. Mingon Kang.
Development of Hi-LASSO is carried out in the DataX lab at University of Nevada, Las Vegas (UNLV).
If you use Hi-LASSO in your research, generally it is appropriate to cite the following paper: Y. Kim, J. Hao, T. Mallavarapu, J. Park and M. Kang, “Hi-LASSO: High-Dimensional LASSO,” in IEEE Access, vol. 7, pp. 44562-44573, 2019, doi: 10.1109/ACCESS.2019.2909071.
Reference¶
Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. “glmnet: Lasso and elastic-net regularized generalized linear models.” R package version 1.4 (2009): 1-24.
Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. (Publisher link).
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17(3), 261-272.