Hi-LASSO in Python

This library provides Hi-LASSO(High-Dimensional LASSO).

What is Hi-LASSO?

Hi-LASSO(High-Dimensional LASSO) can theoretically improves a LASSO model providing better performance of both prediction and feature selection on extremely high-dimensional data. Hi-LASSO alleviates bias introduced from bootstrapping, refines importance scores, improves the performance taking advantage of global oracle property, provides a statistical strategy to determine the number of bootstrapping, and allows tests of significance for feature selection with appropriate distribution. In Hi-LASSO will be applied to use the pool of the python library to process parallel multiprocessing to reduce the time required for the model.

Installation guide

Dependencies

Hi-LASSO support Python 3.6+, Additionally, you will need numpy, scipy, tqdm and glmnet. However, these packages should be installed automatically when installing this codebase.

Installing Hi-LASSO

Hi-LASSO is available through PyPI and can easily be installed with a pip install:

pip install hi_lasso

The PyPI version is updated regularly, however for the latest update, you should clone from GitHub and install it directly:

git clone https://github.com/datax-lab/Hi-LASSO.git
cd hi_lasso
python setup.py

Installation error

If the glmnet packages failed, you can try a follow solutions.

error: extension '_glmnet' has Fortran sources but no Fortran compiler found

You should install anaconda3 and then install conda fortran-compiler.

error: Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio": https://visualstudio.microsoft.com/downloads/

You need to install Microsoft Visual C++ 14.0.

API Reference

Hi-LASSO model

class hi_lasso.hi_lasso.HiLasso(q1='auto', q2='auto', L=30, alpha=0.05, logistic=False, random_state=None, n_jobs=1)[source]

Bases: object

Hi-LASSO(High-Demensinal LASSO) is to improve the LASSO solutions for extremely high-dimensional data.

The main contributions of Hi-LASSO are as following:

  • Rectifying systematic bias introduced by bootstrapping.
  • Refining the computation for importance scores.
  • Providing a statistical strategy to determine the number of bootstrapping.
  • Taking advantage of global oracle property.
  • Allowing tests of significance for feature selection with appropriate distribution.
Parameters:
  • q1 ('auto' or int, optional [default='auto']) – The number of predictors to randomly selecting in Procedure 1. When to set ‘auto’, use q1 as number of samples.
  • q2 ('auto' or int, optional [default='auto']) – The number of predictors to randomly selecting in Procedure 2. When to set ‘auto’, use q2 as number of samples.
  • L (int [default=30]) – The expected value at least how many times a predictor is selected in a bootstrapping.
  • alpha (float [default=0.05]) – significance level used for significance test for feature selection
  • logistic (Boolean [default=False]) – Whether to apply logistic regression model. For classification problem, Hi-LASSO can apply the logistic regression model.
  • random_state (int or None, optional [default=None]) – If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.default_rng
  • n_jobs ('None' or int, optional [default=1]) – The number of jobs to run in parallel. If “n_jobs is None” or “n_jobs == 0” could use the number of CPU cores returned by “multiprocessing.cpu_count()” for automatic parallelization across all available cores.
Variables:
  • n (int) – number of samples.
  • p (int) – number of predictors.

Examples

>>> from hi_lasso import HiLasso
>>> model = HiLasso(q1='auto', q2='auto', L=30, logistic=False, random_state=None, n_jobs=1)
>>> model.fit(X, y, sample_weight=None, significance_level=0.05)
>>> model.coef_
>>> model.intercept_
>>> model.p_values_
fit(X, y, sample_weight=None)[source]

Fit the model with Procedure 1 and Procedure 2.

Procedure 1: Compute importance scores for predictors.

Procedure 2: Compute coefficients and Select variables.

Parameters:
  • X (array-like of shape (n_samples, n_predictors)) – predictor variables
  • y (array-like of shape (n_samples,)) – response variables
  • sample_weight (array-like of shape (n_samples,), default=None) – Optional weight vector for observations. If None, then samples are equally weighted.
Variables:
  • coef (array) – Coefficients of Hi-LASSO.
  • p_values (array) – P-values of each coefficients.
  • intercept (float) – Intercept of Hi-LASSO.
Returns:

self

Return type:

object

Utilities for Hi-LASSO

hi_lasso.util.standardization(X, y)[source]

The response is mean-corrected and the predictors are standardized

Parameters:
  • X (array-like of shape (n_samples, n_predictors)) – predictor
  • y (array-like of shape (n_samples,)) – response
Returns:

scaled_X, scaled_y, sd_X

Return type:

np.ndarray

Getting Started

Data load

[1]:
import pandas as pd
X = pd.read_csv('https://raw.githubusercontent.com/datax-lab/Hi-LASSO/master/simulation_data/X.csv')
y = pd.read_csv('https://raw.githubusercontent.com/datax-lab/Hi-LASSO/master/simulation_data/y.csv')
[2]:
X.head()
[2]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V91 V92 V93 V94 V95 V96 V97 V98 V99 V100
0 0.348426 0.517518 -0.152922 -0.529037 -0.761765 -0.521656 -0.283716 0.490174 0.260491 -0.273269 ... -1.594468 -2.296667 0.688847 -0.884829 0.147586 0.259268 -1.594399 -1.482207 1.597720 1.102619
1 0.585253 0.468142 0.270562 -0.697588 -0.073801 -0.487004 -0.282624 0.076316 0.220236 -0.703602 ... -1.391808 -0.824046 0.727239 -0.046264 -1.482356 -1.080878 0.451564 -0.785846 2.037206 0.526190
2 -0.301555 -0.901268 -0.486487 -0.221278 -0.374391 0.080040 -2.082875 -1.964580 -2.079053 -1.966530 ... 0.455793 2.248457 -1.362962 1.233083 -0.685365 0.315421 -0.695921 0.099969 1.589188 0.328288
3 -0.534475 -0.339979 -0.189170 1.250699 1.147584 0.574485 1.078780 0.024339 1.112345 1.119577 ... 0.105867 0.106110 1.911083 0.063714 0.054433 -0.293569 -2.066818 -0.376513 0.734963 -0.206481
4 0.097464 0.575279 0.353189 -0.816164 -0.769933 -0.224066 0.368319 0.407449 1.145179 0.726215 ... -0.132294 0.492144 0.991359 -1.561440 0.063365 -0.789509 1.067706 0.440007 1.675415 0.922870

5 rows × 100 columns

[3]:
y.head()
[3]:
V1
0 1.233550
1 6.753310
2 -1.866632
3 1.954930
4 1.194514

General Usage

[4]:
from hi_lasso.hi_lasso import HiLasso
hilasso = HiLasso(q1='auto', q2='auto', L=30, alpha=0.05, logistic=False, random_state=None, parallel=False, n_jobs=None)
hilasso.fit(X, y, sample_weight=None)
  0%|                                                                                         | 0/60 [00:00<?, ?it/s]
Procedure 1
100%|████████████████████████████████████████████████████████████████████████████████| 60/60 [01:02<00:00,  1.03s/it]
  3%|██▋                                                                              | 2/60 [00:00<00:03, 16.99it/s]
Procedure 2
100%|████████████████████████████████████████████████████████████████████████████████| 60/60 [00:03<00:00, 18.40it/s]
[4]:
<hi_lasso.hi_lasso.HiLasso at 0x1f41e360f48>
[5]:
hilasso.coef_
[5]:
array([ 0.46210394,  1.24552088,  0.        ,  0.90706553,  0.        ,
        0.        ,  2.23579843,  0.30876645,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       -0.31372601,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       -0.59941971,  0.        ,  0.        ,  0.        ,  1.04376768,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.41696402,  0.        ,
        0.        , -0.32200276, -0.36038398,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        , -0.44779802,  0.        ,  0.        ,  0.        ,
        0.        ,  0.37710895,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ])
[6]:
hilasso.p_values_
[6]:
array([5.08869776e-03, 2.11419331e-18, 9.99720318e-01, 1.24542123e-20,
       9.22045686e-01, 1.00000000e+00, 7.40877404e-29, 3.92122793e-02,
       9.98979737e-01, 1.00000000e+00, 1.00000000e+00, 9.99720318e-01,
       9.99998519e-01, 1.00000000e+00, 9.99720318e-01, 1.41450530e-04,
       6.93387034e-01, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 9.99999995e-01,
       9.99999995e-01, 1.62235113e-05, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 1.71842539e-19, 9.98979737e-01, 1.00000000e+00,
       9.99999995e-01, 9.99999872e-01, 1.00000000e+00, 9.99720318e-01,
       9.99998519e-01, 1.00000000e+00, 9.99999872e-01, 1.00000000e+00,
       9.99999995e-01, 1.00000000e+00, 9.99999995e-01, 6.93387034e-01,
       9.99988764e-01, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       9.99999995e-01, 6.86046534e-02, 1.00000000e+00, 9.99999995e-01,
       1.00000000e+00, 9.99998519e-01, 2.57446165e-01, 1.00000000e+00,
       9.99988764e-01, 9.99999995e-01, 9.99999872e-01, 9.99999995e-01,
       1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 5.08869776e-03,
       9.58646715e-01, 9.99936916e-01, 4.99632744e-06, 1.62235113e-05,
       8.66299610e-01, 1.00000000e+00, 1.00000000e+00, 9.99999995e-01,
       9.99999872e-01, 1.00000000e+00, 9.99999872e-01, 2.57446165e-01,
       5.20130241e-09, 9.99936916e-01, 1.00000000e+00, 9.99999872e-01,
       6.93387034e-01, 2.34773777e-08, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 6.86046534e-02, 9.99999995e-01,
       9.99720318e-01, 1.00000000e+00, 1.00000000e+00, 9.99999995e-01,
       1.00000000e+00, 1.00000000e+00, 1.13056169e-01, 1.75622359e-01,
       9.22045686e-01, 6.93387034e-01, 9.99936916e-01, 1.00000000e+00])
[7]:
hilasso.intercept_
[7]:
0.6223126908082419

Parallel Processing Usage

If Parallel is set to ‘True’, parallel processing can be used for bootstrapping.

You can specify n_jobs(the number of cores) to use in parallel processing.

[8]:
from hi_lasso.hi_lasso import HiLasso
hilasso_parallel = HiLasso(q1='auto', q2='auto', L=30, alpha=0.05, logistic=False, random_state=None, parallel=True, n_jobs=None)
hilasso_parallel.fit(X, y, sample_weight=None)
  0%|                                                                                         | 0/60 [00:00<?, ?it/s]
Procedure 1
100%|████████████████████████████████████████████████████████████████████████████████| 60/60 [00:10<00:00,  5.73it/s]
  0%|                                                                                         | 0/60 [00:00<?, ?it/s]
Procedure 2
100%|████████████████████████████████████████████████████████████████████████████████| 60/60 [00:01<00:00, 31.52it/s]
[8]:
<hi_lasso.hi_lasso.HiLasso at 0x1f43d82b8c8>
[9]:
hilasso_parallel.coef_
[9]:
array([ 0.        ,  1.51419872,  0.        ,  0.87322631,  0.        ,
        0.        ,  2.23200103,  0.4323608 ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       -0.2099277 ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       -0.27517399,  0.        ,  0.        ,  0.        ,  0.9882923 ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.19367698,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.47607576, -0.27887515,  0.        ,  0.        ,  0.        ,
        0.42069865,  0.40714104,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ])
[11]:
hilasso_parallel.p_values_
[11]:
array([9.80448765e-01, 1.57407956e-27, 9.99958412e-01, 9.94516205e-25,
       9.99143705e-01, 1.00000000e+00, 1.91411057e-32, 2.61639765e-09,
       6.77919624e-01, 9.99993965e-01, 1.00000000e+00, 9.99143705e-01,
       9.99787731e-01, 9.99999973e-01, 9.99787731e-01, 8.31122195e-03,
       9.97154801e-01, 1.00000000e+00, 9.99993965e-01, 9.99999973e-01,
       9.99999973e-01, 9.99787731e-01, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 1.66201145e-03, 1.00000000e+00, 1.00000000e+00,
       9.99999424e-01, 9.94516205e-25, 9.57991693e-01, 1.00000000e+00,
       9.99999973e-01, 9.99999973e-01, 1.00000000e+00, 9.99143705e-01,
       1.00000000e+00, 9.99999973e-01, 9.99958412e-01, 1.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 3.24181317e-02,
       1.00000000e+00, 1.00000000e+00, 9.99999973e-01, 9.99999973e-01,
       1.00000000e+00, 9.57991693e-01, 9.99993965e-01, 9.99999424e-01,
       1.00000000e+00, 9.99999973e-01, 4.45406012e-01, 1.00000000e+00,
       9.99999973e-01, 1.00000000e+00, 9.99787731e-01, 9.99999424e-01,
       1.00000000e+00, 1.00000000e+00, 9.97154801e-01, 8.60318467e-01,
       9.91985777e-01, 8.60318467e-01, 1.57267909e-01, 2.35833514e-01,
       7.79033117e-01, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 9.99999424e-01, 9.96648055e-06,
       3.14407728e-05, 1.00000000e+00, 1.00000000e+00, 9.99993965e-01,
       2.61639765e-09, 9.31257597e-05, 9.99999973e-01, 9.99999424e-01,
       9.99999424e-01, 9.99993965e-01, 9.99787731e-01, 1.00000000e+00,
       9.99999424e-01, 1.00000000e+00, 1.00000000e+00, 9.99999424e-01,
       9.99999424e-01, 1.00000000e+00, 1.57267909e-01, 1.57267909e-01,
       9.99787731e-01, 9.91985777e-01, 1.00000000e+00, 1.00000000e+00])
[13]:
hilasso_parallel.intercept_
[13]:
0.6503488998795035

Credit

Hi-LASSO was primarily developed by Dr. Youngsoon Kim, with significant contributions and suggestions by Dr. Joongyang Park, Dr. Mingon Kang, and many others. The python package was developed by Jongkwon Jo. Initial supervision for the project was provided by Dr. Mingon Kang.

Development of Hi-LASSO is carried out in the DataX lab at University of Nevada, Las Vegas (UNLV).

If you use Hi-LASSO in your research, generally it is appropriate to cite the following paper: Y. Kim, J. Hao, T. Mallavarapu, J. Park and M. Kang, “Hi-LASSO: High-Dimensional LASSO,” in IEEE Access, vol. 7, pp. 44562-44573, 2019, doi: 10.1109/ACCESS.2019.2909071.

Reference

Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. “glmnet: Lasso and elastic-net regularized generalized linear models.” R package version 1.4 (2009): 1-24.

Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. (Publisher link).

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17(3), 261-272.