API Reference¶

Hi-LASSO_spark model¶

class hi_lasso_spark.Hi_LASSO_spark.HiLASSO_Spark(X, y, X_test='auto', y_test='auto', alpha=0.05, q1='auto', q2='auto', L=30, cv=5, node='auto', logistic=False)[source]¶

Bases: object

Hi-LASSO_Spark(High-Demensinal LASSO Spark) is to improve the LASSO solutions for extremely high-dimensional data using pyspark.

PySpark is the Python API written in python to support Apache Spark. Apache Spark is a distributed framework that can handle Big Data analysis. Spark is basically a computational engine, that works with huge sets of data by processing them in parallel and batch systems.

The main contributions of Hi-LASSO are as following:

Rectifying systematic bias introduced by bootstrapping.
Refining the computation for importance scores.
Providing a statistical strategy to determine the number of bootstrapping.
Taking advantage of global oracle property.
Allowing tests of significance for feature selection with appropriate distribution.

Parameters:

X (array-like of shape (n_sample, n_feature)) – predictor variables
y (array-like of shape (n_sample,)) – response variables
q1 ('auto' or int, optional [default = 'auto']) – The number of predictors to randomly selecting in Procedure 1. When to set ‘auto’, use q1 as number of samples.
q2 ('auto' or int, optional [default = 'auto']) – The number of predictors to randomly selecting in Procedure 2. When to set ‘auto’, use q2 as number of samples.
L (int [default=30]) – The expected value at least how many times a predictor is selected in a bootstrapping.
alpha (float [default=0.05]) – significance level used for significance test for feature selection
node (Node refers to node which runs the application code in the cluster.) – If you do not specify the number of nodes, the 8 nodes are automatically set to the default node.

Variables:

n (int) – number of samples.
p (int) – number of features.

Examples

>>> from Hi_LASSO_spark_fin import HiLASSO_Spark
>>> model = HiLASSO_Spark(X, y)
>>> model.fit()

>>> model.coef_
>>> model.p_values_
>>> model.selected_var_

Calculate_p_value(coef_result)[source]¶: Compute p-values of each predictor for Statistical Test of Variable Selection.

Estimate_coefficient_Adaptive(value)[source]¶

Estimation of coefficients for each bootstrap sample using Adaptive_LASSO

Returns:	coef_result
Return type:	coefficient for Adaprive_LASSO

Estimate_coefficient_Adaptive_logistic(value)[source]¶

Estimation of coefficients for each bootstrap sample using Adaptive_LASSO

Returns:	coef_result
Return type:	coefficient for Adaprive_LASSO

Estimate_coefficient_Elastic(value)[source]¶

Estimation of coefficients for each bootstrap sample using Elastic_net

Returns:	coef_result
Return type:	coefficient for Elastic_Net

fit()[source]¶

Fit the model with Procedure 1 and Procedure 2.

Procedure 1: Compute importance scores for predictors.

Procedure 2: Compute coefficients and Select variables.

Parallel processing of Spark One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).

Variables:	sc.parallelize() (method) – The sc.parallelize() method is the SparkContext’s parallelize method to create a parallelized collection. This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data. map() (method) – A map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD. Procedure_1_coef (array) – Estimated coefficients by Elastic_net. Procedure_2_coef (array) – Estimated coefficients by Adaptive_LASSO. coef (array) – Estimated coefficients by Hi-LASSO. p_values (array) – P-values of each coefficients. selected_var (array) – Selected variables by significance test.
Returns:	self
Return type:	object

standardization(X, y)[source]¶

The response is mean-corrected and the predictors are standardized :param X: predictor :type X: array-like of shape (n_sample, n_feature) :param y: response :type y: array-like of shape (n_sample,)

Returns:	np.ndarray scaled_X, scaled_y, std