iddn.parameter_tuning_iddn

The module for tuning hyperparameters in iDDN

We use parallel computing and several tricks to make it feasible for parameter tuning in larger data

A bette approach is to manually choose a set of lambda1 and select the one that leads to a reasonable network. By utilizing the prior knowledge, it is more likely to obtain the network that is usable.

Functions

cv_2d(dat1, dat2, dep_mat[, n_cv, ratio_val, ...])

Cross validation by grid search lambda1 and lambda2

calculate_regression(data, topo_est[, cores, n_max])

Linear regression based on estimated network topology

_regression_one_node(data, topo_now, i, n_max[, clip_thr])

Perform linear regression for one node

Module Contents

iddn.parameter_tuning_iddn.cv_2d(dat1, dat2, dep_mat, n_cv=5, ratio_val=0.2, lambda1_lst=np.arange(0.05, 1.05, 0.05), lambda2_lst=np.arange(0.025, 0.525, 0.025), cores=8, n_max=100, iddn_method='resi')

Cross validation by grid search lambda1 and lambda2

To estimate the validation error, we estimate the coefficient of each node on the training set based on the estimated network topology. Then for each node in the validation set, we try to use its neighbors to explain the signal in that node. The portion of unexplained signal in all nodes is defined as the validation error.

Although this function supports 2D grid search of hyperparameters, we can also do 1D search. We simply need to provide a single value for lambda1_lst or lambda2_lst.

Let K be the number of CV repeats, L1 the number of lambda1 values, L2 the number of lambda2 values. N is the sample size, P is the feature number.

Parameters:
  • dat1 ((N,P) array_like) – Data for condition 1

  • dat2 ((N,P) array_like) – Data for condition 1

  • dep_mat ((P, P) array_like) – Constraints (dependency) matrix of iDDN.

  • n_cv (int) – Number of repeats. Can be as large as you like, as we re-sample each time.

  • ratio_val (float) – Ratio of iddn_data for validation. The remaining is used for training.

  • lambda1_lst (array_like) – Values of lambda1 for searching

  • lambda2_lst (array_like) – Values of lambda2 for searching

  • cores (int) – Number of cores used in parallel computing. Should not exceed the number of cores in the computer.

  • n_max (int) – The maximum number of edges allowed for each node during parameter search. As the regression step is time-consuming, limit the edge number will be beneficial. Besides, often we would prefer a sparse network, so the limit here will not influence accuracy much. Note that this limit only occurs for parameter tuning, and iDDN does not have this.

  • iddn_method (str) – iDDN optimization method, can be resi or corr.

Returns:

val_err – The validation error for each lambda1 and lambda2 combination.

Return type:

(K, L1, L2) array_like

iddn.parameter_tuning_iddn.calculate_regression(data, topo_est, cores=8, n_max=100)

Linear regression based on estimated network topology

For each variable, use all its neighbors as predictors and find the regression coefficients. This is calculated for each condition. Let P be the number of features.

Parameters:
  • data ((N,P) array_like) – All iddn_data

  • topo_est ((P,P) array_like) – Estimated adjacency matrix

  • cores (int) – Number of cores used for parallel computing

  • n_max – The maximum number of edges allowed for each node during parameter search. We can use the same value for all nodes, or assign different values for each node. See also``cv_2d`` function.

Returns:

g_asso – Regression coefficients

Return type:

(P,P) array_like

Examples

This is an example of regression operation used in this function: >>> x = np.array([[-1,-1,1,1.0], [1,1,-1,-1]]).T >>> y = np.array([1,1,-1,-1.0]) >>> out = np.linalg.lstsq(x, y, rcond=None) >>> out[0]

iddn.parameter_tuning_iddn._regression_one_node(data, topo_now, i, n_max, clip_thr=1.0)

Perform linear regression for one node

The predictors are its neighbors specified by topo_now.

Parameters:
  • data ((N,P) array_like) – Input data

  • topo_now – Binary mask indicating whether a node is a predictor.

  • i (int) – Index of current node (response variable)

  • n_max – The maximum number of edges allowed for each node during parameter search.

  • clip_thr (float or int) – Clip too large regression coefficient to suppress overfitting a little bit

Returns:

  • out ((P) array_like) – Estimated regression coefficients

  • pred_idx (array_like) – Index of predictors, for debug