Sparse Nonnegative Interaction Models

Non-negative least square regression (NLS) is a constrained least squares problem where the coefficients are restricted to be non-negative. It is useful for modeling non-negative responses such as time measurements, count data, histograms and so on. Existing NLS solvers are designed for cases where the predictor variables and response variables have linear relationships, and do not consider interactions among predictor variables. In this paper, we solve NLS in the complete space of power sets of variables. Such an extension is particularly useful in biology, for modeling genetic associations. Our new algorithms solve NLS problems exactly while decreasing computational burden by using an active set method. The algorithm proceeds in an iterative fashion, such that an optimal interaction term is searched by a branch-and-bound subroutine, and added to the solution set one another. The resulting large search space is efficiently restricted by novel pruning conditions and two kinds of sparsity promoting regularization; $l_{1}$ norm and non-negativity constraints. In computational experiments using HIV-1 datasets, 99% of the search space was safely pruned without losing the optimal variables. In mutagenicity datasets, the proposed method could identify long and accurate patterns compared to the original NLS. Codes are available from https://github.com/afiveithree/inlars.


I. INTRODUCTION
Non-negative least square regression (NLS) is a constrained least squares problem where the coefficients are restricted to be non-negative. It is useful for modeling non-negative data such as time measurements, count data, or price. It has been introduced in [14], and since then many algorithms have been developed. NLS is not only famous for its use as a subroutine for solving non-negative matrix factorization [11], but also has many other applications. For example in computational chemistry, it is used for estimating concentrations from spectra data, in which the percentage of a composition does not take negative values [2]. Another example in computational biology is mass spectrometry analysis, in which observed spectrum is to be recovered by fitting templates isotope patterns [23]. A review by [3] contains other applications such as text mining and speech recognition, as well as algorithmic solutions.
However, existing methods for NLS assume linear relationships among predictor variables and response variables, and The associate editor coordinating the review of this manuscript and approving it for publication was Larbi Boubchir . cannot automatically consider high order interactions. As we show later in this paper, there are promising applications in biology and chemistry in which NLS with variable interactions are useful. For example in HIV research, drug resistance of a virus is modeled by weighted sum of known mutations, where weights are sought to be nonnegative, and the accumulation of mutations often triggers severe increase in the drug resistance. The reason that NLS with variable interaction has not been considered before lies in the necessity of enumerating all the variable interactions a-priori, an NP-hard problem. Frequent pattern discovery, or itemset mining methods such as Apriori [1], FP-growth [8], or Eclat [31], are typically used for enumerating patterns, and enumeration of all the itemsets is equivalent to enumeration of all the combinations of variables in the given dataset. However, using all the resulting patterns as features for machine learning algorithm requires some engineering. For example, [22] employed LASSO regression [27] as a machine learning algorithm and LCM [23] as an enumeration algorithm to consider all the interactions among mutations for predicting the drug resistance of anti-HIV agents. Reference [18] extended this work with a GAP safe screening, and demonstrated that further efficiency can be brought to pattern search. SHIMR [4] extended LASSO to classification by employing double hinge loss, and demonstrated its ability to reject uncertain samples for improving the test set accuracy. In all of the above works, 1 constraints played an important role in suppressing the number of variables in building prediction models. This paper is positioned in the line of these works, however, this work employs, for the first time, non-negativity constraints for inducing sparsity, whose property in terms of support recovery is studied recently [17] and [23]. Similarly to the previous works, we employ a modified itemset mining algorithm as a subroutine to find variable interactions necessary for explaining the response. So the solution obtained is not different from the one obtained by a two-step approach that enumerates all the variable interactions first, then run NLS solver on it. In feature selection literature [7], the above two-step approach is known as a filter method, and our approach is known as an embedded method. An obvious disadvantage of a filter approach is its necessity for enumerating all the interacting variables that satisfy a certain predefined threshold. In data mining literature, this threshold is known as a minimum support, and pattern enumeration with low minimum support is known to be often intractable due to the exponential growth in the number of the resulting output. Nevertheless, using high minimum support is harmful for building an accurate classifier or regressor, since infrequent ones may be crucially important for a given problem. Thus in our solution, we employ an iterative approach similar to boosting, and collect a handful of salient variable interactions one another. Compared to the filter methods, this embedded approach has an advantage in terms of storage, since it does not need to store an intractably large number of variables into memory. For the sake of the search efficiency, we define a tree-based search space, and develop pruning conditions such that ue unnecess parts of teary search space can be pruned. Our pruning conditions make use of the response label attached to each sample, so it is a kind of a supervised approach. Therefore it is much more efficient than the unsupervised filter approach based solely on minimum support. Another important difference of this work from [22] is the efficiency in searching the solution set, which is obtained by additional non-negativity constraints. We demonstrate this fact using simulated data in Section IV.
This paper is based on the ICDM'18 paper [25] which included a solution to the 1 regularized NLS problem considering variable interactions. However, recent studies revealed that non-negativity constraints alone work as a weak regularization that helps promoting sparsity, so we pursue this direction in this extended version. In this paper, we introduce two kinds of nonnegative least square models considering interactions; 1) NIM (Nonnegative Interaction Model), that makes use of only non-negative constraints, whose solution is found by a cutting plane method [16]. If the size of the interaction is limited to 1, the solution is identical to that of NLS. 2) NIL1 (Nonnegative Interaction model with 1 regularization), that makes use of both 1 and non-negative constraints. For efficiently searching the best regularization parameter, least angle regression [6] is employed. Note that NIL1 itself is already described in [25], but we demonstrate additional experimental results.
In order to understand the differences among OLS, NIM and NIL1, we illustrate their solutions in a two dimensional toy problem ( Figure 1). Depending on the location of the OLS solution, four different scenarios are possible, and displayed from left to right. They are categorized into one of the following three cases.
1) OLS solution occurs in the 1st orthant. NIM solution coincides with it, but NIL1 solution may shrink more to the origin depending on the size of the triangle. Sparsity is not induced in this case. (The leftmost plot in Figure 1) 2) OLS solution occurs in the 2nd or 4th orthant, then sparsity is induced since NIM solution typically occurs on either x or y axis where the ellipsoid hits. NIM solution and NIL1 solution do not coincide, but NIL1 solution may shrink more towards zero depending on the size of the triangle, or the strength of the regularization. (The 2nd leftmost plot and the rightmost plot in Figure 1, respectively) VOLUME 9, 2021 3) OLS solution occurs in the 3rd orthant. NIM solution and NIL1 solution coincide, and the maximum sparsity can be obtained. (The 3rd leftmost plot in Figure 1) The solutions of NIM and NIL1 depend not only on the position of the OLS solution, but also on the shape of the error contour around the OLS solution . 1 In practical situation, in which we have not only two but many more variables, cases 1 and 3 rarely occur and the case 2 dominates, since it is unlikely that all the OLS coefficients take the same signs.
In practice, the benefit of NIM resides in its simplicity. Since it does not need any regularization parameter to tune, it is easy to use for practitioners [23]. On the other hand, the benefit of NIL1 is its ability to carefully tune the regularization parameter. We demonstrate in the experiments that NIM finds smaller solution sets, but NIL1 performs better due to its ability to control the size of the solution sets.
This paper is organized as follows. In Section II, we introduce notations and settings used throughout the paper. Section III describes all the algorithms for the proposed methods. Section IV describes experimental settings and results based on simulation datasets. We demonstrate the efficiency obtained by pruning conditions, and highlight the difference made by 1 regularization. In Section V, we demonstrate the usefulness of the proposed methods in HIV-1 datasets and a mutagenicity dataset. Section VII concludes the paper with discussion.

II. PRELIMINARIES
We work on training data with n samples, each sample consisting of up to D items and the corresponding label, which is represented formally as {(z 1 , y 1 ), (z 2 , y 2 ), . . . (z n , y n )}, where y ∈ R + , and z ∈ {0, 1} D . We also use compactly represented design matrix Z = {z 1 , z 2 , . . . , z n } , as well as full design matrix X = {x 1 , x 2 , . . . x n } , in which presence or absence of an itemset is represented as; where I (.) is an indicator function that returns 1 if an itemset (a set of items) t is included in sample x, otherwise 0, and T be the set of all the frequent items appearing in at least one of given samples. Let p = |T | = 2 D − 1 be the size of the combinatorial feature space, then X ∈ R n×p is a binary matrix that contains all the interaction terms. Due to its size, preparation and storage of X is not only inefficient, but often intractable in practice. So in our proposed algorithm, we work with Z, and only necessary part of X is dynamically accessed and stored into memory. where β are regression coefficients. Below we incorporate non-negative constraints to coefficient vector β, and attempt to predict response vector y in a least squares sense. The resulting non-negative least squares (NLS) problem is; where β 0 stands for non-negativity constraints for a vector β. Introduction of 2 penalty results in non-negative ridge regression, and that of 1 penalty results in 1 regularized non-negative least squares ( 1 NLS) or non-negative LASSO; where λ is a regularization parameter, β ∈ R p is a sparse coefficient vector whose entries are mostly zeros due to 1 penalty [9]. The λ that performs best in a given dataset depends on the data, and one has to use an adaptive method such as cross-validation with grid search [9]. However, in our case, we have to run expensive enumeration algorithm as many as the candidate λs. This task is not only inefficient, but also may become intractable for large number of λs. Thus we make advantage of regularization path tracking algorithm, and attempt to reduce computational burden in the enumeration algorithm as much as possible. It can be realized by jumping from the current λ to the next λ, and reusing the already found features in identifying the next feature [6].

B. PATH TRACKING ALGORITHM
Reference [21] has shown that 1 regularized regression with quadratic loss function has piecewise linear solution paths, and can be efficiently computed with the same time complexity as that of ordinal least squares. Let us decouple loss and penalty in equation (3), and define L(y, Xβ) = y − Xβ 2 2 and J (β) = β 1 . Our goal is to show coefficient profilê along the λs in sequence. This problem can be solved efficiently only whenβ(λ) has piecewise linearity [21], namely that for a finite set of ordered λs such that is a direction of the path at λ k , and is known as an equiangular vector. Proposition 1 in [21] shows that the gradient of equation (4) is given as by using Taylor's expansion. As we will see in this section, nonnegative LASSO (3) satisfies the sufficient conditions for the coefficient solution path to be piecewise linear, so it is possible to keep track of it. More formally, the Lagrange primal function of nonnegative LASSO (3) can be written as where δ j ≥ 0 is a Lagrange multiplier, and the dependence of L on X and y are omitted for simplicity. From the Karush-Kuhn-Tucker (KKT) condition, we have Since both β and λ take either zero or positive values, we can derive the following conditions; If λ = 0, it corresponds to an unregularized case, and a solution can be obtained by a standard active set method [14]. If λ > 0, active variables and inactive variables have different conditions. Let a set of active variables be A = {j : β j (λ) = 0}, then the direction of regularization path at some λ(> 0) is calculated as where X A and 1 A denotes a design matrix and a vector of ones, restricted to an active set A, respectively. Our update rule is represented as an affine combination of β and γ , such that where 0 < τ < 1 is a step length. Note that setting τ = 0 corresponds to staying at the current active set, so ∇L(β) = −λ holds. On the other hand, setting τ = 1 corresponds to taking a full step towards the least squares solution so that ∇L(γ ) = 0 holds. Since equation (11) is a linear function in its feasible region, we can track it. Along the path, one of the following two event can occur; • Deleting a variable: In this case, an active variable becomes inactive, thereby one of β * j reaches zero. The step length until this event is obtained by solving • Adding a variable: In this case, an inactive variable becomes active. Since ∇L(β * ) is given as an affine combination of ∇L(β) and ∇L(γ ), it satisfies . By plugging ∇L(β) = −λ and ∇L(γ ) = 0 into the above equation, the step length until this event is obtained as Step lengths of both events are recorded, and a smaller one is chosen. After the occurrence of an event, the direction of the soln path (10) needs to be recalculated.
if τ = τ 1 then remove the variable from active set A. 8: end if 9: if τ = τ 2 then add the variable to active set A.
An interesting extreme case is when λ is largest. In such a case, β = 0 becomes the solution to equation (3), and from KKT condition, such λ is found as So we start by searching the largest λ, then gradually decrease it along the regularization path. The whole algorithm is shown in Algorithm 1.

III. METHODS
In applying Algorithm 1 to full design matrix X, an obvious but an inefficient way is to extract X from training dataset Z in the first place, then run a path-tracking algorithm on it. However, with respect to the increase in the number of items D or the number of samples n, it quickly becomes intractable. So we devise to extract the necessary part of X dynamically from Z. An intuition behind this idea is that, due to the sparsity induced by 1 norm, the number of active variables should be much smaller than that of non-active variables, which enables us to keep only active variables. Therefore, we focus on lines 2 and 6 of Algorithm 1, where a switch of variable from inactive set to active set occurs. Below, we call them Search Problem 1 and 2, and propose efficient branch and bound algorithms for both of them.
As a canonical search space, we employ the one defined by closed itemset mining (CIM) algorithm, LCM [29]. Given transactions consisting of itemsets, CIM can enumerate frequently appearing itemsets in the database while ignoring their subsets with the same frequency. This idea comprises closed itemset rather than frequent itemset, and helps reducing the size of the output as well as improving the interpretability. An exemplar canonical search space of CIM is illustrated in Figure 2. In a canonical search space spanned for all t consisting of single item do project(t) 4: end for 5: return τ 6: end procedure 7: function Project(t) 8: if pruning condition (15) holds then 9: return 10: end if 11: Calculate τ cur = if τ < τ cur and τ cur = 0 then 13: τ = τ cur 14: end if 15: for all child t ∈ t do 16: project(t ) 17: end for 18: end function by CIM algorithm, one can reach a node corresponding to combination of items (itemset) without visiting the same node again. However, the number of nodes grows exponentially to the increase in the number of items, therefore tree pruning is of crucially importance. Below, we propose efficient bounding conditions using target label information attached to each sample.

A. SEARCH PROBLEM 1
We show the pseudocode for solving Search Problem 1 in Algorithm 2, which searches for the minimum step length τ . Suppose that we have reached a node t in a search tree, and the minimum step length found so far be τ * . If we can guarantee that there exists no smaller step length in the downstream nodes of t, then we can safely prune the rest of the tree without losing the optimal pattern. We introduce two such pruning conditions for the Search Problem 1 below.
Theorem 1: Let (∇L(β)) t = n i=1 X ti C it where C it = X it β t − y i . If the following condition is satisfied, then there exists no solution in the downstream of the node t, so we can safely prune the rest of the tree without losing the optimal pattern. Proof: Let t be a child node of t such that t ⊆ t holds, then The second inequality holds due to the fact that On the other hand, from KKT condition, (∇L(β)) t + λ > 0 must hold for any (∇L(β)) t . Therefore if {i|C it ≥0} X ti C it + λ < 0 is guaranteed at the current node t, then there is no need to search for downstream nodes that do not satisfy KKT condition.
The computational complexity for computing the above pruning condition is O(nk), where k is the size of the active patterns, which is generally very small thanks to the sparsity induced by 1 norm. Similarly to the first pruning condition, we propose the second pruning condition.
Theorem 2: If the following condition is satisfied, then there exists no solution in the downstream of the node t, so we can safely prune the rest of the tree without losing the optimal pattern. Proof: Let t be a child node of t such that t ⊆ t holds, then The inequality holds due to the fact that {i|C it ≤0} X ti C it = 0 holds, then (∇L(γ )) t ≥ 0, which violates the non-positiveness condition on ∇L(β) (equations (9)).

B. SEARCH PROBLEM 2
It is called at line 2 in Algorithm 1. It is a maximization problem, but its procedure is similar to Algorithm 2, so we omit it. What we search for is a node that maximizes the following criterion.
Suppose that we have reached a node t, the corresponding objective value be (∇L(β)) t = n i=1 X it y i , and the maximum value found so far be (∇L(β)) * t . If we can guarantee that there exists no larger objective value in the downstream of nodes of t, then we can safely prune the rest of the tree without losing the optimal pattern. Theorem 3: [12] If the following condition is satisfied, then there exists no greater (∇L(β)) t in the downstream of the node t, so we can safely prune the rest of the tree without losing the optimal pattern. For the proof, please refer to [12].

C. NON-NEGATIVE INTERACTION MODEL WITHOUT 1 CONSTRAINTS
For solving NLS problem with variable interactions, we revisit equation (2). For efficiently solving the problem, we introduce a cutting plane approach below. The problem can be rewritten as the following optimization problem; where ξ are slack variables. The above program has p variables and n constraints. Since p can be extremely large due to the potential number of interactions, directly solving the primal problem is hard. So we derive the dual problem. From the KKT condition, we have where δ is a slack variable. They can be rewritten as Substituting them back to the primal problem in its Lagrange form, the dual problem is obtained as; where w + and w − are Lagrange multipliers. After solving the dual problem, primal solution vector β can be recovered w ← solve the dual problem (equations (25), (26), (27) and (28)) 8: end loop from the Lagrange multipliers of the dual problem. The dual problem has a large number of constraints, but we employ a cutting plane method [16] to handle them. In principle, it finds the initial solution by ignoring all the constraints, and repeat adding a constraint one by one to the dual problem until the stopping criterion is satisfied. The constraint to be added is called the most violated constraint, and can be found by solving the following search problem; In our case X is a binary matrix, and this problem involves integer programming. However, it is equivalent to equation (16), thus we can receive the benefit from Theorem 3; an efficient tree pruning condition in the space of power set of variables. To summarize, we iteratively solve the quadratic problem (equations (25), (26), (27) and (28)) with a limited number of constraints, and the solution vector w is used in a subproblem (29) to search for the next constraint. The resulting algorithm is shown in Algorithm 3.
Since the primal problem is convex and so is its dual, we can obtain the global solution by repeating the loop. Moreover, the relative distance to the optimal solution can be measured using the duality gap.

IV. SIMULATION STUDY
In this section, we demonstrate the effectiveness of the proposed pruning conditions on simulated datasets. Simulated datasets are generated as follows. First we draw a random variable from Bernoulli distribution (q = 0.6), then generated design matrices X ∈ {0, 1} n×p of varying sizes. Then we randomly selected 5 columns in the design matrix, and further selected 5 features out of their 2 5 combinations. The true coefficients w are drawn from uniform distribution U [0,1] , and the response vector is built as y = Xw. All the computation time are measured on a 64-bit machine with Intel(R) Xeon(R) E5-2697 Processor 2.70GHz.
In Figure 3, we compare the computation time with/ without the proposed pruning conditions. For this experiment, we fixed the number of items to 50, and varied the sample size n ∈ {50, 100, 200, 500, 1000}. Note that setting the number of items D to 50 corresponds to set- ting the maximum size of the search space to 2 50 . Therefore, the size of the search tree and the time for traversing the entire space grows exponentially. Indeed, in Figure 3, a naïve approach did not finish within 24 hours when n is set to more than 200. In contrast, with our pruning conditions, the proposed search method could successfully control the increase in computation time low as long as n is set to 1000. A similar observation is obtained when we fixed the number of samples to 50, and chose the number of items D from {20, 50, 100}, corresponding to p = 2 20 , 2 50 , 2 100 . In Figure 4, we can see that the size of the search space of a naïve approach increases rapidly, so is the time to traverse the space. On the other hand, with the proposed pruning condition, the size of the search space is restricted to 0.01 % of the original one, and the corresponding search time turns into more than a thousand times faster, when we set the number of items to 100. For investigating the scalability of the proposed approach, we have run NIL1 in simulated datasets with lager D and n. The result appears in Table 1. It turns out that NIL1 scales to even n = 1, 000, 000 or D = 1, 000, 100 when either one of them is kept small, but does not scale when both n and D are larger than 1000. If we consider a case when n is small and D is large, it is likely that we have not observed all the important features, so this case is hard to solve in reality. On the other hand, when n is large and D is small, it is likely that we have observed all the necessary features. NIL1 could successfully handle n = 1, 000, 000 samples when the number of items is 100. In order to further understand the nature of the proposed method, we compare the proposed method with itemset LAR/LASSO. For this purpose, we implemented itemset LAR/LASSO using exactly the same procedure as gLARS [28], except that the pattern mining subroutine is replaced from graph mining to itemset mining. First, we generated a dataset of 200 samples and 100 items, and compared the number of iterations until reaching the peak of the learning curve in terms of the validation set accuracy. NIL1 took 23 seconds until reaching its best (10-th iteration), whereas it took 100,000 seconds for itemset LAR/LASSO to reach its best (95-th iteration). It suggests that the size of the search space is smaller for the proposed method than that of itemset LAR/LASSO. In order to verify this tendency, we have measured the number of traversed nodes in the search tree until showing the full regularization path, and corresponding computational time while varying the maximum pattern size. Figure 5 compares the number of nodes visited for the proposed method and that for itemset LAR/LASSO. In the figure, we can observe that the size of the search space and the resulting time consumption were more than a thousand times smaller for the proposed method compared to itemset LAR/LASSO. This efficiency resorts to non-negative constraints.

V. REAL-WORLD DATA EXPERIMENTS
In this section, we demonstrate the performance of the proposed methods in two applications. The first one is HIV-1 drug resistance prediction problem, and the other is mutagenicity prediction problem.

A. HIV-1 DRUG RESISTANCE PREDICTION
In HIV-1, treatments to the patients resort to subscribing drugs for decreasing the number of viral copies in blood. However, some strains of the virus copies survive by obtaining drug-resistant mutations in its genome. These drug-resistant strains then dominate the population while non-resistant strains are eliminated by drugs. Clinicians and statisticians then desire to decompose experimentally measured drug resistance into a weighted sum of resistance by each mutation, which is traditionally modeled by linear regression [20]. Considering the biological selection mechanism, however, these weights never take negative values. Since mutations with negative weights suggest increased susceptibility to drugs, then they do not survive the biological selection. In previous works [18], [22], negative regression coefficients are observed, but they could be artifacts due to this reason. The NLS approach to this dataset is therefore considered to be more relevant.
The dataset we use is a collection of results of in vitro susceptibility tests to available drugs in market, in which the level of resistance is recorded in fold change compared to that of wild type [20]. This record comprises our target response vector y, and the genotypes are recorded as the difference from the wild type sequence. For example, if an isolate x has two mutations at the first position and the sixth position in the reference sequence that turn original amino acids into Arginin and Cystein, respectively, then the genotype is recorded in a set representation as {1A, 6C}. Figure 6 shows the transition of coefficient β as the iteration proceeds. We can observe that seven different variables once moved into the active set, but one variable is removed at the fifth event. We can also observe that all the regression coefficients are non-negative. Figure 7 shows the learning curves by NIL1 and NIFS in three HIV datasets. Errors in terms of the residual sum of squares RSS : n i=1 (y i − x i β). 2 are displayed for both the training set and the validation set. We can observe that the error curves for the training set keeps decreasing as a function of the numbers of steps by NIL1. On the other hand, the error curve for the validation set stops  decreasing, and starts increasing at some point. The error curve of NIM in the training set and the validation set are displayed as the horizontal flat lines, since they have no parameter to control regularization. The error curve of the validation set by NIL1 almost always lies lower than that of NIM, suggesting its generally better performance in HIV-1 datasets. Figure 8 displays the distributions of the found patterns by NIL1 and NIM. It is observed that NIL1 resulted in much larger solution sets than that by NIM. The same tendency was observed in five other datasets as well.
In Table 2, we demonstrate the efficiency obtained by tree prunings. We compare the computation time of the proposed method with that of the naïve counterpart without pruning. In addition to the pruning introduced in Section III, we employ the maximum pattern size (the number of items in an itemset) as an additional pruning condition for illustration purpose. We can observe in Table 2 that the proposed method has successfully pruned more than 90% of the search space spanned by the naïve method. From Table 2, we can also observe that roughly 90% of the computation time is saved thanks to the tree pruning.
Lastly, we demonstrate the validity of non-negativity constraints in HIV-1 dataset. As it has already been discussed in Section II, the difference between NIL1 and itemset LAR/LASSO lies just in the fact that whether non-negative constraints are enforced. Too see this, we ran both methods in HIV-1 AZT dataset, and compared the features obtained until the 10-th iteration. Table 3 shows the features obtained by NIL1 and that by itemset LAR/LASSO   as well as their coefficients and the ranking in terms of absolute values. Both of the methods identify combinatorial feature {170L, 173E} as one of the most influential feature with large positive weights. However, itemset LAR/LASSO also assigns large negative coefficient to {139R} as well. However, mutations with large negative weights do not survive biological selection, so this feature is considered as an artifact.

B. MUTAGENICITY PREDICTION
In computational chemistry, small molecules are often represented as descriptors and stored in databases together with bioactivity or chemical reactivities. These records can be used for building a statistical model for predicting activity or reactivity of unseen compounds. We use the CPDB mutagenicity dataset [10], which provides mutagenicity classifications (341 mutagens and 343 nonmutagens) as determined by the Salmonella/microsome assay (Ames test). We ran regression methods by making target labels to either mutagen (1.0) or nonmutagen (0.0). In this experiment, we aim at detecting a set of descriptors that trigger mutagenicity as well as building a classifier, so can assume that the regression coefficients take positive values 2 As descriptors, we employed 166 MACCS keys [5] available through Open Babel [19]. We randomly split the dataset into 80% training and 20% testing, and measured the performance in the test set. We repeated this procedure 10 times, and report the averaged statistics in Table 4. We compare NIL1 with NIM. We also ran LARS and NLS as baselines, which do not consider variable interactions. In NIL1, 20% of the training set is reserved as the validation set, and used to determine the regularization parameter. In each experiment, we counted the number of features and the size of each feature, which represents the number of items in it. Rule coverage stands for the ratio of samples represented by at least one feature.  If we compare the proposed methods with the baselines, we can have an interesting observation in its rule coverage. LARS and NLS showed rule coverage as high as 0.994 and 0.744, respectively. On the other hand, the NIM and NIL1 showed lower coverage such as 0.353 and 0.473, respectively. As we discuss more details below, this is due to the nature of the proposed methods which attempt to search for complex variable interactions not covered by a large number of samples, but have high classification accuracy.
Among the non-negative models, NIL1 performed best in terms of the test set accuracy , which would be due to the consideration of variable interactions unlike LARS and NLS. If we focus on the feature size, NIL1 discovered much larger ones than by NIM. On the other hand, the slightly inferior performance of NIM would be due to the lack of ability to avoid overfitting. As we explain in the next paragraph, large features are important, since they can lead to improved interpretability. Tables 5 and 6 show representative features selected by NLS and NIL1, respectively. MACCS ID 133 found by NLS has a ring structure with Nitrogen attached to it, and has a large coverage (36.7%) with a moderate accuracy (72.5%). The same structure is captured by NIL1, but with additional structures (Indices 1 and 4 in Table 6.) It is observed that by adding additional structures, the accuracy of this feature boosts to either 77.8% or 78.3% at the cost of smaller coverage. The same observation can be made for the MAACS ID 16 found by NLS. This three molecules hetero cycle structure is detected by NIL1 as well (Index 6 in Table 6, but with an additional carbon structure. Its accuracy is boosted from 80% to 88.2%, at the cost of a smaller coverage. In general, an easy problem can be solved with features with high accuracy and high coverage, but a harder problem requires many rules with low coverage. In this experiment, NIL1 has successfully discovered such rules unlike the baselines. Another interesting observation can be made for MACCS ID 70 in Table 5. The corresponding structure in NIL1 is Index 2 in Table 6. Although the coverage and accuracy do not differ between the two features, the one by NIL1 suggests that Nitrogen is used in the form of NO 2 or NOOH . This property stems from closed itemset mining, which returns the longest itemset among the itemsets with the same frequency. This additional information is useful for practitioners to understand the classification mechanism, and designing new molecules.

VI. DISCUSSION
LASSO is known to be an interpretable method due to its ability to induce sparseness to regression coefficients. In this paper, we incorporated two additional factors which could improve interpretability. The first one is itemset-based features. Machine learning algorithms which employ itemset-based features are known to be interpretable [13], [30]. The second one is non-negativity on regression coefficients. Non-negative matrix factorization is often used as an analytical and interpretable tool due to its ability to automatically extract sparse factors [15]. Its interpretability is deemed to originate from the addition of positive factors, which is inherited to non-negative least squares in our case.
On the other hand, our model has limitations in its interpretability. For example if we consider an itemset {A, B}, then its occurrence is highly correlated with the occurrence of either item A or item B. Pattern-based features are often overlapping each other, then their interpretations are not straightforward. These are common problems when we employ patterns as features [26], and their solutions are left remained as a future work.

VII. CONCLUSION
In this paper, we have proposed to incorporate interaction terms to the non-negative least squares problem. For suppressing the resulting intractable number of variables, we took advantage of 1 regularization and non-negativity constraints. For efficiently finding interaction terms, novel bounding conditions are introduced to the tree based search space. In terms of the prediction accuracy, enforcing not only non-negativity constraints, but also 1 constraints together turned out to be useful. It can be attributed to the ability of 1 constraints which can avoid overfitting. In application to HIV-1 drug resistance prediction, we discussed the necessity of introducing both non-negativity constraints and variable interactions from the biological viewpoint. In the application to mutagenicity prediction problem, we demonsrtrated that the proposed method is relatively easy to interpret and useful for predictive as well as for explanatory purposes. EINOSHIN SUZUKI received the B.E., M.E., and D.E. degrees from The University of Tokyo, in 1988Tokyo, in , 1990Tokyo, in , and 1993