A Large-Scale Linear Regression Sentiment Model David Sun
[email protected] ABSTRACT This report details the findings in building a large-scale linear regression sentiment model for an Amazon book review corpus. We studied and applied a number of regression and NLP techniques, including Unigram/Bigrams, stop-word removal, ridge and lasso regression.
1.
DATA CORPUS
The corpus contains 975194 non-distinct book reviews collected from Amazon by Mark Dredze and others at Johns Hopkins. Along side the textual reviews we have numerically scored sentiment data on a scale of 1-5. The goal is to learn a linear regression based sentiment model from the textual reviews which can be applied to future reviews to obtain an estimate of the numerical sentiment score.
2.
DATA PROCESSING
The original data was represented in XML format. A simple tokenizer had been applied to obtain a flattened representation consisting of a reverse-indexed dictionary matrix, which allowed review-text and sentiment scores can be extracted. We obtained document boundaries by segmenting at token positions corresponding to the pair. The textual body of a review is extracted by matching token positions for the pair. Review titles often contain sentiment charged summaries of the review and hence are extracted and concatenated with the review body. Finally, numerical ratings are obtained by matching token positions bewtween the pair.
2.1
Duplicate removal
It turns out that many of the reviews in the corpus were duplicates. to remove duplicates, we computed a textual-hash of the first 20 words (or the maximum number of words in the textual-review if less than 20 words) and discarded reviews for which the hash matched a prior review. At the end
of this process we obtained 494761 distinct reviews, roughly half the size of the original corpus.
2.2
Featurization
We experimented with both bag-of-words and bag-of-bigrams models. Each review is featurized into a multinominal count of words and bigrams. These counts are then normailized via tf-idf scores, computed as: St,d = tft,d × log2
N dft
where tft,d is the term frequency for the unigram or bigram token t in document d; N is the total number of documents in the corpus and dft is the document frequency for t. For each unigram and bigram model, we obtain a n by m matrix A where n = N and m is the total number of unique terms. Prior to featurising we removed a set of stop-words provided by the data collectors. A re-indexing dictionary was created to map featurzied meaningful terms to column-indicies in the matrix A. This dictionary ensures that stop-words and miscXML tags found in the dictionary but are not considered in the featurization will not generate full columns of zeros, which may drive A to singularity.
2.3
Matrix Representation
Rather than storing the matrices in dense form we use the sparse format. Figures 3 visualizes the sparsity pattern for the unigram model for the first 1K documents and 1K terms. Figures 2 shows the sparsity pattern for the full unigram model, further demonstrating that a dense representation would be wasteful. In fact, our sparse matrix has only 40m nonzeros while a dense representation would require almost 2e12 or 5.6 thousand times more storage required. The space saved is even greater for the bigram mode due to the increased sparsity of the model, see Figure ??. In fact the dense bigram matrix would be 15000 times more costly to store!
3.
REGRESSION MODELS
We experimented with two regression techniques: ridge and the lasso. Both techniques extend from the basic regression model with different regularization formulations on the predictors to be estimate. Ridge regression attempts to shrink the l2 norm of the predictors by minimizing the penalized residual sum of squares, x ˆ = arg minkAx − yk2 + λkxk22 x
The exact solution to this minimization problem is the wellknown equation: x = (AT A + λI)−1 AT y where I is the identity matrix and λ is a regularization constant. A numerical argument for using ridge-regression over linear regression is by observing that the linear regression solver computes: x = (AT A)−1 AT y However, A is often ill-conditioned and singular and inverting such a matrix would be numerical unstable. Ridgeregression adds along the diagonal of AT A with a suitably choosen λ which effectively adds λ to the eigenvalues of AT A, thus those previously vanishing eigenvalues will no longer be zero. Figure 1: Sparsity pattern of unigram model for first 1K features and 1K reviews
The lasso is a shinkage method like the ridge, except it uses l1 regularization on the predictors: x ˆ = arg minkAx − yk2 + λkxk21 x
No closed form solution to lasso regression exists and various iterative update based algorithms have been proposed, such as the LAR.
4.
RESULTS
This section reports obtained so-far. Our primary validation technique is based on cross-validation: 90% of the data is used to build the model and the remaining 10% of the data is held out. We use cross-validation for both model tuning and validation and thus will assume henthforth the default procedure from which results are obtained in subsequently discussions (unless stated otherwise).
4.1 Figure 2: Sparsity pattern of full unigram model
Ridge Regression
We began by considering the exact solution for ridge-regression via matrix inversion. There are two major draw-backs to this approach: 1. Although the model matrix A is sparse, inverting the matrix AT A is slow and it produces a dense matrix that’s costly to store. This limits the size of the problem that the model can handle. We were only able to solve a problem size of up to n = 30000 using the computing resources we had access to 2. Due to the limitation on how many features we bring into the model performance maybe affected if A is very sparse.
Figure 3: Sparsity pattern of bigram model for first 10K features and 10K reviews
For the direct solution, the parameter to be tuned is the regularization constant λ. We examine the effects of a range of λ values, ranging fro 0.01 to 100. It’s apparent that the solution was insensitive to the regularization constant The RMSE are shown in Table 1. The Bigram model results were much worse compared to the unigram model, we believe partially due to the sparsity of A and partially due to the naive-selection of features (the first 30000 of them). However it should be noted that computing standardized Z-scores
Figure 4: ROC plot for unigram model, blue line is ridge regression, green line is lasso regression
Figure 5: Lift plot for unigram model, blue line is ridge regression, green line is lasso regression
for feature selection would require the diagonal elements of (AT A)− 1 , which itself is computation infeasible to calculate for large A. Table 1: RMSE results for exact solution to ridgeregression for n = 300K λ = 0.001 λ = 0.01 λ = 1 λ = 10 λ = 100 Unigram Bigram
1.235 5.813
1.223 6.132
1.197 6.029
1.249 5.945
1.251 5.796
We next consider iterative solutions for ridge regression using stochastic gradient descent, specifically a version based on conjugate-gradient known as LSQR [2]. The method is based on the Golub-Kahan bidiagonalization process and has been found to achieve good numerical properties especially if A is ill-conditioned. Again, through cross validation the value of λ was insignificant so we set a constance λ = 1. The ROC and 1% LIFT plots for the unigram model are shown in Figures 4 and 5 respectively.
Figure 6: ROC plot for bigram model ridge regression
As shown in the ROC and 1% LIFT plots for the bigram model (Figures 6 and 7) , the results were again signicantly worse. The reason was that the time-complexity of the conjugate gradient approach to minization was proportion to the number features in the model and for the bigram model we had roughly 8million features. This which was infeasible to optimize give our time constraints so we thus set a maximum iteration count of 1000.
4.2
Lasso
We next exprimented with lasso or l1 regression. Following [1] a interior-point method that minimizes the objective was adopted. This approach has ben shown to more stable numerically for large sparse A. The parameter to be optimizes is the relative-tolerance τ , or the duality gap divided by the dual objective value, which is an upper bound on the relative suboptimality. We cross validated and found the solution to be rather agonsitic to τ . The value of λ is set to λmax ∗ 10−4 where λmax is the maximum eignenvalue of
Figure 7: Lift plot for bigram model ridge regression
Table 2: AUC for unigram and bigram model, ridge and lasso regression Ridge Lasso Unigram Bigram
Accuracy 87.44% 81.98%
AUC 0.9163 0.7422
Accuracy 86.40% 82.64%
AUC 0.9217 0.7496
sparse A. We plan to experiment with domain-specific feature selection techniques, e.g. based on tf-idf or other term ranking methods to signficantly reduce the problem size. We are also interested in looking at logistic regression and kernel regression and compare their results with both ridge and lasso.
6. Figure 8: ROC plot for bigram model lasso regression
Figure 9: ROC plot for bigram model lasso regression AT A, as suggested by the authors of the method. The ROC and Lift plots for the unigram model are shown Figures 4 and 5. The ROC and Lift plots for the unigram model are shown Figures 8 and 9. We summarize the AUC in Table 2
4.3
FLOP Counting
The flops command has been deprecated in Matlab since version 6. We found very late on that lightspeed matlab toolbox was able to do flop counting with reasonable effort. The idea is to insert counting macros in the code at the end of each matrix operation. However, due to time constraints we were not able to further explore this option and will leave it as future work.
5.
FUTURE WORK
We plan to do a couple of more experiments on top of the current framework. First order of things is to deal with the large feature space. Both ridge and lasso regression performed suboptimally for bigram model due to the large and
REFERENCES
[1] S. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. An interior-point method for large-scale l1-regularized least squares. Selected Topics in Signal Processing, IEEE Journal of, 1(4):606–617, 2007. [2] C. Paige and M. Saunders. Lsqr: An algorithm for sparse linear equations and sparse least squares. ACM Transactions on Mathematical Software (TOMS), 8(1):43–71, 1982.