true positive: November 2013

Few days back I finished Kaggle.com's (www.kaggle.com) StumbleUpon Evergreen Classification Challenge. StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests.

The challenge: Your mission is to build a classifier which will evaluate a large set of URLs and label them as either evergreen or ephemeral. Can you out-class(ify) StumbleUpon?
(http://www.kaggle.com/c/stumbleupon)

My overall rank in this competition was 6th. I was one of the two persons to maintain top 10 position after the private leaderboard was revealed (http://www.kaggle.com/users/5309/abhishek) .

Lets's talk about the approach now.

My best Public score was 0.89447 which got 6th rank when the private data was revealed. I had 40+ submissions which would have got a Top 10 rank in the Private Leaderboard (best being 3rd).

Anyways, I tried to keep my model as simple as possible and there were only 3 classification models in my ensemble. My ensemble consisted of two Logistic Regression and a k-NN. I used python + sklearn throughout the competition.

I divided the data into two parts :

#1 Boilerplate: I used the preprocessing.py by Triseklion for preprocessing the boilerplate. In TFIDFVectorizer, I used NLTK for stemming and tokenization. So, it was basically the same as the beat_bench.py that I had posted, except pre-processing and NLTK tokenizer.

#2 Raw Data: I used my own data cleaner for cleaning and tokenization and HTML cleaner of NLTK. preprocessing.py by Triseklion was not used here, as I had deployed my own pre-processing. I used the same TFIDFVectorizer as the one for Boilerplate data.

The next step was SVD. The TF-IDF values obtained from both the data were passed through TruncatedSVD of scikit-learn. Both the SVDs used 120 components.

SVD1 ---> Logistic Regression

SVD1 ---> k-NN Classifier

SVD2 ---> Logistic Regression

The final ensemble was a simple mean of these three models.

Things that did not work for me (or gave a lower score) :

#1 Rapid Automatic Keyword Extraction (RAKE) on both Boilerplate and Raw Data.

#2 SVM (I thought it would but it didn't)

#3 Naive Bayes worked to a certain extent, the results were not satisfactory.

#4 Use of Word Embeddings derived using neural network approach on Wikipedia Corpus.

I hope you liked my approach. I will soon be posting some code snippets(on request).

true positive

Thursday, November 28, 2013

serially number all files

Friday, November 8, 2013

StumbleUpon Evergreen Classification Challenge