Thursday, November 28, 2013

serially number all files

A very helpful script to serially number all files in a folder:

ls *.csv | gawk 'BEGIN{ a=1 }{ printf "mv \"%s\" %d.csv\n", $0, a++ }' | bash

Friday, November 8, 2013

StumbleUpon Evergreen Classification Challenge

Few days back I finished Kaggle.com's (www.kaggle.com) StumbleUpon Evergreen Classification Challenge. StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. 

The challengeYour mission is to build a classifier which will evaluate a large set of URLs and label them as either evergreen or ephemeral. Can you out-class(ify) StumbleUpon?

(http://www.kaggle.com/c/stumbleupon)

My overall rank in this competition was 6th. I was one of the two persons to maintain top 10 position after the private leaderboard was revealed (http://www.kaggle.com/users/5309/abhishek) . 


Lets's talk about the approach now. 


My best Public score was 0.89447 which got 6th rank when the private data was revealed. I had 40+ submissions which would have got a Top 10 rank in the Private Leaderboard (best being 3rd).

Anyways, I tried to keep my model as simple as possible and there were only 3 classification models in my ensemble. My ensemble consisted of two Logistic Regression and a k-NN. I used python + sklearn throughout the competition. 
I divided the data into two parts :
#1 Boilerplate: I used the preprocessing.py by Triseklion for preprocessing the boilerplate. In TFIDFVectorizer, I used NLTK for stemming and tokenization. So, it was basically the same as the beat_bench.py that I had posted, except pre-processing and NLTK tokenizer.
#2 Raw Data: I used my own data cleaner for cleaning and tokenization and HTML cleaner of NLTK. preprocessing.py by Triseklion was not used here, as I had deployed my own pre-processing. I used the same TFIDFVectorizer as the one for Boilerplate data. 
The next step was SVD. The TF-IDF values obtained from both the data were passed through TruncatedSVD of scikit-learn. Both the SVDs used 120 components. 
SVD1 ---> Logistic Regression
SVD1 ---> k-NN Classifier
SVD2 ---> Logistic Regression
The final ensemble was a simple mean of these three models.

Things that did not work for me (or gave a lower score) : 
#1 Rapid Automatic Keyword Extraction (RAKE) on both Boilerplate and Raw Data.
#2 SVM (I thought it would but it didn't)
#3 Naive Bayes worked to a certain extent, the results were not satisfactory.
#4 Use of Word Embeddings derived using neural network approach on Wikipedia Corpus.

I hope you liked my approach. I will soon be posting some code snippets(on request).