Thursday, December 5, 2013

Partly Sunny With a Chance of #Hashtags

Algorithm for the team (no_name):

The training data consisted of tweet and its location. The variables to be predicted were S, W and K which have been explained as follows:

s = sentiment
w = when
k = kind
s1,"I can't tell"
s3,"Neutral / author is just sharing information"
s5,"Tweet not related to weather condition"  
w1,"current (same day) weather"
w2,"future (forecast)"
w3,"I can't tell"
w4,"past weather"
k7,"I can't tell"

Competition Details :

For classification we treated S, W and K separately and created different models for each of them. The dataset was also preprocessed separately for the 3 variables.

Functions implemented:
  • Sanitization Function - Each tweet was sanitized prior to vectorization. The sanitization part converted all tweets to lower-case and replaced “cloudy” to “cloud”, “rainy” to “rain” and so on.
  • Sentiment Dictionary - A list of words for different sentiments constituted the sentiment dictionary.
  • Sentiment Scoring - we provided a score to each tweet if the tweet consisted of any words found in the sentiment dictionary.
  • Tense Detection - A tense detector was implemented based on regular expressions and it provided score for “past”, “present”, “future” and “not known” to every tweet in the dataset.
  • Frequent language detection - This function removed tweets for which language was not frequent (10 frequent languages were used).
  • Tokenization - A custom tokenization function for tweets was implemented using NLTK.
  • Stopwords - Stopwords like 'RT','@','#','link','google','facebook','yahoo','rt' , etc. were removed from the dataset.
  • Replace Two or More - Repetitions of characters in a word were removed. Eg. “hottttt” was replaced with “hot”.
  • Spelling Correction - Spelling correction was implemented based on Levenshtein Distance.
  • Weather Vocabulary - A weather vocabulary was made by crawling a few weather sites which scored the tweets as related to weather or not.
  • Category OneHot - The categorical variables like state and location were one hot encoded using this function.

Types of Data Used:
  • All tweets
  • Count Vectorization
  • TFIDF Vectorization
  • Word ngrams (1,2)
  • Char ngrams (1,6)
  • LDA on the data
  • Predicted values of S, W and K using Linear Regression and Ridge Regression

Classifiers Used:
  • Ridge Regression
  • Logistic Regression
  • SGD

  • The different types of data were trained with both the classifiers and and ensemble was created from the different predictions.
  • We used approximately 10 different model-data combinations for creating the final ensemble.
  • The predictions for S and W were normalized between 0 and 1 in the end.

Our model gave a score of 0.1469 on the leaderboard.

In the end we did an average with Jack to end up at 4th position.

After this competition I ended up in the first page of Kaggle rankings:

No comments:

Post a Comment