## Tuesday, December 10, 2013

### Packing Santa's Sleigh (Python Code for MATLAB Benchmark)

`````` #
# This is a translation of the MATLAB benchmark code
# given by Kaggle in "Packing Santa's Sleigh" competition
#
# This file will give a score same as the MATLAB benchmark score
#
import numpy as np
import scipy as sp
import pandas as pd
def getData():
#print data
print "converting data to numpy array"
data = np.asarray(data)
#print data
return data
def pack(data):
#data = data[:100, :]
# the number of presents
presents = data[:,1:]
numPresents = data.shape
print "total presents : ", numPresents
# width and length are 1000 units. Height is not fixed for the packing
width = 1000
length = 1000
# Initial coordinates
xs = 1
ys = 1
zs = -1
lastRowsInd = np.zeros((100, 1)) # temp array for storing indexes of last few rows
lastShelfInd = np.zeros((100,1)) # temp array for storing indexes of last few shelves
numInRow = 0     # Store the number of presents in current row
numInShelf = 0     # Store the number of presents in current shelf
presentCoordinates = np.zeros((numPresents, 25))
tempPresentLenRow = []
tempPresentHeightShelf = []
for i in range(numPresents):
# check if there is room in the row, else increase the row
if (xs + presents[i,0] > width + 1):
ys = ys + np.max(tempPresentLenRow)
xs = 1
numInRow = 0
tempPresentLenRow = []
# check if there is room in shelf, else increase the height
if (ys + presents[i,1] > length + 1):
zs = zs - np.max(tempPresentHeightShelf)
xs = 1
ys = 1
numInShelf = 0
tempPresentHeightShelf = []
presentCoordinates[i,0] = data[i,0]
presentCoordinates[i,[1,7,13,19]] = xs
presentCoordinates[i,[4,10,16,22]] = xs + presents[i,0] - 1
presentCoordinates[i,[2,5,14,17]] = ys
presentCoordinates[i,[8,11,20,23]] = ys + presents[i,1] - 1
presentCoordinates[i,[3,6,9,12]] = zs
presentCoordinates[i,[15,18,21,24]] = zs - presents[i,2] + 1
xs = xs + presents[i,0]
numInRow = numInRow + 1
numInShelf = numInShelf + 1
tempPresentLenRow.append(presents[i,1])
tempPresentHeightShelf.append(presents[i,2])
if i%1000 == 0: print i
zCoords = presentCoordinates[:,3::3]
minZ = np.min(zCoords.ravel())
presentCoordinates[:,3::3] = zCoords - minZ + 1
return presentCoordinates
def saveCSV(predictions):
submission = pd.DataFrame(predictions, columns="PresentID,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4,x5,y5,z5,x6,y6,z6,x7,y7,z7,x8,y8,z8".split(','), dtype = int)
submission.to_csv('submission.csv', index = False)
if __name__ == '__main__':
data = getData()
predictions = pack(data)
saveCSV(predictions)
``````

## Thursday, December 5, 2013

### Partly Sunny With a Chance of #Hashtags

Algorithm for the team (no_name):

The training data consisted of tweet and its location. The variables to be predicted were S, W and K which have been explained as follows:

s = sentiment
w = when
k = kind
============================================================
s1,"I can't tell"
s2,"Negative"
s3,"Neutral / author is just sharing information"
s4,"Positive"
s5,"Tweet not related to weather condition"
w1,"current (same day) weather"
w2,"future (forecast)"
w3,"I can't tell"
w4,"past weather"
k1,"clouds"
k2,"cold"
k3,"dry"
k4,"hot"
k5,"humid"
k6,"hurricane"
k7,"I can't tell"
k8,"ice"
k9,"other"
k10,"rain"
k11,"snow"
k12,"storms"
k13,"sun"
k15,"wind"

For classification we treated S, W and K separately and created different models for each of them. The dataset was also preprocessed separately for the 3 variables.

Functions implemented:
• Sanitization Function - Each tweet was sanitized prior to vectorization. The sanitization part converted all tweets to lower-case and replaced “cloudy” to “cloud”, “rainy” to “rain” and so on.
• Sentiment Dictionary - A list of words for different sentiments constituted the sentiment dictionary.
• Sentiment Scoring - we provided a score to each tweet if the tweet consisted of any words found in the sentiment dictionary.
• Tense Detection - A tense detector was implemented based on regular expressions and it provided score for “past”, “present”, “future” and “not known” to every tweet in the dataset.
• Frequent language detection - This function removed tweets for which language was not frequent (10 frequent languages were used).
• Tokenization - A custom tokenization function for tweets was implemented using NLTK.
• Replace Two or More - Repetitions of characters in a word were removed. Eg. “hottttt” was replaced with “hot”.
• Spelling Correction - Spelling correction was implemented based on Levenshtein Distance.
• Weather Vocabulary - A weather vocabulary was made by crawling a few weather sites which scored the tweets as related to weather or not.
• Category OneHot - The categorical variables like state and location were one hot encoded using this function.

Types of Data Used:
• All tweets
• Count Vectorization
• TFIDF Vectorization
• Word ngrams (1,2)
• Char ngrams (1,6)
• LDA on the data
• Predicted values of S, W and K using Linear Regression and Ridge Regression

Classifiers Used:
• Ridge Regression
• Logistic Regression
• SGD

Model:
• The different types of data were trained with both the classifiers and and ensemble was created from the different predictions.
• We used approximately 10 different model-data combinations for creating the final ensemble.
• The predictions for S and W were normalized between 0 and 1 in the end.

Our model gave a score of 0.1469 on the leaderboard.

In the end we did an average with Jack to end up at 4th position.

After this competition I ended up in the first page of Kaggle rankings: http://www.kaggle.com/users/5309/abhishek