Machine Learning Blogs - Game of Thrones Analytics Cover

Game of Thrones Analytics

Spread the love

Data is coming. Game of Thrones (GOT) is getting more and more interesting and I thought that a machine should not miss out on all the fun. So here’s the way I taught a machine the Game of Phrases –

  • Scraped all the subtitle (.srt) files from the internet till Season 5.
  • Lords of Text Analytics processed all the text to make some sense out of it.
  • One True Algorithm was implemented to teach Mr. Machine all he should know

Ok, enough with the metaphors. So after the scraping part I processed the subs in python to extract all the text from it –

import pysrt
import os
files = os.listdir('s04/')
sentences = []
for f in files:
    try:
        subs = pysrt.open('s04/'+f)
    except:
        continue

 

Game of Thrones Analytics
Game of Thrones Analytics

The files have been loaded but what about the functions in the above snippet that look unknown?

The complete code is given in my GitHub page

The data is here

You can refer to that if you want to go for an implementation of your own

Data Cleaning Game of Thrones Subs

text = clean_text(subs.text)
sentences += review_to_sentences(text,tokenizer,remove_stopwords=True)

The above snippet was used inside the loop where files were being processed one by one. The cleaning part had these steps roughly –

  1. Lower case all the text
  2. Sentence tokenize and then word tokenize
  3. Since this is a game of thrones dataset with its own vocabulary I had to come up with my own brand of stopwords –
    more_stops = ['would','ill','come','one','up.','up','whose','get','',' ','well','say','see','going','like','tell','want','make','know','year','go','yes','take','time','never','could','need','let','enough','many','keep','nothing','oh','look','father','think','cant','thing','still','even','heard','call','back','hear','u','ever','said','better','every','find','may','word','boy','man','lady','woman','give','must','day','done','right','good','always','little','long','seven','girl','son','brother','way','child','king','lord','mother','away','got','whats','ask','wanted','put','first','much','something','friend','sure','course','told','made','war','god','old','people','world']
    
  4. I also had to normalize some text so that they can be interpreted as the same. Here’s the list
    text = text.replace('\n',' ')
    soup = BeautifulSoup(text)
    for tag in soup.find_all('font'):
    	tag.replaceWith('')
    text = soup.get_text()
    text = text.replace('they\'re','they are')
    text = text.replace('They\'re','They are')
    text = text.replace('They\'ve','They have')
    text = text.replace('they\'ve','they have')
    text = text.replace('I\'ve','I have')
    text = text.replace('won\'t','would not')
    text = text.replace('don\'t','do not')
    text = text.replace('Don\'t','Do not')
    text = text.replace('he\'ll','he will')
    text = text.replace('It\'s','It is')
    text = text.replace('it\'s','it is')
    text = text.replace('we\'re','we are')
    text = text.replace('you\'ve','you have')
    text = text.replace('You\'ve','You have')
    text = text.replace('You\'re','You are')
    text = text.replace('you\'re','you are')
    text = text.replace('he\'s','he is')
    text = text.replace('He\'s','He is')
    text = text.replace('she\'s','she is')
    text = text.replace('She\'s','She is')
    text = text.replace('I\'m','I am')
    text = text.replace('one\'s','one is')
    text = text.replace('We\'re','We are')
    text = text.replace('we\'re','we are')
    text = text.replace('didn\'t','did not')
    text = text.replace('That\'s','That is')
    text = text.replace('that\'s','that is')
    text = text.replace('There\'s','There is')
    text = text.replace('there\'s','there is')
    text = text.replace('We\'ll','We will')
    text = text.replace('we\'ll','we will')
    text = text.replace('We\'ve','We have')
    text = text.replace('we\'ve','we have')
    text = text.replace('Where\'s','Where is')
    text = text.replace('where\'s','where is')
    text = text.replace('haven\'t','have not')
    text = text.replace('we\'d','we would')
    text = text.replace('Isn\'t','Is not')
    text = text.replace('isn\'t','is not')
    text = text.replace('you\'d','you would')
    text = text.replace('You\'d','You would')
    text = text.replace('I\'d','I would')
    text = text.replace('aren\'t','are not')
    text = text.replace('you\'ll','you will')
    text = text.replace('You\'ll','You will')
    text = text.replace('it\'ll','it will')
    text = text.replace('It\'ll','It will')
    text = text.replace('weren\'t','were not')
    text = text.replace('men','man')
    text = text.replace('lannisters','lannister')
    text = text.replace('robb\'s','robb')
    text = text.replace('wasn\'t','was not')
    
  5. I did not want to stem the data as for this dataset I thought stemming would be a bit too aggressive approach. So, I lemmatized it instead –
    from nltk.stem import WordNetLemmatizer
    wordnet_lemmatizer = WordNetLemmatizer()
    words = [wordnet_lemmatizer.lemmatize(w) for w in words if not w in stops] # lemmatize
    

And Now, for the algorithm…

 

Game of Thrones Analytics
Game of Thrones Analytics

Game of thrones deserves an algorithm like word2vec algorithm and based on the things I wanted to do it looked like the perfect match. Here’s the Wikipedia definition –

Word2vec is a group of related models that are used to produce so-called word embedding. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words: the network is shown a word, and must guess which words occurred in adjacent positions in an input text

I won’t be going into the details of the algorithm here but you can check it out at this link

Python Code –

# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
        level=logging.INFO)

# Set values for various parameters
num_features = 500    # Word vector dimensionality                      
min_word_count = 2   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 30          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model (this will take some time)
print "Training model..."
model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "300features_4minwords_10context"
model.save(model_name)

And stunningly, once the algorithm had done its bit, the results turned out to be great.

Word2Vec can be used in many ways. I used it in the context of game of thrones in the following manner–

  1. Input a keyword to get the context

    Code

    model.most_similar('winterfell',topn=5)

    Output

    [(u'die', 0.9999886155128479),
     (u'dead', 0.9999884366989136),
     (u'name', 0.9999882578849792),
     (u'sansa', 0.9999881982803345),
     (u'killed', 0.9999881386756897)]
    

    This was what I got across seasons for the same keywords that I gave –

    Associated Keywords
    keyword S01 S02 S03 S04 S05
    winterfell stark, lannister,
    grace, hand, ser
    grace, stark, ship,
    night, true
    stark, grace, kill,
    die, sansa
    die, dead, name,
    sansa, killed
    north, name, thank,
    night, wall
    stark hand, ser, grace,
    last, honor
    night, grace, sister,
    ship, city
    kill, grace, sword,
    wedding, fight
    night, dead, last,
    sister, ser
    fight, queen, night,
    house, north
    tyrion ser, jamie, stark,
    kingdom, lannister
    jamie, please, hard,
    ser, home
    wedding, grace,
    sansa, sword, stark
    name, die, night,
    ser, love
    ser, people, army,
    grace, death
    westeros lannisters, life,
    die, sword, war
    night, life, grace,
    ship, people, true
    kill, die, sword,
    grace, life
    die, life, tyrion,
    night, family
    people, life, queen,
    world, free
    cersei robert, landing,
    queen, house, stark
    city, queen, grace,
    stark, night
    landing, blood, life,
    sansa, people
    tyrion, queen, last,
    family, dead
    queen, people, world,
    army, hand
    baratheon joffrey, hand, old,
    wall, war
    stannis, grace, ship,
    night, wall
    hand, wedding,
    landing, blood, stark
    wall, lannister, die,
    night, last
    wall, world, queen,
    north, death

    The evolution of Winterfell for example becomes quite evident with every passing season. First Season was about the Starks and their honor and grace. Season 4 was more about Starks getting murdered and them losing the control over Winterfell. Simply Amazing!

     

Game of Thrones Analytics
Game of Thrones Analytics
  1. Given a set of words, predict the odd man out!

    Code

    model.doesnt_match('khal greyjoy targaryen'.split())

    Output

    ‘greyjoy’
    Input Odd Man Out
    night, watch, wall, westeros westeros
    khal, greyjoy, targaryen greyjoy
    ned, robb, catelyn, baelish baelish
    arya, robb, sansa, snow, catelyn catelyn

Quite unexpected, especially this one. I am holding back on the parts where it failed, but it was shocking to see how it could do this with reasonable accuracy!

Finally, I think at least I managed to teach the machine something about the amazing HBO offering. Please post your findings on Season 6 if you decide to give this a shot!

Author: Vivek Kalyanarangan