Text Clustering : clusters

Text Clustering : Get quick insights from Unstructured Data 2

Spread the love

In this two-part series, we will explore text clustering and how to get insights from unstructured data. It will be quite powerful and industrial strength. The first part will focus on the motivation. The second part will be about implementation.

This post is the second part of the two-part series on how to get insights from unstructured data using text clustering. We will build this in a very modular way so that it can be applied to any dataset. Moreover, we will also focus on exposing the functionalities as an API so that it can serve as a plug and play model without any disruptions to the existing systems. 

In case you are in a hurry you can find the full code for the project at my Github Page

Just a sneak peek into how the final output is going to look like –

Text Clustering output

Installations

UPDATE: The docker container is here! I have been hearing a lot of complaints about people not being able to get this thing installed. One of the primary reasons being the environment; the OS, python version etc. With the docker setup, all that pain is eliminated and you can easily set things up on your end with these simple instructions!

Docker Setup

  • Install Docker
  • Run git clone https://github.com/vivekkalyanarangan30/Text-Clustering-API
  • Open docker terminal and navigate to /path/to/Text-Clustering-API
  • Run docker build -t clustering-api .
  • Run docker run -p 8180:8180 clustering-api
  • Access http://192.168.99.100:8180/apidocs/index.html from your browser [assuming you are on windows and docker-machine has that IP. Otherwise just use localhost

Native Setup

  • Anaconda distribution of python 2.7 – Download from here
  • flask API python package – After installing anaconda, go to command prompt and type
    pip install flask
  • flasgger python package – After installing anaconda and flask, go to command prompt and
    type pip install flasgger

You are ready with the Tools now. Download the code from here to get started with setting it up.

Running

Unzip the contents, open the command prompt and type

python CLAAS_public.py

A server will be started and you can now access the tool at this location – http://localhost:8180/apidocs/index.html

Workflow

Unguided Clustering

This is where the actual KMeans clustering happens.

  1. It takes a CSV file as input. In addition, you also want to input the column name which contains the unstructured text and the number of clusters
  2. Once you click “Try it Out” button, the inputs will be used by the API
  3. The API does the text cleaning, Tfidf Vectorization and the clustering
  4. Once it’s done, it will give a downloadable link which will have an additional column appended to it with the cluster numbers

Guided Clustering

As far as this technique goes, it is a little more straightforward.

  1. It takes two files as input, one with the data to be clustered and the other with predefined keywords
  2. In addition it takes the column name equivalent to the unguided clustering
  3. As output, it brings out additional columns for each keyword given
  4. TRUE if a document contains that word, FALSE if it doesn’t

This gives a sense of the presence/absence of keywords in documents, giving which documents contain signals from keywords and which of them don’t.

Conclusion

That was all in this multi-series on text clustering. Good enough to get started right? It was an amazing experience penning down this series. See you on the next bit. Have fun!

Author: Vivek Kalyanarangan