In this two-part series, we will explore text clustering and how to get insights from unstructured data. It will be quite powerful and industrial strength. The first part will focus on the motivation. The second part will be about implementation.
This post is the second part of the two-part series on how to get insights from unstructured data using text clustering. We will build this in a very modular way so that it can be applied to any dataset. Moreover, we will also focus on exposing the functionalities as an API so that it can serve as a plug and play model without any disruptions to the existing systems.
- Text Clustering: How to get quick insights from Unstructured Data – Part 1: The Motivation
- Text Clustering: How to get quick insights from Unstructured Data – Part 2: The Implementation
In case you are in a hurry you can find the full code for the project at my Github Page
Just a sneak peek into how the final output is going to look like –
UPDATE: The docker container is here! I have been hearing a lot of complaints about people not being able to get this thing installed. One of the primary reasons being the environment; the OS, python version etc. With the docker setup, all that pain is eliminated and you can easily set things up on your end with these simple instructions!
- Install Docker
git clone https://github.com/vivekkalyanarangan30/Text-Clustering-API
- Open docker terminal and navigate to
docker build -t clustering-api .
docker run -p 8180:8180 clustering-api
- Access http://192.168.99.100:8180/apidocs/index.html from your browser [assuming you are on windows and docker-machine has that IP. Otherwise just use localhost
- Anaconda distribution of python 2.7 – Download from here
- flask API python package – After installing anaconda, go to command prompt and type
pip install flask
- flasgger python package – After installing anaconda and flask, go to command prompt and
type pip install flasgger
You are ready with the Tools now. Download the code from here to get started with setting it up.
Unzip the contents, open the command prompt and type
A server will be started and you can now access the tool at this location – http://localhost:8180/apidocs/index.html
This is where the actual KMeans clustering happens.
- It takes a CSV file as input. In addition, you also want to input the column name which contains the unstructured text and the number of clusters
- Once you click “Try it Out” button, the inputs will be used by the API
- The API does the text cleaning, Tfidf Vectorization and the clustering
- Once it’s done, it will give a downloadable link which will have an additional column appended to it with the cluster numbers
As far as this technique goes, it is a little more straightforward.
- It takes two files as input, one with the data to be clustered and the other with predefined keywords
- In addition it takes the column name equivalent to the unguided clustering
- As output, it brings out additional columns for each keyword given
- TRUE if a document contains that word, FALSE if it doesn’t
This gives a sense of the presence/absence of keywords in documents, giving which documents contain signals from keywords and which of them don’t.
That was all in this multi-series on text clustering. Good enough to get started right? It was an amazing experience penning down this series. See you on the next bit. Have fun!