Text Clustering : clusters

Text Clustering : Get quick insights from Unstructured Data 2

Spread the love

In this two-part series, we will explore text clustering and how to get insights from unstructured data. It will be quite powerful and industrial strength. The first part will focus on the motivation. The second part will be about implementation.

This post is the second part of the two-part series on how to get insights from unstructured data using text clustering. We will build this in a very modular way so that it can be applied to any dataset. Moreover, we will also focus on exposing the functionalities as an API so that it can serve as a plug and play model without any disruptions to the existing systems. 

In case you are in a hurry you can find the full code for the project at my Github Page

Just a sneak peek into how the final output is going to look like –

Text Clustering output

Installations

UPDATE: The docker container is here! I have been hearing a lot of complaints about people not being able to get this thing installed. One of the primary reasons being the environment; the OS, python version etc. With the docker setup, all that pain is eliminated and you can easily set things up on your end with these simple instructions!

Docker Setup

  • Install Docker
  • Run git clone https://github.com/vivekkalyanarangan30/Text-Clustering-API
  • Open docker terminal and navigate to /path/to/Text-Clustering-API
  • Run docker build -t clustering-api .
  • Run docker run -p 8180:8180 clustering-api
  • Access http://192.168.99.100:8180/apidocs/index.html from your browser [assuming you are on windows and docker-machine has that IP. Otherwise just use localhost

Native Setup

  • Anaconda distribution of python 2.7 – Download from here
  • flask API python package – After installing anaconda, go to command prompt and type
    pip install flask
  • flasgger python package – After installing anaconda and flask, go to command prompt and
    type pip install flasgger

You are ready with the Tools now. Download the code from here to get started with setting it up.

Running

Unzip the contents, open the command prompt and type

python CLAAS_public.py

A server will be started and you can now access the tool at this location – http://localhost:8180/apidocs/index.html

Workflow

Unguided Clustering

This is where the actual KMeans clustering happens.

  1. It takes a CSV file as input. In addition, you also want to input the column name which contains the unstructured text and the number of clusters
  2. Once you click “Try it Out” button, the inputs will be used by the API
  3. The API does the text cleaning, Tfidf Vectorization and the clustering
  4. Once it’s done, it will give a downloadable link which will have an additional column appended to it with the cluster numbers

Guided Clustering

As far as this technique goes, it is a little more straightforward.

  1. It takes two files as input, one with the data to be clustered and the other with predefined keywords
  2. In addition it takes the column name equivalent to the unguided clustering
  3. As output, it brings out additional columns for each keyword given
  4. TRUE if a document contains that word, FALSE if it doesn’t

This gives a sense of the presence/absence of keywords in documents, giving which documents contain signals from keywords and which of them don’t.

Conclusion

That was all in this multi-series on text clustering. Good enough to get started right? It was an amazing experience penning down this series. See you on the next bit. Have fun!

Author: Vivek Kalyanarangan

  • Sundar

    unable to download cluster.csv (output file)

    • Vivek Kalyanarangan

      Hi Sundar
      Are there any errors? Let me know more details so that I can help!
      Post the stack trace and I may be able to help you

      • Mark Cichonski

        I also had this same issue. Everything else, instructions, etc, works great, the app runs and gives a link for the cluster.csv, but when you click on it you get the screen below. https://uploads.disquscdn.com/images/f2084711703a29b54bf5ade8e81d753981b0a35e5dc38e029814bc206f3171ca.jpg

        • Mark Cichonski

          I actually just found the file with the cluster added. In the CLAAS_public.py, it is created as Q2.csv. Not sure where the linkage between the web link and the csv file is.

          • Vivek Kalyanarangan

            Thats just an offline copy that it saves in the server.
            Mark – Can you try with Chrome? Did not test it with IE 🙁

  • Josh

    Hi, I have installed the flask and flasgger, but I am getting an import error
    File “CLAAS_public.py”, line 12, in
    from stemming.porter2 import stem
    ImportError: No module named stemming.porter2.

    • vivekkalyanarangan@gmail.com

      Hi
      if you have anaconda distribution, you can install stemming by executing “pip install stemming” from your command prompt. Let me know if that works

  • Jose Roberto Estupinian

    It works very well, yes dealing with the imports takes a bit to resolve. Very interesting

    • vivekkalyanarangan@gmail.com

      Thanks Jose!!! I am planning to add some more documentation to the Github Readme if the getting started barrier seems to be high…

  • Phil Reed

    I’ve got it working too, with a small test set of single line micro texts for now. I shall try it again after converting longer documents to fit in one line each. My question though is about the image you have used of principle component analysis, could you please share how you did that in relation to this clustering tool? Many thanks.

    • Vivek Kalyanarangan

      Thanks Phil! It is the PCA of the document term matrix… I am currently working on enhancing the UI to render that plot. But let me know if you are looking for it on priority. I will commit a version that gives back the plot along with the output as well…

  • Vivek Kalyanarangan

    For ALL: I have wrapped the application in a docker container that will get you started hassle free within minutes! Now you have what I have – the environment, the application behavior is all standardized. Wanted to do this earlier, but better late than never!