We live in the world of unstructured data. It becomes necessary to transform it to build a good model. Consequently, no model is comprehensive without scientific approaches to feature engineering. This post will primarily focus on 3 data transformation tips because that will enable you to do better exploratory data analysis in python.
3 Data Transformation Tips: 1 – Do your exploratory statistics
First of all, soon as we get the data we want to fit a model. We try 10 different algorithms rather than look at the data better. Before you try your hand at the model, it is probably a good idea to make sure you have gone through your data thoroughly. Pull out each variable and let it stand naked in the form of plots and charts. Torture each piece of data until it surrenders. Do your box-plots and missing value identification/imputation techniques to make some sense out of the data. Here are some quick plots to get you started –
def plot_corr(df,size=10): '''Plot a graphical correlation matrix for each pair of columns in the dataframe. Input: df: pandas DataFrame size: vertical and horizontal size of the plot''' corr = df.corr() fig, ax = plt.subplots(figsize=(size, size)) ax.matshow(corr) plt.xticks(range(len(corr.columns)), corr.columns); plt.yticks(range(len(corr.columns)), corr.columns); plt.show()
Probability Distribution plots
import seaborn as sns sns.set(color_codes=True) x = np.random.normal(size=100) sns.distplot(x); sns.distplot(x, kde=False, rug=True); sns.plt.show()
3 Data Transformation Tips: 2 – Pull features out of the air
Data scientists often complain about not having enough data or not having enough features. Hence the trick sometimes is to pull features out of thin air. You can always regularize your models later to control them. Therefore if you have a date column, use date, month, year, day of week etc.
The variables might not have had any dependence on the predicted variable . But their transformations might be. Try taking logs, squares, square roots. Most of all, experiment across variables. Take the difference of variables, their ratios etc.
Dealing with categorical variables by creating dummy variables seems like a vanilla way of doing things. This is most noteworthy in case of large number of categories. Try building a smaller model (like a Bernoulli naïve Bayes) to compute probability scores for each categories. As a result these scores can act as the features!
def Binarize(columnName, df, features=None): df[columnName] = df[columnName].astype(str) if(features is None): features = np.unique(df[columnName].values) print(features) for x in features: df[columnName+'_' + x] = df[columnName].map(lambda y: 1 if y == x else 0) df.drop(columnName, inplace=True, axis=1) return df, features train, binfeatures = Binarize(col, train) nb = BernoulliNB() nb.fit(train[col+'_'+binfeatures].values, train.target.values) train[col] = \ nb.predict_proba(train[col+'_'+binfeatures].values)[:, 1] train.drop(col+'_'+binfeatures, inplace=True, axis=1) train[col] = train[col].astype(float)
3 Data Transformation Tips: 3 – Feature Union
Finally, this is the most noteworthy and challenging one. Putting the pieces together can be difficult. Imagine you have a dataset where there are 10 variables –
- 6 of them are integer and floats where you can directly model them.
- And 1 of them is an xml message you need to parse in order to derive features
- Another is an image you may want to incorporate
- Also 1 of them has free text you want to take work counts with bi-grams
- Finally 1 of them is free text that you want to do TF-IDF on
A little hypothetical and unreal, but there can arise a subset of this situation. In addition you might want to first organize these transformations in the form of different functions or classes. As a result, data can be pushed through feature union and pipeline –
FeatureUnion( transformer_list = [ ('cst', drop_all_non_digit_cols()), ('xml', pipeline.Pipeline([('s1', pull_col(key='XML')), ('xmlfeature', build_xml_features())])), ('img', pipeline.Pipeline([('s2', pull_col(key='IMAGE')), ('image', img_flatten())])), ('count', pipeline.Pipeline([('s3', pull_col(key='txt1')), ('count', count)])), ('tfidf_', pipeline.Pipeline([('s4', pull_col(key='txt2')), ('tfidf', tfidf)])), ], n_jobs=-1)
Furthermore, model training can also be included in it.
In conclusion, I just took a stab at listing down 3 data transformation tips every data scientist should know. Some of them are popular ones and the others – well, not so popular that is the whole point. Drop in any more transformation techniques you may think are important…