Complete EDA and Visualization in Python: An analysis of AI companies (Pt2).
This dataset was gotten from Vineeth- AI Companies Dataset (kaggle.com)
The data has been transformed in Part 1 of the series. In this part we shall explore the data through visualizations created using python’s seaborn, most of which shall be automated with functions. Let’s get to it.
Recap — in part 1 the necessary modules were imported, duplicate values were dropped, null values were transformed, columns were split and modified, new columns were created and data values converted to proper dtype.
Re: Importing the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%reload_ext autoreload
%autoreload 2
This code imports the necessary modules and automatically reloads them before executing a new line of code. It also tells python to display visualizations inside the jupyter notebook.
Visualizing Distributions
First, initialize a global variable “title_dict” to hold the properties that will be applied to all visualization titles.
title_dict= {'family':'serif','size':10,'weight':'bold'
Next, a function shall be defined which takes in a dataframe and a column as an argument, and creates two subplots; a histogram and kdeplot.
A histogram provides a great way for displaying the distribution of observations. Additionally, a Kdeplot plots data as a continuous probability density curve.
Two lines will be added to indicate the mean and median values on the histogram. The function will also contain customizations to the axes, titles and labels.
def individual_column_vizzes(df,column):
filtered_df = df.loc[df[column]!=0]
fig, ax = plt.subplots(1,2, sharex= True, figsize = (8,4))
sns.set_context("paper",font_scale=1.2)
sns.set_style("white")
sns.histplot(filtered_df[column], bins=40, ax=ax[0] )
sns.kdeplot (filtered_df[column], ax= ax[1], fill= True, bw_adjust= 0.5)
mean_value = filtered_df[column].mean()
median_value = filtered_df[column].median()
ax[0].axvline(x=mean_value, color='red',alpha= 0.3, linestyle='--', label=f'Mean: {mean_value:.2f}')
ax[0].axvline(x=median_value, color='green',alpha=0.2, linestyle='-.', label=f'Median: {median_value:.2f}')
ax[0].legend()
xticks = ax[0].get_xticks()
if column == 'Minimum Project Size' or column=="Average Hourly Rate":
xtick_labels = [f'${x:.0f}' if x < 1000 else f'${x/1000:.0f}k' for x in xticks]
elif column == "Number of Employees":
xtick_labels = [f'{x/1000:.0f}k' for x in xticks]
else:
xtick_labels= [f'{int(x*100)}%' for x in xticks]
ax[0].set_xticklabels(xtick_labels)
ax[0].set_title(column + ' Distribution',fontdict=title_dict)
ax[1].set_title(column + ' KDE',fontdict=title_dict)
[axes.set(xlabel='') for axes in ax]
plt.tight_layout()
individual_column_vizzes(df, "Minimum Project Size")
Relationships among columns
A PairGrid allows for subplots of pairwise relationships across a dataframe’s numeric values. Different plot types can be mapped to a section of the plot strategically using the .map_() method.
sns.set_context("paper",font_scale=1)
sns.set_style("white")
g=sns.PairGrid(df,hue= "Continent", palette= "RdBu",)
g.map_diag(sns.ecdfplot)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot, levels=4)
g= g.add_legend(fontsize=9,bbox_to_anchor=(0.99,0.9))
Boxplots:
A boxplot is a categorical plot for comparing variables using quantitative information, displaying the spread, skewness, quartiles and outliers.
Some columns in the dataframe have a wide spread than others, making it difficult to visualize the values when plotted. As such, two boxplot functions will be defined; one will create boxplots on a log scale (for highly skewed columns), and the other using default values on x-axis.
def boxplot_on_log_scale(df,xvalue,yvalue):
edited_df= df.loc[df[xvalue]!=0]
sns.boxplot(data=edited_df,
x=xvalue,y=yvalue,whis=(0,100),width=.5,color="#4e9daf",linewidth=.75)
plt.xscale('log')
plt.title(f'{xvalue} by {yvalue} Boxplot log-scaled',fontdict=title_dict)
plt.ylabel('')
plt.xlabel(f'{xvalue} in log form')
def boxplot_no_log_scale(df,xvalue,yvalue):
edited_df= df.loc[df[xvalue]!=0]
sns.boxplot(data=df,x=xvalue,y=yvalue,whis=(0,100),width=.5,color="#4e9daf",linewidth=.75)
plt.title(f'{xvalue} by {yvalue} Boxplot',fontdict=title_dict)
plt.ylabel('')
boxplot_on_log_scale(df,"Number of Employees","Continent")
boxplot_no_log_scale(df,"Percent AI Service Focus","Continent")
Heatmaps:
A heatmap is a matrixplot which displays trends and test correlations between variables in a grid format.
A function will be defined which first creates a pivot table from the arguments passed before converting into a heatmap.
def pivot_to_heatmap(df,index_var,col,value):
filtered_df= df.loc[df[value]!=0]
var= filtered_df.pivot_table(index=index_var , columns= col, values= value,fill_value=0, aggfunc=np.median)
ax=sns.heatmap(var,linewidth=.7,cmap="Blues",
vmin=-20,annot=True,fmt=".0f",linecolor="black",square=True,
annot_kws={
'fontsize':7,'fontweight':'bold','fontfamily':'arial'})
ax.set(xlabel='',ylabel='')
ax.xaxis.tick_bottom()
plt.title(f'Heatmap: {value} of AI Companies across {index_var} by {col}',fontdict=title_dict,y=1.1,pad=20)
Call the function,
pivot_to_heatmap(df,"Continent","Minimum Project Size","Average Hourly Rate")
Geographic Distribution:
Maps facilitates visualizing distributions using geographic information in an aesthetically-pleasing form. Maps can be plotted using Plotly, a python module.
A new dataframe shall be created by merging geopandas world dataser, “naturalearth_lowres”, to the dataframe with a right join.
import geopandas as gpd
import plotly.express as px
world_data = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
merged_df = world_data.merge(df, how='right', left_on='name', right_on='Country')
merged_df.head(4)
Now the map can be plotted in plotly, mapping “iso_a3” as the locations. The size of each bubble will correspond to the count of each observation. Additionally, a new column which gets the median value for each country in each numeric column shall be created, and used as hover information.
country_counts=merged_df["Country"].value_counts().to_dict()
sizes=[country_counts.get(country,0)for country in merged_df["Country"]]
country_median = merged_df.groupby('Country')[['Average Hourly Rate','Minimum Project Size', 'Number of Employees', 'Percent AI Service Focus']].median().reset_index()
country_median.columns = ['Country', 'Median Hourly Rate($)', 'Median Project Size($)','Median No. of Employees','Median % AI Service Focus']
merged_df2 = pd.merge(merged_df, country_median, on='Country', how='left')
map_fig = px.scatter_geo(merged_df2,locations="iso_a3",
projection="orthographic",color="Continent",
opacity=.8,hover_name="Country",
size=sizes,size_max=40,
hover_data={
"Median Project Size($)",'Median Hourly Rate($)','Median No. of Employees','Median % AI Service Focus'}
)
map_fig.show()
That’ll be all for now. Automating with functions makes it easier to explore the columns. You can click here to view the entire code.