Complete EDA and Visualization in Python: An analysis of AI companies (Pt1).
This dataset was gotten from Vineeth- AI Companies Dataset (kaggle.com)
A good data analysis is as good as how well the data was explored. This is the first step to any data project. Exploratory data analysis involves finding patterns and relationships, identifying outliers, cleaning the data as well as preparing it for processing. Let’s get to it.
Step 1: Import the necessary modules and load the data.
The following python packages will be utilized; Pandas will be used for data manipulation, NumPy for statistical calculations, and Seaborn for data visualization.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%reload_ext autoreload
%autoreload 2
#load the csv file
df = pd.read_csv ("../input/ai-companies/AI_Companies.csv", index_col= 0)
This code imports the necessary modules. It tells python to display visualizations inside the jupyter notebook and automatically reload modules before executing a new line of code.
Step 2- Overview of the data
Pandas has methods for getting information on a dataframe. The describe method which returns a summary statistic of each column shall be utilized.
df.shape
df.describe().transpose()
Step 3: Duplicates and Null Values
The describe method shows ‘Website’, a unique identifier, as containing duplicate values. To drop;
df.drop_duplicates(subset= ["Website"],keep="first",inplace = True)
The code above drops duplicate values based on the column ‘Website’. It keeps the first occurrence of any duplicate and modifies the original dataframe.
Null values: to get the sum of null values across all columns,
df.isnull().sum()
To decide if to drop the null values in ‘Location’, subset to see if it contains information in other columns of the data. The column ‘Unnamed: 7’ contains only one non-null row, subset to view.
df[df["Location"].isnull()]
df[~df["Unnamed: 7"].isnull()]
The data has been shifted to the right; this can be fixed by shifting inwards using the shift method of pandas. After this, the column created by the shift shall be dropped.
df.loc["Cloud(X)"]= df.loc["Cloud(X)"].shift(periods= -1)
df.drop(["Unnamed: 7"], axis=1, inplace= True)
Step 4: Splitting Columns
The column ‘Location’ is in the form of city, country eg Toronto, Canada, and some as country alone eg Australia. This will be split into separate columns, ‘City’ and ‘Country’.
df["City"]= None
df["Country"]= None
for lab, row in df.iterrows():
location = str(row["Location"])
if "," in location:
df.at[lab, "City"], df.at[lab, "Country"]= map(str.strip, location.split(",", 1))
else:
df.at[lab, "Country"]= location.strip()
Notice that abbreviated values in ‘Country’ are US states, subset to confirm.
df[df.loc[:,'Country'].astype(str).str.len()==2]
For a detailed analysis, the US states will be assigned a country name, ‘USA’, but first the column will be duplicated for future referencing.
df["Country_2_usastates"]= df.loc[:, "Country"]
for lab,row in df.iterrows():
if len(str(row["Country"])) == 2 :
df.at[lab,'Country'] = 'United States of America'
else:
continue
Step 5- Data Types
The quantitative columns in the dataframe contain special characters which prevents Pandas from reading it as an integer. This can be done by calling str.replace() multiple times,
df["Minimum Project Size"] = df["Minimum Project Size"].str.replace("Undisclosed","0")
df["Minimum Project Size"] = df["Minimum Project Size"].str.replace("$","")
df["Minimum Project Size"] = df["Minimum Project Size"].str.replace("+","")
df["Minimum Project Size"] = df["Minimum Project Size"].str.replace(",","")
or by applying str.replace() with a lambda function.
df["Minimum Project Size"] = df["Minimum Project Size"].apply(lambda x: \
x.replace("$", "").replace("+", "") \
.replace(",", "").replace("Undisclosed", "0") if any(char in x for char in ["$", "+", ",", "Undisclosed"]) else x)
df["Minimum Project Size"] = pd.to_numeric(df["Minimum Project Size"])
df["Minimum Project Size"].value_counts()
Also convert the column in percentage form to an integer using;
df["Percent AI Service Focus"]= df["Percent AI Service Focus"].str.rstrip('%').astype(float) / 100.0
Lastly, assign continent names to each country. This is optional but will aid further analysis and better visualization. The pycountry-convert module has inbuilt methods for this, merging the dataframe with world data from geopandas is another alternative, that will be shown in the next series (see pt2).
import pycountry_convert as pc
def country_convert (country_name):
if country_name is not None and isinstance(country_name, str):
try:
cont = pc.country_name_to_country_alpha2(country_name)
cont_code= pc.country_alpha2_to_continent_code(cont)
continent_name = pc.convert_continent_code_to_continent_name(cont_code)
return continent_name
except Exception as e:
return "Unknown"
else:
return "Invalid Country Name"
df["Continent"]= df["Country"].apply(country_convert)
df["Continent"].value_counts()
The data has been transformed successfully. The next series will focus on finding relationships through visualizations. You can click here to view the entire code.