Skip to main content

Command Palette

Search for a command to run...

Handling Missing Data with Python: Basics of Data Analytics with Python

Published
4 min read
Handling Missing Data with Python: Basics of Data Analytics with Python
.

Aspiring Data Science Engineer | CSE Final Year student| Python GenAI Enthusiast | Building & Blogging my journey

print(‘Hello’)

Data Analytics is one of the most in-demand skills in today’s world. Solving your own puzzles with data feels thrilling and creative. I’ve always wanted to learn data analytics, but in truth, even after enrolling in many courses, I found myself stuck at the same starting point again and again. My mind felt so unstructured, and I didn’t know where to focus.

So, I decided to begin properly this time, one clear step at a time. Since I already know the basics of Python, I’m now focusing on Data Analytics with Python, along with practising on Kaggle.

And today, I’m starting with one of the most important basics every data analyst needs to know:
How to handle missing or null values in a dataset.

LIBRARIES

Let’s begin by importing necessary libraries

import pandas as pd
import numpy as np
  • Numpy - great for numerical operations, here we are using numpy to help us represent missing values(np.nan)

  • Pandas - helps to create ,clean, analyze data in table-like formats

CREATE A SAMPLE DATASET

num=pd.DataFrame({'LETTERS':['TEN','TWENTY','THIRTY','FOURTY','HUNDRED'],'NUMERICS':[10,20,30,40,100],'INCREMENT':[11,21,31,41,101],'DECREMENT':[9,np.nan,29,39,np.nan]},index=[1,2,3,4,5])

Here I have created a small DataFrame called num that have some missing values (NaN). When you practice you can choose a much bigger dataframe than this. But I would reccomend smaller ones, because it would be more easier to notice difference if you are someone who skips large chunks of comprehension.

So what is a DataFrame?

A labeled table of data that you can easily clean, analyze, and manipulate using pandas. A new DataFrame in Python is created using the pandas library. We usually do this with the pd.DataFrame() function.

This is a dataframe.

  • np.nan -it’s a special constant in NumPy used to represent missing or undefined values

DETECTING MISSING VALUES

There are two common methods used to detect missing data in a DataFrame

  • isnull()

    • It shows where values are missing.

    • True → The value is missing

    • False → The value is not missing

    num.isnull()

.isnull().sum()

  • .isnull() - returns a table of True/False values

  • .sum() - adds up all the True values in each column

num.isnull().sum()

Handling Missing Values

After identifying missing values, the next step is to decide what to do with them.

  1. Removing missing values:

     num.dropna()
    

    The output would be:

    dropna() - by default dropna() removes rows that contain any missing values. You can see that difference from the output ,right? Columns 2 and 5 are dropped as they contained null values.
    If you want to remove entire columns that have missing values:

     num.dropna(axis=1)
    

    The entire column ‘DECREMENT’ has dropped, because it had null values.

    Let me show you the difference

     print("original dataset🌕:\n",num)
     print("\nAfter removing rows with null values🌒:\n", num_cleaned)
     print("\nAfter removing columns with null values🌘:\n",num_cleancol)
    

    output:

  2. Filling Missing Values:

    Instead of removing data, we can replace missing values with something meaningful or something specific. we use fillna() - replaces missing values with the value you give. You can give any value, here i chose 100 .

     num.fillna('100')
    

    output will be :

    Some other options you can use to fill missing values:

    So Forward Fill and Backward Fill

    Instead of using a fixed number, we want to fill missing values based on nearby values in the same column.

    • Forward fill (ffill) – copies the value from the cell above.

        num.ffill()
      

      Uses previous value to fill the missing value

  • Backward fill (bfill) – copies the value from the cell below. Uses next value to fill the missing value
    num.bfill()

That concludes our today’s learning journal. This how I learned how to handle missing values using Pandas. It’s the first step every data analyst learns before moving on to transformations, grouping, and visualizations. I’ll be sharing more of my learning journals in the coming days.

TIP 💡: If you want to practice working with real data, explore Kaggle. It’s a great platform where you can find tons of free datasets to experiment with. You can practice cleaning, analyzing data and even build small projects.

RESOURCES

📊 kaggle

📊Github link for the Notebook : Missing Value handling

📊Watch the explanation reel : Missing data handling for beginner data analytics