The Big Data Cleaning List

A big list of ways to clean data and the code ready to copy and paste.

Sept 25th 2021 14 min read

The Big Data Cleaning List

Intro

Data Cleaning is necessary for making accurate insights, machine learning models and insights to move your business forward. Sadly data cleaning takes a long time to preform and is only slowly getting easier.

To make it easier to clean your data we collected the most common ways to clean you data. With an example, the library and most importantly the code snippet in an easy to use function with the pandas implementation. Just find what you want to clean, copy and paste, install the library and you are done.

Intro
Phone - International Validation & Format
Phone - US Domestic Validation & Format
Address - Standardize
Address - Address to Street Number and Name
Address - Address to Street Number
Address - Address to City
Address - Address to State
Address - Address to State Code
Address - Address to Zip code
Currency to float
Date Format to American Date Format (MM/DD/YYYY)
Date Format to European Date Format (YYYY/MM/DD)
Date Format to Quarter
Date Format to Timestamp
Uppercase String
Lowercase String
Strip Numbers
Strip Alpha Characters
Strip Special Characters
Full Name to First Name
Full Name to Last Name
Email To Domain Name
Reading CSV with default string data type

Phone - International Validation & Format

Parses, and validates phone numbers from various countries

Phonenumbers Library

Example
Input	+442083661177
Output	+44 20 8366 1177

import pandas as pd
import phonenumbers

def phone_to_international(inputvalue):
    """
    Standardize valid phone numbers to international format (+44 20 8366 1177)

    :param inputvalue: string of phone number
    :return: string of international phone
    """
    if(pd.isna(inputvalue) == False and isinstance(inputvalue, str)):
        try:
            phoneInstance = phonenumbers.parse(inputvalue, None)

            if(phonenumbers.is_valid_number(phoneInstance)):
                outputvalue = phonenumbers.format_number(phoneInstance, phonenumbers.PhoneNumberFormat.INTERNATIONAL)
            else:
                outputvalue = ''
        except:
            outputvalue = ''
    else:
        outputvalue = ''
    return outputvalue

# Change from valid phone number to international format, if not valid then insert blank
df[COLUMN_NAME] = df.apply(lambda x: phone_to_international(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Phone - US Domestic Validation & Format

Parses, and validates phone numbers from the United States

Phonenumbers Library

Example
Input	+120012301
Output	2001230101

import phonenumbers

def phone_to_nat(inputvalue):
    """
    Standardize valid phone numbers to US national format
    
    :param inputvalue: string of phone number
    :return: string of US national phone
    """
    if(pd.isna(inputvalue) == False and isinstance(inputvalue, str)):
        try:
            phoneInstance = phonenumbers.parse(inputvalue, None)
            
            if(phonenumbers.is_valid_number(phoneInstance)):
                outputvalue = phonenumbers.format_number(phoneInstance, phonenumbers.PhoneNumberFormat.NATIONAL)
            else:
                outputvalue = ''
        except:
            outputvalue = ''
    else:
        outputvalue = ''
    return outputvalue
    
# Change from valid phone number to US format, if not valid then insert blank
df[COLUMN_NAME] = df.apply(lambda x: phone_to_nat(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Standardize

Parses and orders the parts of the address into a consistent string

usaddress Library

Example
Input	3743 Carson Shores New Glenn, NC 21452
Output	3743 Carson Shores New Glenn NC 21452

import usaddress

def address_to_sandardize(inputvalue):
    """
    Standardize address
    
    :param inputvalue: string of address
    :return: string of standardized address
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        
        address_keys = address.keys()
        
        result = ''
        for key in address_keys:
            result = result + ' ' + address[key]
            
        outputvalue = result.strip()
    except:
        outputvalue = ''
    return outputvalue
    
# Standardize full address
df[COLUMN_NAME] = df.apply(lambda x: address_to_sandardize(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to Street Number and Name

Parses the address and returns the street number and name

usaddress Library

Example
Input	3743 Carson Shores New Glenn, NC 21452
Output	3743 Carson Shores

import usaddress

def address_to_street(inputvalue):
    """
    Standardize address and ouput just the street address
    
    :param inputvalue: string of address
    :return: string of street from address
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        
        address_keys = address.keys()
        
        result = ''
        keys_we_want = ['AddressNumber','StreetName','StreetNamePostType','OccupancyType','OccupancyIdentifier']
        for key in address_keys:
            if key in keys_we_want:
                result = result + ' ' + address[key]
        
        outputvalue = result.lstrip()
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to street address
df[COLUMN_NAME] = df.apply(lambda x: address_to_street(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to Street Number

Parses the address and returns the street number and name

usaddress Library

Example
Input	3743 Carson Shores New Glenn, NC 21452
Output	3743

import usaddress

def address_to_street_num(inputvalue):
    """
    Standardize address and ouput just the street address
    
    :param inputvalue: string of address
    :return: string of street from address
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        
        address_keys = address.keys()
        
        result = ''
        keys_we_want = ['AddressNumber']
        for key in address_keys:
            if key in keys_we_want:
                result = result + ' ' + address[key]
        
        outputvalue = result.lstrip()
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to street number
df[COLUMN_NAME] = df.apply(lambda x: address_to_street_num(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to City

Parses the address and returns the city

usaddress Library

Example
Input	3743 Carson Shores New Glenn, NC 21452
Output	New Glenn

import usaddress

def address_to_city(inputvalue):
    """
    Standardize address and ouput just the city
    
    :param inputvalue: string of address
    :return: string of city from address
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        
        address_keys = address.keys()
        
        result = ''
        keys_we_want = ['PlaceName']
        for key in address_keys:
            if key in keys_we_want:
                result = result + ' ' + address[key]
        
        outputvalue = result.lstrip()
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to city
df[COLUMN_NAME] = df.apply(lambda x: address_to_city(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to State

Parses the address and returns the state

usaddress Library

Example
Input	3743 Carson Shores New Glenn, North Carolina 21452
Output	North Carolina

import usaddress

def address_to_state(inputvalue):
    """
    Standardize address and ouput just the state
    
    :param inputvalue: string of address
    :return: string of state
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        
        address_keys = address.keys()
        
        result = ''
        keys_we_want = ['StateName']
        for key in address_keys:
            if key in keys_we_want:
                result = result + ' ' + address[key]
        
        outputvalue = result.lstrip()
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to state
df[COLUMN_NAME] = df.apply(lambda x: address_to_state(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to State Code

Parses the address and returns the state code

usaddress Library

Example
Input	3743 Carson Shores New Glenn, North Carolina 21452
Output	NC

import usaddress

US_STATE_ABBREV = {'ALABAMA': 'AL','ALASKA': 'AK','AMERICAN SAMOA': 'AS','ARIZONA': 'AZ','ARKANSAS': 'AR','CALIFORNIA': 'CA','COLORADO': 'CO','CONNECTICUT': 'CT','DELAWARE': 'DE','DISTRICT OF COLUMBIA': 'DC','FLORIDA': 'FL','GEORGIA': 'GA','GUAM': 'GU','HAWAII': 'HI','IDAHO': 'ID','ILLINOIS': 'IL','INDIANA': 'IN','IOWA': 'IA','KANSAS': 'KS','KENTUCKY': 'KY','LOUISIANA': 'LA','MAINE': 'ME','MARYLAND': 'MD','MASSACHUSETTS': 'MA','MICHIGAN': 'MI','MINNESOTA': 'MN','MISSISSIPPI': 'MS','MISSOURI': 'MO','MONTANA': 'MT','NEBRASKA': 'NE','NEVADA': 'NV','NEW HAMPSHIRE': 'NH','NEW JERSEY': 'NJ','NEW MEXICO': 'NM','NEW YORK': 'NY','NORTH CAROLINA': 'NC','NORTH DAKOTA': 'ND','NORTHERN MARIANA ISLANDS':'MP','OHIO': 'OH','OKLAHOMA': 'OK','OREGON': 'OR','PENNSYLVANIA': 'PA','PUERTO RICO': 'PR','RHODE ISLAND': 'RI','SOUTH CAROLINA': 'SC','SOUTH DAKOTA': 'SD','TENNESSEE': 'TN','TEXAS': 'TX','UTAH': 'UT','VERMONT': 'VT','VIRGIN ISLANDS': 'VI','VIRGINIA': 'VA','WASHINGTON': 'WA','WEST VIRGINIA': 'WV','WISCONSIN': 'WI','WYOMING': 'WY'}

def address_to_state_code(inputvalue):
    """
    Standardize address and output just the state code
    
    :param inputvalue: string of address
    :return: string of state code
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        if(len(address['StateName']) > 2):
            capitalized = address['StateName'].upper().strip()
            outputvalue = US_STATE_ABBREV[capitalized]
        else:
            outputvalue = address['StateName']
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to state code
df[COLUMN_NAME] = df.apply(lambda x: address_to_state_code(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to Zip code

Parses the address and returns the zip code

usaddress Library

Example
Input	3743 Carson Shores New Glenn, North Carolina 21452
Output	21452

import usaddress

def address_to_zip(inputvalue):
    """
    Standardize address and ouput just the zip code
    
    :param inputvalue: string of address
    :return: string of zip code from address
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        outputvalue = address['ZipCode']
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to zip code
df[COLUMN_NAME] = df.apply(lambda x: address_to_zip(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Currency to float

Turns currency into a float value

price-parser Library

Example
Input	Price: $119.00
Output	119.00

# https://pypi.org/project/price-parser/
import Price

def currency_to_float(inputvalue):
    """
    Convert currency to float
    
    :param inputvalue: string of price
    :return: float of price
    """
    try:
        price = Price.fromstring(inputvalue)
        outputvalue = price.amount_float
    except:
        outputvalue = 0
    return outputvalue

# Parse US currency to float
df[COLUMN_NAME] = df.apply(lambda x: currency_to_float(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Date Format to American Date Format (MM/DD/YYYY)

Formats a date into the american format (MM/DD/YYYY)

dateparser Library

Example
Input	9-20-2021
Output	09/20/2021

import dateparser

def date_to_american_format(inputvalue):
    """
    Converting a date string to a formatted date string
    
    :param inputvalue: string of date
    :return: formatted date string
    """
    try:
        parsed_datetime = dateparser.parse(inputvalue)
        return parsed_datetime.strftime("%m/%d/%Y")
    except:
        return np.nan

# Change date to specific format
df[COLUMN_NAME] = df.apply(lambda x: date_to_american_format(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Date Format to European Date Format (YYYY/MM/DD)

Formats a date into the european format (YYYY/MM/DD)

dateparser Library

Example
Input	9-20-2021
Output	2021/09/20

import dateparser

def date_to_european_format(inputvalue):
    """
    Converting a date string to a formatted date string
    
    :param inputvalue: string of date
    :return: formatted date string
    """
    try:
        parsed_datetime = dateparser.parse(inputvalue)
        return parsed_datetime.strftime("%Y/%m/%d")
    except:
        return np.nan

# Change date to specific format
df[COLUMN_NAME] = df.apply(lambda x: date_to_european_format(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Date Format to Quarter

Formats a date to a quarter

dateparser Library

Example
Input	1987-08-10
Output	3

import dateparser

def date_to_quarter(inputvalue):
    """
    Converting a date string to a quarter
    
    :param inputvalue: string of date
    :return: formatted quarter string
    """
    try:
        parsed_datetime = dateparser.parse(inputvalue)
        return pd.Timestamp(dt.date(parsed_datetime.year, parsed_datetime.month, parsed_datetime.day)).quarter
    except:
        return np.nan

# Converting a date string to a quarter
df[COLUMN_NAME] = df.apply(lambda x: date_to_quarter(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Date Format to Timestamp

Formats a date into a unix timestamp

dateparser Library

Example
Input	9-20-2021
Output	1632137676

import dateparser

def date_to_timestamp(inputvalue):
    """
    Converting a date string to a timestamp
    
    :param inputvalue: string of date
    :return: int of date as timestamp
    """
    try:
        parsed_datetime = dateparser.parse(inputvalue)
        return parsed_datetime.timestamp()
    except:
        return ''

# Converting a date string to a timestamp
df[COLUMN_NAME] = df.apply(lambda x: date_to_timestamp(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Uppercase String

Turns the string into uppercase

Example
Input	3 brown foxes jump after 1 rabbit
Output	3 BROWN FOXES JUMP AFTER 1 RABBIT

def convert_to_uppercase(inputvalue):
    """Convert to uppercase"""
    if(pd.isna(inputvalue) == False):
        return str(inputvalue).upper()
    else:
        return inputvalue

# Convert to uppercase
df[COLUMN_NAME] = df.apply(lambda x: convert_to_uppercase(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Lowercase String

Turns the string into lowercase

Example
Input	3 BROWN FOXES JUMP AFTER 1 RABBIT
Output	3 brown foxes jump after 1 rabbit

def convert_to_lowercase(inputvalue):
    """Convert to lowercase"""
    if(pd.isna(inputvalue) == False):
        return str(inputvalue).lower()
    else:
        return inputvalue

# Convert to lowercase
df[COLUMN_NAME] = df.apply(lambda x: convert_to_lowercase(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Strip Numbers

Strips numbers from a string

Example
Input	3 brown foxes jump after 1 rabbit
Output	brown foxes jump after rabbit

import re

def strip_numbers(inputvalue):
    """Strip numbers from value"""
    if(pd.isna(inputvalue) == False):
        return re.sub(r'\d+', '', str(inputvalue))
    else:
        return inputvalue

# Strip numbers from string
df[COLUMN_NAME] = df.apply(lambda x: strip_numbers(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Strip Alpha Characters

Strips alpha (A-Z,a-z) characters

Example
Input	3 brown foxes jump after 1 rabbit
Output	3 1

import re

def strip_alpha(inputvalue):
    """Strip alpha characters"""
    return re.sub(r'\d+', '', str(inputvalue))

# Strip alpha characters from string
df[COLUMN_NAME] = df.apply(lambda x: strip_alpha(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Strip Special Characters

Strips special characters (!"#$%&'()*+,-./:;<=>?@[]^_`{|}~")

Example
Input	3 brown foxes, jump after 1 rabbit!
Output	3 brown foxes jump after 1 rabbit

import re

def strip_special(inputvalue):
    """Strip special characters"""
    return re.sub(r'\d+', '', str(inputvalue))

# Strip special characters from string
df[COLUMN_NAME] = df.apply(lambda x: strip_special(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Full Name to First Name

Parses and returns the first name from the full name

nameparser Library

Example
Input	Stephen Weber
Output	Stephen

from nameparser import HumanName

def fullname_to_first(fullname):
    """Convert full name to first name"""
    if(pd.isna(fullname) == False and isinstance(fullname, str)):
        name = HumanName(fullname)
        return getattr(name, 'first','')
    else:
        return ''

# Parse just the first name from the full name
df[COLUMN_NAME] = df.apply(lambda x: fullname_to_first(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Full Name to Last Name

Parses and returns the last name from the full name

nameparser Library

Example
Input	Stephen Weber
Output	Weber

from nameparser import HumanName

def fullname_to_last(fullname):
    """Convert full name to last name"""
    if(pd.isna(fullname) == False and isinstance(fullname, str)):
        name = HumanName(fullname)
        return getattr(name, 'surnames','')
    else:
        return ''

# Parse just the last name from the full name
df[COLUMN_NAME] = df.apply(lambda x: fullname_to_last(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Email To Domain Name

Parses an email string and returns the domain name if the domain is a valid structured domain

Example
Input	info@bitrook.com
Output	bitrook.com

import re

def email_to_domain(inputvalue):
    """Convert email to domain name"""
    if(pd.isna(inputvalue) == False and isinstance(inputvalue, str)):
        regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'
        if(re.search(regex,inputvalue)):
            try:
                outputvalue = inputvalue[inputvalue.index('@') + 1 : ]
            except:
                outputvalue = ''
        else:
            outputvalue = ''
    else:
        outputvalue = ''
    return outputvalue
    
# Convert email to domain name
df[COLUMN_NAME] = df.apply(lambda x: email_to_domain(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Reading CSV with default string data type

Sometimes Pandas tries to read a CSV file and interprets the data a bit. Often changing phone numbers to long numbers. Here is a quick and easy way to tell Pandas to assume everything is a string so you can specify what isn't.

import pandas as pd

df = pd.read_csv('FILEPATH.csv',dtype=object)

Found something we should add to the list? Let us know at info@bitrook.com

We won't send you spam. Unsubscribe at any time.