The Big Data Cleaning List

A big list of ways to clean data and the code ready to copy and paste.

14 min read

The Big Data Cleaning List

Intro

Data Cleaning is necessary for making accurate insights, machine learning models and insights to move your business forward. Sadly data cleaning takes a long time to preform and is only slowly getting easier.

To make it easier to clean your data we collected the most common ways to clean you data. With an example, the library and most importantly the code snippet in an easy to use function with the pandas implementation. Just find what you want to clean, copy and paste, install the library and you are done.

Phone - International Validation & Format

Parses, and validates phone numbers from various countries

Phonenumbers Library

Example
Input+442083661177
Output+44 20 8366 1177
import pandas as pd
import phonenumbers

def phone_to_international(inputvalue):
    """
    Standardize valid phone numbers to international format (+44 20 8366 1177)

    :param inputvalue: string of phone number
    :return: string of international phone
    """
    if(pd.isna(inputvalue) == False and isinstance(inputvalue, str)):
        try:
            phoneInstance = phonenumbers.parse(inputvalue, None)

            if(phonenumbers.is_valid_number(phoneInstance)):
                outputvalue = phonenumbers.format_number(phoneInstance, phonenumbers.PhoneNumberFormat.INTERNATIONAL)
            else:
                outputvalue = ''
        except:
            outputvalue = ''
    else:
        outputvalue = ''
    return outputvalue

# Change from valid phone number to international format, if not valid then insert blank
df[COLUMN_NAME] = df.apply(lambda x: phone_to_international(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Phone - US Domestic Validation & Format

Parses, and validates phone numbers from the United States

Phonenumbers Library

Example
Input+120012301
Output2001230101
import phonenumbers

def phone_to_nat(inputvalue):
    """
    Standardize valid phone numbers to US national format
    
    :param inputvalue: string of phone number
    :return: string of US national phone
    """
    if(pd.isna(inputvalue) == False and isinstance(inputvalue, str)):
        try:
            phoneInstance = phonenumbers.parse(inputvalue, None)
            
            if(phonenumbers.is_valid_number(phoneInstance)):
                outputvalue = phonenumbers.format_number(phoneInstance, phonenumbers.PhoneNumberFormat.NATIONAL)
            else:
                outputvalue = ''
        except:
            outputvalue = ''
    else:
        outputvalue = ''
    return outputvalue
    
# Change from valid phone number to US format, if not valid then insert blank
df[COLUMN_NAME] = df.apply(lambda x: phone_to_nat(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Standardize

Parses and orders the parts of the address into a consistent string

usaddress Library

Example
Input3743 Carson Shores New Glenn, NC 21452
Output3743 Carson Shores New Glenn NC 21452
import usaddress

def address_to_sandardize(inputvalue):
    """
    Standardize address
    
    :param inputvalue: string of address
    :return: string of standardized address
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        
        address_keys = address.keys()
        
        result = ''
        for key in address_keys:
            result = result + ' ' + address[key]
            
        outputvalue = result.strip()
    except:
        outputvalue = ''
    return outputvalue
    
# Standardize full address
df[COLUMN_NAME] = df.apply(lambda x: address_to_sandardize(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to Street Number and Name

Parses the address and returns the street number and name

usaddress Library

Example
Input3743 Carson Shores New Glenn, NC 21452
Output3743 Carson Shores
import usaddress

def address_to_street(inputvalue):
    """
    Standardize address and ouput just the street address
    
    :param inputvalue: string of address
    :return: string of street from address
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        
        address_keys = address.keys()
        
        result = ''
        keys_we_want = ['AddressNumber','StreetName','StreetNamePostType','OccupancyType','OccupancyIdentifier']
        for key in address_keys:
            if key in keys_we_want:
                result = result + ' ' + address[key]
        
        outputvalue = result.lstrip()
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to street address
df[COLUMN_NAME] = df.apply(lambda x: address_to_street(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to Street Number

Parses the address and returns the street number and name

usaddress Library

Example
Input3743 Carson Shores New Glenn, NC 21452
Output3743
import usaddress

def address_to_street_num(inputvalue):
    """
    Standardize address and ouput just the street address
    
    :param inputvalue: string of address
    :return: string of street from address
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        
        address_keys = address.keys()
        
        result = ''
        keys_we_want = ['AddressNumber']
        for key in address_keys:
            if key in keys_we_want:
                result = result + ' ' + address[key]
        
        outputvalue = result.lstrip()
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to street number
df[COLUMN_NAME] = df.apply(lambda x: address_to_street_num(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to City

Parses the address and returns the city

usaddress Library

Example
Input3743 Carson Shores New Glenn, NC 21452
OutputNew Glenn
import usaddress

def address_to_city(inputvalue):
    """
    Standardize address and ouput just the city
    
    :param inputvalue: string of address
    :return: string of city from address
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        
        address_keys = address.keys()
        
        result = ''
        keys_we_want = ['PlaceName']
        for key in address_keys:
            if key in keys_we_want:
                result = result + ' ' + address[key]
        
        outputvalue = result.lstrip()
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to city
df[COLUMN_NAME] = df.apply(lambda x: address_to_city(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to State

Parses the address and returns the state

usaddress Library

Example
Input3743 Carson Shores New Glenn, North Carolina 21452
OutputNorth Carolina
import usaddress

def address_to_state(inputvalue):
    """
    Standardize address and ouput just the state
    
    :param inputvalue: string of address
    :return: string of state
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        
        address_keys = address.keys()
        
        result = ''
        keys_we_want = ['StateName']
        for key in address_keys:
            if key in keys_we_want:
                result = result + ' ' + address[key]
        
        outputvalue = result.lstrip()
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to state
df[COLUMN_NAME] = df.apply(lambda x: address_to_state(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to State Code

Parses the address and returns the state code

usaddress Library

Example
Input3743 Carson Shores New Glenn, North Carolina 21452
OutputNC
import usaddress

US_STATE_ABBREV = {'ALABAMA': 'AL','ALASKA': 'AK','AMERICAN SAMOA': 'AS','ARIZONA': 'AZ','ARKANSAS': 'AR','CALIFORNIA': 'CA','COLORADO': 'CO','CONNECTICUT': 'CT','DELAWARE': 'DE','DISTRICT OF COLUMBIA': 'DC','FLORIDA': 'FL','GEORGIA': 'GA','GUAM': 'GU','HAWAII': 'HI','IDAHO': 'ID','ILLINOIS': 'IL','INDIANA': 'IN','IOWA': 'IA','KANSAS': 'KS','KENTUCKY': 'KY','LOUISIANA': 'LA','MAINE': 'ME','MARYLAND': 'MD','MASSACHUSETTS': 'MA','MICHIGAN': 'MI','MINNESOTA': 'MN','MISSISSIPPI': 'MS','MISSOURI': 'MO','MONTANA': 'MT','NEBRASKA': 'NE','NEVADA': 'NV','NEW HAMPSHIRE': 'NH','NEW JERSEY': 'NJ','NEW MEXICO': 'NM','NEW YORK': 'NY','NORTH CAROLINA': 'NC','NORTH DAKOTA': 'ND','NORTHERN MARIANA ISLANDS':'MP','OHIO': 'OH','OKLAHOMA': 'OK','OREGON': 'OR','PENNSYLVANIA': 'PA','PUERTO RICO': 'PR','RHODE ISLAND': 'RI','SOUTH CAROLINA': 'SC','SOUTH DAKOTA': 'SD','TENNESSEE': 'TN','TEXAS': 'TX','UTAH': 'UT','VERMONT': 'VT','VIRGIN ISLANDS': 'VI','VIRGINIA': 'VA','WASHINGTON': 'WA','WEST VIRGINIA': 'WV','WISCONSIN': 'WI','WYOMING': 'WY'}

def address_to_state_code(inputvalue):
    """
    Standardize address and output just the state code
    
    :param inputvalue: string of address
    :return: string of state code
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        if(len(address['StateName']) > 2):
            capitalized = address['StateName'].upper().strip()
            outputvalue = US_STATE_ABBREV[capitalized]
        else:
            outputvalue = address['StateName']
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to state code
df[COLUMN_NAME] = df.apply(lambda x: address_to_state_code(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Address - Address to Zip code

Parses the address and returns the zip code

usaddress Library

Example
Input3743 Carson Shores New Glenn, North Carolina 21452
Output21452
import usaddress

def address_to_zip(inputvalue):
    """
    Standardize address and ouput just the zip code
    
    :param inputvalue: string of address
    :return: string of zip code from address
    """
    try:
        address = usaddress.tag(inputvalue)
        address = address[0]
        outputvalue = address['ZipCode']
    except:
        outputvalue = ''
    return outputvalue
    
# Change full address to zip code
df[COLUMN_NAME] = df.apply(lambda x: address_to_zip(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Currency to float

Turns currency into a float value

price-parser Library

Example
InputPrice: $119.00
Output119.00
# https://pypi.org/project/price-parser/
import Price

def currency_to_float(inputvalue):
    """
    Convert currency to float
    
    :param inputvalue: string of price
    :return: float of price
    """
    try:
        price = Price.fromstring(inputvalue)
        outputvalue = price.amount_float
    except:
        outputvalue = 0
    return outputvalue

# Parse US currency to float
df[COLUMN_NAME] = df.apply(lambda x: currency_to_float(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Date Format to American Date Format (MM/DD/YYYY)

Formats a date into the american format (MM/DD/YYYY)

dateparser Library

Example
Input9-20-2021
Output09/20/2021
import dateparser

def date_to_american_format(inputvalue):
    """
    Converting a date string to a formatted date string
    
    :param inputvalue: string of date
    :return: formatted date string
    """
    try:
        parsed_datetime = dateparser.parse(inputvalue)
        return parsed_datetime.strftime("%m/%d/%Y")
    except:
        return np.nan

# Change date to specific format
df[COLUMN_NAME] = df.apply(lambda x: date_to_american_format(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Date Format to European Date Format (YYYY/MM/DD)

Formats a date into the european format (YYYY/MM/DD)

dateparser Library

Example
Input9-20-2021
Output2021/09/20
import dateparser

def date_to_european_format(inputvalue):
    """
    Converting a date string to a formatted date string
    
    :param inputvalue: string of date
    :return: formatted date string
    """
    try:
        parsed_datetime = dateparser.parse(inputvalue)
        return parsed_datetime.strftime("%Y/%m/%d")
    except:
        return np.nan

# Change date to specific format
df[COLUMN_NAME] = df.apply(lambda x: date_to_european_format(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Date Format to Quarter

Formats a date to a quarter

dateparser Library

Example
Input1987-08-10
Output3
import dateparser

def date_to_quarter(inputvalue):
    """
    Converting a date string to a quarter
    
    :param inputvalue: string of date
    :return: formatted quarter string
    """
    try:
        parsed_datetime = dateparser.parse(inputvalue)
        return pd.Timestamp(dt.date(parsed_datetime.year, parsed_datetime.month, parsed_datetime.day)).quarter
    except:
        return np.nan

# Converting a date string to a quarter
df[COLUMN_NAME] = df.apply(lambda x: date_to_quarter(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Date Format to Timestamp

Formats a date into a unix timestamp

dateparser Library

Example
Input9-20-2021
Output1632137676
import dateparser

def date_to_timestamp(inputvalue):
    """
    Converting a date string to a timestamp
    
    :param inputvalue: string of date
    :return: int of date as timestamp
    """
    try:
        parsed_datetime = dateparser.parse(inputvalue)
        return parsed_datetime.timestamp()
    except:
        return ''

# Converting a date string to a timestamp
df[COLUMN_NAME] = df.apply(lambda x: date_to_timestamp(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Uppercase String

Turns the string into uppercase

Example
Input3 brown foxes jump after 1 rabbit
Output3 BROWN FOXES JUMP AFTER 1 RABBIT
def convert_to_uppercase(inputvalue):
    """Convert to uppercase"""
    if(pd.isna(inputvalue) == False):
        return str(inputvalue).upper()
    else:
        return inputvalue

# Convert to uppercase
df[COLUMN_NAME] = df.apply(lambda x: convert_to_uppercase(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Lowercase String

Turns the string into lowercase

Example
Input3 BROWN FOXES JUMP AFTER 1 RABBIT
Output3 brown foxes jump after 1 rabbit
def convert_to_lowercase(inputvalue):
    """Convert to lowercase"""
    if(pd.isna(inputvalue) == False):
        return str(inputvalue).lower()
    else:
        return inputvalue

# Convert to lowercase
df[COLUMN_NAME] = df.apply(lambda x: convert_to_lowercase(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Strip Numbers

Strips numbers from a string

Example
Input3 brown foxes jump after 1 rabbit
Outputbrown foxes jump after rabbit
import re

def strip_numbers(inputvalue):
    """Strip numbers from value"""
    if(pd.isna(inputvalue) == False):
        return re.sub(r'\d+', '', str(inputvalue))
    else:
        return inputvalue

# Strip numbers from string
df[COLUMN_NAME] = df.apply(lambda x: strip_numbers(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Strip Alpha Characters

Strips alpha (A-Z,a-z) characters

Example
Input3 brown foxes jump after 1 rabbit
Output3 1
import re

def strip_alpha(inputvalue):
    """Strip alpha characters"""
    return re.sub(r'\d+', '', str(inputvalue))

# Strip alpha characters from string
df[COLUMN_NAME] = df.apply(lambda x: strip_alpha(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Strip Special Characters

Strips special characters (!"#$%&'()*+,-./:;<=>?@[]^_`{|}~")

Example
Input3 brown foxes, jump after 1 rabbit!
Output3 brown foxes jump after 1 rabbit
import re

def strip_special(inputvalue):
    """Strip special characters"""
    return re.sub(r'\d+', '', str(inputvalue))

# Strip special characters from string
df[COLUMN_NAME] = df.apply(lambda x: strip_special(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Full Name to First Name

Parses and returns the first name from the full name

nameparser Library

Example
InputStephen Weber
OutputStephen
from nameparser import HumanName

def fullname_to_first(fullname):
    """Convert full name to first name"""
    if(pd.isna(fullname) == False and isinstance(fullname, str)):
        name = HumanName(fullname)
        return getattr(name, 'first','')
    else:
        return ''

# Parse just the first name from the full name
df[COLUMN_NAME] = df.apply(lambda x: fullname_to_first(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Full Name to Last Name

Parses and returns the last name from the full name

nameparser Library

Example
InputStephen Weber
OutputWeber
from nameparser import HumanName

def fullname_to_last(fullname):
    """Convert full name to last name"""
    if(pd.isna(fullname) == False and isinstance(fullname, str)):
        name = HumanName(fullname)
        return getattr(name, 'surnames','')
    else:
        return ''

# Parse just the last name from the full name
df[COLUMN_NAME] = df.apply(lambda x: fullname_to_last(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Email To Domain Name

Parses an email string and returns the domain name if the domain is a valid structured domain

Example
Inputinfo@bitrook.com
Outputbitrook.com
import re

def email_to_domain(inputvalue):
    """Convert email to domain name"""
    if(pd.isna(inputvalue) == False and isinstance(inputvalue, str)):
        regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'
        if(re.search(regex,inputvalue)):
            try:
                outputvalue = inputvalue[inputvalue.index('@') + 1 : ]
            except:
                outputvalue = ''
        else:
            outputvalue = ''
    else:
        outputvalue = ''
    return outputvalue
    
# Convert email to domain name
df[COLUMN_NAME] = df.apply(lambda x: email_to_domain(x[COLUMN_NAME]),axis=1)

Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.

Reading CSV with default string data type

Sometimes Pandas tries to read a CSV file and interprets the data a bit. Often changing phone numbers to long numbers. Here is a quick and easy way to tell Pandas to assume everything is a string so you can specify what isn't.

import pandas as pd

df = pd.read_csv('FILEPATH.csv',dtype=object)

Found something we should add to the list? Let us know at info@bitrook.com


Looking for an easier way?

Clean Data 10x faster using AI with BitRook.
Download the app here for FREE.

We won't send you spam. Unsubscribe at any time.