A big list of ways to clean data and the code ready to copy and paste.
14 min read
Data Cleaning is necessary for making accurate insights, machine learning models and insights to move your business forward. Sadly data cleaning takes a long time to preform and is only slowly getting easier.
To make it easier to clean your data we collected the most common ways to clean you data. With an example, the library and most importantly the code snippet in an easy to use function with the pandas implementation. Just find what you want to clean, copy and paste, install the library and you are done.
Parses, and validates phone numbers from various countries
| Example | |
|---|---|
| Input | +442083661177 |
| Output | +44 20 8366 1177 |
import pandas as pd
import phonenumbers
def phone_to_international(inputvalue):
"""
Standardize valid phone numbers to international format (+44 20 8366 1177)
:param inputvalue: string of phone number
:return: string of international phone
"""
if(pd.isna(inputvalue) == False and isinstance(inputvalue, str)):
try:
phoneInstance = phonenumbers.parse(inputvalue, None)
if(phonenumbers.is_valid_number(phoneInstance)):
outputvalue = phonenumbers.format_number(phoneInstance, phonenumbers.PhoneNumberFormat.INTERNATIONAL)
else:
outputvalue = ''
except:
outputvalue = ''
else:
outputvalue = ''
return outputvalue
# Change from valid phone number to international format, if not valid then insert blank
df[COLUMN_NAME] = df.apply(lambda x: phone_to_international(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses, and validates phone numbers from the United States
| Example | |
|---|---|
| Input | +120012301 |
| Output | 2001230101 |
import phonenumbers
def phone_to_nat(inputvalue):
"""
Standardize valid phone numbers to US national format
:param inputvalue: string of phone number
:return: string of US national phone
"""
if(pd.isna(inputvalue) == False and isinstance(inputvalue, str)):
try:
phoneInstance = phonenumbers.parse(inputvalue, None)
if(phonenumbers.is_valid_number(phoneInstance)):
outputvalue = phonenumbers.format_number(phoneInstance, phonenumbers.PhoneNumberFormat.NATIONAL)
else:
outputvalue = ''
except:
outputvalue = ''
else:
outputvalue = ''
return outputvalue
# Change from valid phone number to US format, if not valid then insert blank
df[COLUMN_NAME] = df.apply(lambda x: phone_to_nat(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses and orders the parts of the address into a consistent string
| Example | |
|---|---|
| Input | 3743 Carson Shores New Glenn, NC 21452 |
| Output | 3743 Carson Shores New Glenn NC 21452 |
import usaddress
def address_to_sandardize(inputvalue):
"""
Standardize address
:param inputvalue: string of address
:return: string of standardized address
"""
try:
address = usaddress.tag(inputvalue)
address = address[0]
address_keys = address.keys()
result = ''
for key in address_keys:
result = result + ' ' + address[key]
outputvalue = result.strip()
except:
outputvalue = ''
return outputvalue
# Standardize full address
df[COLUMN_NAME] = df.apply(lambda x: address_to_sandardize(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses the address and returns the street number and name
| Example | |
|---|---|
| Input | 3743 Carson Shores New Glenn, NC 21452 |
| Output | 3743 Carson Shores |
import usaddress
def address_to_street(inputvalue):
"""
Standardize address and ouput just the street address
:param inputvalue: string of address
:return: string of street from address
"""
try:
address = usaddress.tag(inputvalue)
address = address[0]
address_keys = address.keys()
result = ''
keys_we_want = ['AddressNumber','StreetName','StreetNamePostType','OccupancyType','OccupancyIdentifier']
for key in address_keys:
if key in keys_we_want:
result = result + ' ' + address[key]
outputvalue = result.lstrip()
except:
outputvalue = ''
return outputvalue
# Change full address to street address
df[COLUMN_NAME] = df.apply(lambda x: address_to_street(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses the address and returns the street number and name
| Example | |
|---|---|
| Input | 3743 Carson Shores New Glenn, NC 21452 |
| Output | 3743 |
import usaddress
def address_to_street_num(inputvalue):
"""
Standardize address and ouput just the street address
:param inputvalue: string of address
:return: string of street from address
"""
try:
address = usaddress.tag(inputvalue)
address = address[0]
address_keys = address.keys()
result = ''
keys_we_want = ['AddressNumber']
for key in address_keys:
if key in keys_we_want:
result = result + ' ' + address[key]
outputvalue = result.lstrip()
except:
outputvalue = ''
return outputvalue
# Change full address to street number
df[COLUMN_NAME] = df.apply(lambda x: address_to_street_num(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses the address and returns the city
| Example | |
|---|---|
| Input | 3743 Carson Shores New Glenn, NC 21452 |
| Output | New Glenn |
import usaddress
def address_to_city(inputvalue):
"""
Standardize address and ouput just the city
:param inputvalue: string of address
:return: string of city from address
"""
try:
address = usaddress.tag(inputvalue)
address = address[0]
address_keys = address.keys()
result = ''
keys_we_want = ['PlaceName']
for key in address_keys:
if key in keys_we_want:
result = result + ' ' + address[key]
outputvalue = result.lstrip()
except:
outputvalue = ''
return outputvalue
# Change full address to city
df[COLUMN_NAME] = df.apply(lambda x: address_to_city(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses the address and returns the state
| Example | |
|---|---|
| Input | 3743 Carson Shores New Glenn, North Carolina 21452 |
| Output | North Carolina |
import usaddress
def address_to_state(inputvalue):
"""
Standardize address and ouput just the state
:param inputvalue: string of address
:return: string of state
"""
try:
address = usaddress.tag(inputvalue)
address = address[0]
address_keys = address.keys()
result = ''
keys_we_want = ['StateName']
for key in address_keys:
if key in keys_we_want:
result = result + ' ' + address[key]
outputvalue = result.lstrip()
except:
outputvalue = ''
return outputvalue
# Change full address to state
df[COLUMN_NAME] = df.apply(lambda x: address_to_state(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses the address and returns the state code
| Example | |
|---|---|
| Input | 3743 Carson Shores New Glenn, North Carolina 21452 |
| Output | NC |
import usaddress
US_STATE_ABBREV = {'ALABAMA': 'AL','ALASKA': 'AK','AMERICAN SAMOA': 'AS','ARIZONA': 'AZ','ARKANSAS': 'AR','CALIFORNIA': 'CA','COLORADO': 'CO','CONNECTICUT': 'CT','DELAWARE': 'DE','DISTRICT OF COLUMBIA': 'DC','FLORIDA': 'FL','GEORGIA': 'GA','GUAM': 'GU','HAWAII': 'HI','IDAHO': 'ID','ILLINOIS': 'IL','INDIANA': 'IN','IOWA': 'IA','KANSAS': 'KS','KENTUCKY': 'KY','LOUISIANA': 'LA','MAINE': 'ME','MARYLAND': 'MD','MASSACHUSETTS': 'MA','MICHIGAN': 'MI','MINNESOTA': 'MN','MISSISSIPPI': 'MS','MISSOURI': 'MO','MONTANA': 'MT','NEBRASKA': 'NE','NEVADA': 'NV','NEW HAMPSHIRE': 'NH','NEW JERSEY': 'NJ','NEW MEXICO': 'NM','NEW YORK': 'NY','NORTH CAROLINA': 'NC','NORTH DAKOTA': 'ND','NORTHERN MARIANA ISLANDS':'MP','OHIO': 'OH','OKLAHOMA': 'OK','OREGON': 'OR','PENNSYLVANIA': 'PA','PUERTO RICO': 'PR','RHODE ISLAND': 'RI','SOUTH CAROLINA': 'SC','SOUTH DAKOTA': 'SD','TENNESSEE': 'TN','TEXAS': 'TX','UTAH': 'UT','VERMONT': 'VT','VIRGIN ISLANDS': 'VI','VIRGINIA': 'VA','WASHINGTON': 'WA','WEST VIRGINIA': 'WV','WISCONSIN': 'WI','WYOMING': 'WY'}
def address_to_state_code(inputvalue):
"""
Standardize address and output just the state code
:param inputvalue: string of address
:return: string of state code
"""
try:
address = usaddress.tag(inputvalue)
address = address[0]
if(len(address['StateName']) > 2):
capitalized = address['StateName'].upper().strip()
outputvalue = US_STATE_ABBREV[capitalized]
else:
outputvalue = address['StateName']
except:
outputvalue = ''
return outputvalue
# Change full address to state code
df[COLUMN_NAME] = df.apply(lambda x: address_to_state_code(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses the address and returns the zip code
| Example | |
|---|---|
| Input | 3743 Carson Shores New Glenn, North Carolina 21452 |
| Output | 21452 |
import usaddress
def address_to_zip(inputvalue):
"""
Standardize address and ouput just the zip code
:param inputvalue: string of address
:return: string of zip code from address
"""
try:
address = usaddress.tag(inputvalue)
address = address[0]
outputvalue = address['ZipCode']
except:
outputvalue = ''
return outputvalue
# Change full address to zip code
df[COLUMN_NAME] = df.apply(lambda x: address_to_zip(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Turns currency into a float value
| Example | |
|---|---|
| Input | Price: $119.00 |
| Output | 119.00 |
# https://pypi.org/project/price-parser/
import Price
def currency_to_float(inputvalue):
"""
Convert currency to float
:param inputvalue: string of price
:return: float of price
"""
try:
price = Price.fromstring(inputvalue)
outputvalue = price.amount_float
except:
outputvalue = 0
return outputvalue
# Parse US currency to float
df[COLUMN_NAME] = df.apply(lambda x: currency_to_float(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Formats a date into the american format (MM/DD/YYYY)
| Example | |
|---|---|
| Input | 9-20-2021 |
| Output | 09/20/2021 |
import dateparser
def date_to_american_format(inputvalue):
"""
Converting a date string to a formatted date string
:param inputvalue: string of date
:return: formatted date string
"""
try:
parsed_datetime = dateparser.parse(inputvalue)
return parsed_datetime.strftime("%m/%d/%Y")
except:
return np.nan
# Change date to specific format
df[COLUMN_NAME] = df.apply(lambda x: date_to_american_format(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Formats a date into the european format (YYYY/MM/DD)
| Example | |
|---|---|
| Input | 9-20-2021 |
| Output | 2021/09/20 |
import dateparser
def date_to_european_format(inputvalue):
"""
Converting a date string to a formatted date string
:param inputvalue: string of date
:return: formatted date string
"""
try:
parsed_datetime = dateparser.parse(inputvalue)
return parsed_datetime.strftime("%Y/%m/%d")
except:
return np.nan
# Change date to specific format
df[COLUMN_NAME] = df.apply(lambda x: date_to_european_format(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Formats a date to a quarter
| Example | |
|---|---|
| Input | 1987-08-10 |
| Output | 3 |
import dateparser
def date_to_quarter(inputvalue):
"""
Converting a date string to a quarter
:param inputvalue: string of date
:return: formatted quarter string
"""
try:
parsed_datetime = dateparser.parse(inputvalue)
return pd.Timestamp(dt.date(parsed_datetime.year, parsed_datetime.month, parsed_datetime.day)).quarter
except:
return np.nan
# Converting a date string to a quarter
df[COLUMN_NAME] = df.apply(lambda x: date_to_quarter(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Formats a date into a unix timestamp
| Example | |
|---|---|
| Input | 9-20-2021 |
| Output | 1632137676 |
import dateparser
def date_to_timestamp(inputvalue):
"""
Converting a date string to a timestamp
:param inputvalue: string of date
:return: int of date as timestamp
"""
try:
parsed_datetime = dateparser.parse(inputvalue)
return parsed_datetime.timestamp()
except:
return ''
# Converting a date string to a timestamp
df[COLUMN_NAME] = df.apply(lambda x: date_to_timestamp(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Turns the string into uppercase
| Example | |
|---|---|
| Input | 3 brown foxes jump after 1 rabbit |
| Output | 3 BROWN FOXES JUMP AFTER 1 RABBIT |
def convert_to_uppercase(inputvalue):
"""Convert to uppercase"""
if(pd.isna(inputvalue) == False):
return str(inputvalue).upper()
else:
return inputvalue
# Convert to uppercase
df[COLUMN_NAME] = df.apply(lambda x: convert_to_uppercase(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Turns the string into lowercase
| Example | |
|---|---|
| Input | 3 BROWN FOXES JUMP AFTER 1 RABBIT |
| Output | 3 brown foxes jump after 1 rabbit |
def convert_to_lowercase(inputvalue):
"""Convert to lowercase"""
if(pd.isna(inputvalue) == False):
return str(inputvalue).lower()
else:
return inputvalue
# Convert to lowercase
df[COLUMN_NAME] = df.apply(lambda x: convert_to_lowercase(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Strips numbers from a string
| Example | |
|---|---|
| Input | 3 brown foxes jump after 1 rabbit |
| Output | brown foxes jump after rabbit |
import re
def strip_numbers(inputvalue):
"""Strip numbers from value"""
if(pd.isna(inputvalue) == False):
return re.sub(r'\d+', '', str(inputvalue))
else:
return inputvalue
# Strip numbers from string
df[COLUMN_NAME] = df.apply(lambda x: strip_numbers(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Strips alpha (A-Z,a-z) characters
| Example | |
|---|---|
| Input | 3 brown foxes jump after 1 rabbit |
| Output | 3 1 |
import re
def strip_alpha(inputvalue):
"""Strip alpha characters"""
return re.sub(r'\d+', '', str(inputvalue))
# Strip alpha characters from string
df[COLUMN_NAME] = df.apply(lambda x: strip_alpha(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Strips special characters (!"#$%&'()*+,-./:;<=>?@[]^_`{|}~")
| Example | |
|---|---|
| Input | 3 brown foxes, jump after 1 rabbit! |
| Output | 3 brown foxes jump after 1 rabbit |
import re
def strip_special(inputvalue):
"""Strip special characters"""
return re.sub(r'\d+', '', str(inputvalue))
# Strip special characters from string
df[COLUMN_NAME] = df.apply(lambda x: strip_special(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses and returns the first name from the full name
| Example | |
|---|---|
| Input | Stephen Weber |
| Output | Stephen |
from nameparser import HumanName
def fullname_to_first(fullname):
"""Convert full name to first name"""
if(pd.isna(fullname) == False and isinstance(fullname, str)):
name = HumanName(fullname)
return getattr(name, 'first','')
else:
return ''
# Parse just the first name from the full name
df[COLUMN_NAME] = df.apply(lambda x: fullname_to_first(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses and returns the last name from the full name
| Example | |
|---|---|
| Input | Stephen Weber |
| Output | Weber |
from nameparser import HumanName
def fullname_to_last(fullname):
"""Convert full name to last name"""
if(pd.isna(fullname) == False and isinstance(fullname, str)):
name = HumanName(fullname)
return getattr(name, 'surnames','')
else:
return ''
# Parse just the last name from the full name
df[COLUMN_NAME] = df.apply(lambda x: fullname_to_last(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Parses an email string and returns the domain name if the domain is a valid structured domain
| Example | |
|---|---|
| Input | info@bitrook.com |
| Output | bitrook.com |
import re
def email_to_domain(inputvalue):
"""Convert email to domain name"""
if(pd.isna(inputvalue) == False and isinstance(inputvalue, str)):
regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'
if(re.search(regex,inputvalue)):
try:
outputvalue = inputvalue[inputvalue.index('@') + 1 : ]
except:
outputvalue = ''
else:
outputvalue = ''
else:
outputvalue = ''
return outputvalue
# Convert email to domain name
df[COLUMN_NAME] = df.apply(lambda x: email_to_domain(x[COLUMN_NAME]),axis=1)
Not working or need something easier? Download BitRook for free and have it do this for you automatically with no code.
Sometimes Pandas tries to read a CSV file and interprets the data a bit. Often changing phone numbers to long numbers. Here is a quick and easy way to tell Pandas to assume everything is a string so you can specify what isn't.
import pandas as pd
df = pd.read_csv('FILEPATH.csv',dtype=object)
Found something we should add to the list? Let us know at info@bitrook.com
Clean Data 10x faster using AI with BitRook.
Download the app here for FREE.
We won't send you spam. Unsubscribe at any time.