How to identify dummy data in pandas and delete?

How to identify dummy data in pandas and delete? - python

Is there a way to identify the dummy data in a dataframe and delete them? In my data below, there are random characters in each column that I need to delete.
import pandas as pd
import numpy as np
data = {'Name' : ['Tom', 'AABBCC', 'Joseph', 'Krish', 'XXXX', 'John', 'U'],
'Address1': ['High Street', 'uwdfjfuf', '00000', 'Green Lane', 'Kingsway', 'Church Street', 'iwefwfn'],
'Address2': ['Park Avenue', 'The Crescent', 'ABCXYZ', 'Highfield Road', 'Stanley Road', 'New Street', '1ca2s597']}
contact_details = pd.DataFrame(data)
#Code to identify and delete dummy data
print(contact_details)
Output of the above code:
Name Address1 Address2
0 Tom High Street Park Avenue
1 AABBCC uwdfjfuf The Crescent
2 Joseph 00000 ABCXYZ
3 Krish Green Lane Highfield Road
4 XXXX Kingsway Stanley Road
5 John Church Street New Street
6 U iwefwfn 1ca2s597

have you investigated your data? Are always the "good data" a combination of lowercase and uppercase characters? If so you could make a function to find those dummy data, for example:
if text.lower() == text or text.upper() == text:
# text is dummy

Without a good definition of good and bad values for each column, there's really nothing you can do automatically. There are a couple of data cleaning tricks you can use to make those values easier to find in a large dataset.
Starting with your original dataset:
import pandas as pd
data = {'Name': ['Tom', 'AABBCC', 'Joseph', 'Krish', 'XXXX', 'John', 'U'],
'Address1': ['High Street', 'uwdfjfuf', '00000', 'Green Lane', 'Kingsway', 'Church Street', 'iwefwfn'],
'Address2': ['Park Avenue', 'The Crescent', 'ABCXYZ', 'Highfield Road', 'Stanley Road', 'New Street', '1ca2s597']}
contact_details = pd.DataFrame(data)
The first thing you can do is get the unique values for a column to reduce the number of values you're looking through.
# get all the unique values in the 'Name' column
names = contact_details['Name'].unique()
Next you can sort them so that any near-duplicates will stand out more easily. Near duplicates happen often with errors in data entry.
# sort them alphabetically and then take a closer look
names.sort()
print(list(names))
So for example, is you had seen the values ' Tom', 'Tom', and 'Tom ', you know you need to strip whitespace from names.
contact_details['Name'] = contact_details['Name'].strip()
Another benefit of sorting unique values in a column is that string values that start with a number will all be at the beginning of the list, and lowercase strings will be sorted at the end. This makes a couple of your 'Address1 values stand out.
# get all the unique values in the 'Address1' column
address1 = contact_details['Address1'].unique()
address1.sort()
print(list(address1))
This gives me the list of unique values:
['00000', 'Church Street', 'Green Lane', 'High Street', 'Kingsway', 'iwefwfn', 'uwdfjfuf']
It's not clear yet if the first value is valid, but those last two look suspect. If I want to remove those, I can filter them out by selecting all rows where Address1 is not in a list of bad values.
contact_details_filtered = contact_details[~contact_details['Address1'].isin(['iwefwfn', 'uwdfjfuf'])]
print(contact_details_filtered)
This gives me the output:
Name Address1 Address2
0 Tom High Street Park Avenue
2 Joseph 00000 ABCXYZ
3 Krish Green Lane Highfield Road
4 XXXX Kingsway Stanley Road
5 John Church Street New Street
Row 2 is definitely suspect, and row 4 is questionable, but I think you get the idea of how to find and remove values that look like placeholders or just bad data.

Related

Find duplicate values in pandas column that may contain an additional word

Suppose a dataframe with a column of company names:
namelist = ['canadian pacific railway',
'nestlé canada',
'nestlé',
'chicken farmers of canada',
'cenovus energy',
'merck frosst canada',
'food banks canada',
'canadian fertilizer institute',
'balanceco',
'bell textron canada',
'safran landing systems canada',
'airbus canada',
'airbus',
'investment counsel association of canada',
'teck resources',
'fertilizer canada',
'engineers canada',
'google',
'google canada']
s = pd.Series(namelist)
I would like to isolate all rows of company names that are duplicated, whether or not they contain the word "canada". In this example, the filtered column should contain:
'nestlé canada'
'nestlé'
'airbus canada'
'airbus'
'google'
'google canada'
The goal is to standardize the names to a single form.
I can't wholesale remove the word because there are other company names with that word I want to preserve, like "fertilizer canada" and "engineers canada".
Is there a regex pattern or another clever way to get this?

You can replace the trailing "canada" and then use duplicated to construct a Boolean mask to get the values from the original Series. (With keep=False, all duplicates are marked as True.)
s[s.str.replace(' canada$', '').duplicated(keep=False)]

s[s.str.replace('\s*canada\s*', '', regex=True).duplicated(keep=False)]
Output:
1 nestlé canada
2 nestlé
10 airbus canada
11 airbus
16 google
17 google canada
dtype: object

Extract country from cities in pandas

I have an array of list of cities. I want to group them by the country name. Is there any library I can install which will do that ?
e.g array(['Los Angeles', 'Detroit', 'Seattle', 'Atlanta', 'Santiago',
'Pittsburgh', 'Seoul', 'Santa Clara', 'Austin', 'Chicago'])
I want to know the country they belong to and add a new country column in my dataframe.

I agree with what has been said in the comments - there is no clear way to join a city to a country when city names are not unique.
For example if we run...
import pandas as pd
df = pd.read_csv('https://datahub.io/core/world-cities/r/world-cities.csv')
df.rename(columns ={"name":"city"}, inplace=True)
print(df)
Outputs:
# create a list of city names for testing...
myCityList = ['Los Angeles', 'Detroit', 'Seattle', 'Atlanta', 'Santiago', 'Pittsburgh', 'Seoul', 'Santa Clara', 'Austin', 'Chicago']
# pull out all the rows matching a city in the test list..
df.query(f'city=={myCityList}')
Outputs:
However something is wrong because there are more rows listed than items in the test city list (and clearly Santiago is listed multiple times)...
print(len(myCityList))
print(df.query(f'city=={myCityList}').shape[0])
Outputs:
10
15
Maybe the above is useful but it has to be used with caution as it's not 100% guaranteed to output the correct country for a given city.

How to assign rows a number based on a level in pandas dataframe?

I have the following code:
from pandas import DataFrame
import pandas as pd
data = {'City': ['NY', 'NY', 'Arizona'], 'Doctor': ['Dr. Prof. Vera', 'Dr. Prof. Vera', 'Dr. Martin'], 'Type': ['Checked', 'Checked', 'Ordered'], 'Covid-Patient': ['yes', 'no', 'no']}
df = DataFrame(data).set_index(['City', 'Doctor', 'Type'])
df['Dr-Nr.'] = pd.Series(df.groupby(['Doctor']).cumcount()+1)
Which results in:
But what I want is an individual number of the Doctor in a new column Dr-Nr..
Apparently, the grouping by level Doctor does not seem to have an effect. Any help is appreciated!

You can rank() index level Doctor:
df['Dr-Nr.'] =df.assign(d_=df.index.get_level_values('Doctor'))['d_'].rank(method='dense').astype(int)
The order of indexing will be alphabetical here, so:
Covid-Patient Dr-Nr.
City Doctor Type
NY Dr. Prof. Vera Checked yes 2
Checked no 2
Arizona Dr. Martin Ordered no 1

Conditional extraction of a substring from a string

I've got a DataFrame with a column that is an object and contains the full address in one of the following formats:
'street name, building number',
'city, street, building number',
'city, district, street, building number'.
Regardless of the format I need to extract the name of the street and copy it to a new column. I cannot attach the original DF since all the information is in Russian. I've created a dummy DF instead:
df = pd.DataFrame({'address':['new york city, the bronx borough, willis avenue, building 34',
'town of salem, main street, building 105',
'second boulevard, 12'],
'street':0})
N.B. Different parts of one string are always separated by one comma. The substring with the street name in it always contains one of the words: 'street', 'avenue', 'boulevard'.
After several hours of Googling I've come up with something like this but to no avail:
street_list = ['street', 'avenue', 'boulevard']
for row in df:
for x in street_list:
if df.loc[row, 'address'].split(', ')[0].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[0]
elif df.loc[row, 'address'].split(', ')[1].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[1]
elif df.loc[row, 'address'].split(', ')[2].contains(x):
df.loc[row, 'street'] = df.loc[row, 'address'].split(', ')[2]
This code doesn't work for me. Is it possible to tweak it somehow so that it works(or maybe you know a better solution)?
Please let me know if any additional information is required.

As far as I understand:
1. The streets could be in a couple of positions in the comma separated values depending on the length.
2. The streets has an additional substring check.
In the below code:
Point 1 is represented by the streetMap
Point 2 is represented by the 'any' condition
import pandas as pd
df = pd.DataFrame({'address':['new york city, the bronx borough, willis avenue, building 34',
'town of salem, main street, building 105',
'second boulevard, 12'],
'street':0})
streetMap = {2:0,3:1,4:2} # Map of length of items to location of street.
street_list = ['street', 'avenue', 'boulevard']
addresses = df['address']
streets = []
for address in addresses:
items = address.split(', ')
streetCandidate = items[streetMap[len(items)]]
street = streetCandidate if any([s in streetCandidate for s in street_list]) else "NA"
streets.append(street)
df['street'] = streets
print(df)
Output:
0 new york city, the bronx borough, willis avenu... willis avenue
1 town of salem, main street, building 105 main street
2 second boulevard, 12 second boulevard

How do I get all the values of different strings that end with a certain word

My dataframe has a column called Borough which contains values like these:
"east toronto", "west toronto", "central toronto" and "west toronto", along with other region names.
Now I want a regular expression which gets me the data of every entry that ends with "toronto". How do I do that?
I tried this:
tronto_data = df_toronto[df_toronto['Borough'] = .*Toronto$].reset_index(drop=True)
tronto_data.head(7)

If the data is well formatted you can split the string on the space and access the final word, comparing it to Toronto. For example
df = pd.DataFrame({'column': ['west toronto', 'central toronto', 'some place']})
mask_df = df['column'].str.split(' ', expand=True)
which returns:
0 1
0 west toronto
1 central toronto
2 some place
you can then access the final column to work out the rows that end with Toronto.
toronto_df = df[mask_df[1]=='toronto']
Edit:
Did not know there was a string method .endswith which is the better way to do this. However, this solution does provide two columns which maybe useful.

Like #Code_10 refers in a comment that you can use string.endswith.. try below->
df = pd.DataFrame({'city': ['east toronto', 'west toronto', 'other', 'central toronto']})
df_toronto = df[df['city'].str.endswith('toronto')]
#df_toronto.head()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to identify dummy data in pandas and delete? - python

have you investigated your data? Are always the "good data" a combination of lowercase and uppercase characters? If so you could make a function to find those dummy data, for example: if text.lower() == text or text.upper() == text: # text is dummy

Related

Find duplicate values in pandas column that may contain an additional word

Extract country from cities in pandas

How to assign rows a number based on a level in pandas dataframe?

Conditional extraction of a substring from a string

How do I get all the values of different strings that end with a certain word

Categories

Resources