How to check dataset for typos and replace them? - python

I have a question.
Is there a way on how to check wheteher there are typos in a specific column?
I have an Excel sheet which is read by use of pandas.
First I need to make a unique list in Python, based on the name of the column;
Second I need to replace the wrong values with the new values.

Working in a Jupyter notebook and doing this semi-manually might be the best way. One option could be to start by creating a list of correct spelling:
correct= ['terms','that','are','spelt','correctly']
and create a subset from your data frame which does not contain the values in that list.
df[~df['columnname'].str.startswith(tuple(correct))]
You will then know how many rows are affected. You can then count the number of different variations:
df['columnname'].value_counts()
and if reasonable, you could look at the unique values, and make them into a list:
listoftypos = list(df['columnname'].unique())
print(listoftypos)
and then create a dictionary again in a semi-manual way as:
typodict= {'terma':'term','thaaat':'that','arree':'are','speelt':'spelt','cooorrectly':'correct'}
then iterate over your original data frame, and if a row in the column contains the keyword which is in your list of typos, then replace it with the correct key from the dictionary, something like this:
for index,row in df.itterows():
if any(row['columnname'] in s for s in listoftypos):
correctspelling = list(typodict.keys())[list(typodict.values()).index(row['columnname'])])
df.at[index,'columnname'] = correctspelling
A strong caveat here though - of course, this would be something that would have to be done iteratively if the dataframe was extremely large.

Keep in mind that a generic spell check is a fairly tall order, but I believe this solution will fit your need with the lowest chance of false matches:
Setup:
import difflib
import re
from itertools import permutations
cardinal_directions=['north', 'south', 'east', 'west']
regions=['coast', 'central', 'international', 'mid']
p_lst=list(permutations(cardinal_directions+regions,2))
area=[''.join(i) for i in p_lst]+cardinal_directions+regions
df=pd.DataFrame({"ID":list(range(0,9)), "region":['Midwest', 'Northwest', 'West', 'Northeast', 'East coast', 'Central', 'South', 'International', 'Centrall']})
Initial DF:
ID
region
0
Midwest
1
Northwest
2
West
3
Northeast
4
East coast
5
Central
6
South
7
International
8
Centrall
Function:
def spell_check(my_str, name_bank):
prcnt=[]
for y in name_bank:
prcnt.append(difflib.SequenceMatcher(None, y, my_str.lower().strip()).ratio())
return name_bank[prcnt.index(max(prcnt))]
Apply Function to DF:
df.region=df.region.apply(lambda x: spell_check(x, area))
Resultant DF:
ID
region
0
midwest
1
northwest
2
west
3
northeast
4
eastcoast
5
central
6
south
7
international
8
central
I hope this answers your question and good luck.

Related

Python - How to search for entries in a panda dataframe based on entries (list) in another column?

I am new to Python and dataframes. I have a big panda dataframe I need to extract information from, I will try to explain my problem in a small example.
Say my dataframe looks like this:
name city number
Hana NYC 23
Fred London 12
Ben Paris 90
Lisa Berlin 3
Now I have a list with entries that relate to the column "number"
numbers = [3,12,23]
and I want to have the corresponding entries in another list from the "name" column
names = ['Lisa', 'Fred', 'Hana']
Is there an existing function for this problem?
df[df.number.isin(numbers)].name.tolist()
and, if you want exactly in same order:
df[df.number.isin(numbers)].sort_values("number").name.tolist()
You did not explain your desired output, but:
If you want to filter by one of the cirteria:
df[df['number'].isin(numbers)]
will leave you with rows within the numbers array.
If you want both:
df[(df['number'].isin(numbers)) & (df['name'].isin(names))]
Don't forget the names array to strings:
names = ['Lisa', 'Fred', 'Hana']

how to merge multiple datasets with differences in merge-index strings?

Hello I am struggling to find a solution to probably a very common problem.
I want to merge two csv-files with soccer data. They basically store different data of the same games. Normally I would do a merge with .merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. So for example Manchester City is called Man. City in the second data frame.
Here's roughly what df1 and df2 look like:
df:
team1 team2 date some_value_i_want_to_compare
Manchester City Arsenal 2022-05-20 22:00:00 0.2812 5
df2:
team1 team2 date some_value_i_want_to_compare
Man. City Arsenal 2022-05-20 22:00:00 0.2812 3
Note that in the above case there are only differences in team1 but there could also be cases where team2 is slightly different. So for example in this case Arsenal could be called FC Arsenal in the second data set.
So my main question is: How could I automatically analyse the differences in the two datasets naming?
My second question is: How do I scale this for more than 2 data sets so that the number of data sets ultimately doesn't matter?
As commenters and existing answer have suggested, if the number of unique names is not too large, then you can manually extract the mismatches and correct them. That is probably the best solution unless the number of mismatches is very large.
Another case which can occur, is when you have a ground truth list of allowed indexes (for example, the list of all soccer teams in a given league), but the data may contain many different attempts at spelling or abbreviating each team. If this is similar to your situation, you can use difflib to search for the most likely match for a given name. For example:
import difflib
true_names = ['Manchester United', 'Chelsea']
mismatch_names = ['Man. Unites', 'Chlsea', 'Chelsee']
best_matches = [difflib.get_close_matches(x, true_names, n=1) for x in mismatch_names]
for old,new in zip(mismatch_names, best_matches):
print(f"Best match for {old} is {new[0]}")
output:
Best match for Man. Unites is Manchester United
Best match for Chlsea is Chelsea
Best match for Chelsee is Chelsea
Note if the spelling is very bad, you can ask difflib to find the closest n matches using the n= keyword argument. This can help to reduce manual data cleaning work, although it is often unavoidable, at least to some degree.
Hope it helps.
You could start by doing an anti-join to isolate the ones that don't match:
# Merge two team datasets
teams_join = df1.merge(df2, on='team1',
how='left', indicator=True)
# Select the team1 column where _merge is left_only
team_list = teams_join.loc[teams_join['_merge'] == 'left_only', 'team1']
# print team names in df1 with no match in df2
print(df1[df1["team1"].isin(team_list)])
This will give you all the teams in df1 without a match in df2. You could do the same for df2 (just reverse everything df1 and df2 in the previous code). Then you can take those two lists with the names that don't match and manually rename them if there are few enough of them.

Select Random Data From Python Dataframe based on Columns data distribution & Conditions

I have 12 rows in my input data and need to select 4 random rows which keeps the columns distribution in focus at the time of random selection.
This is a sample data, original data contains million rows.
Input data Sample -
input_data = pd.DataFrame({'Id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
'Fruit': ['Apple','Mango','Orange','Apple','Apple','Mango','Apple','Mango','Apple','Apple','Apple','Orange'],
'City':['California','California','Chicago','Michigan','New York','Ohio','Michigan',
'Michigan','Ohio','Florida','New York','Washington']})
Output Data Expectation -
output_data = pd.DataFrame({'Id': ['A','A','B','C'],
'Fruit': ['Apple','Mango','Apple','Orange'],
'City':['California','Ohio','Michigan','New York']})
My random selection should consider the below three parameters -
The Id distribution, in below image, out of 4, 2 rows should be selected from A, 1 row from B and one from C
The Fruit distribution, 2 rows for Apple, 1 for Mango and 1 for Orange
The data should prioritize the higher frequency Cities
I am aware of sampling the data using pandas sample function and tried that which gives me unbalanced selection -
input_data.sample(n = 4)
Any leads on how to attend the problem is really appreciated!
You are prescribing probabilities on single random variables 3 times, once on the ID, once on fruit and once on the city, whereas you need to select an ordered tuple of 3: (ID, fruit, city), and you have restriction on the possible combinations too. In general, it is not possible. I'll explain why not so that you can modify your question to match your needs.
Forget about how pandas help you to make random choices and let's understand the problem mathematically first. Let's simplify the problem into 2D. Keep the fruits (apple, mango, orange) and cities (Ohio, Florida). First, let's suppose you have all the possible combination:
unique ID
Fruit
City
0
Apple
Ohio
1
Apple
Florida
2
Mango
Ohio
3
Mango
Florida
4
Orange
Ohio
5
Orange
Florida
Then you define the probability for the different categories independently via their frequency:
Fruit
frequency
probability
0
Apple
5
0.5
1
Mango
2
0.2
2
Orange
3
0.3
City
frequency
probability
0
Ohio
2
0.2
1
Florida
8
0.8
Now you can represent your possible choices:
Each line in your list of possible choices are represented in the figure (their ID is written into the center of the cells together with their possibilities). By selecting a line from the table, it means you generate a point on this 2D space. If you use the area of the cells to determine the probability you choose that pair, you'll get the desired 1D probability distributions, hence this representation. Of course, it is a good, intuitive choice to generate a random number on a 2D (discrete) space by generating 2 random numbers in each dimension, but this is not a must. In this example, the individual properties are independent, meaning that if your line's fruit property is apple, then it has a 20% or 80% probability, that it is from Ohio or Florida, respectively, which is equal to the original 1D distribution you prescribed for cities.
Now consider if you have an extra entry with unique id 6 for (Orange, Florida). When you generate a point on the 2D space, and it falls onto cells 5 and 6, you have the freedom to choose from the 5th or the 6th line. This case occurs if you have repeated set of tuples. (If your full table of all the 3 properties is considered, then you don't have repeated tuples).
Now consider what happens if you keep the prescribed 1D probabilities but don't represent all the possibilities, e.g. by removing the entry (Apple, Florida) with ID 1. You cannot generate points on the cell with number 1 anymore, but this affects the 1D probabilities you prescribed. If you can resolve this issue by redistributing the removed 40% so that individual category probabilities will be the one you desire, then you can select lines with the probability of the properties you want. This case occurs in your table, because not every possibility is listed.
If you can redistribute the probabilities, e.g. according to the following table (by scaleing up everything by 100%/(100%-40%), then not all the variables will be independent anymore. E.g. if it is apple, then it must have city Ohio (instead of 20% - 80% probability share with Florida).
You mentioned that you have millions of rows. Maybe in your complete table all possible combinations can be found, and you don't need to deal with this problem. You can also extend your table so that it contains all the possible combinations, and you can later decide how to interpret the results when you selected a row not contained in your full table initially.
this doesn't include the 'city' column, but it's a start:
# the usual:
import pandas as pd
import numpy as np
from random import sample
# our df:
df = pd.DataFrame({'Id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
'Fruit': ['Apple','Mango','Orange','Apple','Apple','Mango','Apple','Mango','Apple','Apple','Apple','Orange'],
'City':['California','California','Chicago','Michigan','New York','Ohio','Michigan',
'Michigan','Ohio','Florida','New York','Washington']})
# the fun part:
def lets_check_it(df):
# take 2 samples with 'A' Id, one sample with 'B', one sample with 'C' put them all in a df:
result = pd.concat([df[df['Id']=='A'].sample(1),df[df['Id']=='A'].sample(1),df[df['Id']=='B'].sample(1),df[df['Id']=='C'].sample(1)])
# if Apple or Orange are not in results, keep on sampling:
while ('Apple' not in result['Fruit'].value_counts().index.tolist()) | ('Orange' not in result['Fruit'].value_counts().index.tolist()):
result = pd.concat([df[df['Id']=='A'].sample(1),df[df['Id']=='A'].sample(1),df[df['Id']=='B'].sample(1),df[df['Id']=='C'].sample(1)])
else:
# if Apple and Orange are in results, we have to check if it's 2 of 'Apple' and 1 of 'Orange that's the result we want
while (result['Fruit'].value_counts()['Apple'] != 2) | (result['Fruit'].value_counts()['Orange'] != 1):
# if it's not the desired result, run the whole function again:
return lets_check_it(df)
# if it's the desired result, return the result:
else:
return result
Not sure how this is going to play out time-wise with millions of rows.

How to parse suburb from unstructured address string

Python noob here.
I am working with a large dataset that includes a column with unstructured strings. I need to develop a way to create a list that includes all of the suburb names in Australia (I can source this easily). I then need a program that parses through the string, and where a sequence matches an entry in the list, it saves the substring to a new column. The dataset was appended from multiple sources, so there is no consistent structure to the strings.
As an example, the rows look like this:
GIBSON AVE PADSTOW NSW 2211
SYDNEY ROAD COBURG VIC 3058
DUNLOP ST, ROSELANDS
FOREST RD HURSTVILLE NSW 2220
UNKNOWN
JOSEPHINE CRES CHERRYBROOK NSW 2126
I would be greatly appreciative if anyone has any example code that they can share with me, or if you can point me in the right direction for the most appropriate tool/method to use.
In this example, the expected output would look like:
'Padstow'
'Coburg'
'Roselands'
'Hurstville'
''
'Cherrybrook'
EDIT:
Would this code work?
import pandas as pd
import numpy as np
suburb_list = np.genfromtxt('filepath/nsw.csv',
delimiter=',', dtype=str)
top_row = suburb_list[:].tolist()
dataset = pd.read_csv(‘filepath/dataset.csv')
def get_suburb(dataset.address):
for s in suburb_list:
if s in address.lower()
return s
So for a pretty simple approach, you could just use a big list with all the suburb names in lower case, and then do:
suburbs = [ 'padstow', 'cowburg', .... many more]
def get_suburb(unstructured_string):
for s in suburbs:
if s in unstructured_string.lower()
return s
This will give you the first match. If you want to get fancy and maybe try to get it right in the face of misspellings etc., you could try "fuzzy" string comparison methods like the Levenshtein distance (for which you'd have to separate the string into individual words first).

Python - Verifying event based on values in dataframe

I've got a dataframe for which I am trying to verify an event based on others values in the dataframe.
To be more concrete it's about UFO sightings. I've already grouped the df by date of sighting and dropped all rows with only one unique entry.
The next step would be to check when dates are equal whether the city also is.
In this case I would like to drop all lines, as city is different.
I'd like to keep, as the event has got the same time and and the city is the same.
I am looking for way to do this for my entire dataframe. Sorry if that's a stupid question I'm very new to programming.
If you are just trying to remove duplicates of the combination of datetime, city and state then you can do the following which will keep the first row with first occurrence of each datetime, city and state combination.
df[df.duplicated(subset=['datetime', 'city', 'state']) == False]
I don't think I'm understanding your problem, but I'll post this answer and we can work from there.
The corroborations column counts the number of times we have an observation with the same datetime and city/state combination. So in the example below, the 20th of December has three sightings, but two of those were in Portville, and the other was in Duluth. Thus the corroborations column for each event receives values of 2 and 1, respectively.
Similarly, even though we have four observations taking place in Portville, there two of them happened on the 20th, and the others on the 21st. Thus we group them as two separate events.
df = pd.DataFrame({'datetime': pd.to_datetime(['2016-12-20', '2016-12-20', '2016-12-20', '2016-12-21', '2016-12-21']),
'city': ['duluth', 'portville', 'portville', 'portville', 'portville'],
'state': ['mn', 'ny', 'ny', 'ny', 'ny']})
s = lambda x: x.shape[0]
df['corroborations'] = df.groupby(['datetime', 'city', 'state'])['city'].transform(s)
>>> df
datetime city state corroborations
0 2016-12-20 duluth mn 1
1 2016-12-20 portville ny 2
2 2016-12-20 portville ny 2
3 2016-12-21 portville ny 2
4 2016-12-21 portville ny 2

Categories