I've got a dataframe for which I am trying to verify an event based on others values in the dataframe.
To be more concrete it's about UFO sightings. I've already grouped the df by date of sighting and dropped all rows with only one unique entry.
The next step would be to check when dates are equal whether the city also is.
In this case I would like to drop all lines, as city is different.
I'd like to keep, as the event has got the same time and and the city is the same.
I am looking for way to do this for my entire dataframe. Sorry if that's a stupid question I'm very new to programming.
If you are just trying to remove duplicates of the combination of datetime, city and state then you can do the following which will keep the first row with first occurrence of each datetime, city and state combination.
df[df.duplicated(subset=['datetime', 'city', 'state']) == False]
I don't think I'm understanding your problem, but I'll post this answer and we can work from there.
The corroborations column counts the number of times we have an observation with the same datetime and city/state combination. So in the example below, the 20th of December has three sightings, but two of those were in Portville, and the other was in Duluth. Thus the corroborations column for each event receives values of 2 and 1, respectively.
Similarly, even though we have four observations taking place in Portville, there two of them happened on the 20th, and the others on the 21st. Thus we group them as two separate events.
df = pd.DataFrame({'datetime': pd.to_datetime(['2016-12-20', '2016-12-20', '2016-12-20', '2016-12-21', '2016-12-21']),
'city': ['duluth', 'portville', 'portville', 'portville', 'portville'],
'state': ['mn', 'ny', 'ny', 'ny', 'ny']})
s = lambda x: x.shape[0]
df['corroborations'] = df.groupby(['datetime', 'city', 'state'])['city'].transform(s)
>>> df
datetime city state corroborations
0 2016-12-20 duluth mn 1
1 2016-12-20 portville ny 2
2 2016-12-20 portville ny 2
3 2016-12-21 portville ny 2
4 2016-12-21 portville ny 2
Related
I have a large data set and I wanna do the following task in an efficient way. suppose we have 2 data frames. for each element in df2 I wanna search in the first data set df1 only in row where the first 2 letters are in common then the word with the most common token is choosen.
Let's see in an example:
df1:
common work co
summer hot su
apple ap
colorful fall co
support it su
could comp co
df2:
condition work it co
common mistakes co
could comp work co
summer su
Take the first row of df2 as an example (condition work it). I wanna find a row in df1 where they have the same first_two and have the most common token.
The first_two of condition work it is co. so I wanna search in df1 where first_two is co. So the search is done among: common work, colorful fall, could comp since condition work it has 1 common token with common work it is selected.
output:
df2:
name first_two match
condition work it co `common work`
common mistakes co `common work`
could comp work co `could comp`
summer su `summer hot'
appears ap Nane
The last row is Nane since there is no common word between appears and apple
I did following:
df3=(df1.groupby(['first_two'])
.agg({'name': lambda x: ",".join(x)})
.reset_index())
merge_=df3.merge(df2, on='first_two',how='inner')
But now I have to search in name_x for each name_y. how to find an element of name_x whose has the most common token with name_y?
You have pretty much explained the most efficient method already.
Extract first 2 letters using .str.[:2] for the series and assign it to new columns in both the dataframe.
Extract unique values of 2 letter column from df2.
Inner join the result from #2 on to df1.
Perform a group by count on the result of #3 and sort descending based on the count and drop duplicates to get the most repeated item for the 2 letter column.
Left join result of #4 onto df2.
Hello I am struggling to find a solution to probably a very common problem.
I want to merge two csv-files with soccer data. They basically store different data of the same games. Normally I would do a merge with .merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. So for example Manchester City is called Man. City in the second data frame.
Here's roughly what df1 and df2 look like:
df:
team1 team2 date some_value_i_want_to_compare
Manchester City Arsenal 2022-05-20 22:00:00 0.2812 5
df2:
team1 team2 date some_value_i_want_to_compare
Man. City Arsenal 2022-05-20 22:00:00 0.2812 3
Note that in the above case there are only differences in team1 but there could also be cases where team2 is slightly different. So for example in this case Arsenal could be called FC Arsenal in the second data set.
So my main question is: How could I automatically analyse the differences in the two datasets naming?
My second question is: How do I scale this for more than 2 data sets so that the number of data sets ultimately doesn't matter?
As commenters and existing answer have suggested, if the number of unique names is not too large, then you can manually extract the mismatches and correct them. That is probably the best solution unless the number of mismatches is very large.
Another case which can occur, is when you have a ground truth list of allowed indexes (for example, the list of all soccer teams in a given league), but the data may contain many different attempts at spelling or abbreviating each team. If this is similar to your situation, you can use difflib to search for the most likely match for a given name. For example:
import difflib
true_names = ['Manchester United', 'Chelsea']
mismatch_names = ['Man. Unites', 'Chlsea', 'Chelsee']
best_matches = [difflib.get_close_matches(x, true_names, n=1) for x in mismatch_names]
for old,new in zip(mismatch_names, best_matches):
print(f"Best match for {old} is {new[0]}")
output:
Best match for Man. Unites is Manchester United
Best match for Chlsea is Chelsea
Best match for Chelsee is Chelsea
Note if the spelling is very bad, you can ask difflib to find the closest n matches using the n= keyword argument. This can help to reduce manual data cleaning work, although it is often unavoidable, at least to some degree.
Hope it helps.
You could start by doing an anti-join to isolate the ones that don't match:
# Merge two team datasets
teams_join = df1.merge(df2, on='team1',
how='left', indicator=True)
# Select the team1 column where _merge is left_only
team_list = teams_join.loc[teams_join['_merge'] == 'left_only', 'team1']
# print team names in df1 with no match in df2
print(df1[df1["team1"].isin(team_list)])
This will give you all the teams in df1 without a match in df2. You could do the same for df2 (just reverse everything df1 and df2 in the previous code). Then you can take those two lists with the names that don't match and manually rename them if there are few enough of them.
I have a question.
Is there a way on how to check wheteher there are typos in a specific column?
I have an Excel sheet which is read by use of pandas.
First I need to make a unique list in Python, based on the name of the column;
Second I need to replace the wrong values with the new values.
Working in a Jupyter notebook and doing this semi-manually might be the best way. One option could be to start by creating a list of correct spelling:
correct= ['terms','that','are','spelt','correctly']
and create a subset from your data frame which does not contain the values in that list.
df[~df['columnname'].str.startswith(tuple(correct))]
You will then know how many rows are affected. You can then count the number of different variations:
df['columnname'].value_counts()
and if reasonable, you could look at the unique values, and make them into a list:
listoftypos = list(df['columnname'].unique())
print(listoftypos)
and then create a dictionary again in a semi-manual way as:
typodict= {'terma':'term','thaaat':'that','arree':'are','speelt':'spelt','cooorrectly':'correct'}
then iterate over your original data frame, and if a row in the column contains the keyword which is in your list of typos, then replace it with the correct key from the dictionary, something like this:
for index,row in df.itterows():
if any(row['columnname'] in s for s in listoftypos):
correctspelling = list(typodict.keys())[list(typodict.values()).index(row['columnname'])])
df.at[index,'columnname'] = correctspelling
A strong caveat here though - of course, this would be something that would have to be done iteratively if the dataframe was extremely large.
Keep in mind that a generic spell check is a fairly tall order, but I believe this solution will fit your need with the lowest chance of false matches:
Setup:
import difflib
import re
from itertools import permutations
cardinal_directions=['north', 'south', 'east', 'west']
regions=['coast', 'central', 'international', 'mid']
p_lst=list(permutations(cardinal_directions+regions,2))
area=[''.join(i) for i in p_lst]+cardinal_directions+regions
df=pd.DataFrame({"ID":list(range(0,9)), "region":['Midwest', 'Northwest', 'West', 'Northeast', 'East coast', 'Central', 'South', 'International', 'Centrall']})
Initial DF:
ID
region
0
Midwest
1
Northwest
2
West
3
Northeast
4
East coast
5
Central
6
South
7
International
8
Centrall
Function:
def spell_check(my_str, name_bank):
prcnt=[]
for y in name_bank:
prcnt.append(difflib.SequenceMatcher(None, y, my_str.lower().strip()).ratio())
return name_bank[prcnt.index(max(prcnt))]
Apply Function to DF:
df.region=df.region.apply(lambda x: spell_check(x, area))
Resultant DF:
ID
region
0
midwest
1
northwest
2
west
3
northeast
4
eastcoast
5
central
6
south
7
international
8
central
I hope this answers your question and good luck.
I am working with COVID data and am trying to control for population and show incidence per 100,000.
I have one DataFrame with population:
**Country** **Population**
China 1389102
Israel 830982
Iran 868912
I have a second DataFrame showing COVID data:
**Date** **Country** **Confirmed**
01/Jan/2020 China 8
01/Jan/2020 Israel 3
01/Jan/2020 Iran 2
02/Jan/2020 China 15
02/Jan/2020 Israel 5
02/Jan/2020 Iran 5
I wish to perform a calculation on my COVID DataFrame using info from the population DataFrame. That is to say, to normalise cases per 100,000 for each data via:
(Chinese Datapoint/Chinese Population) * 100,000
Likewise for my other countries.
I am stumped on this one and not too sure do I achieve my result via grouping data, zipping data, etc.
Any help welcome.
Edit: I should have added that confirmed cases are cumulative as each day goes on. So for example, I wish to performed for Jan 1st for China: (8/china population)*100000 and like wise for Jan 2nd, Jan 3rd, Jan 4th... And again, likewise for each country. Essentially performing a calculation to the entire DataFrame based on data in another DataFrame.
You could merge 2 dataframes and perform the operation:
# Define the norm operation
def norm_cases(cases, population):
return (cases/population)*100
# If the column name for country is same in both dataframes
covid_df.merge(population_df, on='country_column', how='left')
# For different col names
covid_df.merge(population_df, left_on='covid_country_column', right_on='population_country_column', how='left')
covid_df['norm_cases'] = covid_df.apply(lambda x: norm_cases(x['cases_column'], x['population_column']), axis=1)
Assuming that your dataframes are called df1 and df2 and by "Datapoint" you mean the column **Confirmed**:
normed_cases = (
df2.reset_index().groupby(['**Country**', '**Date**']).sum()['**Confirmed**']
/ df1.set_index('**Country**')['**Population**'] * 100000)
reset the index of df2 to make the date a column (only applicable if **Date** was the index before)
Group by country and date and sum the groups to get the total cases per country and date
set country as index to the first df df1 to allow country-index oriented division
divide by population
I took an approach combining many of your suggestions. Step one, I merged my two dataframes. Step two, I divided my confirmed column by the population. Step three, I multiplied the same column by 100,000. There probably is a more elegant approach but this works.
covid_df = covid_df.merge(population_df, on='Country', how='left')
covid_df["Confirmed"] = covid_df["Confirmed"].divide(covid_df["Population"], axis="index")
covid_df["Confirmed"] = covid_df["Confirmed"] *100000
Suppose Dataframe with population as df_pop, and Covid data as df_data.
# Set index country of df_pop
df_pop = df_pop.set_index(['Country'])
# Norm value
norm = 100000
# Calculate norm cases
df_data['norm_cases'] = [((conf/df_pop.loc[country].Population )*norm
for (conf, country) in zip(df_data.Confirmed,df_data.Country) ]
You can use df1.set_index('Country').join(df2.set_index('Country')) here, then it will be easy for you to perform this actions.
I have a number of number of small dataframes with a date and stock price for a given stock. Someone else showed me how to loop through them so they are contained in a list called all_dfs. So all_dfs[0] would be a dataframe with Date and IBM US equity, all_dfs[1] would be Date and MMM US Equity, etc. (example shown below). The Date column in the dataframes is always the same but the stock names are all different and the numbers associated with that stock column are always different. So when you call all_dfs[1] this is the dataframe you would see (i.e., all_dfs[1].head()):
IDX Date MMM US equity
0 1/3/2000 47.19
1 1/4/2000 45.31
2 1/5/2000 46.63
3 1/6/2000 50.38
I want to add the same additional columns to EVERY dataframe. So I was trying to loop through them and add the columns. The numbers in the stock name columns are the basis for the calculations that make the other columns.
There are more columns to add but I think they will all loop through the same way soc this is a sample of the columns I want to add:
Column 1 to add >>> df['P_CHG1D'] = df['Stock name #1'].pct_change(1) * 100
Column 2 to add >>> df['PCHG_SIG'] = P_CHG1D > 3
Column 3 to add>>> df['PCHG_SIG']= df['PCHG_SIG'].map({True:1,False:0})
This is the code that I have so far but it is returning a syntax errors for the all_dfs[i].
for i in range (len(df.columns)):
for all_dfs[i]:
df['P_CHG1D'] = df.loc[:,0].pct_change(1) * 100
So I also have 2 problems that I can not figure out
I dont know how to add columns to every dataframes in the loop. So I would have to do something like all_dfs[i].['ADD COLUMN NAME'] = df['Stock Name 1'].pct_change(1) * 100
the second part after the = which is the df['Stock Name 1'] this keeps changing (so in this example it is called MMM US Equity but the next time it would be called the column header of the second dataframe - so it could be IBM US Equity) as each dataframe has a different name so I don't know how to call that properly in the loop
I am new to python/pandas so if I am thinking about this the wrong way let me know if there is a better solution.
Consider iterating through the length of alldfs to reference each element in loop by its index. For first new column, use .ix operator to select stock column by its column position of 2 (third column):
for i in range(len(alldfs)):
dfList[i].is_copy = False # TURNS OFF SettingWithCopyWarning
dfList[i]['P_CHG1D'] = dfList[i].ix[:, 2].pct_change(1) * 100
dfList[i]['PCHG_SIG'] = dfList[i]['P_CHG1D'] > 3
dfList[i]['PCHG_SIG_VAL'] = dfList[i]['PCHG_SIG'].map({True:1,False:0})