I am working on the Olympics dataset related to this
This is what the dataframe looks like:
Unnamed: 0 # Summer 01 ! 02 ! 03 ! Total # Winter \
0 Afghanistan (AFG) 13 0 0 2 2 0
1 Algeria (ALG) 12 5 2 8 15 3
2 Argentina (ARG) 23 18 24 28 70 18
3 Armenia (ARM) 5 1 2 9 12 6
4 Australasia (ANZ) [ANZ] 2 3 4 5 12 0
I want to do the following things:
Split country name and country code and add country name as data
frame index
Remove extra unnecessary characters from country name.
For example the updated column should be:
Unnamed: 0 # Summer 01 ! 02 ! 03 ! Total # Winter \
0 Afghanistan 13 0 0 2 2 0
1 Algeria 12 5 2 8 15 3
2 Argentina 23 18 24 28 70 18
3 Armenia 5 1 2 9 12 6
4 Australasia 2 3 4 5 12 0
Please show me a proper way to achieve this.
You can use regex and replace to that i.e
df = df.replace('\(.+?\)|\[.+?\]\s*','',regex=True).rename(columns={'Unnamed: 0':'Country'}).set_index('Country')
Output:
Summer 01 ! 02 ! 03 ! Total Winter
Country
Afghanistan 13 0 0 2 2 0
Algeria 12 5 2 8 15 3
Argentina 23 18 24 28 70 18
Armenia 5 1 2 9 12 6
Australasia 2 3 4 5 12 0
If you dont want to rename then .set_index('Unnamed: 0')
Or Thanks #Scott a much easier solution is to split by ( and select the first element i.e
df['Unnamed: 0'] = df['Unnamed: 0'].str.split('\(').str[0]
Splitting to get two columns, country and Country Code and setting country as index:
df2 = pd.DataFrame(df.Unnamed.str.split(' ',1).tolist(), columns = ['Country', 'countryCode']).set_index('Country')
You could also add country code as an additional info in your dataframe.
Removing the extra thing, as I suppose like: [ANZ], using regex (as mentioned in other answer)
df2 = df2.replace('\[.*?\]','', regex=True)
Related
The objective is to subtract a row (N) with previous row (N-1) separated by groups.
Given a df
years nchar nval
0 2019 a 1
1 2019 b 1
2 2019 c 1
3 2020 a 1
4 2020 s 4
Lets,separate into group of year 2019, and we denote it as df_2019
For df_2019, there we assign constant 10.
Then,only for index 0, we do the following operation and assign to a new column 'B`
df_2019.loc[df_2019.index[0], 'B']= 10 - df_2019['nval'].values[0]
Whereas, the other index
df_2019.loc[df_2019.index[N], 'B'] = df_2019['B'].values[N-1] - df_2019['nval'].values[N]
This, will produced the following table
years nchar nval C D B
1 2019 a 1 9
2 2019 b 1 8
3 2019 c 1 7
For the group 2020, the same computation apply. However, the only difference is, the constant value is the 7, which is taken from the last index of column B.
To answer this requirement, the following code is produced with extra possible groups.
import pandas as pd
year=[2019,2019,2019,2020,2020,2020,2020,2022,2022,2022]
nval=[1,1,1,1,4,1,4,5,6,7]
nchar=['a','b','c','a','s','c','a','b','c','g']
df=pd.DataFrame(zip(year,nchar,nval),columns=['years','nchar','nval'])
print(df)
year_ls=[2019,2020,2022]
nspacing_total=2
nspacing_between_df=4
all_df=[]
default_val=10
for idx,dyear in enumerate(year_ls):
df_=df[df['years']==dyear].reset_index(drop=True)
t=pd.DataFrame([[''] * 3]*len(df_), columns=["C", "D", "B"])
df_=pd.concat([df_,t],axis=1)
Total = df_['nval'].sum()
df_=pd.DataFrame([[''] * len(df.columns)]*1, columns=df.columns).append(df_).reset_index(drop=True)
if idx ==0:
df_.loc[df_.index[0], 'B']=default_val
if idx !=0:
pre_df=all_df[idx-1]
pre_val=pre_df['B'].values[-1]
nposi=1
pre_years=pre_df['years'].values[nposi]
df_.loc[df_.index[0], 'nchar']=f'From {pre_years}'
df_.loc[df_.index[0], 'B']=pre_val
for ndexd in range(df_.shape[0]-1):
df_.loc[df_.index[ndexd+1], 'B']=df_['B'].values[ndexd]-df_['nval'].values[ndexd+1]
df_=df_.append(pd.DataFrame([[''] * len(df.columns)]*nspacing_total, columns=df.columns)).reset_index(drop=True)
df_.loc[df_.index[-1], 'nval']=Total
df_.loc[df_.index[-1], 'nchar']='Total'
df_.loc[df_.index[-1], 'B']=df_['B'].values[0]-df_['nval'].values[-1]
all_df.append(df_)
However, I wonder whether this proposal can be further simplified further using pandas groupby or other. I really appreciate for any tips.
Ultimately, I would like to express the table as below, which will be exported to excel
years nchar nval C D B
0 10
1 2019 a 1 9
2 2019 b 1 8
3 2019 c 1 7
4
5 Total 3 7
6
7
8
9
10 From 2019 7
11 2020 a 1 6
12 2020 s 4 2
13 2020 c 1 1
14 2020 a 4 -3
15
16 Total 10 -3
17
18
19
20
21 From 2020 -3
22 2022 b 5 -8
23 2022 c 6 -14
24 2022 g 7 -21
25
26 Total 18 -21
27
28
29
30
The code to produced the above table
# Optional to represent the table above
all_ap_df=[]
for a_df in all_df:
df=a_df.append(pd.DataFrame([[''] * len(df.columns)]*nspacing_between_df, columns=df.columns)).reset_index(drop=True)
all_ap_df.append(df)
df=pd.concat(all_ap_df,axis=0).reset_index(drop=True)
df.loc[df_.index[0], 'D']=df['B'].values[0]
df.loc[df_.index[0], 'B']=''
df = df.fillna('')
I think this is actually quite simple. Use groupby + cumsum:
df['B'] = 10 - df['nval'].cumsum()
Output:
>>> df
years nchar nval B
0 2019 a 1 9
1 2019 b 1 8
2 2019 c 1 7
3 2020 a 1 6
4 2020 s 4 2
In your case chain with groupby
df['new'] = df.groupby('years')['nval'].cumsum().rsub(10)
Out[8]:
0 9
1 8
2 7
3 9
4 5
Name: nval, dtype: int64
In the popular UM Intro to DS in Py coursera course, I'm having difficulty completing the second question in the Week 2 assignment. Based on the below df sample:
# Summer Silver Bronze Total ... Silver.2 Bronze.2 Combined total ID
Gold ...
0 13 0 2 2 ... 0 2 2 AFG
5 12 2 8 15 ... 2 8 15 ALG
18 23 24 28 70 ... 24 28 70 ARG
1 5 2 9 12 ... 2 9 12 ARM
3 2 4 5 12 ... 4 5 12 ANZ
[5 rows x 15 columns]
The question is as follows:
Question 1
Which country has won the most gold medals in summer games?
This function should return a single string value.
The answer is 'USA'
I know this is very rudimentary, but I cannot get it. Pretty embarrassed but very frustrated.
The below are errors I've encountered.
df['Gold'].argmax()
...
KeyError: 'Gold'
df['Gold'].idxmax()
...
KeyError: 'Gold'
max(df.idxmax())
...
TypeError: reduction operation 'argmax' not allowed for this dtype
df.ID.idxmax()
TypeError: reduction operation 'argmax' not allowed for this dtype
This works, but not within a function
df['ID'].sort_index(axis=0,ascending=False).iloc[0]
I really appreciate any support.
Update 1
One successful attempt
thanks to #Grr! I'm am still very curious as to why other methods are failing
Update 2
Second successful attempt thanks to #alec_djinn, this approach was similar to what I had previously tried but could not figure out. Thank you!
Try it like this:
df.ID.idxmax()
I think you wanted to do the following:
df.sort_index(ascending=False, inplace=True)
df.head(1)['ID'] #or df.iloc[0]['ID']
in a function it would be:
def f(df):
df.sort_index(ascending=False, inplace=True) #you can sort outside the function as well
return df.iloc[0]['ID']
It's a bit odd that that column is your index, but be that as it may you could grab the row where the value of the index is equal to the max of the index and then reference the ID column.
df[df.index == df.index.max()].ID
Your other methods are failing as a result of the KeyError. The index name is Gold, but Gold is not in the column index and this raises the KeyError. I.e. df['Gold'] is not possible when 'Gold' is the index. Instead use df.index. You could also reset the index like so.
df = df.reset_index()
df
Gold # Summer Silver Bronze Total # Winter Gold.1 ... Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
0 0 13 0 2 2 0 0 ... 0 13 0 0 2 2 AFG
1 5 12 2 8 15 3 0 ... 0 15 5 2 8 15 ALG
2 18 23 24 28 70 18 0 ... 0 41 18 24 28 70 ARG
3 1 5 2 9 12 6 0 ... 0 11 1 2 9 12 ARM
4 3 2 4 5 12 0 0 ... 0 2 3 4 5 12 ANZ
[5 rows x 16 columns]
Then you can use df['Gold'] or df.Gold as you were attempting before as 'Gold' is now an acceptable key.
df.Gold.idxmax()
2
In my case its 'ARG' with 18 Gold medals
Could someone tell me how to Add rows in this dataframe automatically?
I have a data frame df :
frequency
enrollment_id event days
1 access 2 3
7 8
9 4
10 3
12 2
15 21
18 4
19 8
20 20
22 16
23 2
28 2
29 14
navigate 2 1
7 4
9 1
10 3
11 1
12 1
15 5
18 1
19 1
22 3
23 1
28 1
29 2
page_close 2 1
7 6
9 2
10 3
... ...
200881 navigate 28 1
200882 discussion 28 4
navigate 28 4
200883 access 28 2
navigate 28 2
page_close 28 1
200885 navigate 21 1
200887 access 21 3
navigate 21 2
page_close 21 1
video 21 1
200888 access 21 2
discussion 21 1
navigate 21 5
page_close 21 1
video 21 1
wiki 21 1
200889 navigate 21 1
200893 navigate 21 2
200895 navigate 21 1
200896 navigate 21 1
200897 navigate 21 1
200898 navigate 21 1
200900 navigate 21 1
200901 access 21 3
navigate 21 2
page_close 21 2
video 21 1
200904 navigate 21 1
200905 navigate 21 1
This df has 3 index: 1. event 2. days 3. enrollment_id
and only one column frequency
event has 7 different value like : access, remove etc.
days has 30 different vaule 0 - 29 (not every event has 0 - 29. some event just has for example 0, 1, 4.)
enrollment_id has a lot of different value (maybe 100000). Same, not each days has all enrollment_id.
My question is : How can I add all lost rows?
For example : If I have this
frequency
enrollment_id event days
1 access 2 3
7 8
I need to add rows for
frequency
enrollment_id event days
1 access 0 0
1 0
3 0
4 0
5 0
6 0
... ...
29 0
and I need to add rows for 0 with all other enrollment_id and frequency 0
and and all rows for access with0days - 29days and enrollment_id from 1 - max
I really want to get this answer. I really appreciate your help!!
EDIT:
If need add mising days only to last level days use reindex with unstack + stack:
df = df['frequency'].unstack()
.reindex(columns=list(range(30)), fill_value=0)
.stack()
.to_frame('frequency')
If need add all combination of all levels:
Use by new MultiIndex created by from_product:
#get all unique values of all levels
a = df.index.get_level_values('enrollment_id').unique()
b = df.index.get_level_values('event').unique()
c = df.index.get_level_values('days').unique()
Or you can use your values in lists like:
a = ['access', 'remove']
b = range(1, df.index.get_level_values('event').max() + 1)
c = range(30)
mux = pd.MultiIndex.from_product([a,b,c], names=df.index.names)
#for missing values add 0
df = df.reindex(mux, fill_value=0)
I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.
I have a DataFrame (df) with various columns. In this assignment I have to find the difference between summer gold medals and winter gold medals, relative to total medals, for each country using stats about the olympics.
I must only include countries which have at least one gold medal. I am trying to use dropna() to not include those countries who do not at least have one medal. My current code:
def answer_three():
df['medal_count'] = df['Gold'] - df['Gold.1']
df['medal_count'].dropna()
df['medal_dif'] = df['medal_count'] / df['Gold.2']
df['medal_dif'].dropna()
return df.head()
print (answer_three())
This results in the following output:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
Afghanistan 13 0 0 2 2 0 0
Algeria 12 5 2 8 15 3 0
Argentina 23 18 24 28 70 18 0
Armenia 5 1 2 9 12 6 0
Australasia 2 3 4 5 12 0 0
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 \
Afghanistan 0 0 0 13 0 0 2
Algeria 0 0 0 15 5 2 8
Argentina 0 0 0 41 18 24 28
Armenia 0 0 0 11 1 2 9
Australasia 0 0 0 2 3 4 5
Combined total ID medal_count medal_dif
Afghanistan 2 AFG 0 NaN
Algeria 15 ALG 5 1.0
Argentina 70 ARG 18 1.0
Armenia 12 ARM 1 1.0
Australasia 12 ANZ 3 1.0
I need to get rid of both the '0' values in "medal_count" and the NaN in "medal_dif".
I am also aware the maths/way I have written the code is probably incorrect to solve the question, but I think I need to start by dropping these values? Any help with any of the above is greatly appreciated.
You are required to pass an axis e.g. axis=1 into the drop function.
An axis of 0 => row, and 1 => column. 0 seems to be the default.
As you can see the entire column is dropped for axis =1