In the popular UM Intro to DS in Py coursera course, I'm having difficulty completing the second question in the Week 2 assignment. Based on the below df sample:
# Summer Silver Bronze Total ... Silver.2 Bronze.2 Combined total ID
Gold ...
0 13 0 2 2 ... 0 2 2 AFG
5 12 2 8 15 ... 2 8 15 ALG
18 23 24 28 70 ... 24 28 70 ARG
1 5 2 9 12 ... 2 9 12 ARM
3 2 4 5 12 ... 4 5 12 ANZ
[5 rows x 15 columns]
The question is as follows:
Question 1
Which country has won the most gold medals in summer games?
This function should return a single string value.
The answer is 'USA'
I know this is very rudimentary, but I cannot get it. Pretty embarrassed but very frustrated.
The below are errors I've encountered.
df['Gold'].argmax()
...
KeyError: 'Gold'
df['Gold'].idxmax()
...
KeyError: 'Gold'
max(df.idxmax())
...
TypeError: reduction operation 'argmax' not allowed for this dtype
df.ID.idxmax()
TypeError: reduction operation 'argmax' not allowed for this dtype
This works, but not within a function
df['ID'].sort_index(axis=0,ascending=False).iloc[0]
I really appreciate any support.
Update 1
One successful attempt
thanks to #Grr! I'm am still very curious as to why other methods are failing
Update 2
Second successful attempt thanks to #alec_djinn, this approach was similar to what I had previously tried but could not figure out. Thank you!
Try it like this:
df.ID.idxmax()
I think you wanted to do the following:
df.sort_index(ascending=False, inplace=True)
df.head(1)['ID'] #or df.iloc[0]['ID']
in a function it would be:
def f(df):
df.sort_index(ascending=False, inplace=True) #you can sort outside the function as well
return df.iloc[0]['ID']
It's a bit odd that that column is your index, but be that as it may you could grab the row where the value of the index is equal to the max of the index and then reference the ID column.
df[df.index == df.index.max()].ID
Your other methods are failing as a result of the KeyError. The index name is Gold, but Gold is not in the column index and this raises the KeyError. I.e. df['Gold'] is not possible when 'Gold' is the index. Instead use df.index. You could also reset the index like so.
df = df.reset_index()
df
Gold # Summer Silver Bronze Total # Winter Gold.1 ... Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
0 0 13 0 2 2 0 0 ... 0 13 0 0 2 2 AFG
1 5 12 2 8 15 3 0 ... 0 15 5 2 8 15 ALG
2 18 23 24 28 70 18 0 ... 0 41 18 24 28 70 ARG
3 1 5 2 9 12 6 0 ... 0 11 1 2 9 12 ARM
4 3 2 4 5 12 0 0 ... 0 2 3 4 5 12 ANZ
[5 rows x 16 columns]
Then you can use df['Gold'] or df.Gold as you were attempting before as 'Gold' is now an acceptable key.
df.Gold.idxmax()
2
In my case its 'ARG' with 18 Gold medals
Related
Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?
Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
I am working on the Olympics dataset related to this
This is what the dataframe looks like:
Unnamed: 0 # Summer 01 ! 02 ! 03 ! Total # Winter \
0 Afghanistan (AFG) 13 0 0 2 2 0
1 Algeria (ALG) 12 5 2 8 15 3
2 Argentina (ARG) 23 18 24 28 70 18
3 Armenia (ARM) 5 1 2 9 12 6
4 Australasia (ANZ) [ANZ] 2 3 4 5 12 0
I want to do the following things:
Split country name and country code and add country name as data
frame index
Remove extra unnecessary characters from country name.
For example the updated column should be:
Unnamed: 0 # Summer 01 ! 02 ! 03 ! Total # Winter \
0 Afghanistan 13 0 0 2 2 0
1 Algeria 12 5 2 8 15 3
2 Argentina 23 18 24 28 70 18
3 Armenia 5 1 2 9 12 6
4 Australasia 2 3 4 5 12 0
Please show me a proper way to achieve this.
You can use regex and replace to that i.e
df = df.replace('\(.+?\)|\[.+?\]\s*','',regex=True).rename(columns={'Unnamed: 0':'Country'}).set_index('Country')
Output:
Summer 01 ! 02 ! 03 ! Total Winter
Country
Afghanistan 13 0 0 2 2 0
Algeria 12 5 2 8 15 3
Argentina 23 18 24 28 70 18
Armenia 5 1 2 9 12 6
Australasia 2 3 4 5 12 0
If you dont want to rename then .set_index('Unnamed: 0')
Or Thanks #Scott a much easier solution is to split by ( and select the first element i.e
df['Unnamed: 0'] = df['Unnamed: 0'].str.split('\(').str[0]
Splitting to get two columns, country and Country Code and setting country as index:
df2 = pd.DataFrame(df.Unnamed.str.split(' ',1).tolist(), columns = ['Country', 'countryCode']).set_index('Country')
You could also add country code as an additional info in your dataframe.
Removing the extra thing, as I suppose like: [ANZ], using regex (as mentioned in other answer)
df2 = df2.replace('\[.*?\]','', regex=True)
I currently have the following dataframe:
df1
3 4 5 6
0 NaN NaN Sea NaN
1 light medium light medium
2 26 41.5 15 14
3 32 40 18 29
4 41 29 19 42
And I am trying to return a new dataframe where only the Sea column and onwards remains:
df1
5 6
0 Sea NaN
1 light medium
2 15 14
3 18 29
4 19 42
I feel I am very close with my code:
for i in range(len(df.columns)):
if pd.Series.any(df.iloc[:,i].str.contains(pat="Sea")):
xyz = df.columns[i] #This is the piece of code I am having trouble with
df = df.loc[:,[xyz:??]]
Essentially I would like to return the column index of where the word 'Sea' is contained and then create a new dataframe from that index to the length of the dataframe. Hopefully that explanation makes sense, and any help is appreciated
Step 1: Get the column name:
In [542]: c = df[df == 'Sea'].any().argmax(); c
Out[542]: '5'
Step 2: Use df.loc to index:
In [544]: df.loc[:, c:]
Out[544]:
5 6
0 Sea NaN
1 light medium
2 15 14
3 18 29
4 19 42
If df.loc[:, c:] doesn't work, you may want to fall back on a more explicit version (thanks to piRSquared for the simplification):
df.iloc[:, df.columns.get_loc(c):]
Maybe you could write a little rudimentary function to do so.
def match_cut(df, to_match):
for col in df.columns:
if df[col].str.match(to_match).any():
return df.loc[:, col:]
return pd.DataFrame()
With that being said, cᴏʟᴅsᴘᴇᴇᴅ's answer should be preferred as it avoids column looping like this function.
>>> match_cut(df, 'Sea')
5 6
0 Sea np.nan
1 light medium
2 15 14
3 18 29
4 19 42
You can try thisby using list and index
df2.ix[:,df2.ix[0,:].tolist().index('Sea'):]
Out[85]:
5 6
0 Sea NaN
1 light medium
2 15 14
3 18 29
4 19 42
I have a DataFrame (df) with various columns. In this assignment I have to find the difference between summer gold medals and winter gold medals, relative to total medals, for each country using stats about the olympics.
I must only include countries which have at least one gold medal. I am trying to use dropna() to not include those countries who do not at least have one medal. My current code:
def answer_three():
df['medal_count'] = df['Gold'] - df['Gold.1']
df['medal_count'].dropna()
df['medal_dif'] = df['medal_count'] / df['Gold.2']
df['medal_dif'].dropna()
return df.head()
print (answer_three())
This results in the following output:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
Afghanistan 13 0 0 2 2 0 0
Algeria 12 5 2 8 15 3 0
Argentina 23 18 24 28 70 18 0
Armenia 5 1 2 9 12 6 0
Australasia 2 3 4 5 12 0 0
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 \
Afghanistan 0 0 0 13 0 0 2
Algeria 0 0 0 15 5 2 8
Argentina 0 0 0 41 18 24 28
Armenia 0 0 0 11 1 2 9
Australasia 0 0 0 2 3 4 5
Combined total ID medal_count medal_dif
Afghanistan 2 AFG 0 NaN
Algeria 15 ALG 5 1.0
Argentina 70 ARG 18 1.0
Armenia 12 ARM 1 1.0
Australasia 12 ANZ 3 1.0
I need to get rid of both the '0' values in "medal_count" and the NaN in "medal_dif".
I am also aware the maths/way I have written the code is probably incorrect to solve the question, but I think I need to start by dropping these values? Any help with any of the above is greatly appreciated.
You are required to pass an axis e.g. axis=1 into the drop function.
An axis of 0 => row, and 1 => column. 0 seems to be the default.
As you can see the entire column is dropped for axis =1
I have a dataframe which looks like this:
Trial Measurement Data
0 0 12
1 4
2 12
1 0 12
1 12
2 0 12
1 12
2 NaN
3 12
I want to resample my data so that every trial has just two measurements
So I want to turn it into something like this:
Trial Measurement Data
0 0 8
1 8
1 0 12
1 12
2 0 12
1 12
This rather uncommon task stems from the fact that my data has an intentional jitter on the part of the stimulus presentation.
I know pandas has a resample function, but I have no idea how to apply it to my second-level index while keeping the data in discrete categories based on the first-level index :(
Also, I wanted to iterate, over my first-level indices, but apparently
for sub_df in np.arange(len(df['Trial'].max()))
Won't work because since 'Trial' is an index pandas can't find it.
Well, it's not the prettiest I've ever seen, but from a frame looking like
>>> df
Trial Measurement Data
0 0 0 12
1 0 1 4
2 0 2 12
3 1 0 12
4 1 1 12
5 2 0 12
6 2 1 12
7 2 2 NaN
8 2 3 12
then we can manually build the two "average-like" objects and then use pd.melt to reshape the output:
avg = df.groupby("Trial")["Data"].agg({0: lambda x: x.head((len(x)+1)//2).mean(),
1: lambda x: x.tail((len(x)+1)//2).mean()})
result = pd.melt(avg.reset_index(), "Trial", var_name="Measurement", value_name="Data")
result = result.sort("Trial").set_index(["Trial", "Measurement"])
which produces
>>> result
Data
Trial Measurement
0 0 8
1 8
1 0 12
1 12
2 0 12
1 12