I currently have the following dataframe:
df1
3 4 5 6
0 NaN NaN Sea NaN
1 light medium light medium
2 26 41.5 15 14
3 32 40 18 29
4 41 29 19 42
And I am trying to return a new dataframe where only the Sea column and onwards remains:
df1
5 6
0 Sea NaN
1 light medium
2 15 14
3 18 29
4 19 42
I feel I am very close with my code:
for i in range(len(df.columns)):
if pd.Series.any(df.iloc[:,i].str.contains(pat="Sea")):
xyz = df.columns[i] #This is the piece of code I am having trouble with
df = df.loc[:,[xyz:??]]
Essentially I would like to return the column index of where the word 'Sea' is contained and then create a new dataframe from that index to the length of the dataframe. Hopefully that explanation makes sense, and any help is appreciated
Step 1: Get the column name:
In [542]: c = df[df == 'Sea'].any().argmax(); c
Out[542]: '5'
Step 2: Use df.loc to index:
In [544]: df.loc[:, c:]
Out[544]:
5 6
0 Sea NaN
1 light medium
2 15 14
3 18 29
4 19 42
If df.loc[:, c:] doesn't work, you may want to fall back on a more explicit version (thanks to piRSquared for the simplification):
df.iloc[:, df.columns.get_loc(c):]
Maybe you could write a little rudimentary function to do so.
def match_cut(df, to_match):
for col in df.columns:
if df[col].str.match(to_match).any():
return df.loc[:, col:]
return pd.DataFrame()
With that being said, cᴏʟᴅsᴘᴇᴇᴅ's answer should be preferred as it avoids column looping like this function.
>>> match_cut(df, 'Sea')
5 6
0 Sea np.nan
1 light medium
2 15 14
3 18 29
4 19 42
You can try thisby using list and index
df2.ix[:,df2.ix[0,:].tolist().index('Sea'):]
Out[85]:
5 6
0 Sea NaN
1 light medium
2 15 14
3 18 29
4 19 42
Related
I'm trying to create a new column that gives a rolling sum of values in the Values column. The rolling sum includes 4 rows i.e. the current row and the next three rows. I want to do this for each type in the 'Type' column.
However, if there are fewer than 4 rows before the next type starts, I want the rolling sum to use only the remaining rows. For example, if there are 2 rows after the current row for the current type, a total of 3 rows is used for the rolling sum. See the table below showing what I'm currently getting and what I expect.
Index
Type
Value
Current Rolling Sum
Expected Rolling Sum
1
left
5
22
22
2
left
9
34
34
3
left
0
NaN
25
4
left
8
NaN
25
5
left
17
NaN
17
6
straight
7
61
61
7
straight
4
77
77
8
straight
0
86
86
9
straight
50
97
97
10
straight
23
NaN
47
11
straight
13
NaN
24
12
straight
11
NaN
11
The following line of code is what I'm currently using to get the rolling sum.
rolling_sum = df.groupby('Type', sort=False)['Value'].rolling(4, min_periods = 3).sum().shift(-3).reset_index()
rolling_sum = rolling_sum.rename(columns={'Value': 'Rolling Sum'})
extracted_col = rolling_sum['Rolling Sum']
df = df.join(extracted_col)
I would really appreciate your help.
You can try running the rolling sum on the reversed values for each group and then reverse back afterward, using a min_periods of 1:
df['Rolling Sum'] = df.groupby('Type', sort=False)['Value'].apply(lambda x: x[::-1].rolling(4, min_periods=1).sum()[::-1])
Result:
Index Type Value Rolling Sum
0 1 left 5 22.0
1 2 left 9 34.0
2 3 left 0 25.0
3 4 left 8 25.0
4 5 left 17 17.0
5 6 straight 7 61.0
6 7 straight 4 77.0
7 8 straight 0 86.0
8 9 straight 50 97.0
9 10 straight 23 47.0
10 11 straight 13 24.0
11 12 straight 11 11.0
I'm trying to create a new column in a df. I want the new column to equal the count of the number rows of each unique 'mother_ID, which is a different column in the df.
This is what I'm currently doing. It makes the new column but the new column is filled with 'NaN's.
df.columns = ['mother_ID', 'date_born', 'mother_mass_g', 'hatchling_masses_g']
df.to_numpy()
This is how the original df appears when I print it:
count = df.groupby('mother_ID').hatchling_masses_g.count()
df['count']= count
Pic below shows what I get when I print new df, although if I simply print(count) I get the correct counts for each mother_ID . Does anyone know what I'm doing wrong?
Use groupby transform('count'):
df['count'] = df.groupby('mother_ID')['hatchling_masses_g'].transform('count')
Notice the difference between groupby count and groupby tranform with 'count'.
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame({
'mother_ID': np.random.choice(['a', 'b'], 10),
'hatchling_masses_g': np.random.randint(1, 100, 10)
})
mother_ID hatchling_masses_g
0 b 63
1 a 28
2 b 31
3 b 81
4 a 8
5 a 77
6 a 16
7 b 54
8 a 81
9 a 28
groupby.count
counts = df.groupby('mother_ID')['hatchling_masses_g'].count()
mother_ID
a 6
b 4
Name: hatchling_masses_g, dtype: int64
Notice how there are only 2 rows. When assigning back to the DataFrame there are 10 rows which means that pandas doesn't know how to align the data back. Which results in NaNs indicating missing data:
df['count'] = counts
mother_ID hatchling_masses_g count
0 b 63 NaN
1 a 28 NaN
2 b 31 NaN
3 b 81 NaN
4 a 8 NaN
5 a 77 NaN
6 a 16 NaN
7 b 54 NaN
8 a 81 NaN
9 a 28 NaN
It's trying to find 'a' and 'b' in the index and since it cannot it fills with only NaN values.
groupby.tranform('count')
transform, on the other hand, will populate the entire group with the count:
counts = df.groupby('mother_ID')['hatchling_masses_g'].transform('count')
counts:
0 4
1 6
2 4
3 4
4 6
5 6
6 6
7 4
8 6
9 6
Name: hatchling_masses_g, dtype: int64
Notice 10 rows were created (one for every row in the DataFrame):
This assigns back to the dataframe nicely (since the indexes align):
df['count'] = counts
mother_ID hatchling_masses_g count
0 b 63 4
1 a 28 6
2 b 31 4
3 b 81 4
4 a 8 6
5 a 77 6
6 a 16 6
7 b 54 4
8 a 81 6
9 a 28 6
If needed counts can be done via groupby count, then join back to the DataFrame on the group key:
counts = df.groupby('mother_ID')['hatchling_masses_g'].count().rename('count')
df = df.join(counts, on='mother_ID')
counts:
mother_ID
a 6
b 4
Name: count, dtype: int64
df:
mother_ID hatchling_masses_g count
0 b 63 4
1 a 28 6
2 b 31 4
3 b 81 4
4 a 8 6
5 a 77 6
6 a 16 6
7 b 54 4
8 a 81 6
9 a 28 6
In the popular UM Intro to DS in Py coursera course, I'm having difficulty completing the second question in the Week 2 assignment. Based on the below df sample:
# Summer Silver Bronze Total ... Silver.2 Bronze.2 Combined total ID
Gold ...
0 13 0 2 2 ... 0 2 2 AFG
5 12 2 8 15 ... 2 8 15 ALG
18 23 24 28 70 ... 24 28 70 ARG
1 5 2 9 12 ... 2 9 12 ARM
3 2 4 5 12 ... 4 5 12 ANZ
[5 rows x 15 columns]
The question is as follows:
Question 1
Which country has won the most gold medals in summer games?
This function should return a single string value.
The answer is 'USA'
I know this is very rudimentary, but I cannot get it. Pretty embarrassed but very frustrated.
The below are errors I've encountered.
df['Gold'].argmax()
...
KeyError: 'Gold'
df['Gold'].idxmax()
...
KeyError: 'Gold'
max(df.idxmax())
...
TypeError: reduction operation 'argmax' not allowed for this dtype
df.ID.idxmax()
TypeError: reduction operation 'argmax' not allowed for this dtype
This works, but not within a function
df['ID'].sort_index(axis=0,ascending=False).iloc[0]
I really appreciate any support.
Update 1
One successful attempt
thanks to #Grr! I'm am still very curious as to why other methods are failing
Update 2
Second successful attempt thanks to #alec_djinn, this approach was similar to what I had previously tried but could not figure out. Thank you!
Try it like this:
df.ID.idxmax()
I think you wanted to do the following:
df.sort_index(ascending=False, inplace=True)
df.head(1)['ID'] #or df.iloc[0]['ID']
in a function it would be:
def f(df):
df.sort_index(ascending=False, inplace=True) #you can sort outside the function as well
return df.iloc[0]['ID']
It's a bit odd that that column is your index, but be that as it may you could grab the row where the value of the index is equal to the max of the index and then reference the ID column.
df[df.index == df.index.max()].ID
Your other methods are failing as a result of the KeyError. The index name is Gold, but Gold is not in the column index and this raises the KeyError. I.e. df['Gold'] is not possible when 'Gold' is the index. Instead use df.index. You could also reset the index like so.
df = df.reset_index()
df
Gold # Summer Silver Bronze Total # Winter Gold.1 ... Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
0 0 13 0 2 2 0 0 ... 0 13 0 0 2 2 AFG
1 5 12 2 8 15 3 0 ... 0 15 5 2 8 15 ALG
2 18 23 24 28 70 18 0 ... 0 41 18 24 28 70 ARG
3 1 5 2 9 12 6 0 ... 0 11 1 2 9 12 ARM
4 3 2 4 5 12 0 0 ... 0 2 3 4 5 12 ANZ
[5 rows x 16 columns]
Then you can use df['Gold'] or df.Gold as you were attempting before as 'Gold' is now an acceptable key.
df.Gold.idxmax()
2
In my case its 'ARG' with 18 Gold medals
I am trying to drop NA values from a pandas dataframe.
I have used dropna() (which should drop all NA rows from the dataframe). Yet, it does not work.
Here is the code:
import pandas as pd
import numpy as np
prison_data = pd.read_csv('https://andrewshinsuke.me/docs/compas-scores-two-years.csv')
That's how you get the data frame. As the following shows, the default read_csv method does indeed convert the NA data points to np.nan.
np.isnan(prison_data.head()['out_custody'][4])
Out[2]: True
Conveniently, the head() of the DF already contains a NaN values (in the column out_custody), so printing prison_data.head() this, you get:
id name first last compas_screening_date sex
0 1 miguel hernandez miguel hernandez 2013-08-14 Male
1 3 kevon dixon kevon dixon 2013-01-27 Male
2 4 ed philo ed philo 2013-04-14 Male
3 5 marcu brown marcu brown 2013-01-13 Male
4 6 bouthy pierrelouis bouthy pierrelouis 2013-03-26 Male
dob age age_cat race ...
0 1947-04-18 69 Greater than 45 Other ...
1 1982-01-22 34 25 - 45 African-American ...
2 1991-05-14 24 Less than 25 African-American ...
3 1993-01-21 23 Less than 25 African-American ...
4 1973-01-22 43 25 - 45 Other ...
v_decile_score v_score_text v_screening_date in_custody out_custody
0 1 Low 2013-08-14 2014-07-07 2014-07-14
1 1 Low 2013-01-27 2013-01-26 2013-02-05
2 3 Low 2013-04-14 2013-06-16 2013-06-16
3 6 Medium 2013-01-13 NaN NaN
4 1 Low 2013-03-26 NaN NaN
priors_count.1 start end event two_year_recid
0 0 0 327 0 0
1 0 9 159 1 1
2 4 0 63 0 1
3 1 0 1174 0 0
4 2 0 1102 0 0
However, running prison_data.dropna() does not change the dataframe in any way.
prison_data.dropna()
np.isnan(prison_data.head()['out_custody'][4])
Out[3]: True
df.dropna() by default returns a new dataset without NaN values. So, you have to assign it to the variable
df = df.dropna()
if you want it to modify the df inplace, you have to explicitly specify
df.dropna(inplace= True)
it wasn't working because there was at least one nan per row
I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.