Pandas Dataframe replace values in a Series - python

I am trying to update my_df based on conditional selection as in:
my_df[my_df['group'] == 'A']['rank'].fillna('A+')
However, this is not persistence ... e.g: the my_df still have NaN or NaT ... and I am not sure how to do this in_place. Please advise on how to persist the the update to my_df.

Create boolean mask and assign to filtered column rank:
my_df = pd.DataFrame({'group':list('AAAABC'),
'rank':['a','b',np.nan, np.nan, 'c',np.nan],
'C':[7,8,9,4,2,3]})
print (my_df)
group rank C
0 A a 7
1 A b 8
2 A NaN 9
3 A NaN 4
4 B c 2
5 C NaN 3
m = my_df['group'] == 'A'
my_df.loc[m, 'rank'] = my_df.loc[m, 'rank'].fillna('A+')
print(my_df)
group rank C
0 A a 7
1 A b 8
2 A A+ 9
3 A A+ 4
4 B c 2
5 C NaN 3

You need to assign it back
my_df.loc[my_df['group'] == 'A','rank']=my_df.loc[my_df['group'] == 'A','rank'].fillna('A+')

Your operations are not in-place, so you need to assign back to a variable. In addition, chained indexing is not recommended.
One option is pd.Series.mask with a Boolean series:
# data from #jezrael
df['rank'].mask((df['group'] == 'A') & df['rank'].isnull(), 'A+', inplace=True)
print(df)
C group rank
0 7 A a
1 8 A b
2 9 A A+
3 4 A A+
4 2 B c
5 3 C NaN

Related

Is there a way to get the data after a specific condition in Pandas?

i want to know if there is a way to take the data from a dataframe after a specific condition, and keep taking that data until another condition is applied.
I have the following dataframe:
column_1 column_2
0 1 a
1 1 a
2 1 b
3 4 b
4 4 c
5 4 c
6 0 d
7 0 d
8 0 e
9 4 e
10 4 f
11 4 f
12 1 g
13 1 g
I want to select from this dataframe only the rows when in column_1 when it changes from 1->4 and stays 4 until it changes to another value, as follow:
column_1 column_2
3 4 b
4 4 c
5 4 c
Is there a way to do this in Pandas and not make them lists?
Another option is to find the cut off points using shift+eq; then use groupby.cummax to create a boolean filter:
df[(df['column_1'].shift().eq(1) & df['column_1'].eq(4)).groupby(df['column_1'].diff().ne(0).cumsum()).cummax()]
Output:
column_1 column_2
3 4 b
4 4 c
5 4 c
You can create helper column for groups by duplicated values new first, then test if shifted values is 1 compare with actual row and for these rows get new values. Last compare new column by filtered values for all duplicated 4 rows:
df['new'] = df['column_1'].ne(df['column_1'].shift()).cumsum()
s = df.loc[df['column_1'].shift().eq(1) & df['column_1'].eq(4), 'new']
df = df[df['new'].isin(s)]
print (df)
column_1 column_2 new
3 4 b 2
4 4 c 2
5 4 c 2

Selecting rows from one DataFrame depending on values from another

Take two data frames
print(df1)
A B
0 a 1
1 a 3
2 a 5
3 b 7
4 b 9
5 c 11
6 c 13
7 c 15
print(df2)
C D
a apple 1
b pear 1
c apple 1
So the values in column df1['A'] are the indexes of df2.
I want to select the rows in df1 where the values in column A are 'apple' in df2['C']. Resulting in:
A B
0 a 1
1 a 3
2 a 5
5 c 11
6 c 13
7 c 15
Made many edits due to comments and question edits,
Basically you first extract the indexes of df2 by filtering the dataframe by values in C, then filter the df2 by indexes with isin
indexes = df2[df2['C']=='apple'].index
df1[df1['A'].isin(indexes)]
>>>
A B
0 a 1
1 a 3
2 a 5
5 c 11
6 c 13
7 c 15
UPDATE
If you want to minimize memory allocation try to prevent saving information, (note. That i am not sure ot will solve your menory allocation issue because i didnt have full details of the situation and maybe even not suited enough to provide a solution):
df1[df1['A'].isin( df2[df2['C']=='apple'].index)]

apply() method: Normalize the first column by the sum of the second

I'm having trouble understanding how a function works:
""" the apply() method lets you apply an arbitrary function to the group
result. The function take a DataFrame and returns a Pandas object (a df or
series) or a scalar.
For example: normalize the first column by the sum of the second"""
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
print (df); print (df.groupby('key').apply(norm_by_data2))
(Excerpt from: "Python Data Science Handbook", Jake VanderPlas pp. 167)
Returns this:
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
key data1 data2
0 A 0.000000 5
1 B 0.142857 0
2 C 0.166667 3
3 A 0.375000 3
4 B 0.571429 7
5 C 0.416667 9
For me, the best way to understand how this works is by manually calculating the values.
Can someone explain how to manually arrive to the second value of the column 'data1': 0.142857
It's 1/7? but where do this values come from?
Thanks!
I got it!!
The sum of column B for each group is:
A: 5 + 3 = 8
B: 0 + 7 = 7
C: 3 + 9 = 12
For example, to arrive to 0.142857, divide 1 in the sum of group B (it's 7) : 1/7 = 0.142857

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Outer join in python Pandas

I have two data sets as following
A B
IDs IDs
1 1
2 2
3 5
4 7
How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A
Something like Following
B
Ids
5
7
I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following
pd.merge(A, B, on='ids', how='right')
Thanks
You can use NumPy's setdiff1d, like so -
np.setdiff1d(B['IDs'],A['IDs'])
Also, np.in1d could be used for the same effect, like so -
B[~np.in1d(B['IDs'],A['IDs'])]
Please note that np.setdiff1d would give us a sorted NumPy array as output.
Sample run -
>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
IDs
1 7
2 5
You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:
A = pd.DataFrame({'IDs':[1,2,3,4],
'B':[4,5,6,7],
'C':[1,8,9,4]})
print (A)
B C IDs
0 4 1 1
1 5 8 2
2 6 9 3
3 7 4 4
B = pd.DataFrame({'IDs':[1,2,5,7],
'A':[1,8,3,7],
'D':[1,8,9,4]})
print (B)
A D IDs
0 1 1 1
1 8 8 2
2 3 9 5
3 7 4 7
df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']
df = df.drop('_merge', axis=1)
print (df)
B C IDs A D
4 NaN NaN 5.0 3.0 9.0
5 NaN NaN 7.0 7.0 4.0
You could convert the data series to sets and take the difference:
import pandas as pd
df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)}) # Take difference and convert back to DataFrame
The variable "C" then yields
C
0 5
1 7
You can simply use pandas' .isin() method:
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]
If these are separate DataFrames:
a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]
Output:
IDs
2 5
3 7

Categories