how to filter rows based on

how to filter rows based on - python

I have a dataframe in pandas where one of the column (i.e., column 'b') contains strings with $ symbols:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [51, 2,32,99,81], 'b': ['$3', '$4','$-','$0','$23']})
I want to filter the dataframe such that I only retain the rows where column'b' only returns integers other then zero and the $ symbol is discarded.
My desired output is:
Any feedback is welcome.

In [64]: df = pd.DataFrame({'a': [51, 2,32,99,81], 'b': ['$3', '$4','$-','$0','$23']})
In [65]: df['b'] = pd.to_numeric(df['b'].str.replace(r'\D+', ''), errors='coerce')
In [67]: df
Out[67]:
a b
0 51 3.0
1 2 4.0
2 32 NaN
3 99 0.0
4 81 23.0
In [68]: df = df[df['b'].notnull() & df['b'].ne(0)]
In [69]: df
Out[69]:
a b
0 51 3.0
1 2 4.0
4 81 23.0
alternatively we can filter it this way:
In [73]: df = df.query("b == b and b != 0")
In [74]: df
Out[74]:
a b
0 51 3.0
1 2 4.0
4 81 23.0

df.b=pd.to_numeric(df['b'].str.replace('$', ''),errors='coerce')
df
Out[603]:
a b
0 51 3.0
1 2 4.0
2 32 NaN
3 99 0.0
4 81 23.0
df.loc[(df.b.notnull())&(df.b!=0),:]
Out[604]:
a b
0 51 3.0
1 2 4.0
4 81 23.0

Similarly, using pd.to_numeric (assuming your data has this same structure throughout).
df.b = pd.to_numeric(df.b.str[1:], errors='coerce')
print(df)
a b
0 51 3.0
1 2 4.0
2 32 NaN
3 99 0.0
4 81 23.0
print (df.dropna(subset=['b']))
a b
0 51 3.0
1 2 4.0
3 99 0.0
4 81 23.0
If you want to filter out both NaNs and zeros, use:
print (df[df.b.notnull() & df.b.ne(0)])
a b
0 51 3.0
1 2 4.0
4 81 23.0

Related

Python Dataframe Logical Operations on Multiple Columns using Multiple If statements

I have a big data frame with float values. I want to perform two if logical operations.
My code:
df =
A B
0 78.2 98.2
1 54.0 58.0
2 45.0 49.0
3 20.0 10.0
# I want to compare each column data with predefined limits and assign a rank.
# For A col, Give rank 1 if > 70, 2 if 70< > 40, 3 if < 40
# For B col, Give rank 1 if > 80, 2 if 80< > 45, 3 if < 45
# perform the logical operation
df['A_op','B_op'] = pd.cut(df, bins=[[np.NINF, 40, 70, np.inf],[np.NINF, 45, 80, np.inf]], labels=[[3, 2, 1],[3, 2, 1]])
Present output:
ValueError: Input array must be 1 dimensional
Expected output:
df =
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 2
2 45.0 49.0 2 2
3 20.0 10.0 3 3

It doesn't look like you need to use pd.cut for this. You can simply use np.select:
df["A_op"] = np.select([df["A"]>70, df["A"]<40],[1,3], 2)
df["B_op"] = np.select([df["B"]>80, df["B"]<45],[1,3], 2)
print (df)
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 2
2 45.0 49.0 2 2
3 20.0 10.0 3 3

After a series of trials, I found the direct answer from the select method.
My answer:
rankdf = pd.DataFrame({'Ah':[70],'Al':[40],'Bh':[80],'Bl':[45]})
hcols = ['Ah','Bh']
lcols = ['Al','Bl']
# input columns
ip_cols = ['A','B']
#create empty op columns in df
op_cols = ['A_op','B_op']
df = pd.concat([df,pd.DataFrame(columns=op_cols)])
# logic operation
df[op_cols] = np.select([df[ip_cols ]>rankdf[hcols].values, df[ip_cols]<rankdf[lcols].values],[1,3],2)
Present output:
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 3
2 45.0 49.0 2 3
3 20.0 10.0 3 3

Conditionally insert columns of one Pandas dataframe into columns of another dataframe

I have 2 dataframes:
dfA = pd.DataFrame({'label':[1,5,2,4,2,3],
'group':['A']*3 + ['B']*3,
'x':[np.nan]*3 + [1,2,3],
'y':[np.nan]*3 + [1,2,3]})
dfB = pd.DataFrame({'uniqid':[1,2,3,4,5,6,7],
'horizontal':[34,41,23,34,23,43,22],
'vertical':[98,67,19,57,68,88,77]})
...which look like:
label group x y
0 1 A NaN NaN
1 5 A NaN NaN
2 2 A NaN NaN
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
uniqid horizontal vertical
0 1 34 98
1 2 41 67
2 3 23 19
3 4 34 57
4 5 23 68
5 6 43 88
6 7 22 77
Basically, dfB contains 'horizontal' and 'vertical' values for all unique IDs. I want to populate the 'x' and 'y' columns in dfA with the 'horizontal' and 'vertical' values in dfB but only for group A; data for group B should remain unchanged.
The desired output would be:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
I've used .merge() to add additional columns to the dataframe for both groups A and B and then copy data to x and y columns for group A only. And finally delete columns from dfB.
dfA = dfA.merge(dfB, how = 'left', left_on = 'label', right_on = 'uniqid')
dfA.loc[dfA['group'] == 'A','x'] = dfA.loc[dfA['group'] == 'A','horizontal']
dfA.loc[dfA['group'] == 'A','y'] = dfA.loc[dfA['group'] == 'A','vertical']
dfA = dfA[['label','group','x','y']]
The correct output is produced:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
...but this is a really, really ugly solution. Is there a better solution?

combine_first
dfA.set_index(['label', 'group']).combine_first(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
fillna
Works as well
dfA.set_index(['label', 'group']).fillna(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()

We can try loc to extract/update only the part we want. And since you are merging on one column, which also has unique value on dfB, you can use set_index and loc/reindex:
mask = dfA['group']=='A'
dfA.loc[ mask, ['x','y']] = (dfB.set_index('uniqid')
.loc[dfA.loc[mask,'label'],
['horizontal','vertical']]
.values
)
Output:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
Note that the above would fail if some of dfA.label is not in dfB.uniqueid. In which case, we need to use reindex:
(dfB.set_index('uniqid')
.reindex[dfA.loc[mask,'label']
[['horizontal','vertical']].values
)

Form single Row from all rows with corresponding values in pandas

I have dataframe as follows:
2017 2018
A B C A B C
0 12 NaN NaN 98 NaN NaN
1 NaN 23 NaN NaN 65 NaN
2 NaN NaN 45 NaN NaN 43
I want to convert this data frame into:
2017 2018
A B C A B C
0 12 23 45 98 65 43

First back filling missing values and then select first row by double [] for one row DataFrame:
df = df.bfill().iloc[[0]]
#alternative
#df = df.ffill().iloc[-1]]
print (df)
2017 2018
A B C A B C
0 12.0 23.0 45.0 98.0 65.0 43.0

One could sum along the columns:
import pandas as pd
import numpy as np
# Create DataFrame:
tmp = np.hstack((np.diag([12., 23., 42.]), np.diag([98., 65., 43.])))
tmp[tmp == 0] = np.NaN
df = pd.DataFrame(tmp, )
# Sum:
df2 = pd.DataFrame(df.sum(axis=0)).T
Resulting in:
0 1 2 3 4 5
0 12.0 23.0 42.0 98.0 65.0 43.0
This is convenient because Dataframe.sum ignores NaN by default. Couple of notes:
One loses the column names in this approach.
All-NaN columns will return 0 in the result.

How to subtract rows in a df based on a value in another column

I am trying to calculate the difference in certain rows based on the values from other columns.
Using the example data frame below, I want to calculate the difference in Time based on the values in the Code column. Specifically, I want to loop through and determine the time difference between B and A. So Time in B - Time in A.
I can do this manually using the iloc function but I was hoping to determine a more efficient way. Especially if I have to repeat this process numerous times.
import pandas as pd
import numpy as np
k = 5
N = 15
d = ({'Time' : np.random.randint(k, k + 100 , size=N),
'Code' : ['A','x','B','x','A','x','B','x','A','x','B','x','A','x','B']})
df = pd.DataFrame(data=d)
Output:
Code Time
0 A 89
1 x 39
2 B 24
3 x 62
4 A 83
5 x 57
6 B 69
7 x 10
8 A 87
9 x 62
10 B 86
11 x 11
12 A 54
13 x 44
14 B 71
Expected Output:
diff
1 -65
2 -14
3 -1
4 17

First filter by boolean indexing, then subtract by sub with reset_index for default index for align Series a and b and last if want one column DataFrame add to_frame:
a = df.loc[df['Code'] == 'A', 'Time'].reset_index(drop=True)
b = df.loc[df['Code'] == 'B', 'Time'].reset_index(drop=True)
Similar alternative solution:
a = df.loc[df['Code'] == 'A'].reset_index()['Time']
b = df.loc[df['Code'] == 'B'].reset_index()['Time']
c = b.sub(a).to_frame('diff')
print (c)
diff
0 -65
1 -14
2 -1
3 17
Last for new index start from 1 add rename:
c = b.sub(a).to_frame('diff').rename(lambda x: x + 1)
print (c)
diff
1 -65
2 -14
3 -1
4 17
Another approach if need count more difference is reshape by unstack:
df = df.set_index(['Code', df.groupby('Code').cumcount() + 1])['Time'].unstack()
print (df)
1 2 3 4 5 6 7
Code
A 89.0 83.0 87.0 54.0 NaN NaN NaN
B 24.0 69.0 86.0 71.0 NaN NaN NaN
x 39.0 62.0 57.0 10.0 62.0 11.0 44.0
#last remove `NaN`s rows
c = df.loc['B'].sub(df.loc['A']).dropna()
print (c)
1 -65.0
2 -14.0
3 -1.0
4 17.0
dtype: float64
#subtract with NaNs values - fill_value=0 return non NaNs values
d = df.loc['x'].sub(df.loc['A'], fill_value=0)
print (d)
1 -50.0
2 -21.0
3 -30.0
4 -44.0
5 62.0
6 11.0
7 44.0
dtype: float64

Assuming your Code is a repeat of 'A', 'x', 'B', 'x', you can just use
>>> (df.Time[df.Code == 'B'].reset_index() - df.Time[df.Code == 'A'].reset_index())[['Time']]
Time
0 -65
1 -14
2 -1
3 17
But note that the original assumption, that 'A' and 'B' values alternate, seems fragile.
If you want the indexes to run from 1 to 4, as in your question, you can assign the previous to diff, and then use
diff.index += 1
>>> diff
Time
1 -65
2 -14
3 -1
4 17

opposite of df.diff() in pandas

I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.

You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN

df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44

If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to filter rows based on - python

df.b=pd.to_numeric(df['b'].str.replace('$', ''),errors='coerce') df Out[603]: a b 0 51 3.0 1 2 4.0 2 32 NaN 3 99 0.0 4 81 23.0 df.loc[(df.b.notnull())&(df.b!=0),:] Out[604]: a b 0 51 3.0 1 2 4.0 4 81 23.0

Related

Python Dataframe Logical Operations on Multiple Columns using Multiple If statements

Conditionally insert columns of one Pandas dataframe into columns of another dataframe

Form single Row from all rows with corresponding values in pandas

How to subtract rows in a df based on a value in another column

opposite of df.diff() in pandas

Categories

Resources