I have a dataframe in pandas where one of the column (i.e., column 'b') contains strings with $ symbols:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [51, 2,32,99,81], 'b': ['$3', '$4','$-','$0','$23']})
I want to filter the dataframe such that I only retain the rows where column'b' only returns integers other then zero and the $ symbol is discarded.
My desired output is:
Any feedback is welcome.
In [64]: df = pd.DataFrame({'a': [51, 2,32,99,81], 'b': ['$3', '$4','$-','$0','$23']})
In [65]: df['b'] = pd.to_numeric(df['b'].str.replace(r'\D+', ''), errors='coerce')
In [67]: df
Out[67]:
a b
0 51 3.0
1 2 4.0
2 32 NaN
3 99 0.0
4 81 23.0
In [68]: df = df[df['b'].notnull() & df['b'].ne(0)]
In [69]: df
Out[69]:
a b
0 51 3.0
1 2 4.0
4 81 23.0
alternatively we can filter it this way:
In [73]: df = df.query("b == b and b != 0")
In [74]: df
Out[74]:
a b
0 51 3.0
1 2 4.0
4 81 23.0
df.b=pd.to_numeric(df['b'].str.replace('$', ''),errors='coerce')
df
Out[603]:
a b
0 51 3.0
1 2 4.0
2 32 NaN
3 99 0.0
4 81 23.0
df.loc[(df.b.notnull())&(df.b!=0),:]
Out[604]:
a b
0 51 3.0
1 2 4.0
4 81 23.0
Similarly, using pd.to_numeric (assuming your data has this same structure throughout).
df.b = pd.to_numeric(df.b.str[1:], errors='coerce')
print(df)
a b
0 51 3.0
1 2 4.0
2 32 NaN
3 99 0.0
4 81 23.0
print (df.dropna(subset=['b']))
a b
0 51 3.0
1 2 4.0
3 99 0.0
4 81 23.0
If you want to filter out both NaNs and zeros, use:
print (df[df.b.notnull() & df.b.ne(0)])
a b
0 51 3.0
1 2 4.0
4 81 23.0
Related
I have a big data frame with float values. I want to perform two if logical operations.
My code:
df =
A B
0 78.2 98.2
1 54.0 58.0
2 45.0 49.0
3 20.0 10.0
# I want to compare each column data with predefined limits and assign a rank.
# For A col, Give rank 1 if > 70, 2 if 70< > 40, 3 if < 40
# For B col, Give rank 1 if > 80, 2 if 80< > 45, 3 if < 45
# perform the logical operation
df['A_op','B_op'] = pd.cut(df, bins=[[np.NINF, 40, 70, np.inf],[np.NINF, 45, 80, np.inf]], labels=[[3, 2, 1],[3, 2, 1]])
Present output:
ValueError: Input array must be 1 dimensional
Expected output:
df =
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 2
2 45.0 49.0 2 2
3 20.0 10.0 3 3
It doesn't look like you need to use pd.cut for this. You can simply use np.select:
df["A_op"] = np.select([df["A"]>70, df["A"]<40],[1,3], 2)
df["B_op"] = np.select([df["B"]>80, df["B"]<45],[1,3], 2)
print (df)
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 2
2 45.0 49.0 2 2
3 20.0 10.0 3 3
After a series of trials, I found the direct answer from the select method.
My answer:
rankdf = pd.DataFrame({'Ah':[70],'Al':[40],'Bh':[80],'Bl':[45]})
hcols = ['Ah','Bh']
lcols = ['Al','Bl']
# input columns
ip_cols = ['A','B']
#create empty op columns in df
op_cols = ['A_op','B_op']
df = pd.concat([df,pd.DataFrame(columns=op_cols)])
# logic operation
df[op_cols] = np.select([df[ip_cols ]>rankdf[hcols].values, df[ip_cols]<rankdf[lcols].values],[1,3],2)
Present output:
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 3
2 45.0 49.0 2 3
3 20.0 10.0 3 3
I have 2 dataframes:
dfA = pd.DataFrame({'label':[1,5,2,4,2,3],
'group':['A']*3 + ['B']*3,
'x':[np.nan]*3 + [1,2,3],
'y':[np.nan]*3 + [1,2,3]})
dfB = pd.DataFrame({'uniqid':[1,2,3,4,5,6,7],
'horizontal':[34,41,23,34,23,43,22],
'vertical':[98,67,19,57,68,88,77]})
...which look like:
label group x y
0 1 A NaN NaN
1 5 A NaN NaN
2 2 A NaN NaN
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
uniqid horizontal vertical
0 1 34 98
1 2 41 67
2 3 23 19
3 4 34 57
4 5 23 68
5 6 43 88
6 7 22 77
Basically, dfB contains 'horizontal' and 'vertical' values for all unique IDs. I want to populate the 'x' and 'y' columns in dfA with the 'horizontal' and 'vertical' values in dfB but only for group A; data for group B should remain unchanged.
The desired output would be:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
I've used .merge() to add additional columns to the dataframe for both groups A and B and then copy data to x and y columns for group A only. And finally delete columns from dfB.
dfA = dfA.merge(dfB, how = 'left', left_on = 'label', right_on = 'uniqid')
dfA.loc[dfA['group'] == 'A','x'] = dfA.loc[dfA['group'] == 'A','horizontal']
dfA.loc[dfA['group'] == 'A','y'] = dfA.loc[dfA['group'] == 'A','vertical']
dfA = dfA[['label','group','x','y']]
The correct output is produced:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
...but this is a really, really ugly solution. Is there a better solution?
combine_first
dfA.set_index(['label', 'group']).combine_first(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
fillna
Works as well
dfA.set_index(['label', 'group']).fillna(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
We can try loc to extract/update only the part we want. And since you are merging on one column, which also has unique value on dfB, you can use set_index and loc/reindex:
mask = dfA['group']=='A'
dfA.loc[ mask, ['x','y']] = (dfB.set_index('uniqid')
.loc[dfA.loc[mask,'label'],
['horizontal','vertical']]
.values
)
Output:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
Note that the above would fail if some of dfA.label is not in dfB.uniqueid. In which case, we need to use reindex:
(dfB.set_index('uniqid')
.reindex[dfA.loc[mask,'label']
[['horizontal','vertical']].values
)
I have dataframe as follows:
2017 2018
A B C A B C
0 12 NaN NaN 98 NaN NaN
1 NaN 23 NaN NaN 65 NaN
2 NaN NaN 45 NaN NaN 43
I want to convert this data frame into:
2017 2018
A B C A B C
0 12 23 45 98 65 43
First back filling missing values and then select first row by double [] for one row DataFrame:
df = df.bfill().iloc[[0]]
#alternative
#df = df.ffill().iloc[-1]]
print (df)
2017 2018
A B C A B C
0 12.0 23.0 45.0 98.0 65.0 43.0
One could sum along the columns:
import pandas as pd
import numpy as np
# Create DataFrame:
tmp = np.hstack((np.diag([12., 23., 42.]), np.diag([98., 65., 43.])))
tmp[tmp == 0] = np.NaN
df = pd.DataFrame(tmp, )
# Sum:
df2 = pd.DataFrame(df.sum(axis=0)).T
Resulting in:
0 1 2 3 4 5
0 12.0 23.0 42.0 98.0 65.0 43.0
This is convenient because Dataframe.sum ignores NaN by default. Couple of notes:
One loses the column names in this approach.
All-NaN columns will return 0 in the result.
I am trying to calculate the difference in certain rows based on the values from other columns.
Using the example data frame below, I want to calculate the difference in Time based on the values in the Code column. Specifically, I want to loop through and determine the time difference between B and A. So Time in B - Time in A.
I can do this manually using the iloc function but I was hoping to determine a more efficient way. Especially if I have to repeat this process numerous times.
import pandas as pd
import numpy as np
k = 5
N = 15
d = ({'Time' : np.random.randint(k, k + 100 , size=N),
'Code' : ['A','x','B','x','A','x','B','x','A','x','B','x','A','x','B']})
df = pd.DataFrame(data=d)
Output:
Code Time
0 A 89
1 x 39
2 B 24
3 x 62
4 A 83
5 x 57
6 B 69
7 x 10
8 A 87
9 x 62
10 B 86
11 x 11
12 A 54
13 x 44
14 B 71
Expected Output:
diff
1 -65
2 -14
3 -1
4 17
First filter by boolean indexing, then subtract by sub with reset_index for default index for align Series a and b and last if want one column DataFrame add to_frame:
a = df.loc[df['Code'] == 'A', 'Time'].reset_index(drop=True)
b = df.loc[df['Code'] == 'B', 'Time'].reset_index(drop=True)
Similar alternative solution:
a = df.loc[df['Code'] == 'A'].reset_index()['Time']
b = df.loc[df['Code'] == 'B'].reset_index()['Time']
c = b.sub(a).to_frame('diff')
print (c)
diff
0 -65
1 -14
2 -1
3 17
Last for new index start from 1 add rename:
c = b.sub(a).to_frame('diff').rename(lambda x: x + 1)
print (c)
diff
1 -65
2 -14
3 -1
4 17
Another approach if need count more difference is reshape by unstack:
df = df.set_index(['Code', df.groupby('Code').cumcount() + 1])['Time'].unstack()
print (df)
1 2 3 4 5 6 7
Code
A 89.0 83.0 87.0 54.0 NaN NaN NaN
B 24.0 69.0 86.0 71.0 NaN NaN NaN
x 39.0 62.0 57.0 10.0 62.0 11.0 44.0
#last remove `NaN`s rows
c = df.loc['B'].sub(df.loc['A']).dropna()
print (c)
1 -65.0
2 -14.0
3 -1.0
4 17.0
dtype: float64
#subtract with NaNs values - fill_value=0 return non NaNs values
d = df.loc['x'].sub(df.loc['A'], fill_value=0)
print (d)
1 -50.0
2 -21.0
3 -30.0
4 -44.0
5 62.0
6 11.0
7 44.0
dtype: float64
Assuming your Code is a repeat of 'A', 'x', 'B', 'x', you can just use
>>> (df.Time[df.Code == 'B'].reset_index() - df.Time[df.Code == 'A'].reset_index())[['Time']]
Time
0 -65
1 -14
2 -1
3 17
But note that the original assumption, that 'A' and 'B' values alternate, seems fragile.
If you want the indexes to run from 1 to 4, as in your question, you can assign the previous to diff, and then use
diff.index += 1
>>> diff
Time
1 -65
2 -14
3 -1
4 17
I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs