I have a big data frame with float values. I want to perform two if logical operations.
My code:
df =
A B
0 78.2 98.2
1 54.0 58.0
2 45.0 49.0
3 20.0 10.0
# I want to compare each column data with predefined limits and assign a rank.
# For A col, Give rank 1 if > 70, 2 if 70< > 40, 3 if < 40
# For B col, Give rank 1 if > 80, 2 if 80< > 45, 3 if < 45
# perform the logical operation
df['A_op','B_op'] = pd.cut(df, bins=[[np.NINF, 40, 70, np.inf],[np.NINF, 45, 80, np.inf]], labels=[[3, 2, 1],[3, 2, 1]])
Present output:
ValueError: Input array must be 1 dimensional
Expected output:
df =
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 2
2 45.0 49.0 2 2
3 20.0 10.0 3 3
It doesn't look like you need to use pd.cut for this. You can simply use np.select:
df["A_op"] = np.select([df["A"]>70, df["A"]<40],[1,3], 2)
df["B_op"] = np.select([df["B"]>80, df["B"]<45],[1,3], 2)
print (df)
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 2
2 45.0 49.0 2 2
3 20.0 10.0 3 3
After a series of trials, I found the direct answer from the select method.
My answer:
rankdf = pd.DataFrame({'Ah':[70],'Al':[40],'Bh':[80],'Bl':[45]})
hcols = ['Ah','Bh']
lcols = ['Al','Bl']
# input columns
ip_cols = ['A','B']
#create empty op columns in df
op_cols = ['A_op','B_op']
df = pd.concat([df,pd.DataFrame(columns=op_cols)])
# logic operation
df[op_cols] = np.select([df[ip_cols ]>rankdf[hcols].values, df[ip_cols]<rankdf[lcols].values],[1,3],2)
Present output:
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 3
2 45.0 49.0 2 3
3 20.0 10.0 3 3
Related
I have 2 dataframes:
dfA = pd.DataFrame({'label':[1,5,2,4,2,3],
'group':['A']*3 + ['B']*3,
'x':[np.nan]*3 + [1,2,3],
'y':[np.nan]*3 + [1,2,3]})
dfB = pd.DataFrame({'uniqid':[1,2,3,4,5,6,7],
'horizontal':[34,41,23,34,23,43,22],
'vertical':[98,67,19,57,68,88,77]})
...which look like:
label group x y
0 1 A NaN NaN
1 5 A NaN NaN
2 2 A NaN NaN
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
uniqid horizontal vertical
0 1 34 98
1 2 41 67
2 3 23 19
3 4 34 57
4 5 23 68
5 6 43 88
6 7 22 77
Basically, dfB contains 'horizontal' and 'vertical' values for all unique IDs. I want to populate the 'x' and 'y' columns in dfA with the 'horizontal' and 'vertical' values in dfB but only for group A; data for group B should remain unchanged.
The desired output would be:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
I've used .merge() to add additional columns to the dataframe for both groups A and B and then copy data to x and y columns for group A only. And finally delete columns from dfB.
dfA = dfA.merge(dfB, how = 'left', left_on = 'label', right_on = 'uniqid')
dfA.loc[dfA['group'] == 'A','x'] = dfA.loc[dfA['group'] == 'A','horizontal']
dfA.loc[dfA['group'] == 'A','y'] = dfA.loc[dfA['group'] == 'A','vertical']
dfA = dfA[['label','group','x','y']]
The correct output is produced:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
...but this is a really, really ugly solution. Is there a better solution?
combine_first
dfA.set_index(['label', 'group']).combine_first(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
fillna
Works as well
dfA.set_index(['label', 'group']).fillna(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
We can try loc to extract/update only the part we want. And since you are merging on one column, which also has unique value on dfB, you can use set_index and loc/reindex:
mask = dfA['group']=='A'
dfA.loc[ mask, ['x','y']] = (dfB.set_index('uniqid')
.loc[dfA.loc[mask,'label'],
['horizontal','vertical']]
.values
)
Output:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
Note that the above would fail if some of dfA.label is not in dfB.uniqueid. In which case, we need to use reindex:
(dfB.set_index('uniqid')
.reindex[dfA.loc[mask,'label']
[['horizontal','vertical']].values
)
I need to combine multiple rows into a single row, and the original dataframes looks like:
IndividualID DayID TripID JourSequence TripPurpose
200100000001 1 1 1 3
200100000001 1 2 2 31
200100000001 1 3 3 23
200100000001 1 4 4 5
200100000009 1 55 1 3
200100000009 1 56 2 12
200100000009 1 57 3 4
200100000009 1 58 4 6
200100000009 1 59 5 19
200100000009 1 60 6 2
I was trying to build some sort of 'trip chain', so basically all the journey sequences and trip purposes of one individual on a single day should be in the same row...
Ideally I was trying to convert the table to something like this:
IndividualID DayID Seq1 TripPurp1 Seq2 TripPur2 Seq3 TripPurp3 Seq4 TripPur4
200100000001 1 1 3 2 31 3 23 4 5
200100000009 1 1 3 2 12 3 4 4 6
If this is not possible, then the following mode would also be fine:
IndividualID DayID TripPurposes
200100000001 1 3, 31, 23, 5
200100000009 1 3, 12, 4, 6
Is there any possible solutions? I was thinking on for loop/ while statement, but maybe that was not really a good idea.
Thanks in advance!
You can try:
df_out = df.set_index(['IndividualID','DayID',df.groupby(['IndividualID','DayID']).cumcount()+1]).unstack().sort_index(level=1, axis=1)
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out.reset_index()
Output:
IndividualID DayID JourSequence_1 TripID_1 TripPurpose_1 \
0 200100000001 1 1.0 1.0 3.0
1 200100000009 1 1.0 55.0 3.0
JourSequence_2 TripID_2 TripPurpose_2 JourSequence_3 TripID_3 \
0 2.0 2.0 31.0 3.0 3.0
1 2.0 56.0 12.0 3.0 57.0
TripPurpose_3 JourSequence_4 TripID_4 TripPurpose_4 JourSequence_5 \
0 23.0 4.0 4.0 5.0 NaN
1 4.0 4.0 58.0 6.0 5.0
TripID_5 TripPurpose_5 JourSequence_6 TripID_6 TripPurpose_6
0 NaN NaN NaN NaN NaN
1 59.0 19.0 6.0 60.0 2.0
To get your second output you just need to groupby and apply list:
df.groupby(['IndividualID', 'DayID'])['TripPurpose'].apply(list)
TripPurpose
IndividualID DayID
200100000001 1 [3, 31, 23, 5]
200100000009 1 [3, 12, 4, 6, 19, 2]
to get your first output you can do something like this (probably not the best approach):
df2 = pd.DataFrame(df.groupby(['IndividualID', 'DayID'])['TripPurpose'].apply(list))
trip = df2['TripPurpose'].apply(pd.Series).rename(columns = lambda x: 'TripPurpose'+ str(x+1))
df3 = pd.DataFrame(df.groupby(['IndividualID', 'DayID'])['JourSequence'].apply(list))
seq = df3['JourSequence'].apply(pd.Series).rename(columns = lambda x: 'seq'+ str(x+1))
pd.merge(trip,seq,on=['IndividualID','DayID'])
output is not sorted
I am trying to calculate the difference in certain rows based on the values from other columns.
Using the example data frame below, I want to calculate the difference in Time based on the values in the Code column. Specifically, I want to loop through and determine the time difference between B and A. So Time in B - Time in A.
I can do this manually using the iloc function but I was hoping to determine a more efficient way. Especially if I have to repeat this process numerous times.
import pandas as pd
import numpy as np
k = 5
N = 15
d = ({'Time' : np.random.randint(k, k + 100 , size=N),
'Code' : ['A','x','B','x','A','x','B','x','A','x','B','x','A','x','B']})
df = pd.DataFrame(data=d)
Output:
Code Time
0 A 89
1 x 39
2 B 24
3 x 62
4 A 83
5 x 57
6 B 69
7 x 10
8 A 87
9 x 62
10 B 86
11 x 11
12 A 54
13 x 44
14 B 71
Expected Output:
diff
1 -65
2 -14
3 -1
4 17
First filter by boolean indexing, then subtract by sub with reset_index for default index for align Series a and b and last if want one column DataFrame add to_frame:
a = df.loc[df['Code'] == 'A', 'Time'].reset_index(drop=True)
b = df.loc[df['Code'] == 'B', 'Time'].reset_index(drop=True)
Similar alternative solution:
a = df.loc[df['Code'] == 'A'].reset_index()['Time']
b = df.loc[df['Code'] == 'B'].reset_index()['Time']
c = b.sub(a).to_frame('diff')
print (c)
diff
0 -65
1 -14
2 -1
3 17
Last for new index start from 1 add rename:
c = b.sub(a).to_frame('diff').rename(lambda x: x + 1)
print (c)
diff
1 -65
2 -14
3 -1
4 17
Another approach if need count more difference is reshape by unstack:
df = df.set_index(['Code', df.groupby('Code').cumcount() + 1])['Time'].unstack()
print (df)
1 2 3 4 5 6 7
Code
A 89.0 83.0 87.0 54.0 NaN NaN NaN
B 24.0 69.0 86.0 71.0 NaN NaN NaN
x 39.0 62.0 57.0 10.0 62.0 11.0 44.0
#last remove `NaN`s rows
c = df.loc['B'].sub(df.loc['A']).dropna()
print (c)
1 -65.0
2 -14.0
3 -1.0
4 17.0
dtype: float64
#subtract with NaNs values - fill_value=0 return non NaNs values
d = df.loc['x'].sub(df.loc['A'], fill_value=0)
print (d)
1 -50.0
2 -21.0
3 -30.0
4 -44.0
5 62.0
6 11.0
7 44.0
dtype: float64
Assuming your Code is a repeat of 'A', 'x', 'B', 'x', you can just use
>>> (df.Time[df.Code == 'B'].reset_index() - df.Time[df.Code == 'A'].reset_index())[['Time']]
Time
0 -65
1 -14
2 -1
3 17
But note that the original assumption, that 'A' and 'B' values alternate, seems fragile.
If you want the indexes to run from 1 to 4, as in your question, you can assign the previous to diff, and then use
diff.index += 1
>>> diff
Time
1 -65
2 -14
3 -1
4 17
I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs
I have a dataframe in pandas where one of the column (i.e., column 'b') contains strings with $ symbols:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [51, 2,32,99,81], 'b': ['$3', '$4','$-','$0','$23']})
I want to filter the dataframe such that I only retain the rows where column'b' only returns integers other then zero and the $ symbol is discarded.
My desired output is:
Any feedback is welcome.
In [64]: df = pd.DataFrame({'a': [51, 2,32,99,81], 'b': ['$3', '$4','$-','$0','$23']})
In [65]: df['b'] = pd.to_numeric(df['b'].str.replace(r'\D+', ''), errors='coerce')
In [67]: df
Out[67]:
a b
0 51 3.0
1 2 4.0
2 32 NaN
3 99 0.0
4 81 23.0
In [68]: df = df[df['b'].notnull() & df['b'].ne(0)]
In [69]: df
Out[69]:
a b
0 51 3.0
1 2 4.0
4 81 23.0
alternatively we can filter it this way:
In [73]: df = df.query("b == b and b != 0")
In [74]: df
Out[74]:
a b
0 51 3.0
1 2 4.0
4 81 23.0
df.b=pd.to_numeric(df['b'].str.replace('$', ''),errors='coerce')
df
Out[603]:
a b
0 51 3.0
1 2 4.0
2 32 NaN
3 99 0.0
4 81 23.0
df.loc[(df.b.notnull())&(df.b!=0),:]
Out[604]:
a b
0 51 3.0
1 2 4.0
4 81 23.0
Similarly, using pd.to_numeric (assuming your data has this same structure throughout).
df.b = pd.to_numeric(df.b.str[1:], errors='coerce')
print(df)
a b
0 51 3.0
1 2 4.0
2 32 NaN
3 99 0.0
4 81 23.0
print (df.dropna(subset=['b']))
a b
0 51 3.0
1 2 4.0
3 99 0.0
4 81 23.0
If you want to filter out both NaNs and zeros, use:
print (df[df.b.notnull() & df.b.ne(0)])
a b
0 51 3.0
1 2 4.0
4 81 23.0