I would like to create a filter that allows me to retrieve only the values of opposite signs on a certain column. (example 10, -10, 22, -22)
How can I do this? Thanks
I would like to keep only B codes whose opposite value is in A, typically:
The exact logic and expected output is unclear (please provide an example), bus you could use the absolute value and the sign as groupers:
out = (df
.assign(abs=df['col'].abs(), sign=np.sign(df['col']))
.pivot(index='abs', columns='sign')
)
output:
id col
sign -1 1 -1 1
abs
4 NaN 4.0 NaN 4.0
7 5.0 NaN -7.0 NaN
10 3.0 0.0 -10.0 10.0
22 2.0 1.0 -22.0 22.0
used input:
df = pd.DataFrame({'id': range(6),
'col': [10, 22, -22, -10, 4, -7],
})
id col
0 0 10
1 1 22
2 2 -22
3 3 -10
4 4 4
5 5 -7
Related
I'd like to know if I can do all this in one line, rather than multiple lines.
my dataframe:
import pandas as pd
df = pd.DataFrame({'ID' : [1,1,1,1,1,1,2,2,2,2,2,2]
,'A': [1, 2, 3, 10, np.nan, 5 , 20, 6, 7, np.nan, np.nan, np.nan]
, 'B': [0,1,1,0,1,1,1,1,1,0,1,0]
, 'desired_output' : [5,5,5,5,5,5,20,20,20,20,20,20]})
df
ID A B desired_output
0 1 1.0 0 5
1 1 2.0 1 5
2 1 3.0 1 5
3 1 10.0 0 5
4 1 NaN 1 5
5 1 5.0 1 5
6 2 20.0 1 20
7 2 6.0 1 20
8 2 7.0 1 20
9 2 NaN 0 20
10 2 NaN 1 20
11 2 NaN 0 20
I'm trying to find the maximum value of column A, for values of column B == 1, group by column ID, and transform the results directly so that the value is back in the dataframe without extra merging et al.
something like the following (but without getting errors!)
df['desired_output'] = df.groupby('ID').A.where(df.B == 1).transform('max') ## this gives error
The max function should ignore the NaNs as well. I wonder if I'm trying too much in one line, but one can hope there is a way for a beautiful code.
EDIT:
I can get a very similar output by changing the where clause:
df['desired_output'] = df.where(df.B == 1).groupby('ID').A.transform('max') ## this works but output is not what i want
but the output is not exactly what I want. desired_output should not have any NaN, unless all values of A are NaN for when B == 1.
Here is a way to do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID' : [1,1,1,1,1,1,2,2,2,2,2,2],
'A': [1, 2, 3, 10, np.nan, 5 , 20, 6, 7, np.nan, np.nan, np.nan],
'B': [0,1,1,0,1,1,1,1,1,0,1,0],
'desired_output' : [5,5,5,5,5,5,20,20,20,20,20,20]
})
df['output'] = df[df.B == 1].groupby('ID').A.max()[df.ID].array
df
Result:
ID A B desired_output output
0 1 1.0 0 5 5.0
1 1 2.0 1 5 5.0
2 1 3.0 1 5 5.0
3 1 10.0 0 5 5.0
4 1 NaN 1 5 5.0
5 1 5.0 1 5 5.0
6 2 20.0 1 20 20.0
7 2 6.0 1 20 20.0
8 2 7.0 1 20 20.0
9 2 NaN 0 20 20.0
10 2 NaN 1 20 20.0
11 2 NaN 0 20 20.0
Decomposition:
df[df.B == 1] # start by filtering on B
.groupby('ID') # group by ID
.A.max() # get max values in column A
[df.ID] # recast the result on ID series shape
.array # fetch the raw values from the Series
Important note: it relies on the fact that the index is as in the given example, that is, sorted, starting from 0, with a 1 increment. You will have to reset_index() of your DataFrame before this operation when this is not the case.
Suppose I have a DataFrame with some NaNs:
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
The result should be like this which is just +10 of the previous NaN value of the column.
0 1 2 3
1 4 12 13
2 14 12 9
Is there any way to do this using any methods?or do I have to iterate each column?
You can use ffill() to fill the NaNs with the previous non-NaN value, and then a simple mask to increment all by 10:
result = df.ffill()
result[df.isna()] += 10
Output
0 1 2
0 1.0 2.0 3.0
1 4.0 12.0 13.0
2 14.0 12.0 9.0
I have a pandas dataframe with two dimensions. I want to calculate the rolling standard deviation along axis 1 while also including datapoints in the rows above and below.
So say I have this df:
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want a rectangular window 3 rows high and 2 columns across, moving from left to right. So, for example,
std_df.loc[1, 'C']
would be equal to
np.std([1, 5, 9, 2, 6, 10, 3, 7, 11])
But no idea how to achieve this without very slow iteration
Looks like what you want is pd.shift
import pandas as pd
import numpy as np
data = {'A': [1,2,3,4], 'B': [5,6,7,8], 'C': [9,10,11,12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Shifting the dataframe you provided by 1 yields the row above
print(df.shift(1))
A B C
0 NaN NaN NaN
1 1.0 5.0 9.0
2 2.0 6.0 10.0
3 3.0 7.0 11.0
Similarly, shifting the dataframe you provided by -1 yields the row below
print(df.shift(-1))
A B C
0 2.0 6.0 10.0
1 3.0 7.0 11.0
2 4.0 8.0 12.0
3 NaN NaN NaN
so the code below should do what you're looking for (add_prefix prefixes the column names to make them unique)
above_df = df.shift(1).add_prefix('above_')
below_df = df.shift(-1).add_prefix('below_')
lagged = pd.concat([df, above_df, below_df], axis=1)
lagged['std'] = lagged.apply(np.std, axis=1)
print(lagged)
A B C above_A above_B above_C below_A below_B below_C std
0 1 5 9 NaN NaN NaN 2.0 6.0 10.0 3.304038
1 2 6 10 1.0 5.0 9.0 3.0 7.0 11.0 3.366502
2 3 7 11 2.0 6.0 10.0 4.0 8.0 12.0 3.366502
3 4 8 12 3.0 7.0 11.0 NaN NaN NaN 3.304038
I need to combine multiple rows into a single row, and the original dataframes looks like:
IndividualID DayID TripID JourSequence TripPurpose
200100000001 1 1 1 3
200100000001 1 2 2 31
200100000001 1 3 3 23
200100000001 1 4 4 5
200100000009 1 55 1 3
200100000009 1 56 2 12
200100000009 1 57 3 4
200100000009 1 58 4 6
200100000009 1 59 5 19
200100000009 1 60 6 2
I was trying to build some sort of 'trip chain', so basically all the journey sequences and trip purposes of one individual on a single day should be in the same row...
Ideally I was trying to convert the table to something like this:
IndividualID DayID Seq1 TripPurp1 Seq2 TripPur2 Seq3 TripPurp3 Seq4 TripPur4
200100000001 1 1 3 2 31 3 23 4 5
200100000009 1 1 3 2 12 3 4 4 6
If this is not possible, then the following mode would also be fine:
IndividualID DayID TripPurposes
200100000001 1 3, 31, 23, 5
200100000009 1 3, 12, 4, 6
Is there any possible solutions? I was thinking on for loop/ while statement, but maybe that was not really a good idea.
Thanks in advance!
You can try:
df_out = df.set_index(['IndividualID','DayID',df.groupby(['IndividualID','DayID']).cumcount()+1]).unstack().sort_index(level=1, axis=1)
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out.reset_index()
Output:
IndividualID DayID JourSequence_1 TripID_1 TripPurpose_1 \
0 200100000001 1 1.0 1.0 3.0
1 200100000009 1 1.0 55.0 3.0
JourSequence_2 TripID_2 TripPurpose_2 JourSequence_3 TripID_3 \
0 2.0 2.0 31.0 3.0 3.0
1 2.0 56.0 12.0 3.0 57.0
TripPurpose_3 JourSequence_4 TripID_4 TripPurpose_4 JourSequence_5 \
0 23.0 4.0 4.0 5.0 NaN
1 4.0 4.0 58.0 6.0 5.0
TripID_5 TripPurpose_5 JourSequence_6 TripID_6 TripPurpose_6
0 NaN NaN NaN NaN NaN
1 59.0 19.0 6.0 60.0 2.0
To get your second output you just need to groupby and apply list:
df.groupby(['IndividualID', 'DayID'])['TripPurpose'].apply(list)
TripPurpose
IndividualID DayID
200100000001 1 [3, 31, 23, 5]
200100000009 1 [3, 12, 4, 6, 19, 2]
to get your first output you can do something like this (probably not the best approach):
df2 = pd.DataFrame(df.groupby(['IndividualID', 'DayID'])['TripPurpose'].apply(list))
trip = df2['TripPurpose'].apply(pd.Series).rename(columns = lambda x: 'TripPurpose'+ str(x+1))
df3 = pd.DataFrame(df.groupby(['IndividualID', 'DayID'])['JourSequence'].apply(list))
seq = df3['JourSequence'].apply(pd.Series).rename(columns = lambda x: 'seq'+ str(x+1))
pd.merge(trip,seq,on=['IndividualID','DayID'])
output is not sorted
I am struggeling to get the right (restricted to the selection) index when using the methode xs by pandas to select specific data in my dataframe. Let me demonstrate, what I am doing:
print(df)
value
idx1 idx2 idx3 idx4 idx5
10 2.0 0.0010 1 2 6.0 ...
2 3 6.0 ...
...
7 8 6.0 ...
8 9 6.0 ...
20 2.0 0.0010 1 2 6.0 ...
2 3 6.0 ...
...
18 19 6.0 ...
19 20 6.0 ...
# get dataframe for idx1 = 10, idx2 = 2.0, idx3 = 0.0010
print(df.xs([10,2.0,0.0010]))
value
idx4 idx5
1 2 6.0 ...
2 3 6.0 ...
3 4 6.0 ...
4 5 6.0 ...
5 6 6.0 ...
6 7 6.0 ...
7 8 6.0 ...
8 9 6.0 ...
# get the first index list of this part of the dataframe
print(df.xs([10,2.0,0.0010]).index.levels[0])
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19]
So I do not understand, why the full list of values that occur in idx4 is returned even though we restricted the dataframe to a part where idx4 only takes values from 1 to 8. Is it that I use the index method in a wrong way?
This is a known feature not bug. pandas preserves all of the index information. You can determine which of the levels are expressed and at what location via the labels attribute.
If you are looking to create an index that is fresh and just contains the information relevant to the slice you just made, you can do this:
df_new = df.xs([10,2.0,0.0010])
idx_new = pd.MultiIndex.from_tuples(df_new.index.to_series(),
names=df_new.index.names)
df_new.index = idx_new