I've a question about a specific problem in pandas:
I have in a df a column with the following values:
5
4
3
2
1
0
0
0
0
1
2
3
4
5
I want to select all the rows from the first 5 to the last 0:
5
4
3
2
1
0
0
0
0
I tried with drop duplicates, but i loose the last three zeroes.
I'm thinking aboutusing a for cycle and stop when the i-th value of the column is greater than the i-1 value, but i don't know how to make such a cycle for a dataframe in pandas.
Can someone help me?
Thank you in advance, I hope I've explained the problem clearly.
You could use DataFrame.shift to compare with the next row, and keep only those that are less or equal than the previous. Here I use np.r_ to include the first value too:
import numpy as np
df[np.r_[True, df.col.le(df.col.shift()).to_numpy()[1:]]]
col
0 5
1 4
2 3
3 2
4 1
5 0
6 0
7 0
8 0
Let us try cummax:
s = df['col']
df.loc[s.eq(5).cummax() & s[::-1].eq(0).cummax()]
col
0 5
1 4
2 3
3 2
4 1
5 0
6 0
7 0
8 0
Related
I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')
I am beginner, and I really need help on the following:
I need to do similar to the following but on a two dimensional dataframe Identifying consecutive occurrences of a value
I need to use this answer but for two dimensional dataframe. I need to count at least 2 consecuetive ones along the columns dimension. Here is a sample dataframe:
my_df=
0 1 2
0 1 0 1
1 0 1 0
2 1 1 1
3 0 0 1
4 0 1 0
5 1 1 0
6 1 1 1
7 1 0 1
The output I am looking for is:
0 1 2
0 3 5 4
Instead of the column 'consecutive', I need a new output called "out_1_df" for line
df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
So that later I can do
threshold = 2;
out_2_df= (out_1_df > threshold).astype(int)
I tried the following:
out_1_df= my_df.groupby(( my_df != my_df.shift(axis=0)).cumsum(axis=0))
out_2_df =`(out_1_df > threshold).astype(int)`
How can I modify this?
Try:
import pandas as pd
df=pd.DataFrame({0:[1,0,1,0,0,1,1,1], 1:[0,1,1,0,1,1,1,0], 2: [1,0,1,1,0,0,1,1]})
out_2_df=((df.diff(axis=0).eq(0)|df.diff(periods=-1,axis=0).eq(0))&df.eq(1)).sum(axis=0)
>>> out_2_df
[3 5 4]
How do I count the number of multicolumn (thing, cond=1) event occurrences prior to every (thing, cond=any) event?
(These could be winning games of poker by player, episodes of depression by patient, or so on.) For example, row index == 3, below, contains the pair (thing, cond) = (c,2), and shows the number of prior (c,1) occurrences, which is correctly (but manually) shown in the priors column as 0. I'm interested in producing a synthetic column with the count of prior (thing, 1) events for every (thing, event) pair in my data. My data are monotonically increasing in time. The natural index in the silly DataFrame can be taken as logical ticks, if it helps. (<Narrator>: It really doesn't.)
For convenience, below is the code for my test DataFrame and the manually generated priors column, which I cannot get pandas to usefully generate, no matter which combinations of groupby, cumsum, shift, where, & etc. I try. I have googled and wracked my brain for days. No SO answers seem to fit the bill. The key to reading the priors column is that its entries say things like, "Before this (a,1) or (a,2) event, there have been 2 (a,1) events."
[In]:
import pandas as pd
silly = pd.DataFrame({'thing': ['a','b','a','c','b','c','c','a','a','b','c','a'], "cond": [1,2,1,2,1,2,1,2,1,2,1,2]})
silly['priors'] = pd.Series([0,0,1,0,0,0,0,2,2,1,1,3])
silly
[Out]:
silly
thing cond priors
0 a 1 0
1 b 2 0
2 a 1 1
3 c 2 0
4 b 1 0
5 c 2 0
6 c 1 0
7 a 2 2
8 a 1 2
9 b 2 1
10 c 1 1
11 a 2 3
The closest I've come is:
silly
[In]:
silly['priors_inc'] = silly[['thing', 'cond']].where(silly['cond'] == 1).groupby('thing').cumsum() - 1
[Out]:
silly
thing cond priors priors_inc
0 a 1 0 0.0
1 b 2 0 NaN
2 a 1 1 1.0
3 c 2 0 NaN
4 b 1 0 0.0
5 c 2 0 NaN
6 c 1 0 0.0
7 a 2 2 NaN
8 a 1 2 2.0
9 b 2 1 NaN
10 c 1 1 1.0
11 a 2 3 NaN
Note that the values that are present in the incomplete priors column are correct, but not all of the desired data are there.
Please, if at all possible, withhold any "Pythonic" answers. While my real data are small compared to most ML problems, I want to learn pandas the right way, not the toy data way with Python loops or itertools chicanery that I've seen too much of already. Thank you in advance! (And I apologize for the wall of text!)
You need to
Cumulatively count where each "cond" is 1
Do this for each "thing"
Make sure the counts are shifted by 1.
You can do this using groupby, cumsum and shift:
(df.cond.eq(1)
.groupby(df.thing)
.apply(lambda x: x.cumsum().shift())
.fillna(0, downcast='infer'))
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 2
8 2
9 1
10 1
11 3
Name: cond, dtype: int64
Another option to avoid the apply is to chain two groupby calls—one does the shifting, the other performs the cumsum.
(df.cond.eq(1)
.groupby(df.thing)
.cumsum()
.groupby(df.thing)
.shift()
.fillna(0, downcast='infer'))
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 2
8 2
9 1
10 1
11 3
Name: cond, dtype: int64
I want to sort a subset of a dataframe (say, between indexes i and j) according to some value. I tried
df2=df.iloc[i:j].sort_values(by=...)
df.iloc[i:j]=df2
No problem with the first line but nothing happens when I run the second one (not even an error). How should I do ? (I tried also the update function but it didn't do either).
I believe need assign to filtered DataFrame with converting to numpy array by values for avoid align indices:
df = pd.DataFrame({'A': [1,2,3,4,3,2,1,4,1,2]})
print (df)
A
0 1
1 2
2 3
3 4
4 3
5 2
6 1
7 4
8 1
9 2
i = 2
j = 7
df.iloc[i:j] = df.iloc[i:j].sort_values(by='A').values
print (df)
A
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 4
8 1
9 2
I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .