I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .
Related
[enter image description here][1]
Index number 72 is missing from original dataframe which is shown in image. I want to cut dataframe like [0:71,:] with condition like when index sequence breaks then dataframe automatically cuts from previous index value.
Compare shifted values of index subtracted by original values if greater like 1 with invert ordering by [::-1] and Series.cummax, last filter in boolean indexing:
df = pd.DataFrame({'a': range(3,13)}).drop(3)
print (df)
a
0 3
1 4
2 5
4 7
5 8
6 9
7 10
8 11
9 12
df = df[df.index.to_series().shift(-1, fill_value=0).sub(df.index).gt(1)[::-1].cummax()]
print (df)
a
0 3
1 4
2 5
i came to this:
df = pd.DataFrame({'col':[1,2,3,4,5,6,7,8,9]}, index=[-1,0,1,2,3,4,5,7,8])
ind = next((i for i in range(len(df)-1) if df.index[i]+1!=df.index[i+1]),len(df))+1
>>> df.iloc[:ind]
'''
col
-1 1
0 2
1 3
2 4
3 5
4 6
5 7
With numpy, get the values that are equal to a normal range starting from the first index, up to the first mismatch (excluded):
df[np.minimum.accumulate(df.index==np.arange(df.index[0], df.index[0]+len(df)))]
Example:
col
-1 1
0 2
1 3
3 4
4 5
output:
col
-1 1
0 2
1 3
In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
I have a df:
id value
1 10
2 15
1 10
1 10
2 13
3 10
3 20
I am trying to keep only rows that have 1 unique value in column value so that the result df looks like this:
id value
1 10
1 10
1 10
I dropped id = 2, 3 because it has more than 1 unique value in column value, 15, 13 & 10, 20 respectively.
I read this answer.
But this simply removes duplicates whereas I want to check if a given column - in this case column value has more than 1 unique value.
I tried:
df['uniques'] = pd.Series(df.groupby('id')['value'].nunique())
But this returns nan for every row since I am trying to fit n returns on n+m rows after grouping. I can write a function and apply it to every row but I was wondering if there is a smart quick filter that achieves my goal.
Use transform with groupby to align the group values to the individual rows:
df['nuniques'] = df.groupby('id')['value'].transform('nunique')
Output:
id value nuniques
0 1 10 1
1 2 15 2
2 1 10 1
3 1 10 1
4 2 13 2
5 3 10 2
6 3 20 2
If you only need to filter your data, you don't need to assign the new column:
df[df.groupby('id')['value'].transform('nunique') == 1]
Let us do filter
out = df.groupby('id').filter(lambda x : x['value'].nunique()==1)
Out[6]:
id value
0 1 10
2 1 10
3 1 10
I have a DataFrame like this:
subject trial attended
0 1 1 1
1 1 3 0
2 1 4 1
3 1 7 0
4 1 8 1
5 2 1 1
6 2 2 1
7 2 6 1
8 2 8 0
9 2 9 1
10 2 11 1
11 2 12 1
12 2 13 1
13 2 14 1
14 2 15 1
I would like to GroupBy subject.
Then iterate in each row of the GroupBy dataframe.
If for a row 'attended' == 1, then to increase a variable sum_reactive by 1.
If the sum_reactive variable reaches == 4, then to add in a dictionary the 'subject' and 'trial' in which the variable sum_reactive reached a value of 4.
I as trying to define a function for this, but it doesn't work:
def count_attended():
sum_reactive = 0
dict_attended = {}
for i, g in reactive.groupby(['subject']):
for row in g:
if g['attended'][row] == 1:
sum_reactive += 1
if sum_reactive == 4:
dict_attended.update({g['subject'] : g['trial'][row]})
return dict_attended
return dict_attended
I think that I don't have clear how to iterate inside each GroupBy dataframe. I'm quite new using pandas.
IIUC try,
df = df.query('attended == 1')
df.loc[df.groupby('subject')['attended'].cumsum() == 4, ['subject', 'trial']].to_dict(orient='record')
Output:
[{'subject': 2, 'trial': 9}]
Using groupby with cumsum will do the counting attended, then check to see when this value equals to 4 to create a boolean series. You can use this boolean series to do boolean indexing to filter your dataframe to certain rows. Lastly, with lock and column filtering select subject and trial.
I have a pandas data-frame where I am trying to replace/ change the duplicate values to 0 (don't want to delete the values) within a certain range of days.
So, in example given below, I want to replace duplicate values in all columns with 0 within a range of let's say 3 (the number can be changed) days. Desired result is also given below
A B C
01-01-2011 2 10 0
01-02-2011 2 12 2
01-03-2011 2 10 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 5 23 1
01-07-2011 4 21 4
01-08-2011 2 21 5
01-09-2011 1 11 0
So, the output should look like
A B C
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
Any help will be appreciated.
You can use df.shift() for this to look at a value from a row up or down (or several rows, specified by the number x in .shift(x)).
You can use that in combination with .loc to select all rows that have a identical value to the 2 rows above and then replace it with a 0.
Something like this should work :
(edited the code to make it flexible for endless number of columns and flexible for the number of days)
numberOfDays = 3 # number of days to compare
for col in df.columns:
for x in range(1, numberOfDays):
df.loc[df[col] == df[col].shift(x), col] = 0
print df
This gives me the output:
A B C
date
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
I don't find anything better than looping over all columns, because every column leads to a different grouping.
First define a function which does what you want at grouped level, i.e. setting all but the first entry to zero:
def set_zeros(g):
g.values[1:] = 0
return g
for c in df.columns:
df[c] = df.groupby([c, pd.Grouper(freq='3D')], as_index=False)[c].transform(set_zeros)
This custom function can be applied to each group, which is defined by a time range (freq='3D') and equal values of a column within this period. As the columns generally have their equal values in different rows, this has to be done for each column in a loop.
Change freq to 5D, 10D or 20D for your other considerations.
For a detailed description of how to define the time period see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases