I have a pandas data-frame where I am trying to replace/ change the duplicate values to 0 (don't want to delete the values) within a certain range of days.
So, in example given below, I want to replace duplicate values in all columns with 0 within a range of let's say 3 (the number can be changed) days. Desired result is also given below
A B C
01-01-2011 2 10 0
01-02-2011 2 12 2
01-03-2011 2 10 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 5 23 1
01-07-2011 4 21 4
01-08-2011 2 21 5
01-09-2011 1 11 0
So, the output should look like
A B C
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
Any help will be appreciated.
You can use df.shift() for this to look at a value from a row up or down (or several rows, specified by the number x in .shift(x)).
You can use that in combination with .loc to select all rows that have a identical value to the 2 rows above and then replace it with a 0.
Something like this should work :
(edited the code to make it flexible for endless number of columns and flexible for the number of days)
numberOfDays = 3 # number of days to compare
for col in df.columns:
for x in range(1, numberOfDays):
df.loc[df[col] == df[col].shift(x), col] = 0
print df
This gives me the output:
A B C
date
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
I don't find anything better than looping over all columns, because every column leads to a different grouping.
First define a function which does what you want at grouped level, i.e. setting all but the first entry to zero:
def set_zeros(g):
g.values[1:] = 0
return g
for c in df.columns:
df[c] = df.groupby([c, pd.Grouper(freq='3D')], as_index=False)[c].transform(set_zeros)
This custom function can be applied to each group, which is defined by a time range (freq='3D') and equal values of a column within this period. As the columns generally have their equal values in different rows, this has to be done for each column in a loop.
Change freq to 5D, 10D or 20D for your other considerations.
For a detailed description of how to define the time period see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Related
In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
I have a df:
id value
1 10
2 15
1 10
1 10
2 13
3 10
3 20
I am trying to keep only rows that have 1 unique value in column value so that the result df looks like this:
id value
1 10
1 10
1 10
I dropped id = 2, 3 because it has more than 1 unique value in column value, 15, 13 & 10, 20 respectively.
I read this answer.
But this simply removes duplicates whereas I want to check if a given column - in this case column value has more than 1 unique value.
I tried:
df['uniques'] = pd.Series(df.groupby('id')['value'].nunique())
But this returns nan for every row since I am trying to fit n returns on n+m rows after grouping. I can write a function and apply it to every row but I was wondering if there is a smart quick filter that achieves my goal.
Use transform with groupby to align the group values to the individual rows:
df['nuniques'] = df.groupby('id')['value'].transform('nunique')
Output:
id value nuniques
0 1 10 1
1 2 15 2
2 1 10 1
3 1 10 1
4 2 13 2
5 3 10 2
6 3 20 2
If you only need to filter your data, you don't need to assign the new column:
df[df.groupby('id')['value'].transform('nunique') == 1]
Let us do filter
out = df.groupby('id').filter(lambda x : x['value'].nunique()==1)
Out[6]:
id value
0 1 10
2 1 10
3 1 10
Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1
I want to build on a previous question of mine.
Let's look at some Python code.
import numpy as np
import pandas as pd
mat = np.array([[1,2,3],[4,5,6]])
df_mat = pd.DataFrame(mat)
df_mat_tidy = (df_mat.stack()
.rename_axis(index = ['V1','V2'])
.rename('value')
.reset_index()
.reindex(columns = ['value','V1','V2']))
df_mat_tidy
This takes me from a pivot table (mat) to a "tidy" (in the Tidyverse sense) version of the data that gives one variable as the column from which the number came, one variable as the row from which the number came, and one variable as the number in the pivot table at the row-column position.
Now I want to expand on that to get the row-column pair repeated the number of times the pivot table specifies. In other words, if position 1,1 has value 3 and position 2,1 has value 4, I want the data frame to go
col row
1 1
1 1
1 1
1 2
1 2
1 2
1 2
instead of
col row value
1 1 3
1 2 4
I think I know how to loop over the rows of the second example and produce that, but I want something faster.
Is there a way to "melt" the pivot table the way that I am describing?
Have a look at the parts of pandas' docs entitled "Reshaping and pivot tables".
Both .pivot(), .pivot_table() and .melt() are all existing functions. It looks like you are reinventing some wheels.
You could just rebuild a DataFrame from a comprehension:
pd.DataFrame([i for j in [[[rec['V1'], rec['V2']]] * rec['value']
for rec in df_mat_tidy.to_dict(orient='records')]
for i in j], columns=['col', 'row'])
It gives as expected:
col row
0 0 0
1 0 1
2 0 1
3 0 2
4 0 2
5 0 2
6 1 0
7 1 0
8 1 0
9 1 0
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 1 2
16 1 2
17 1 2
18 1 2
19 1 2
20 1 2
I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .