I am trying to add a running count to a pandas df.
For the values in Column A, I want to add '5' and for values in Column B I want to add '1'.
So for the df below I'm hoping to produce:
A B Total
0 0 0 0
1 0 0 0
2 1 0 5
3 1 1 6
4 1 1 6
5 2 1 11
6 2 2 12
So for every incremental integer in Column A, it equal '5' in the total. While Column B is the '+1'.
I tried:
df['Total'] = df['A'].cumsum(axis = 0)
But this doesn't include Column B
df['Total'] = df['A'] * 5 + df['B']
As far as I can tell, you are simply doing row wise operations, not a cumulative sum. This snippet calculates the row value of A times 5 and adds the row value of B for each row. Please don't make it any more complicated than it really is.
What is a cumulative sum (also called running total)?
From Wikipedia:
Consider the sequence < 5 8 3 2 >. What is the total of this sequence?
Answer: 5 + 8 + 3 + 2 = 18. This is arrived at by simple summation of the sequence.
Now we insert the number 6 at the end of the sequence to get < 5 8 3 2 6 >. What is the total of that sequence?
Answer: 5 + 8 + 3 + 2 + 6 = 24. This is arrived at by simple summation of the sequence. But if we regarded 18 as the running total, we need only add 6 to 18 to get 24. So, 18 was, and 24 now is, the running total. In fact, we would not even need to know the sequence at all, but simply add 6 to 18 to get the new running total; as each new number is added, we get a new running total.
Related
In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
I have a dataframe like this:
df1
a b c
0 1 2 [bg10, ng45, fg56]
1 4 5 [cv10, fg56]
2 7 8 [bg10, ng45, fg56]
3 7 8 [fg56, fg56]
4 7 8 [bg10]
I would like to count the total occurences take place of each type in column 'c'. I would then like to return the value of column 'b' for the values in column 'c' that have a count total of '1'.
The expected output is soemthing like this:
c b total_count
0 bg10 2 2
0 ng45 2 2
0 fg56 2 5
1 cv10 5 1
1 fg56 5 5
I have tried the 'Collections' library, and a 'for' loop (I understand its not best practise in Pandas) but i think i'm missing some fundamental udnerstanding of lists within cells, and how to perform analysis like these.
Thank you for taking my question into consideration.
I would use apply the following way:
first I create the df:
df1=pd.DataFrame({"b":[2,5,8,8], "c":[['bg10', 'ng45', 'fg56'],['cv10', 'fg56'],['bg10', 'ng45', 'fg56'],['fg56', 'fg56']]})
next use apply to count the number of (non unique) items in a list and save it in a different column:
df1["count_c"]=df1.c.apply(lambda x: len(x))
you will get the following:
b c count_c
0 2 [bg10, ng45, fg56] 3
1 5 [cv10, fg56] 2
2 8 [bg10, ng45, fg56] 3
3 8 [fg56, fg56] 2
to get the lines when c larger than threshold:`
df1[df1["count_c"]>2]["b"]
note: if you want to count only unique values in each list in column c you should use:
df1["count_c"]=df1.c.apply(lambda x: len(set(x)))
EDIT
in order to count the total number of each item I would try this:
first let's "unpack all the lists into columns
new_df1=(df1.c.apply(lambda x: pd.Series(x))).stack().reset_index(level=1,drop=True).to_frame("c").join(df1[["b"]],how="left")
then get the total counts of each item in the list and add it to a new col:
counts_dict=new_df1.c.value_counts().to_dict()
new_df1["total_count_c"]=new_df1.c.map(counts_dict)
new_df1.head()
c b total_count_c
0 bg10 2 2
0 ng45 2 2
0 fg56 2 5
1 cv10 5 1
1 fg56 5 5
I have a pandas data-frame where I am trying to replace/ change the duplicate values to 0 (don't want to delete the values) within a certain range of days.
So, in example given below, I want to replace duplicate values in all columns with 0 within a range of let's say 3 (the number can be changed) days. Desired result is also given below
A B C
01-01-2011 2 10 0
01-02-2011 2 12 2
01-03-2011 2 10 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 5 23 1
01-07-2011 4 21 4
01-08-2011 2 21 5
01-09-2011 1 11 0
So, the output should look like
A B C
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
Any help will be appreciated.
You can use df.shift() for this to look at a value from a row up or down (or several rows, specified by the number x in .shift(x)).
You can use that in combination with .loc to select all rows that have a identical value to the 2 rows above and then replace it with a 0.
Something like this should work :
(edited the code to make it flexible for endless number of columns and flexible for the number of days)
numberOfDays = 3 # number of days to compare
for col in df.columns:
for x in range(1, numberOfDays):
df.loc[df[col] == df[col].shift(x), col] = 0
print df
This gives me the output:
A B C
date
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
I don't find anything better than looping over all columns, because every column leads to a different grouping.
First define a function which does what you want at grouped level, i.e. setting all but the first entry to zero:
def set_zeros(g):
g.values[1:] = 0
return g
for c in df.columns:
df[c] = df.groupby([c, pd.Grouper(freq='3D')], as_index=False)[c].transform(set_zeros)
This custom function can be applied to each group, which is defined by a time range (freq='3D') and equal values of a column within this period. As the columns generally have their equal values in different rows, this has to be done for each column in a loop.
Change freq to 5D, 10D or 20D for your other considerations.
For a detailed description of how to define the time period see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
I'm a new Python user (making the shift from VBA) and am having trouble figuring out Python's loop function. I have a dataframe df, and I want to create a column of variables based on some condition being met in another column, based on a loop. Something like the below:
cycle = 5
dummy = 1
for i = 1 to cycle
if df["high"].iloc[i] >= df["exit"].iloc[i] and
df["low"].iloc[i] <= df["exit"].iloc[i] then
df["signal"] = dummy
break
elif i = cycle
df["signal"] = cycle + 1
break
else:
dummy = dummy + 1
next i
Basically trying to find in which column over the next columns up to the cycle variable are the conditions in the if statement met, and if they're never met, assign cycle + 1. So df["signal"] will be a column of numbers ranging 1 -> (cycle + 1). Also, there are some NaN values in df["exit"], not sure how that affects the loop.
I've found fairly extensive documentation on row iterations on the site, I feel like this is close to where I need to get to, but can't figure out how to adapt it. Thanks for any advice!
EDIT: INCLUDED DATA SAMPLE FROM EXCEL CELLS BELOW:
high low EXIT test signal/(OUTPUT COLUMN)
4 3 4 1 1
2 2 2 1 1
2 3 5 0 6
4 3 1 0 5
2 5 2 0 4
5 5 1 0 3
3 1 5 0 2
5 1 5 1 1
1 1 4 0 0
EDIT 2: FURTHER CLARIFICATION AROUND SCRIPT
Once the condition
df["high"].iloc[i] >= df["exit"].iloc[i] and
df["low"].iloc[i] <= df["exit"].iloc[i]
is met in the loop, it should terminate for that particular instance/row.
EDIT 3: EXPECTED OUTPUT
The expected output is the df["signal"] column - it is the first instance in the loop where the condition
df["high"].iloc[i] >= df["exit"].iloc[i] and
df["low"].iloc[i] <= df["exit"].iloc[i]
is met in any given row. The output in df["signal"] is effectively i from the loop, or the given iteration.
here is how I would solve the problem, the column 'gr' must not exist before doing this:
# first check all the rows meeting the conditions and add 1 in a temporary column gr
df.loc[(df["high"] >= df["exit"]) & (df["low"] <= df["exit"]), 'gr'] = 1
# manipulate column gr to use groupby after
df['gr'] = df['gr'].cumsum().bfill()
# use cumcount after groupby to recalculate signal
df.loc[:,'signal'] = df.groupby('gr').cumcount(ascending=False).add(1)
# cut the value in signal to the value cycle + 1
df.loc[df['signal'] > cycle, 'signal'] = cycle + 1
# drop column gr
df = df.drop('gr',1)
and you get
high low exit signal
0 4 3 4 1
1 2 2 2 1
2 2 3 5 6
3 4 3 1 5
4 2 5 2 4
5 5 5 1 3
6 3 1 5 2
7 5 1 5 1
8 1 1 4 1
Note: The last row is not working properly as never a row with the condition is met after, and not sure how it will be in the full data or how to handle this. You may consider to add df = df.dropna(subset=['gr']) after the line starting with df['gr'] = ... to drop these last rows, up to you.
I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .