Dataframe: calculate difference in dates column by another column - python

I'm trying to calculate running difference on the date column depending on "event column".
So, to add another column with date difference between 1 in event column (there only 0 and 1).
Spo far I came to this half-working crappy solution
Dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Code:
x = df.loc[df['event']==1, 'date']
k = 0
for i in range(len(x)):
df.loc[k:x.index[i], 'duration'] = x.iloc[i] - k
k = x.index[i]
But I'm sure there is a more elegant solution.
Thanks for any advice.
Output format:
+------+-------+----------+
| date | event | duration |
+------+-------+----------+
| 1 | 0 | 3 |
| 2 | 0 | 3 |
| 3 | 1 | 3 |
| 4 | 0 | 6 |
| 5 | 0 | 6 |
| 6 | 0 | 6 |
| 7 | 0 | 6 |
| 8 | 0 | 6 |
| 9 | 1 | 6 |
| 10 | 0 | 4 |
| 11 | 0 | 4 |
| 12 | 0 | 4 |
| 13 | 1 | 4 |
| 14 | 0 | 2 |
| 15 | 1 | 2 |
+------+-------+----------+

Using your initial dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Add an index-like column to mark where the transitions occur (you could also base this on the date column if it is unique):
df = df.reset_index().rename(columns={'index':'idx'})
df.loc[df['event']==0, 'idx'] = np.nan
df['idx'] = df['idx'].fillna(method='bfill')
Then, use a groupby() to count the records, and backfill them to match your structure:
df['duration'] = df.groupby('idx')['event'].count()
df['duration'] = df['duration'].fillna(method='bfill')
# Alternatively, the previous two lines can be combined as pointed out by OP
# df['duration'] = df.groupby('idx')['event'].transform('count')
df = df.drop(columns='idx')
print(df)
date event duration
0 1 0 2.0
1 2 1 2.0
2 3 0 3.0
3 4 0 3.0
4 5 1 3.0
5 6 0 5.0
6 7 0 5.0
7 8 0 5.0
8 9 0 5.0
9 10 1 5.0
10 11 0 6.0
11 12 0 6.0
12 13 0 6.0
13 14 0 6.0
14 15 0 6.0
15 16 1 6.0
16 17 0 NaN
It ends up as a float value because of the NaN in the last row. This approach works well in general if there are obvious "groups" of things to count.
As an alternative, because the dates are already there as integers you can look at the differences in dates directly:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0]})
tmp = df[df['event']==1].copy()
tmp['duration'] = (tmp['date'] - tmp['date'].shift(1)).fillna(tmp['date'])
df = pd.merge(df, tmp[['date','duration']], on='date', how='left').fillna(method='bfill')

Related

Transposing group of data in pandas dataframe

I have a large dataframe like this:
|type| qt | vol|
|----|---- | -- |
| A | 1 | 10 |
| A | 2 | 12 |
| A | 1 | 12 |
| B | 3 | 11 |
| B | 4 | 20 |
| B | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
How can I transpose to the dataframe with grouping horizontally like that?
|A. |B. |C. |
|--------------|--------------|--------------|
|type| qt | vol|type| qt | vol|type| qt | vol|
|----|----| ---|----|----| ---|----|----| ---|
| A | 1 | 10 | B | 3 | 11 | C | 4 | 20 |
| A | 2 | 12 | B | 4 | 20 | C | 4 | 20 |
| A | 1 | 12 | B | 4 | 20 | C | 4 | 20 |
| C | 4 | 20 |
You can group the dataframe on type then create key-value pairs of groups inside a dict comprehension, finally use concat along axis=1 and pass the optional keys parameter to get the final result:
d = {k:g.reset_index(drop=True) for k, g in df.groupby('type')}
pd.concat(d.values(), keys=d.keys(), axis=1)
Alternatively you can use groupby + cumcount to create a sequential counter per group, then create a multilevel index having two levels where the first level is counter and second level is column type itself, finally use stack followed by unstack to reshape:
c = df.groupby('type').cumcount()
df.set_index([c, df['type'].values]).stack().unstack([1, 2])
A B C
type qt vol type qt vol type qt vol
0 A 1 10 B 3 11 C 4 20
1 A 2 12 B 4 20 C 4 20
2 A 1 12 B 4 20 C 4 20
3 NaN NaN NaN NaN NaN NaN C 4 20
This is pretty much pivot by one column:
(df.assign(idx=df.groupby('type').cumcount())
.pivot(index='idx',columns='type', values=df.columns)
.swaplevel(0,1, axis=1)
.sort_index(axis=1)
)
Output:
type A B C
qt type vol qt type vol qt type vol
idx
0 1 A 10 3 B 11 4 C 20
1 2 A 12 4 B 20 4 C 20
2 1 A 12 4 B 20 4 C 20
3 NaN NaN NaN NaN NaN NaN 4 C 20

Add column with average value grouped by column

I want to replace column value of dataframe with mean(without zeros) value of column grouped by another column.
Dataframe df is like:
ID | TYPE | rate
-------------
1 | A | 0 <- Replace this
2 | B | 2
3 | C | 1
4 | A | 2
5 | C | 1
6 | C | 0 <- Replace this
7 | C | 8
8 | C | 2
9 | D | 0 <- Replace this
I have to replace values in rating where rating = 0:
df['rate'][df['rate']==0] = ?
with average value for that TYPE.
Average(without zeros) value for every type is:
A = 2/1 = 2
B = 2/1 = 2
C = (1 + 1 + 8 + 2)/4 = 3
D = 0 (default value when there isn't information for type)
Expected result:
ID | TYPE | rate
-------------
1 | A | 2 <- Changed
2 | B | 2
3 | C | 1
4 | A | 2
5 | C | 1
6 | C | 3 <- Changed
7 | C | 8
8 | C | 2
9 | D | 0 <- Changed
You could mask the rate column in the dataframe, GroupBy the TYPE and transform with the mean, which will exlude NaNs. The use fillna to replace the values in the masked dataframe:
ma = df.rate.mask(df.rate.eq(0))
df['rate'] = ma.fillna(ma.groupby(df.TYPE).transform('mean').fillna(0))
ID TYPE rate
0 1 A 2.0
1 2 B 2.0
2 3 C 1.0
3 4 A 2.0
4 5 C 1.0
5 6 C 3.0
6 7 C 8.0
7 8 C 2.0
8 9 D 0.0

Dataframe conditional column subtract until zero

This is different than the usual 'subtract until 0' questions on here as it is conditional on another column. This question is about creating that conditional column.
This dataframe consists of three columns.
Column 'quantity' tells you how much to add/subtract.
Column 'in' tells you when to subtract.
Column 'cumulative_in' tells you how much you have.
+----------+----+---------------+
| quantity | in | cumulative_in |
+----------+----+---------------+
| 5 | 0 | |
| 1 | 0 | |
| 3 | 1 | 3 |
| 4 | 1 | 7 |
| 2 | 1 | 9 |
| 1 | 0 | |
| 1 | 0 | |
| 3 | 0 | |
| 1 | -1 | |
| 2 | 0 | |
| 1 | 0 | |
| 2 | 0 | |
| 3 | 0 | |
| 3 | 0 | |
| 1 | 0 | |
| 3 | 0 | |
+----------+----+---------------+
Whenever column 'in' equals -1, starting from next row I want to create a column 'out' (0/1) that tells it to keep subtracting until 'cumulative_in' reaches 0. Doing it by hand,
Column 'out' tells you when to keep subtracting.
Column 'cumulative_subtracted' tells you how much you have already subtracted.
I subtract column 'cumulative_in' by 'cumulative_subtracted' until it reaches 0, the output looks something like this:
+----------+----+---------------+-----+-----------------------+
| quantity | in | cumulative_in | out | cumulative_subtracted |
+----------+----+---------------+-----+-----------------------+
| 5 | 0 | | | |
| 1 | 0 | | | |
| 3 | 1 | 3 | | |
| 4 | 1 | 7 | | |
| 2 | 1 | 9 | | |
| 1 | 0 | | | |
| 1 | 0 | | | |
| 3 | 0 | | | |
| 1 | -1 | | | |
| 2 | 0 | 7 | 1 | 2 |
| 1 | 0 | 6 | 1 | 3 |
| 2 | 0 | 4 | 1 | 5 |
| 3 | 0 | 1 | 1 | 8 |
| 3 | 0 | 0 | 1 | 9 |
| 1 | 0 | | | |
| 3 | 0 | | | |
+----------+----+---------------+-----+-----------------------+
I couldn't find a vector solution to this. I would love to see one. However, the problem is not that hard when going through row by row. I hope your dataframe is not too big!!
First set up the data.
data = {
"quantity": [
5,1,3,4,2,1,1,3,1,2,1,2,3,3,1,3
],
"in":[
0,0,1,1,1,0,0,0,-1,0,0,0,0,0,0,0
],
"cumulative_in": [
np.NaN,np.NaN,3,7,9,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN
]
}
Then set up the dataframe and extra columns. I used np.NaN for the 'out' but 0 was easier for 'cumulative_subtracted'
df=pd.DataFrame(data)
df['out'] = np.NaN
df['cumulative_subtracted'] = 0
Set the initial variables
last_in = 0.
reduce = False
Go through the dataframe row by row, unfortunately.
for i in df.index:
# check if necessary to adjust last_in value.
if ~np.isnan(df.at[i, "cumulative_in"]) and reduce == False:
last_in = df.at[i, "cumulative_in"]
# check if -1 and change reduce to true
elif df.at[i, "in"] == -1:
reduce = True
# check if reduce true, the implement reductions
elif reduce == True:
df.at[i, "out"] = 1
if df.at[i, "quantity"] <= last_in:
last_in -= df.at[i, "quantity"]
df.at[i, "cumulative_in"] = last_in
df.at[i, "cumulative_subtracted"] = (
df.at[i - 1, "cumulative_subtracted"] + df.at[i, "quantity"]
)
elif df.at[i, "quantity"] > last_in:
df.at[i, "cumulative_in"] = 0
df.at[i, "cumulative_subtracted"] = (
df.at[i - 1, "cumulative_subtracted"] + last_in
)
last_in = 0
reduce = False
This works for the data given, and hopefully for all your dataset.
print(df)
quantity in cumulative_in out cumulative_subtracted
0 5 0 NaN NaN 0
1 1 0 NaN NaN 0
2 3 1 3.0 NaN 0
3 4 1 7.0 NaN 0
4 2 1 9.0 NaN 0
5 1 0 NaN NaN 0
6 1 0 NaN NaN 0
7 3 0 NaN NaN 0
8 1 -1 NaN NaN 0
9 2 0 7.0 1.0 2
10 1 0 6.0 1.0 3
11 2 0 4.0 1.0 5
12 3 0 1.0 1.0 8
13 3 0 0.0 1.0 9
14 1 0 NaN NaN 0
15 3 0 NaN NaN 0
It is not clear for me what happens when the quantity to subtract has not yet reached zero and you have another '1' in the 'in' column.
Yet, here is a rough solution for a simple case:
import pandas as pd
import numpy as np
size = 20
df = pd.DataFrame(
{
"quantity": np.random.randint(1, 6, size),
"in": np.full(size, np.nan),
}
)
# These are just to place a random 1 and -1 into 'in', not important
df.loc[np.random.choice(df.iloc[:size//3, :].index, 1), 'in'] = 1
df.loc[np.random.choice(df.iloc[size//3:size//2, :].index, 1), 'in'] = -1
df.loc[np.random.choice(df.iloc[size//2:, :].index, 1), 'in'] = 1
# Fill up with 1/-1 values the missing values after each entry up to the
# next 1/-1 entry.
df.loc[:, 'in'] = df['in'].fillna(method='ffill')
# Calculates the cumulative sum with a negative value for subtractions
df["cum_in"] = (df["quantity"] * df['in']).cumsum()
# Subtraction indicator and cumulative column
df['out'] = (df['in'] == -1).astype(int)
df["cumulative_subtracted"] = df.loc[df['in'] == -1, 'quantity'].cumsum()
# Remove values when the 'cum_in' turns to negative
df.loc[
df["cum_in"] < 0 , ["in", "cum_in", "out", "cumulative_subtracted"]
] = np.NaN
print(df)

Pandas, how to count the occurance within grouped dataframe and create new column?

How do I get the count of each values within the group using pandas ?
In the below table, I have Group and the Value column, and I want to generate a new column called count, which should contain the total nunmber of occurance of that value within the group.
my df dataframe is as follows (without the count column):
-------------------------
| Group| Value | Count? |
-------------------------
| A | 10 | 3 |
| A | 20 | 2 |
| A | 10 | 3 |
| A | 10 | 3 |
| A | 20 | 2 |
| A | 30 | 1 |
-------------------------
| B | 20 | 3 |
| B | 20 | 3 |
| B | 20 | 3 |
| B | 10 | 1 |
-------------------------
| C | 20 | 2 |
| C | 20 | 2 |
| C | 10 | 2 |
| C | 10 | 2 |
-------------------------
I can get the counts using this:
df.groupby(['group','value']).value.count()
but this is just to view, I am having difficuly putting the results back to the dataframe as new columns.
Using transform
df['count?']=df.groupby(['group','value']).value.transform('count').values
Try a merge:
df
Group Value
0 A 10
1 A 20
2 A 10
3 A 10
4 A 20
5 A 30
6 B 20
7 B 20
8 B 20
9 B 10
10 C 20
11 C 20
12 C 10
13 C 10
g = df.groupby(['Group', 'Value']).Group.count()\
.to_frame('Count?').reset_index()
df = df.merge(g)
df
Group Value Count?
0 A 10 3
1 A 10 3
2 A 10 3
3 A 20 2
4 A 20 2
5 A 30 1
6 B 20 3
7 B 20 3
8 B 20 3
9 B 10 1
10 C 20 2
11 C 20 2
12 C 10 2
13 C 10 2

Pandas: replace zero value with value of another column

How to replace zero value in a column with value from same row of another column where previous row value of column is zero i.e. replace only where non-zero has not been encountered yet?
For example: Given a dataframe with columns a, b and c:
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 0 | 0 |
| 1 | 5 | 0 | 0 |
| 2 | 3 | 4 | 0 |
| 3 | 2 | 0 | 3 |
| 4 | 1 | 8 | 1 |
+----+-----+-----+----+
replace zero values in b and c with values of a where previous value is zero
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 2 | 2 |
| 1 | 5 | 5 | 5 |
| 2 | 3 | 4 | 3 |
| 3 | 2 | 0 | 3 | <-- zero in this row is not replaced because of
| 4 | 1 | 8 | 1 | non-zero value (4) in row before it.
+----+-----+-----+----+
In [90]: (df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
...: .fillna(pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]),
...: columns=df.columns, index=df.index))
...: .astype(int)
...: )
Out[90]:
a b c
0 2 2 2
1 5 5 5
2 3 4 3
3 2 0 3
4 1 8 1
Explanation:
In [91]: df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
Out[91]:
a b c
0 2 NaN NaN
1 5 NaN NaN
2 3 4.0 NaN
3 2 0.0 3.0
4 1 8.0 1.0
now we can fill NaN's with the corresponding values from the DF below (which is built as 3 concatenated a columns):
In [92]: pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]), columns=df.columns, index=df.index)
Out[92]:
a b c
0 2 2 2
1 5 5 5
2 3 3 3
3 2 2 2
4 1 1 1

Categories