I have a very long timeseries indicating wether a day was dry (no rain) or wet. Part of the timeserie is shown here:
Date DryDay
2009-05-07 0
2009-05-08 0
2009-05-09 1
2009-05-10 1
2009-05-11 1
2009-05-12 1
2009-05-13 1
2009-05-14 0
2009-05-15 0
2009-05-16 0
2009-05-17 0
2009-05-18 1
2009-05-20 0
2009-05-21 1
2009-05-22 0
2009-05-23 1
2009-05-24 1
2009-05-25 1
2009-05-26 0
2009-05-27 0
2009-05-28 1
2009-05-29 1
2009-05-30 0
....
I need to find dry periods, which means that I want to find periods with succesive dry days (more than one dry days succesive). Therefore I would like change the value of DryDay from 1 to 0 when there is only on dry day succesive. Like this:
Date DryDay
2009-05-07 0
2009-05-08 0
2009-05-09 1
2009-05-10 1
2009-05-11 1
2009-05-12 1
2009-05-13 1
2009-05-14 0
2009-05-15 0
2009-05-16 0
2009-05-17 0
2009-05-18 0
2009-05-20 0
2009-05-21 0
2009-05-22 0
2009-05-23 1
2009-05-24 1
2009-05-25 1
2009-05-26 0
2009-05-27 0
2009-05-28 1
2009-05-29 1
2009-05-30 0
...
Can anyone help me how to do this with Pandas?
There might be a better way but here is one,
df['DryDay'] = ((df['DryDay'] == 1) & ((df['DryDay'].shift() == 1) | (df['DryDay'].shift(-1) == 1))).astype(int)
Date DryDay
0 2009-05-07 0
1 2009-05-08 0
2 2009-05-09 1
3 2009-05-10 1
4 2009-05-11 1
5 2009-05-12 1
6 2009-05-13 1
7 2009-05-14 0
8 2009-05-15 0
9 2009-05-16 0
10 2009-05-17 0
11 2009-05-18 0
12 2009-05-20 0
13 2009-05-21 0
14 2009-05-22 0
15 2009-05-23 1
16 2009-05-24 1
17 2009-05-25 1
18 2009-05-26 0
19 2009-05-27 0
20 2009-05-28 1
21 2009-05-29 1
22 2009-05-30 0
Try this ....
((df1.DryDay.rolling(2,min_periods=1).sum()>1)|(df1.DryDay.iloc[::-1].rolling(2,min_periods=1).sum()>1)).astype(int)
Out[95]:
0 0
1 0
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 1
16 1
17 1
18 0
19 0
20 1
21 1
22 0
Name: DryDay, dtype: int32
Related
I have a Dataframe with a column called No.. I need to count the number of consecutive 0s in column No.. For example, the first 0 is recorded as 1, and the second 0 is recorded as 2. If it encounters 1, the counter is cleared. And save the result in the column count.
what should I do?
An example of my Dataframe is as follows:
import numpy as np
import pandas as pd
np.random.seed(2021)
a = np.random.randint(0, 2, 20)
df = pd.DataFrame(a, columns=['No.'])
print(df)
No.
0 0
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 1
11 1
12 1
13 1
14 0
15 0
16 0
17 0
18 0
19 0
The result I need:
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 1
6 0 2
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 1
15 0 2
16 0 3
17 0 4
18 0 5
19 0 6
Generate pseudo-groups with cumsum and then generate within-group counters with groupby.cumsum:
groups = df['No.'].ne(0).cumsum()
df['count'] = df['No.'].eq(0).groupby(groups).cumsum()
Output:
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 1
6 0 2
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 1
15 0 2
16 0 3
17 0 4
18 0 5
19 0 6
I'm stuck in this issue. Considering that in column “value” the number 1 appears, then the next column "trigger” displays the number 1 in the next 5 cells.
Please consider the following example:
Index values
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 1
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
The expected result should be as follows:
Index values trigger
1 0 0
2 0 0
3 1 0
4 0 1
5 0 1
6 0 1
7 0 1
8 0 1
9 0 0
10 0 0
11 1 0
12 0 1
13 0 1
14 0 1
15 0 1
16 0 1
17 0 0
18 0 0
19 0 0
20 0 0
Series.ffill
m = df['values'].eq(1)
df['trigger'] = df['values'].where(m).ffill(limit=5).mask(m).fillna(0,
downcast='int')
Or
df['trigger'] = (df['values'].shift().where(lambda x: x.eq(1))
.ffill(limit=4).fillna(0, downcast='int'))
Output
print(df)
Index values trigger
0 1 0 0
1 2 0 0
2 3 1 0
3 4 0 1
4 5 0 1
5 6 0 1
6 7 0 1
7 8 0 1
8 9 0 0
9 10 0 0
10 11 1 0
11 12 0 1
12 13 0 1
13 14 0 1
14 15 0 1
15 16 0 1
16 17 0 0
17 18 0 0
18 19 0 0
19 20 0 0
You could use .fillna(df['value']) if you want keep values of column values
I want to create a column that increments up by 1 for every row that is not NaT in diffs. If the value is NaT, I want the increment to reset
Below is an example dataframe:
x y min z o diffs
0 0 0 0 1 1 NaT
1 0 0 0 2 1 00:00:01
2 0 0 0 6 1 00:00:04
3 0 0 0 11 1 00:00:05
4 0 0 0 14 0 NaT
5 0 0 2 18 0 NaT
6 0 0 2 41 1 NaT
7 0 0 2 42 0 NaT
8 0 0 8 13 1 00:00:54
9 0 0 8 16 1 00:00:03
10 0 0 8 17 1 00:00:01
11 0 0 8 20 0 NaT
12 0 0 8 32 1 NaT
This is my expected output:
x y min z o diffs increment
0 0 0 0 1 1 NaT 0
1 0 0 0 2 1 00:00:01 1
2 0 0 0 6 1 00:00:04 2
3 0 0 0 11 1 00:00:05 3
4 0 0 0 14 0 NaT 0
5 0 0 2 18 0 NaT 0
6 0 0 2 41 1 NaT 0
7 0 0 2 42 0 NaT 0
8 0 0 8 13 1 00:00:54 1
9 0 0 8 16 1 00:00:03 2
10 0 0 8 17 1 00:00:01 3
11 0 0 8 20 0 NaT 0
12 0 0 8 32 1 NaT 0
Use numpy.where with set not missing values to counter by cumcount with consecutive non missing groups:
m = df['diffs'].notnull()
df['increment'] = np.where(m, df.groupby(m.ne(m.shift()).cumsum()).cumcount()+1, 0)
print (df)
x y min z o diffs increment
0 0 0 0 1 1 NaT 0
1 0 0 0 2 1 00:00:01 1
2 0 0 0 6 1 00:00:04 2
3 0 0 0 11 1 00:00:05 3
4 0 0 0 14 0 NaT 0
5 0 0 2 18 0 NaT 0
6 0 0 2 41 1 NaT 0
7 0 0 2 42 0 NaT 0
8 0 0 8 13 1 00:00:54 1
9 0 0 8 16 1 00:00:03 2
10 0 0 8 17 1 00:00:01 3
11 0 0 8 20 0 NaT 0
12 0 0 8 32 1 NaT 0
If performance is important, alternative solution:
b = m.cumsum()
df['increment'] = b-b.mask(m).ffill().fillna(0).astype(int)
I have a dataframe which has Dates and public holidays
Date WeekNum Public_Holiday
1/1/2015 1 1
2/1/2015 1 0
3/1/2015 1 0
4/1/2015 1 0
5/1/2015 1 0
6/1/2015 1 0
7/1/2015 1 0
8/1/2015 2 0
9/1/2015 2 0
10/1/2015 2 0
11/1/2015 2 0
12/1/2015 2 0
13/1/2015 2 0
I have to create a conditional column named Public_Holiday_Week, which should return 1, if that particular week has a public holiday
And I want to see an output like this
Date WeekNum Public_Holiday Public_Holiday_Week
1/1/2015 1 1 1
2/1/2015 1 0 1
3/1/2015 1 0 1
4/1/2015 1 0 1
5/1/2015 1 0 1
6/1/2015 1 0 1
7/1/2015 1 0 1
8/1/2015 2 0 0
9/1/2015 2 0 0
10/1/2015 2 0 0
11/1/2015 2 0 0
12/1/2015 2 0 0
13/1/2015 2 0 0
I tried using np.where
df['Public_Holiday_Week'] = np.where(df['Public_Holiday']==1,1,0)
But it applies 0 for other days of the week when it is not a public holiday.
Do I have to apply rolling here? Appreciate your help
For improve performance dont use groupby, rather get all WeekNum with at least one 1 and then select values by isin, last cast boolean mask to ints:
weeks = df.loc[df['Public_Holiday'].eq(1), 'WeekNum']
df['Public_Holiday_Week'] = df['WeekNum'].isin(weeks).astype(int)
print (df)
Date WeekNum Public_Holiday Public_Holiday_Week
0 1/1/2015 1 1 1
1 2/1/2015 1 0 1
2 3/1/2015 1 0 1
3 4/1/2015 1 0 1
4 5/1/2015 1 0 1
5 6/1/2015 1 0 1
6 7/1/2015 1 0 1
7 8/1/2015 2 0 0
8 9/1/2015 2 0 0
9 10/1/2015 2 0 0
10 11/1/2015 2 0 0
11 12/1/2015 2 0 0
12 13/1/2015 2 0 0
As pointed #Mohamed Thasin ah if necessary is possible groupby by week, but then get different output, because different week numbers:
df['weeks'] = pd.to_datetime(df['Date'], dayfirst=True).dt.week
weeks = df.loc[df['Public_Holiday'].eq(1), 'weeks']
df['Public_Holiday_Week'] = df['weeks'].isin(weeks).astype(int)
print (df)
Date WeekNum Public_Holiday weeks Public_Holiday_Week
0 1/1/2015 1 1 1 1
1 2/1/2015 1 0 1 1
2 3/1/2015 1 0 1 1
3 4/1/2015 1 0 1 1
4 5/1/2015 1 0 2 0
5 6/1/2015 1 0 2 0
6 7/1/2015 1 0 2 0
7 8/1/2015 2 0 2 0
8 9/1/2015 2 0 2 0
9 10/1/2015 2 0 2 0
10 11/1/2015 2 0 2 0
11 12/1/2015 2 0 3 0
12 13/1/2015 2 0 3 0
Use resample and skip the use of the WeekNum column altogether.
df.assign(
Public_Holiday_Week=
df.resample('W-Wed', on='Date').Public_Holiday.transform('max')
)
Date WeekNum Public_Holiday Public_Holiday_Week
0 2015-01-01 1 1 1
1 2015-01-02 1 0 1
2 2015-01-03 1 0 1
3 2015-01-04 1 0 1
4 2015-01-05 1 0 1
5 2015-01-06 1 0 1
6 2015-01-07 1 0 1
7 2015-01-08 2 0 0
8 2015-01-09 2 0 0
9 2015-01-10 2 0 0
10 2015-01-11 2 0 0
11 2015-01-12 2 0 0
12 2015-01-13 2 0 0
groupby and max, with map:
df['Public_Holiday_Week'] = df.WeekNum.map(df.groupby('WeekNum').Public_Holiday.max())
print(df)
Date WeekNum Public_Holiday Public_Holiday_Week
0 1/1/2015 1 1 1
1 2/1/2015 1 0 1
2 3/1/2015 1 0 1
3 4/1/2015 1 0 1
4 5/1/2015 1 0 1
5 6/1/2015 1 0 1
6 7/1/2015 1 0 1
7 8/1/2015 2 0 0
8 9/1/2015 2 0 0
9 10/1/2015 2 0 0
10 11/1/2015 2 0 0
11 12/1/2015 2 0 0
12 13/1/2015 2 0 0
groupby and transform, with max
df['Public_Holiday_Week'] = df.groupby('WeekNum').Public_Holiday.transform('max')
Thankfully, this will generalise nicely when grouping by month-year:
df['Public_Holiday_Week'] = (
df.groupby(['WeekNum', df.Date.str.split('/', 1).str[1]])
.Public_Holiday.transform('max')
)
print(df)
Date WeekNum Public_Holiday Public_Holiday_Week
0 1/1/2015 1 1 1
1 2/1/2015 1 0 1
2 3/1/2015 1 0 1
3 4/1/2015 1 0 1
4 5/1/2015 1 0 1
5 6/1/2015 1 0 1
6 7/1/2015 1 0 1
7 8/1/2015 2 0 0
8 9/1/2015 2 0 0
9 10/1/2015 2 0 0
10 11/1/2015 2 0 0
11 12/1/2015 2 0 0
12 13/1/2015 2 0 0
I get one data csv file from github and use pd.csv_read() to read it. it would automatically create series number like this.
label repeattrips id offer_id never_bought_company \
0 1 5 86246 1208251 0
1 1 16 86252 1197502 0
2 0 0 12682470 1197502 1
3 0 0 12996040 1197502 1
4 0 0 13089312 1204821 0
5 0 0 13179265 1197502 1
6 0 0 13251776 1200581 0
but when I create my csv file and read it.
label gender age_range action0 action1 action2 action3 first \
0 0 2 1 0 1 0 2 1
0 0 4 0 0 1 0 1 1
0 1 2 8 0 1 0 9 1
1 0 2 0 0 1 0 1 1
0 1 5 0 0 1 0 1 1
0 1 5 0 0 1 0 1 1
the label is regarded as series number in my output.
If I create a series number in the front of every line of my data, still didn't solve the problem. like this:
label gender age_range action0 action1 action2 action3 first \
0 0 0 2 1 0 1 0 2 1
1 0 0 4 0 0 1 0 1 1
2 0 1 2 8 0 1 0 9 1
3 1 0 2 0 0 1 0 1 1
4 0 1 5 0 0 1 0 1 1
5 0 1 5 0 0 1 0 1 1
6 0 0 7 5 0 1 0 6 1
7 0 0 7 1 0 1 0 2 1
I don't know if I saved it right. My csv data is like this (added series number) and the github file looks similar format as well:
label gender age_range action0 action1 action2 action3 first second third fourth sirstrate secondrate thirdrate fourthrate total_cat total_brand total_time total_items users_appear users_items users_cats users_brands users_times users_action0 users_action1 users_action2 users_action3 merchants_appear merchants_items merchants_cats merchants_brands merchants_times merchants_action0 merchants_action1 merchants_action2 merchants_action3
0 0 0 2 1 0 1 0 2 1 1 0 0.0224719101124 0.5 0.5 0 1 1 1 1 89 71 22 45 17 87 0 2 0 46 34 11 16 3 38 4 2 2
1 0 0 4 0 0 1 0 1 1 1 0 0.00469483568075 0.0232558139535 0.0232558139535 0.0 1 1 1 1 213 102 47 44 30 170 0 36 7 103 58 25 23 6 81 0 22 0
2 0 1 2 8 0 1 0 9 1 1 0 0.0157342657343 0.0181818181818 0.0181818181818 0.0 2 2 1 5 572 393 111 158 60 517 0 15 40 119 70 24 20 17 106 6 7 0
3 1 0 2 0 0 1 0 1 1 1 0 0.0142857142857 0.0769230769231 0.0769230769231 0.0 1 1 1 1 70 33 19 15 15 57 0 11 2 27 17 11 15 11 18 0 2 7
4 0 1 5 0 0 1 0 1 1 1 0 0.025641025641 0.2 0.2 0.0 1 1 1 1 39 32 16 29 14 34 0 4 1 133 88 26 25 11 128 0 5 0
one line in one blank, rather than every item of one line in one blank.
Could you tell me how to solve this?
You'll need to provide code to get more substantive help since it's unclear why you're facing a problem. For example, copying the data you pasted at the bottom reads in just fine with pd.read_clipboard(), and pd.read_csv() should also work fine as long as you set it up with a space separator:
In [2]: pd.read_clipboard()
Out[2]:
label gender age_range action0 action1 action2 action3 first \
0 0 0 2 1 0 1 0 2
1 0 0 4 0 0 1 0 1
2 0 1 2 8 0 1 0 9
3 1 0 2 0 0 1 0 1
4 0 1 5 0 0 1 0 1
second third ... users_action3 merchants_appear \
0 1 1 ... 0 46
1 1 1 ... 7 103
2 1 1 ... 40 119
3 1 1 ... 2 27
4 1 1 ... 1 133
merchants_items merchants_cats merchants_brands merchants_times \
0 34 11 16 3
1 58 25 23 6
2 70 24 20 17
3 17 11 15 11
4 88 26 25 11
merchants_action0 merchants_action1 merchants_action2 merchants_action3
0 38 4 2 2
1 81 0 22 0
2 106 6 7 0
3 18 0 2 7
4 128 0 5 0
[5 rows x 37 columns]