I used pivot_table group a dataframe named df,
UserId eventA eventB eventC ... date days
1 77 4 9 2015-11-01 2
1 3 1 1 2015-12-30 60
1 37 1 2 2016-04-23 174
1 6 2 2 2016-06-12 225
2 42 6 7 2015-09-07 130
... ... ... ... ...
drop date then
df = df.pivot_table(df, index='UserID',columns='days',fill_value='0')
eventA \
day 1 2 3 4 5 6 7
UserID
1 0 77 0 0 0 0 0
2 0 6 0 0 0 0 9
3 0 0 0 0 12 0 0
4 0 0 0 0 0 0 3
5 0 0 0 33 0 0 0
... eventG \
days 8 9 10 ... 769 770
msno ...
1 0 12 113 ... 0 0
2 0 0 0 ... 0 0
3 0 0 32 ... 66 0
4 0 0 0 ... 0 0
5 0 0 0 ... 0 43
On another hand I have another dataframe with UsersID, start and end date.
Each user may have multiple records.
I convert days from [date-first Start_date] for each user.
UserID Start_date End_date
1 2015-10-31 2015-12-21
1 2016-01-01 2016-07-21
2 2015-05-01 2016-10-01
3 2015-05-22 2015-08-22
3 2015-09-09 2015-11-01
3 2016-03-31 2016-07-24
4 2016-10-31 2016-12-21
So the problem is here, I want to find every range time for each user which he is not in the range of (Start_date,End_date) in days as None and padding them in df.
For example, userID 3 has three ranges and I wish to make [2015-08-22 to 2015-05-22,2015-09-09 to 2015-05-22] --at the 30th day to the 111th day, [2015-11-01 to 2015-05-22,2016-03-31 to 2015-05-22]--at the 164th day to the 314th day,[2016-07-24 to 2015-05-22,\ ]the 429th day to the 770th day--as None.
The final dataframe should be similar like this,
eventA \
day 1 2 3 4 5 6 7
UserID
1 0 77 0 0 0 0 0
2 0 6 0 0 0 0 9
3 0 0 0 0 12 0 0
4 0 0 0 0 0 0 3
5 0 0 0 33 0 0 0
... eventG \
days 8 9 10 ... 112 113 114... 769 770
msno ... ...
1 0 12 113 ... 0 2 4 None None
2 0 0 0 ... 12 0 3 None None
3 0 0 32 ... None None None 66 0
4 0 0 0 ... 5 1 0 None None
5 0 0 0 ... None None 2 0 43
Hope I made this question clear.
Looking for someone could help me!
Related
I have a data frame with four columns, track,num_tracks playlist, cluster. My goal is to create a new data frame that will output a row that contains the track,pid and columns for each unique value in cluster with its corresponding count.
Here is a sample dataframe:
pid track cluster num_track
0 1 6 4
0 2 1 4
0 3 6 4
0 4 3 4
1 5 10 3
1 6 10 3
1 7 1 4
2 8 9 5
2 9 11 5
2 10 2 5
2 11 2 5
2 12 2 5
So my desired output would be:
pid track cluster num_track c1 c2 c3 c4 c5 c6 c7 ... c12
0 1 6 4 1 0 1 0 0 2 0 0
0 2 1 4 1 0 1 0 0 2 0 0
0 3 6 4 1 0 1 0 0 2 0 0
0 4 3 4 1 0 1 0 0 2 0 0
1 5 10 3 1 0 0 0 0 0 0 0
1 6 10 3 1 0 0 0 0 0 0 0
1 7 1 3 1 0 0 0 0 0 0 0
2 8 9 5 0 3 0 0 0 0 0 0
2 9 11 5 0 3 0 0 0 0 0 0
2 10 2 5 0 3 0 0 0 0 0 0
2 11 2 5 0 3 0 0 0 0 0 0
2 12 2 5 0 3 0 0 0 0 0 0
I hope I have presented my question correctly if anything is incorrect tell me! I haven't enough rep to set up a bounty yet but could repost when I have enough.
Any help would be appreciated!!
You can using crosstab with reindex , then concat back to original df
s=pd.crosstab(df.pid,df.cluster).reindex(df.pid)
s.index=df.index
df=pd.concat([df,s.add_prefix('c')],1)
df
Out[209]:
pid track cluster num_track c1 c2 c3 c6 c9 c10 c11
0 0 1 6 4 1 0 1 2 0 0 0
1 0 2 1 4 1 0 1 2 0 0 0
2 0 3 6 4 1 0 1 2 0 0 0
3 0 4 3 4 1 0 1 2 0 0 0
4 1 5 10 3 1 0 0 0 0 2 0
5 1 6 10 3 1 0 0 0 0 2 0
6 1 7 1 4 1 0 0 0 0 2 0
7 2 8 9 5 0 3 0 0 1 0 1
8 2 9 11 5 0 3 0 0 1 0 1
9 2 10 2 5 0 3 0 0 1 0 1
10 2 11 2 5 0 3 0 0 1 0 1
11 2 12 2 5 0 3 0 0 1 0 1
I want to create a column that increments up by 1 for every row that is not NaT in diffs. If the value is NaT, I want the increment to reset
Below is an example dataframe:
x y min z o diffs
0 0 0 0 1 1 NaT
1 0 0 0 2 1 00:00:01
2 0 0 0 6 1 00:00:04
3 0 0 0 11 1 00:00:05
4 0 0 0 14 0 NaT
5 0 0 2 18 0 NaT
6 0 0 2 41 1 NaT
7 0 0 2 42 0 NaT
8 0 0 8 13 1 00:00:54
9 0 0 8 16 1 00:00:03
10 0 0 8 17 1 00:00:01
11 0 0 8 20 0 NaT
12 0 0 8 32 1 NaT
This is my expected output:
x y min z o diffs increment
0 0 0 0 1 1 NaT 0
1 0 0 0 2 1 00:00:01 1
2 0 0 0 6 1 00:00:04 2
3 0 0 0 11 1 00:00:05 3
4 0 0 0 14 0 NaT 0
5 0 0 2 18 0 NaT 0
6 0 0 2 41 1 NaT 0
7 0 0 2 42 0 NaT 0
8 0 0 8 13 1 00:00:54 1
9 0 0 8 16 1 00:00:03 2
10 0 0 8 17 1 00:00:01 3
11 0 0 8 20 0 NaT 0
12 0 0 8 32 1 NaT 0
Use numpy.where with set not missing values to counter by cumcount with consecutive non missing groups:
m = df['diffs'].notnull()
df['increment'] = np.where(m, df.groupby(m.ne(m.shift()).cumsum()).cumcount()+1, 0)
print (df)
x y min z o diffs increment
0 0 0 0 1 1 NaT 0
1 0 0 0 2 1 00:00:01 1
2 0 0 0 6 1 00:00:04 2
3 0 0 0 11 1 00:00:05 3
4 0 0 0 14 0 NaT 0
5 0 0 2 18 0 NaT 0
6 0 0 2 41 1 NaT 0
7 0 0 2 42 0 NaT 0
8 0 0 8 13 1 00:00:54 1
9 0 0 8 16 1 00:00:03 2
10 0 0 8 17 1 00:00:01 3
11 0 0 8 20 0 NaT 0
12 0 0 8 32 1 NaT 0
If performance is important, alternative solution:
b = m.cumsum()
df['increment'] = b-b.mask(m).ffill().fillna(0).astype(int)
I have a very long timeseries indicating wether a day was dry (no rain) or wet. Part of the timeserie is shown here:
Date DryDay
2009-05-07 0
2009-05-08 0
2009-05-09 1
2009-05-10 1
2009-05-11 1
2009-05-12 1
2009-05-13 1
2009-05-14 0
2009-05-15 0
2009-05-16 0
2009-05-17 0
2009-05-18 1
2009-05-20 0
2009-05-21 1
2009-05-22 0
2009-05-23 1
2009-05-24 1
2009-05-25 1
2009-05-26 0
2009-05-27 0
2009-05-28 1
2009-05-29 1
2009-05-30 0
....
I need to find dry periods, which means that I want to find periods with succesive dry days (more than one dry days succesive). Therefore I would like change the value of DryDay from 1 to 0 when there is only on dry day succesive. Like this:
Date DryDay
2009-05-07 0
2009-05-08 0
2009-05-09 1
2009-05-10 1
2009-05-11 1
2009-05-12 1
2009-05-13 1
2009-05-14 0
2009-05-15 0
2009-05-16 0
2009-05-17 0
2009-05-18 0
2009-05-20 0
2009-05-21 0
2009-05-22 0
2009-05-23 1
2009-05-24 1
2009-05-25 1
2009-05-26 0
2009-05-27 0
2009-05-28 1
2009-05-29 1
2009-05-30 0
...
Can anyone help me how to do this with Pandas?
There might be a better way but here is one,
df['DryDay'] = ((df['DryDay'] == 1) & ((df['DryDay'].shift() == 1) | (df['DryDay'].shift(-1) == 1))).astype(int)
Date DryDay
0 2009-05-07 0
1 2009-05-08 0
2 2009-05-09 1
3 2009-05-10 1
4 2009-05-11 1
5 2009-05-12 1
6 2009-05-13 1
7 2009-05-14 0
8 2009-05-15 0
9 2009-05-16 0
10 2009-05-17 0
11 2009-05-18 0
12 2009-05-20 0
13 2009-05-21 0
14 2009-05-22 0
15 2009-05-23 1
16 2009-05-24 1
17 2009-05-25 1
18 2009-05-26 0
19 2009-05-27 0
20 2009-05-28 1
21 2009-05-29 1
22 2009-05-30 0
Try this ....
((df1.DryDay.rolling(2,min_periods=1).sum()>1)|(df1.DryDay.iloc[::-1].rolling(2,min_periods=1).sum()>1)).astype(int)
Out[95]:
0 0
1 0
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 1
16 1
17 1
18 0
19 0
20 1
21 1
22 0
Name: DryDay, dtype: int32
I get one data csv file from github and use pd.csv_read() to read it. it would automatically create series number like this.
label repeattrips id offer_id never_bought_company \
0 1 5 86246 1208251 0
1 1 16 86252 1197502 0
2 0 0 12682470 1197502 1
3 0 0 12996040 1197502 1
4 0 0 13089312 1204821 0
5 0 0 13179265 1197502 1
6 0 0 13251776 1200581 0
but when I create my csv file and read it.
label gender age_range action0 action1 action2 action3 first \
0 0 2 1 0 1 0 2 1
0 0 4 0 0 1 0 1 1
0 1 2 8 0 1 0 9 1
1 0 2 0 0 1 0 1 1
0 1 5 0 0 1 0 1 1
0 1 5 0 0 1 0 1 1
the label is regarded as series number in my output.
If I create a series number in the front of every line of my data, still didn't solve the problem. like this:
label gender age_range action0 action1 action2 action3 first \
0 0 0 2 1 0 1 0 2 1
1 0 0 4 0 0 1 0 1 1
2 0 1 2 8 0 1 0 9 1
3 1 0 2 0 0 1 0 1 1
4 0 1 5 0 0 1 0 1 1
5 0 1 5 0 0 1 0 1 1
6 0 0 7 5 0 1 0 6 1
7 0 0 7 1 0 1 0 2 1
I don't know if I saved it right. My csv data is like this (added series number) and the github file looks similar format as well:
label gender age_range action0 action1 action2 action3 first second third fourth sirstrate secondrate thirdrate fourthrate total_cat total_brand total_time total_items users_appear users_items users_cats users_brands users_times users_action0 users_action1 users_action2 users_action3 merchants_appear merchants_items merchants_cats merchants_brands merchants_times merchants_action0 merchants_action1 merchants_action2 merchants_action3
0 0 0 2 1 0 1 0 2 1 1 0 0.0224719101124 0.5 0.5 0 1 1 1 1 89 71 22 45 17 87 0 2 0 46 34 11 16 3 38 4 2 2
1 0 0 4 0 0 1 0 1 1 1 0 0.00469483568075 0.0232558139535 0.0232558139535 0.0 1 1 1 1 213 102 47 44 30 170 0 36 7 103 58 25 23 6 81 0 22 0
2 0 1 2 8 0 1 0 9 1 1 0 0.0157342657343 0.0181818181818 0.0181818181818 0.0 2 2 1 5 572 393 111 158 60 517 0 15 40 119 70 24 20 17 106 6 7 0
3 1 0 2 0 0 1 0 1 1 1 0 0.0142857142857 0.0769230769231 0.0769230769231 0.0 1 1 1 1 70 33 19 15 15 57 0 11 2 27 17 11 15 11 18 0 2 7
4 0 1 5 0 0 1 0 1 1 1 0 0.025641025641 0.2 0.2 0.0 1 1 1 1 39 32 16 29 14 34 0 4 1 133 88 26 25 11 128 0 5 0
one line in one blank, rather than every item of one line in one blank.
Could you tell me how to solve this?
You'll need to provide code to get more substantive help since it's unclear why you're facing a problem. For example, copying the data you pasted at the bottom reads in just fine with pd.read_clipboard(), and pd.read_csv() should also work fine as long as you set it up with a space separator:
In [2]: pd.read_clipboard()
Out[2]:
label gender age_range action0 action1 action2 action3 first \
0 0 0 2 1 0 1 0 2
1 0 0 4 0 0 1 0 1
2 0 1 2 8 0 1 0 9
3 1 0 2 0 0 1 0 1
4 0 1 5 0 0 1 0 1
second third ... users_action3 merchants_appear \
0 1 1 ... 0 46
1 1 1 ... 7 103
2 1 1 ... 40 119
3 1 1 ... 2 27
4 1 1 ... 1 133
merchants_items merchants_cats merchants_brands merchants_times \
0 34 11 16 3
1 58 25 23 6
2 70 24 20 17
3 17 11 15 11
4 88 26 25 11
merchants_action0 merchants_action1 merchants_action2 merchants_action3
0 38 4 2 2
1 81 0 22 0
2 106 6 7 0
3 18 0 2 7
4 128 0 5 0
[5 rows x 37 columns]
I have a pandas DataFrame that looks like:
Payment_Method DCASH_T3M DCASH_T3M_3D PAYPAL Unknown combined
day_name
2013-08-27 0 0 0 1 1
2013-08-28 0 0 0 4 4
2013-08-29 0 0 0 17 17
2013-08-30 0 0 0 4 4
2013-09-02 0 0 0 3 3
2013-09-03 0 0 0 1 1
2013-09-04 0 0 0 3 3
2013-09-05 0 0 0 1 1
2013-09-06 0 0 0 5 5
2013-09-09 0 0 0 2 2
2013-09-10 0 0 0 5 5
2013-09-11 0 0 0 18 18
2013-09-12 0 0 0 6 6
2013-09-13 0 0 0 13 13
2013-09-16 0 0 0 19 19
.....
I would like to sum up all days in the same week so that I would have a new row for each week with the sum. I need also by Month
Thanks.
This is what df.resample is all about.
As long as you have a proper time-series as an index.
Try df.resample('W', how=sum) for weekly, df.resample('M', how=sum) for monthly.