I get one data csv file from github and use pd.csv_read() to read it. it would automatically create series number like this.
label repeattrips id offer_id never_bought_company \
0 1 5 86246 1208251 0
1 1 16 86252 1197502 0
2 0 0 12682470 1197502 1
3 0 0 12996040 1197502 1
4 0 0 13089312 1204821 0
5 0 0 13179265 1197502 1
6 0 0 13251776 1200581 0
but when I create my csv file and read it.
label gender age_range action0 action1 action2 action3 first \
0 0 2 1 0 1 0 2 1
0 0 4 0 0 1 0 1 1
0 1 2 8 0 1 0 9 1
1 0 2 0 0 1 0 1 1
0 1 5 0 0 1 0 1 1
0 1 5 0 0 1 0 1 1
the label is regarded as series number in my output.
If I create a series number in the front of every line of my data, still didn't solve the problem. like this:
label gender age_range action0 action1 action2 action3 first \
0 0 0 2 1 0 1 0 2 1
1 0 0 4 0 0 1 0 1 1
2 0 1 2 8 0 1 0 9 1
3 1 0 2 0 0 1 0 1 1
4 0 1 5 0 0 1 0 1 1
5 0 1 5 0 0 1 0 1 1
6 0 0 7 5 0 1 0 6 1
7 0 0 7 1 0 1 0 2 1
I don't know if I saved it right. My csv data is like this (added series number) and the github file looks similar format as well:
label gender age_range action0 action1 action2 action3 first second third fourth sirstrate secondrate thirdrate fourthrate total_cat total_brand total_time total_items users_appear users_items users_cats users_brands users_times users_action0 users_action1 users_action2 users_action3 merchants_appear merchants_items merchants_cats merchants_brands merchants_times merchants_action0 merchants_action1 merchants_action2 merchants_action3
0 0 0 2 1 0 1 0 2 1 1 0 0.0224719101124 0.5 0.5 0 1 1 1 1 89 71 22 45 17 87 0 2 0 46 34 11 16 3 38 4 2 2
1 0 0 4 0 0 1 0 1 1 1 0 0.00469483568075 0.0232558139535 0.0232558139535 0.0 1 1 1 1 213 102 47 44 30 170 0 36 7 103 58 25 23 6 81 0 22 0
2 0 1 2 8 0 1 0 9 1 1 0 0.0157342657343 0.0181818181818 0.0181818181818 0.0 2 2 1 5 572 393 111 158 60 517 0 15 40 119 70 24 20 17 106 6 7 0
3 1 0 2 0 0 1 0 1 1 1 0 0.0142857142857 0.0769230769231 0.0769230769231 0.0 1 1 1 1 70 33 19 15 15 57 0 11 2 27 17 11 15 11 18 0 2 7
4 0 1 5 0 0 1 0 1 1 1 0 0.025641025641 0.2 0.2 0.0 1 1 1 1 39 32 16 29 14 34 0 4 1 133 88 26 25 11 128 0 5 0
one line in one blank, rather than every item of one line in one blank.
Could you tell me how to solve this?
You'll need to provide code to get more substantive help since it's unclear why you're facing a problem. For example, copying the data you pasted at the bottom reads in just fine with pd.read_clipboard(), and pd.read_csv() should also work fine as long as you set it up with a space separator:
In [2]: pd.read_clipboard()
Out[2]:
label gender age_range action0 action1 action2 action3 first \
0 0 0 2 1 0 1 0 2
1 0 0 4 0 0 1 0 1
2 0 1 2 8 0 1 0 9
3 1 0 2 0 0 1 0 1
4 0 1 5 0 0 1 0 1
second third ... users_action3 merchants_appear \
0 1 1 ... 0 46
1 1 1 ... 7 103
2 1 1 ... 40 119
3 1 1 ... 2 27
4 1 1 ... 1 133
merchants_items merchants_cats merchants_brands merchants_times \
0 34 11 16 3
1 58 25 23 6
2 70 24 20 17
3 17 11 15 11
4 88 26 25 11
merchants_action0 merchants_action1 merchants_action2 merchants_action3
0 38 4 2 2
1 81 0 22 0
2 106 6 7 0
3 18 0 2 7
4 128 0 5 0
[5 rows x 37 columns]
Related
Say I had a dataframe column of ones and zeros, and I wanted to group by clusters of where the value is 1. Using groupby would ordinarily render 2 groups, a single group of zeros, and a single group of ones.
df = pd.DataFrame([1,1,1,0,0,0,0,1,1,0,0,0,1,0,1,1,1],columns=['clusters'])
print df
clusters
0 1
1 1
2 1
3 0
4 0
5 0
6 0
7 1
8 1
9 0
10 0
11 0
12 1
13 0
14 1
15 1
16 1
for k, g in df.groupby(by=df.clusters):
print k, g
0 clusters
3 0
4 0
5 0
6 0
9 0
10 0
11 0
13 0
1 clusters
0 1
1 1
2 1
7 1
8 1
12 1
14 1
15 1
16 1
So in effect, I need to have a new column with a unique identifier for all clusters of 1: hence we would end up with:
clusters unique
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4
Any help welcome. Thanks.
Let us do ngroup
m = df['clusters'].eq(0)
df['unqiue'] = df.groupby(m.cumsum()[~m]).ngroup() + 1
clusters unqiue
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4
Using a mask:
m = df['clusters'].eq(0)
df['unique'] = m.ne(m.shift()).mask(m, False).cumsum().mask(m, 0)
output:
clusters unique
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4
I'm stuck in this issue. Considering that in column “value” the number 1 appears, then the next column "trigger” displays the number 1 in the next 5 cells.
Please consider the following example:
Index values
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 1
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
The expected result should be as follows:
Index values trigger
1 0 0
2 0 0
3 1 0
4 0 1
5 0 1
6 0 1
7 0 1
8 0 1
9 0 0
10 0 0
11 1 0
12 0 1
13 0 1
14 0 1
15 0 1
16 0 1
17 0 0
18 0 0
19 0 0
20 0 0
Series.ffill
m = df['values'].eq(1)
df['trigger'] = df['values'].where(m).ffill(limit=5).mask(m).fillna(0,
downcast='int')
Or
df['trigger'] = (df['values'].shift().where(lambda x: x.eq(1))
.ffill(limit=4).fillna(0, downcast='int'))
Output
print(df)
Index values trigger
0 1 0 0
1 2 0 0
2 3 1 0
3 4 0 1
4 5 0 1
5 6 0 1
6 7 0 1
7 8 0 1
8 9 0 0
9 10 0 0
10 11 1 0
11 12 0 1
12 13 0 1
13 14 0 1
14 15 0 1
15 16 0 1
16 17 0 0
17 18 0 0
18 19 0 0
19 20 0 0
You could use .fillna(df['value']) if you want keep values of column values
Simple enough but the formatting of the data is giving me a headache:
I have the sales data of items for each day of the week. Each row is a different item, and the number corresponds to how many is sold in that day.
I want to make a mosaic plot to see the distribution of item sales across days (so seven 'columns', one for each day). I'm using statsmodel.mosaic.
How should I reformat this? It's already a dataframe, and I'm using the mosaic(...) function.
0 Fri Mon Sat Sun Thu Tue Wed
0 2 3 3 1 2 3 4
1 2 2 0 1 1 3 1
2 0 0 1 0 2 0 2
3 4 1 3 0 6 1 4
4 0 1 0 0 1 6 0
5 0 0 0 0 0 0 0
6 14 8 13 9 14 25 40
7 3 11 4 4 5 7 7
8 16 24 20 18 22 32 41
9 0 0 0 0 0 0 0
10 0 1 8 6 1 0 8
11 0 1 2 2 3 4 3
12 10 5 3 7 11 13 22
13 3 10 2 5 6 6 15
14 5 4 7 7 12 10 17
15 0 6 0 2 1 3 3
16 1 0 0 1 8 2 6
17 3 6 7 4 7 15 24
18 0 0 0 1 1 0 3
19 0 0 0 0 0 1 0
20 0 1 0 0 0 0 0
21 6 3 2 4 10 19 14
22 2 5 4 2 4 12 9
23 7 6 5 8 12 9 11
24 0 2 0 0 0 2 2
25 0 2 0 0 1 3 1
26 0 2 0 0 0 1 0
27 0 0 0 0 0 3 0
28 5 0 2 3 1 2 5
29 0 0 0 0 0 0 1
30 0 0 0 0 0 0 1
31 4 4 2 0 5 7 4
32 2 3 4 1 7 3 5
33 1 1 1 0 4 2 2
34 0 0 0 0 0 0 1
35 1 3 3 3 8 6 13
36 1 2 0 0 2 6 6
37 2 5 0 3 0 7 2
38 0 0 0 0 1 0 0
39 1 2 2 0 0 4 3
40 1 4 0 3 1 8 5
41 2 1 2 1 2 1 3
42 0 0 0 0 0 0 0
I have a very long timeseries indicating wether a day was dry (no rain) or wet. Part of the timeserie is shown here:
Date DryDay
2009-05-07 0
2009-05-08 0
2009-05-09 1
2009-05-10 1
2009-05-11 1
2009-05-12 1
2009-05-13 1
2009-05-14 0
2009-05-15 0
2009-05-16 0
2009-05-17 0
2009-05-18 1
2009-05-20 0
2009-05-21 1
2009-05-22 0
2009-05-23 1
2009-05-24 1
2009-05-25 1
2009-05-26 0
2009-05-27 0
2009-05-28 1
2009-05-29 1
2009-05-30 0
....
I need to find dry periods, which means that I want to find periods with succesive dry days (more than one dry days succesive). Therefore I would like change the value of DryDay from 1 to 0 when there is only on dry day succesive. Like this:
Date DryDay
2009-05-07 0
2009-05-08 0
2009-05-09 1
2009-05-10 1
2009-05-11 1
2009-05-12 1
2009-05-13 1
2009-05-14 0
2009-05-15 0
2009-05-16 0
2009-05-17 0
2009-05-18 0
2009-05-20 0
2009-05-21 0
2009-05-22 0
2009-05-23 1
2009-05-24 1
2009-05-25 1
2009-05-26 0
2009-05-27 0
2009-05-28 1
2009-05-29 1
2009-05-30 0
...
Can anyone help me how to do this with Pandas?
There might be a better way but here is one,
df['DryDay'] = ((df['DryDay'] == 1) & ((df['DryDay'].shift() == 1) | (df['DryDay'].shift(-1) == 1))).astype(int)
Date DryDay
0 2009-05-07 0
1 2009-05-08 0
2 2009-05-09 1
3 2009-05-10 1
4 2009-05-11 1
5 2009-05-12 1
6 2009-05-13 1
7 2009-05-14 0
8 2009-05-15 0
9 2009-05-16 0
10 2009-05-17 0
11 2009-05-18 0
12 2009-05-20 0
13 2009-05-21 0
14 2009-05-22 0
15 2009-05-23 1
16 2009-05-24 1
17 2009-05-25 1
18 2009-05-26 0
19 2009-05-27 0
20 2009-05-28 1
21 2009-05-29 1
22 2009-05-30 0
Try this ....
((df1.DryDay.rolling(2,min_periods=1).sum()>1)|(df1.DryDay.iloc[::-1].rolling(2,min_periods=1).sum()>1)).astype(int)
Out[95]:
0 0
1 0
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 1
16 1
17 1
18 0
19 0
20 1
21 1
22 0
Name: DryDay, dtype: int32
I used pivot_table group a dataframe named df,
UserId eventA eventB eventC ... date days
1 77 4 9 2015-11-01 2
1 3 1 1 2015-12-30 60
1 37 1 2 2016-04-23 174
1 6 2 2 2016-06-12 225
2 42 6 7 2015-09-07 130
... ... ... ... ...
drop date then
df = df.pivot_table(df, index='UserID',columns='days',fill_value='0')
eventA \
day 1 2 3 4 5 6 7
UserID
1 0 77 0 0 0 0 0
2 0 6 0 0 0 0 9
3 0 0 0 0 12 0 0
4 0 0 0 0 0 0 3
5 0 0 0 33 0 0 0
... eventG \
days 8 9 10 ... 769 770
msno ...
1 0 12 113 ... 0 0
2 0 0 0 ... 0 0
3 0 0 32 ... 66 0
4 0 0 0 ... 0 0
5 0 0 0 ... 0 43
On another hand I have another dataframe with UsersID, start and end date.
Each user may have multiple records.
I convert days from [date-first Start_date] for each user.
UserID Start_date End_date
1 2015-10-31 2015-12-21
1 2016-01-01 2016-07-21
2 2015-05-01 2016-10-01
3 2015-05-22 2015-08-22
3 2015-09-09 2015-11-01
3 2016-03-31 2016-07-24
4 2016-10-31 2016-12-21
So the problem is here, I want to find every range time for each user which he is not in the range of (Start_date,End_date) in days as None and padding them in df.
For example, userID 3 has three ranges and I wish to make [2015-08-22 to 2015-05-22,2015-09-09 to 2015-05-22] --at the 30th day to the 111th day, [2015-11-01 to 2015-05-22,2016-03-31 to 2015-05-22]--at the 164th day to the 314th day,[2016-07-24 to 2015-05-22,\ ]the 429th day to the 770th day--as None.
The final dataframe should be similar like this,
eventA \
day 1 2 3 4 5 6 7
UserID
1 0 77 0 0 0 0 0
2 0 6 0 0 0 0 9
3 0 0 0 0 12 0 0
4 0 0 0 0 0 0 3
5 0 0 0 33 0 0 0
... eventG \
days 8 9 10 ... 112 113 114... 769 770
msno ... ...
1 0 12 113 ... 0 2 4 None None
2 0 0 0 ... 12 0 3 None None
3 0 0 32 ... None None None 66 0
4 0 0 0 ... 5 1 0 None None
5 0 0 0 ... None None 2 0 43
Hope I made this question clear.
Looking for someone could help me!