Assign Week Number beginning at 1 based on Dates starting in November - python

I have a date column with dates starting in the month of November. Below is a sample:
| dt |
|------------|
| 11/13/2017 |
| 11/13/2017 |
| 11/13/2017 |
| 11/13/2017 |
| 11/20/2017 |
| 11/20/2017 |
| 11/27/2017 |
| 11/27/2017 |
| 11/27/2017 |
| 12/4/2017 |
| 12/11/2017 |
| 12/18/2017 |
| 12/18/2017 |
| 12/25/2017 |
| 1/1/2018 |
| 1/8/2018 |
I want to get week number from the dates but the week number should be 1 for 11/13/2017, 2 for 11/20/2017 and would continue to increase till 1/8/2018. How can I achieve this in Python?

You can do:
df['week'] = (df['dt'] - df['dt'].min())//pd.to_timedelta('7D') + 1
Output:
dt week
0 2017-11-13 1
1 2017-11-13 1
2 2017-11-13 1
3 2017-11-13 1
4 2017-11-20 2
5 2017-11-20 2
6 2017-11-27 3
7 2017-11-27 3
8 2017-11-27 3
9 2017-12-04 4
10 2017-12-11 5
11 2017-12-18 6
12 2017-12-18 6
13 2017-12-25 7
14 2018-01-01 8
15 2018-01-08 9

Let us do, PS just do not name your column as dt , since dt is a function from pandas
((df['dt']-df['dt'].min())//7).dt.days+1
Out[300]:
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 3
9 4
10 5
11 6
12 6
13 7
14 8
15 9
Name: dt, dtype: int64

Related

Python: How to create an html table from a list of strings

Can You please suggest me a Python module which will help to convert provided list to a html table?
list looks like this:
[' Date Jan 31 Jan 30 Jan 29 Jan 28 Jan 27 Jan 26 Jan 25 Jan 24 Jan 23 Jan 22 ',
' ID1 6 7 6 7 7 7 5 7 7 7 ',
' ID2 6 6 4 6 7 5 6 7 5 6 ',
' ID3 6 7 6 5 6 7 6 2 7 5 ',
' ID4 4 5 3 6 7 5 7 7 7 5 ',
' ID5 0 0 0 0 0 0 0 0 0 0 ',
' ID6 0 0 0 0 0 0 0 0 0 0 ',
' ID7 0 0 0 0 0 0 0 0 0 0 ']
Thanks in advance.
The data is a little messy and this could definitely do with cleaning up to be more efficient but as a hacky way of getting what you need this should do the trick:
from prettytable import PrettyTable
x = PrettyTable()
data = [' Date Jan 31 Jan 30 Jan 29 Jan 28 Jan 27 Jan 26 Jan 25 Jan 24 Jan 23 Jan 22 ', ' ID1 6 7 6 7 7 7 5 7 7 7 ', ' ID2 6 6 4 6 7 5 6 7 5 6 ', ' ID3 6 7 6 5 6 7 6 2 7 5 ', ' ID4 4 5 3 6 7 5 7 7 7 5 ', ' ID5 0 0 0 0 0 0 0 0 0 0 ', ' ID6 0 0 0 0 0 0 0 0 0 0 ', ' ID7 0 0 0 0 0 0 0 0 0 0 ']
clean_data = []
header_list = []
for row in data:
clean_data.append(row.strip().split()) # iterate through original list and remove spaces from outer items then split by spaces
data_to_merge = clean_data[0] # get first row from clean list
data_to_merge = data_to_merge[1:] # remove date from list
alternate_join = map(' '.join, zip(data_to_merge[::2], data_to_merge[1::2])) # join alternately to get Jan 31, Jan 30 etc rather than Jan, 31, Jan, 30
converted_to_list = list(alternate_join) # convert map to list
converted_to_list.insert(0, clean_data[0][0]) # put the date header item back in the beginning of the list
x.field_names = converted_to_list # add the field names to PrettyTable
x.add_rows(clean_data[1:]) # add the rows to PrettyTable
print(x) # Done!
# +------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
# | Date | Jan 31 | Jan 30 | Jan 29 | Jan 28 | Jan 27 | Jan 26 | Jan 25 | Jan 24 | Jan 23 | Jan 22 |
# +------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
# | ID1 | 6 | 7 | 6 | 7 | 7 | 7 | 5 | 7 | 7 | 7 |
# | ID2 | 6 | 6 | 4 | 6 | 7 | 5 | 6 | 7 | 5 | 6 |
# | ID3 | 6 | 7 | 6 | 5 | 6 | 7 | 6 | 2 | 7 | 5 |
# | ID4 | 4 | 5 | 3 | 6 | 7 | 5 | 7 | 7 | 7 | 5 |
# | ID5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# | ID6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# | ID7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# +------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+

Dataframe: calculate difference in dates column by another column

I'm trying to calculate running difference on the date column depending on "event column".
So, to add another column with date difference between 1 in event column (there only 0 and 1).
Spo far I came to this half-working crappy solution
Dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Code:
x = df.loc[df['event']==1, 'date']
k = 0
for i in range(len(x)):
df.loc[k:x.index[i], 'duration'] = x.iloc[i] - k
k = x.index[i]
But I'm sure there is a more elegant solution.
Thanks for any advice.
Output format:
+------+-------+----------+
| date | event | duration |
+------+-------+----------+
| 1 | 0 | 3 |
| 2 | 0 | 3 |
| 3 | 1 | 3 |
| 4 | 0 | 6 |
| 5 | 0 | 6 |
| 6 | 0 | 6 |
| 7 | 0 | 6 |
| 8 | 0 | 6 |
| 9 | 1 | 6 |
| 10 | 0 | 4 |
| 11 | 0 | 4 |
| 12 | 0 | 4 |
| 13 | 1 | 4 |
| 14 | 0 | 2 |
| 15 | 1 | 2 |
+------+-------+----------+
Using your initial dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Add an index-like column to mark where the transitions occur (you could also base this on the date column if it is unique):
df = df.reset_index().rename(columns={'index':'idx'})
df.loc[df['event']==0, 'idx'] = np.nan
df['idx'] = df['idx'].fillna(method='bfill')
Then, use a groupby() to count the records, and backfill them to match your structure:
df['duration'] = df.groupby('idx')['event'].count()
df['duration'] = df['duration'].fillna(method='bfill')
# Alternatively, the previous two lines can be combined as pointed out by OP
# df['duration'] = df.groupby('idx')['event'].transform('count')
df = df.drop(columns='idx')
print(df)
date event duration
0 1 0 2.0
1 2 1 2.0
2 3 0 3.0
3 4 0 3.0
4 5 1 3.0
5 6 0 5.0
6 7 0 5.0
7 8 0 5.0
8 9 0 5.0
9 10 1 5.0
10 11 0 6.0
11 12 0 6.0
12 13 0 6.0
13 14 0 6.0
14 15 0 6.0
15 16 1 6.0
16 17 0 NaN
It ends up as a float value because of the NaN in the last row. This approach works well in general if there are obvious "groups" of things to count.
As an alternative, because the dates are already there as integers you can look at the differences in dates directly:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0]})
tmp = df[df['event']==1].copy()
tmp['duration'] = (tmp['date'] - tmp['date'].shift(1)).fillna(tmp['date'])
df = pd.merge(df, tmp[['date','duration']], on='date', how='left').fillna(method='bfill')

Calculate streak in pandas without apply

I have a DataFrame like this:
date | type | column1
----------------------------
2019-01-01 | A | 1
2019-02-01 | A | 1
2019-03-01 | A | 1
2019-04-01 | A | 0
2019-05-01 | A | 1
2019-06-01 | A | 1
2019-07-01 | B | 1
2019-08-01 | B | 1
2019-09-01 | B | 0
I want to have a column called "streak" that has a streak, but grouped by column "type":
date | type | column1 | streak
-------------------------------------
2019-01-01 | A | 1 | 1
2019-02-01 | A | 1 | 2
2019-03-01 | A | 1 | 3
2019-04-01 | A | 0 | 0
2019-05-01 | A | 1 | 1
2019-06-01 | A | 1 | 2
2019-07-01 | B | 1 | 1
2019-08-01 | B | 1 | 2
2019-09-01 | B | 0 | 0
I managed to do it like that:
def streak(df):
grouper = (df.column1 != df.column1.shift(1)).cumsum()
df['streak'] = df.groupby(grouper).cumsum()['column1']
return df
df = df.groupby(['type']).apply(streak)
But I'm wondering if it's possible to do it inline without using a groupby and apply, because my DataFrame contains about 100M rows and it takes several hours to process.
Any ideas on how to optimize this for speed?
You want the cumsum of 'column1' grouping by 'type' + the cumsum of a Boolean Series which resets the grouping at every 0.
df['streak'] = df.groupby(['type', df.column1.eq(0).cumsum()]).column1.cumsum()
date type column1 streak
0 2019-01-01 A 1 1
1 2019-02-01 A 1 2
2 2019-03-01 A 1 3
3 2019-04-01 A 0 0
4 2019-05-01 A 1 1
5 2019-06-01 A 1 2
6 2019-07-01 B 1 1
7 2019-08-01 B 1 2
8 2019-09-01 B 0 0
IIUC, this is what you need.
m = df.column1.ne(df.column1.shift()).cumsum()
df['streak'] =df.groupby([m , 'type'])['column1'].cumsum()
Output
date type column1 streak
0 1/1/2019 A 1 1
1 2/1/2019 A 1 2
2 3/1/2019 A 1 3
3 4/1/2019 A 0 0
4 5/1/2019 A 1 1
5 6/1/2019 A 1 2
6 7/1/2019 B 1 1
7 8/1/2019 B 1 2
8 9/1/2019 B 0 0

Pandas DataFrame Group and Rollup in one operation

I have a Pandas DataFrame with two columns "close_time" of a trade (DateTime format) and the "net_profit" from that trade. I have shared some sample data below. I need to find the count of total trades and count of profitable trades by day. So, for example, the output would look like
+-----------------------------------------------------------+
| Close_day Total_Trades Total_Profitable_Trades |
+-----------------------------------------------------------+
| 2014-11-03 5 4 |
+-----------------------------------------------------------+
Can this be done using something like groupby? How?
+------------------------------------+
| close_time net_profit |
+------------------------------------+
| 0 2014-10-31 14:41:41 20.84 |
| 1 2014-11-03 10:50:59 238.74 |
| 2 2014-11-03 11:05:10 491.32 |
| 3 2014-11-03 12:31:06 55.87 |
| 4 2014-11-03 14:31:34 -402.29 |
| 5 2014-11-03 20:33:29 164.18 |
| 6 2014-11-04 16:30:24 -296.96 |
| 7 2014-11-04 23:59:21 281.86 |
| 8 2014-11-04 23:59:34 -296.37 |
| 9 2014-11-05 10:14:42 517.55 |
| 10 2014-11-05 20:38:49 350.35 |
| 11 2014-11-07 11:23:31 710.13 |
| 12 2014-11-07 11:23:38 137.55 |
| 13 2014-11-11 19:00:01 201.97 |
| 14 2014-11-11 19:00:15 -484.77 |
| 15 2014-11-12 23:41:04 -1346.71 |
| 16 2014-11-12 23:41:25 514.30 |
| 17 2014-11-13 18:55:34 103.34 |
| 18 2014-11-13 18:55:43 -180.37 |
| 19 2014-11-26 17:10:59 -1756.69 |
+------------------------------------+
Setup
Make sure that your close_time is datetime by using
df.close_time = pd.to_datetime(df.close_time)
You can use groupby and agg here:
out = (df.groupby(df.close_time.dt.date)
.net_profit.agg(['count', lambda x: x.gt(0).sum()])).astype(int)
out.columns = ['trades', 'profitible_trades']
trades profitible_trades
close_time
2014-10-31 1 1
2014-11-03 5 4
2014-11-04 3 1
2014-11-05 2 2
2014-11-07 2 2
2014-11-11 2 1
2014-11-12 2 1
2014-11-13 2 1
2014-11-26 1 0

Pandas: create rows for each unique value of a column, even with missing data

Note: I had difficulty wording the title of my question, so if you can think of something better to help other people with a similar question, please let me know and I will change it.
Current Data
Stored as a Pandas DataFrame
print(df)
week | site | vol
1 | a | 10
2 | a | 11
3 | a | 2
1 | b | 55
2 | b | 1
1 | c | 69
2 | c | 66
3 | c | 23
Notice that site b has no data for week 3
Goal
week | site | vol
1 | a | 10
2 | a | 11
3 | a | 2
1 | b | 55
2 | b | 1
3 | b | 0
1 | c | 69
2 | c | 66
3 | c | 23
Essentially, I want to create rows for all of the unique combinations of week and site. If the original data doesn't have a vol for a week-site combo, then it gets a 0.
Using stack with unstack
df.set_index(['week','site']).unstack('week',fill_value=0).stack().reset_index()
Out[424]:
site week vol
0 a 1 10
1 a 2 11
2 a 3 2
3 b 1 55
4 b 2 1
5 b 3 0
6 c 1 69
7 c 2 66
8 c 3 23
You can use crosstab and stack:
pd.crosstab(df.site,df.week,df.vol, aggfunc='first').fillna(0).stack().reset_index(name='vol')
Output:
site week vol
0 a 1 10.0
1 a 2 11.0
2 a 3 2.0
3 b 1 55.0
4 b 2 1.0
5 b 3 0.0
6 c 1 69.0
7 c 2 66.0
8 c 3 23.0

Categories