How can i customize the week number in Python? - python

Currently, the week number for the period of '2020-5-6' to '2020-5-19' is 20 and 21.
How do I customise it so that the week number is 1 and 2 instead, and also have the subsequent periods change accordingly.
My code:
import pandas as pd
df = pd.DataFrame({'Date':pd.date_range('2020-5-6', '2020-5-19')})
df['Period'] = df['Date'].dt.to_period('W-TUE')
df['Week_Number'] = df['Period'].dt.week
df.head()
print(df)
My output:
Date Period Week_Number
0 2020-05-06 2020-05-06/2020-05-12 20
1 2020-05-07 2020-05-06/2020-05-12 20
2 2020-05-08 2020-05-06/2020-05-12 20
3 2020-05-09 2020-05-06/2020-05-12 20
...
11 2020-05-17 2020-05-13/2020-05-19 21
12 2020-05-18 2020-05-13/2020-05-19 21
13 2020-05-19 2020-05-13/2020-05-19 21
What I want:
Date Period Week_Number
0 2020-05-06 2020-05-06/2020-05-12 1
1 2020-05-07 2020-05-06/2020-05-12 1
2 2020-05-08 2020-05-06/2020-05-12 1
3 2020-05-09 2020-05-06/2020-05-12 1
...
11 2020-05-17 2020-05-13/2020-05-19 2
12 2020-05-18 2020-05-13/2020-05-19 2
13 2020-05-19 2020-05-13/2020-05-19 2

Related

Remove duplicate dataframe column items just at the beginning while keeping the last entry

The dataframe below is what I'm trying to plot, but there are several duplicate entries in each column. By maintaining only the final entry, I wish to eliminate the initial repeated components in each column so that they do not appear in the graph(Ignore if duplicates in middle and last).
Could someone please help me solve this issue?
Code I tried, this removes if duplicates in entire row:
df = df.drop_duplicates(subset=df.columns[1:], keep='last')
df = df.groupby((df.shift() != df).cumsum()).filter(lambda x: len(x) < 5)
Input:
Date Build1 Build2 Build3 Build4 Build5 Build6
2022-11-26 00:00:00 30 30 30 30 30 30
2022-11-27 00:00:00 30 30 30 30 30 30
2022-11-28 00:00:00 30 30 30 30 30 30
2022-11-29 00:00:00 30 30 30 30 30 30
2022-11-30 00:00:00 30 30 30 30 30 30
2022-12-01 00:00:00 28 30 30 30 30 30
2022-12-02 00:00:00 25 30 30 30 30 30
2022-12-03 00:00:00 25 30 30 30 30 30
2022-12-04 00:00:00 22 28 30 30 30 30
2022-12-05 00:00:00 22 26 30 30 30 30
2022-12-06 00:00:00 22 23 30 30 30 30
2022-12-07 00:00:00 22 22 30 30 30 30
2022-12-08 00:00:00 22 20 30 30 30 30
2022-12-09 00:00:00 22 20 25 30 30 30
2022-12-10 00:00:00 22 20 23 30 30 30
2022-12-11 00:00:00 22 20 23 30 30 30
2022-12-12 00:00:00 22 20 18 30 30 30
2022-12-13 00:00:00 22 20 14 30 30 30
2022-12-14 00:00:00 22 20 11 30 30 30
2022-12-15 00:00:00 22 20 10 27 30 30
2022-12-16 00:00:00 22 20 10 20 30 30
2022-12-17 00:00:00 22 20 10 20 30 30
2022-12-18 00:00:00 22 20 10 20 30 30
2022-12-19 00:00:00 22 20 10 13 30 30
2022-12-20 00:00:00 22 20 10 2 30 30
2022-12-21 00:00:00 22 20 10 2 19 30
2022-12-22 00:00:00 22 20 10 2 11 30
2022-12-23 00:00:00 22 20 10 2 4 30
2022-12-24 00:00:00 22 20 10 2 0 30
2022-12-25 00:00:00 22 20 10 2 0 22
2022-12-26 00:00:00 22 20 10 2 0 15
2022-12-27 00:00:00 22 20 10 2 0 15
2022-12-28 00:00:00 22 20 10 2 0 9
Expected output:
Date Build1 Build2 Build3 Build4 Build5 Build6
2022-11-26 00:00:00
2022-11-27 00:00:00
2022-11-28 00:00:00
2022-11-29 00:00:00
2022-11-30 00:00:00 30
2022-12-01 00:00:00 28
2022-12-02 00:00:00 25
2022-12-03 00:00:00 25 30
2022-12-04 00:00:00 22 28
2022-12-05 00:00:00 22 26
2022-12-06 00:00:00 22 23
2022-12-07 00:00:00 22 22
2022-12-08 00:00:00 22 20 30
2022-12-09 00:00:00 22 20 25
2022-12-10 00:00:00 22 20 23
2022-12-11 00:00:00 22 20 23
2022-12-12 00:00:00 22 20 18
2022-12-13 00:00:00 22 20 14
2022-12-14 00:00:00 22 20 11 30
2022-12-15 00:00:00 22 20 10 27
2022-12-16 00:00:00 22 20 10 20
2022-12-17 00:00:00 22 20 10 20
2022-12-18 00:00:00 22 20 10 20
2022-12-19 00:00:00 22 20 10 13
2022-12-20 00:00:00 22 20 10 2 30
2022-12-21 00:00:00 22 20 10 2 19
2022-12-22 00:00:00 22 20 10 2 11
2022-12-23 00:00:00 22 20 10 2 4
2022-12-24 00:00:00 22 20 10 2 0 30
2022-12-25 00:00:00 22 20 10 2 0 22
2022-12-26 00:00:00 22 20 10 2 0 15
2022-12-27 00:00:00 22 20 10 2 0 15
2022-12-28 00:00:00 22 20 10 2 0 9
You can simply do
is_duplicate = df.apply(pd.Series.duplicated, axis=1)
df.where(~is_duplicate, np.nan)
which gives
Date Build1 Build2 Build3 Build4
0 2022-11-26 00:00:00 30 30 NaN NaN NaN
1 2022-11-27 00:00:00 30 30 NaN NaN NaN
2 2022-11-28 00:00:00 30 30 NaN NaN NaN
3 2022-11-29 00:00:00 30 30 NaN NaN NaN
4 2022-11-30 00:00:00 30 30 NaN NaN NaN
5 2022-12-01 00:00:00 28 30 NaN NaN NaN
6 2022-12-02 00:00:00 25 30 NaN NaN NaN
7 2022-12-03 00:00:00 25 30 NaN NaN NaN
8 2022-12-04 00:00:00 22 30 NaN NaN NaN
9 2022-12-05 00:00:00 22 30 NaN NaN NaN
10 2022-12-06 00:00:00 22 30 NaN NaN NaN
11 2022-12-07 00:00:00 22 30 NaN NaN NaN
12 2022-12-08 00:00:00 22 30 NaN NaN NaN
13 2022-12-09 00:00:00 22 25 30.0 NaN NaN
14 2022-12-10 00:00:00 22 23 30.0 NaN NaN
15 2022-12-11 00:00:00 22 23 30.0 NaN NaN
16 2022-12-12 00:00:00 22 18 30.0 NaN NaN
17 2022-12-13 00:00:00 22 14 30.0 NaN NaN
18 2022-12-14 00:00:00 22 11 30.0 NaN NaN
19 2022-12-15 00:00:00 22 10 27.0 30.0 NaN
20 2022-12-16 00:00:00 22 10 20.0 30.0 NaN
21 2022-12-17 00:00:00 22 10 20.0 30.0 NaN
22 2022-12-18 00:00:00 22 10 20.0 30.0 NaN
23 2022-12-19 00:00:00 22 10 13.0 30.0 NaN
24 2022-12-20 00:00:00 22 10 2.0 30.0 NaN
25 2022-12-21 00:00:00 22 10 2.0 19.0 30.0
26 2022-12-22 00:00:00 22 10 2.0 11.0 30.0
27 2022-12-23 00:00:00 22 10 2.0 4.0 30.0
28 2022-12-24 00:00:00 22 10 2.0 0.0 30.0
29 2022-12-25 00:00:00 22 10 2.0 0.0 22.0
30 2022-12-26 00:00:00 22 10 2.0 0.0 15.0
31 2022-12-27 00:00:00 22 10 2.0 0.0 15.0
32 2022-12-28 00:00:00 22 10 2.0 0.0 9.0
or
is_duplicate = df.apply(pd.Series.duplicated, axis=1)
print(df.where(~is_duplicate, ''))
which gives:
Date Build1 Build2 Build3 Build4
0 2022-11-26 00:00:00 30 30
1 2022-11-27 00:00:00 30 30
2 2022-11-28 00:00:00 30 30
3 2022-11-29 00:00:00 30 30
4 2022-11-30 00:00:00 30 30
5 2022-12-01 00:00:00 28 30
6 2022-12-02 00:00:00 25 30
7 2022-12-03 00:00:00 25 30
8 2022-12-04 00:00:00 22 30
9 2022-12-05 00:00:00 22 30
10 2022-12-06 00:00:00 22 30
11 2022-12-07 00:00:00 22 30
12 2022-12-08 00:00:00 22 30
13 2022-12-09 00:00:00 22 25 30
14 2022-12-10 00:00:00 22 23 30
15 2022-12-11 00:00:00 22 23 30
16 2022-12-12 00:00:00 22 18 30
17 2022-12-13 00:00:00 22 14 30
18 2022-12-14 00:00:00 22 11 30
19 2022-12-15 00:00:00 22 10 27 30
20 2022-12-16 00:00:00 22 10 20 30
21 2022-12-17 00:00:00 22 10 20 30
22 2022-12-18 00:00:00 22 10 20 30
23 2022-12-19 00:00:00 22 10 13 30
24 2022-12-20 00:00:00 22 10 2 30
25 2022-12-21 00:00:00 22 10 2 19 30
26 2022-12-22 00:00:00 22 10 2 11 30
27 2022-12-23 00:00:00 22 10 2 4 30
28 2022-12-24 00:00:00 22 10 2 0 30
29 2022-12-25 00:00:00 22 10 2 0 22
30 2022-12-26 00:00:00 22 10 2 0 15
31 2022-12-27 00:00:00 22 10 2 0 15
32 2022-12-28 00:00:00 22 10 2 0 9

Pandas Drop rows for current Year-month

My dataframe has multiple years and months in "yyyy-mm-dd" format.
I would like to dynamically drop all current Year-month rows from the df
you could use a simple strftime method where the %Y%m is not equal to the year month of the current date.
df1 = df.loc[
df['Date'].dt.strftime('%Y%m') != pd.Timestamp('today').strftime('%Y%m')]
Example
d = pd.date_range('01 oct 2021', '01 dec 2021',freq='d')
df = pd.DataFrame(d,columns=['Date'])
print(df)
Date
0 2021-10-01
1 2021-10-02
2 2021-10-03
3 2021-10-04
4 2021-10-05
.. ...
57 2021-11-27
58 2021-11-28
59 2021-11-29
60 2021-11-30
61 2021-12-01
print(df1)
Date
0 2021-10-01
1 2021-10-02
2 2021-10-03
3 2021-10-04
4 2021-10-05
5 2021-10-06
6 2021-10-07
7 2021-10-08
8 2021-10-09
9 2021-10-10
10 2021-10-11
11 2021-10-12
12 2021-10-13
13 2021-10-14
14 2021-10-15
15 2021-10-16
16 2021-10-17
17 2021-10-18
18 2021-10-19
19 2021-10-20
20 2021-10-21
21 2021-10-22
22 2021-10-23
23 2021-10-24
24 2021-10-25
25 2021-10-26
26 2021-10-27
27 2021-10-28
28 2021-10-29
29 2021-10-30
30 2021-10-31
61 2021-12-01

calculate time in days based on another date column and its first date in pandas

I have a df as shown below
Date t_factor
2020-02-01 5
2020-02-06 14
2020-02-09 23
2020-02-03 23
2020-03-11 38
2020-02-20 29
2020-02-13 30
2020-02-29 100
2020-03-26 70
from the I would like to create a column called time_in_days, which will be calculated from the first day in the date column as shown below.
Note: where the column t_factor is unused here
Expected Output:
Date t_factor time_in_days
2020-02-01 5 1
2020-02-06 14 6
2020-02-09 23 9
2020-02-03 23 3
2020-03-11 38 40
2020-02-20 29 20
2020-02-13 30 13
2020-02-29 100 29
2020-03-26 70 55
Subtract the dates from the first date to get the delta.
# If you have a column of strings,
# df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['time_in_days_actual'] = (df['Date'] - df.at[0, 'Date']).dt.days + 1
df
Date t_factor time_in_days time_in_days_actual
0 2020-02-01 5 1 1
1 2020-02-06 14 6 6
2 2020-02-09 23 9 9
3 2020-02-03 23 3 3
4 2020-03-11 38 40 40
5 2020-02-20 29 20 20
6 2020-02-13 30 13 13
7 2020-02-29 100 29 29
8 2020-03-26 70 55 55
In [26]: a = ["2020-02-01", "2020-02-03", "2020-02-13", "2020-02-29","2020-03-26"]
In [27]: df = pd.DataFrame(a, columns=["Date"])
In [28]: start_date = datetime.strptime(df.iloc[0]["Date"],"%Y-%m-%d")
In [29]: df["time_in_days"] = df["Date"].apply(lambda x: (datetime.strptime(x,"%Y-%m-%d") - start_date).days+1)
In [30]: df
Out[30]:
Date time_in_days
0 2020-02-01 1
1 2020-02-03 3
2 2020-02-13 13
3 2020-02-29 29
4 2020-03-26 55
Try changing the column to Date Time format first. and try to use something like this:
import pandas as pd
lis = '''2020-02-01
2020-02-06
2020-02-09
2020-02-03
2020-02-11
2020-02-20
2020-02-13
2020-02-29
2020-02-26'''.replace(' ','') .split() # ignore this
dt = pd.to_datetime(lis)
diff = dt[6]-dt[0]
print(diff.days)
Should do the trick.
df = pd.DataFrame({'date':dt,'random_col':np.random.randn(len(dt))})
df['date_diff'] = df['date'].apply(lambda x: x-df.iloc[0,0])
df

python pandas mean by hour of day

I'm working with the following dataset with hourly counts in columns. The dataframe has more than 1400 columns and 100 rows.
My dataset looks like this:
CITY 2019-10-01 00:00 2019-10-01 01:00 2019-10-01 02:00 .... 2019-12-01 12:00
Wien 15 16 16 .... 14
Graz 11 11 11 .... 10
Innsbruck 12 12 10 .... 12
....
How can I convert this datatime to datetime such as this:
CITY 2019-10-01 2019-10-02 2019-10-03 .... 2019-12-01
(or 1 day) (or 2 day) (or 3 day) (or 72 day)
Wien 14 15 16 .... 12
Graz 13 12 14 .... 10
Innsbruck 13 12 12 .... 12
....
I would like the average of all hours of the day to be in the column of the one day.
The data type is:
type(df.columns[0])
out: str
type(df.columns[1])
out: pandas._libs.tslibs.timestamps.Timestamp
Thanks for your help!
I would do something like this:
days = df.columns[1:].to_series().dt.normalize()
df.set_index('CITY').groupby(days, axis=1).mean()
Output:
2019-10-01 2019-12-01
CITY
Wien 15.666667 14.0
Salzburg 12.000000 14.0
Graz 11.000000 10.0
Innsbruck 11.333333 12.0

How to plot a pandas multiindex dataFrame with all xticks

I have a pandas dataFrame like this:
content
date
2013-12-18 12:30:00 1
2013-12-19 10:50:00 1
2013-12-24 11:00:00 0
2014-01-02 11:30:00 1
2014-01-03 11:50:00 0
2013-12-17 16:40:00 10
2013-12-18 10:00:00 0
2013-12-11 10:00:00 0
2013-12-18 11:45:00 0
2013-12-11 14:40:00 4
2010-05-25 13:05:00 0
2013-11-18 14:10:00 0
2013-11-27 11:50:00 3
2013-11-13 10:40:00 0
2013-11-20 10:40:00 1
2008-11-04 14:49:00 1
2013-11-18 10:05:00 0
2013-08-27 11:00:00 0
2013-09-18 16:00:00 0
2013-09-27 11:40:00 0
date being the index.
I reduce the values to months using:
dataFrame = dataFrame.groupby([lambda x: x.year, lambda x: x.month]).agg([sum])
which outputs:
content
sum
2006 3 66
4 65
5 48
6 87
7 37
8 54
9 73
10 74
11 53
12 45
2007 1 28
2 40
3 95
4 63
5 56
6 66
7 50
8 49
9 18
10 28
Now when I plot this dataFrame, I want the x-axis show every month/year as a tick. I have tries setting xticks but it doesn't seem to work. How could this be achieved? This is my current plot using dataFrame.plot():
You can use set_xtick() and set_xticklabels():
idx = pd.date_range("2013-01-01", periods=1000)
val = np.random.rand(1000)
s = pd.Series(val, idx)
g = s.groupby([s.index.year, s.index.month]).mean()
ax = g.plot()
ax.set_xticks(range(len(g)));
ax.set_xticklabels(["%s-%02d" % item for item in g.index.tolist()], rotation=90);
output:

Categories