I have a dataframe df as below:
Student_id Date_of_visit(d/m/y)
1 1/4/2020
1 30/12/2019
1 26/12/2019
2 3/1/2021
2 10/1/2021
3 4/5/2020
3 22/8/2020
How can I get the bar-graph with x-axis as month-year(eg: y-ticks: Dec 2019, Jan 2020, Feb 2020) and on y-axis - the total number of students (count) visited on a particular month.
Convert values to datetimes, then use DataFrame.resample with Resampler.size for counts, create new format of datetimes by DatetimeIndex.strftime:
df['Date_of_visit'] = pd.to_datetime(df['Date_of_visit'], dayfirst=True)
s = df.resample('M', on='Date_of_visit')['Student_id'].size()
s.index = s.index.strftime('%b %Y')
print (s)
Date_of_visit
Dec 2019 2
Jan 2020 0
Feb 2020 0
Mar 2020 0
Apr 2020 1
May 2020 1
Jun 2020 0
Jul 2020 0
Aug 2020 1
Sep 2020 0
Oct 2020 0
Nov 2020 0
Dec 2020 0
Jan 2021 2
Name: Student_id, dtype: int64
If need count only unique Student_id use Resampler.nunique:
s = df.resample('M', on='Date_of_visit')['Student_id'].nunique()
s.index = s.index.strftime('%b %Y')
print (s)
Date_of_visit
Dec 2019 1
Jan 2020 0
Feb 2020 0
Mar 2020 0
Apr 2020 1
May 2020 1
Jun 2020 0
Jul 2020 0
Aug 2020 1
Sep 2020 0
Oct 2020 0
Nov 2020 0
Dec 2020 0
Jan 2021 1
Name: Student_id, dtype: int64
Last plot by Series.plot.bar
s.plot.bar()
I have one dataframe which looks like below:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 21 Dec 2017 18 Dec 2017 21 Dec 2017
4 22 Dec 2017 22 Dec 2017
Conditions to be checked:
Want to check if any row contains two dates or not like 3rd row. If present split them into two separate rows.
Apply the datetime on both columns.
I am trying to do the same operation like below:
df['Date_1'] = pd.to_datetime(df['Date_1'], format='%d %b %Y')
But getting below error:
ValueError: unconverted data remains:
Expected Output:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 18 Dec 2017
4 21 Dec 2017 21 Dec 2017
5 22 Dec 2017 22 Dec 2017
After using regex with findall get the you date , your problem become a unnesting problem
s=df.apply(lambda x : x.str.findall(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{,4})'))
unnesting(s,['Date_1','Date_2']).apply(pd.to_datetime)
Out[82]:
Date_1 Date_2
0 2017-12-05 2017-12-05
1 2017-12-14 2017-12-14
2 2017-12-15 2017-12-15
3 2017-12-18 2017-12-18
3 2017-12-21 2017-12-21
4 2017-12-22 2017-12-22
I'm having a problem presenting data in the required way. My dataframe is formatted and then sorted by 'Site ID'. I need to present the data by Site ID with all date instances grouped alongside.
I'm 90% there in terms of how I want it to look using pivot_table
df_pivot = pd.pivot_table(df, index=['Site Ref','Site Name', 'Date'])
however the date column is not sorted.
(The tiny example output appears sorted however the ****Thu Jan 11 2018 10:43:20 entry**** illustrates my issue on large data sets)
I cannot figure out how to present it like below but also with the dates sorted per site ID
Any help is gratefully accepted
df = pd.DataFrame.from_dict([{'Site Ref': '1234567', 'Site Name': 'Building A', 'Date': 'Mon Jan 08 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Mon Jan 08 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Tue Jan 09 2018 10:43:20', 'Duration': 70}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Wed Jan 10 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1212345', 'Site Name':'Building C', 'Date': 'Fri Jan 12 2018 10:43:20', 'Duration': 100}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Thu Jan 11 2018 10:43:20', 'Duration': 80}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Fri Jan 12 2018 12:22:20', 'Duration': 80}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Mon Jan 15 2018 11:43:20', 'Duration': 90}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Wed Jan 17 2018 10:43:20', 'Duration': 220}])
df = DataFrame(df, columns=['Site Ref', 'Site Name', 'Date', 'Duration'])
df = df.sort_values(by=['Site Ref'])
df
Site Ref Site Name Date Duration
5 1123456 Building D Thu Jan 11 2018 10:43:20 80
6 1123456 Building D Fri Jan 12 2018 12:22:20 80
7 1123456 Building D Mon Jan 15 2018 11:43:20 90
8 1123456 Building D Wed Jan 17 2018 10:43:20 220
4 1212345 Building C Fri Jan 12 2018 10:43:20 100
0 1234567 Building A Mon Jan 08 2018 10:43:20 120
1 1245678 Building B Mon Jan 08 2018 10:43:20 120
2 1245678 Building B Tue Jan 09 2018 10:43:20 70
3 1245678 Building B Wed Jan 10 2018 10:43:20 120
df_pivot = pd.pivot_table(df, index=['Site Ref','Site Name', 'Date'])
df_pivot
Site Ref Site Name Date
1123456 Building D Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
****Thu Jan 11 2018 10:43:20 80****
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120
It's sorted lexicographically, because Date has object (string) dtype
Workaround - add a new column of datetime dtype, use it before Date in the pivot_table and drop it afterwards:
In [74]: (df.assign(x=pd.to_datetime(df['Date']))
.pivot_table(df, index=['Site Ref','Site Name', 'x', 'Date'])
.reset_index(level='x', drop=True))
Out[74]:
Duration
Site Ref Site Name Date
1123456 Building D Thu Jan 11 2018 10:43:20 80
Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120
You need to convert your dates to datetime values rather than strings. Something like the following would work on your current pivot table:
df_pivot.reset_index(inplace=True)
df_pivot['Date'] = pd.to_datetime(df_pivot['Date'])
df_pivot.sort_values(by=['Site Ref', 'Date'], inplace=True)
Sort the values by Site Ref, groupby mean using sort = False i.e
df.sort_values('Site Ref').groupby(['Site Ref','Site Name','Date'],sort=False).mean()
Duration
Site Ref Site Name Date
1123456 Building D Thu Jan 11 2018 10:43:20 80
Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120
I have this dataframe:
date value
1 Thu 17th Nov 2016 385.943800
2 Fri 18th Nov 2016 1074.160340
3 Sat 19th Nov 2016 2980.857860
4 Sun 20th Nov 2016 1919.723960
5 Mon 21st Nov 2016 884.279340
6 Tue 22nd Nov 2016 869.071070
7 Wed 23rd Nov 2016 760.289260
8 Thu 24th Nov 2016 2481.689270
9 Fri 25th Nov 2016 2745.990070
10 Sat 26th Nov 2016 2273.413250
11 Sun 27th Nov 2016 2630.414900
12 Mon 28th Nov 2016 817.322310
13 Tue 29th Nov 2016 1766.876030
14 Wed 30th Nov 2016 469.388420
I would like to change the format of the date column to this format YYYY-MM-DD. The dataframe consists of more than 200 rows, and every day new rows will be added, so I need to find a way to do this automatically.
This link is not helping because it sets the dates like this dates = ['30th November 2009', '31st March 2010', '30th September 2010'] and I can't do it for every row. Anyone knows a way to solve this?
Dateutil will do this job.
from dateutil import parser
print df
df2 = df.copy()
df2.date = df2.date.apply(lambda x: parser.parse(x))
df2
Output:
I want to write a program where i can compare current date with couple of dates that i have.
my data is
12 JUN 2016
21 MAR 1989
15 MAR 1958
15 SEP 1958
23 OCT 1930
15 SEP 1928
10 MAR 2010
23 JAN 1928
15 NOV 1925
26 AUG 2009
29 APR 1987
20 JUL 1962
10 MAY 1960
13 FEB 1955
10 MAR 1956
3 MAR 2010
14 NOV 1958
4 AUG 1985
24 AUG 1956
15 FEB 1955
19 MAY 1987
30 APR 1990
8 SEP 2014
18 JAN 2012
14 DEC 1960
1 AUG 1998
7 SEP 1963
9 MAR 2012
1 MAY 1990
14 MAY 1985
15 JUN 1945
5 APR 1995
26 FEB 1987
13 DEC 1983
15 AUG 2009
16 SEP 1980
16 JAN 2005
19 JUN 2011
Now how can i compare this to current date to know that date is not exceeding current date ( i.e 13/JUN/2016).
please help me! Thank you.
You have to create a datetime object using the string data. You can create the object by parsing the date string using strptime method.
from datetime import datetime
mydate = datetime.strptime("19 JUN 2011", "%d %b %Y")
And then use the object to compare it with today's date.
print mydate < datetime.today()
True