How to split pandas column into two columns with strings and ints

How to split pandas column into two columns with strings and ints - python

Im looking to split the column Date range into two columns, starting date and ending date. However it split doesn't seem to work because it does not recognise the '-'. Any advice?
I tried using
'''
ebola1 = pd.DataFrame(ebola['Date range'].str.split('-',1).to_list(),columns = ['start date','end date'])
'''
However, it returns the following:
So (1) it doesn't recognize the '-', (2)how do I distinguish between 'Jun-Nov 1976' and 'Oct 2001-Mar 2002', (3) how to I include the new columns in the existing table?
Thanks for the help!

There is used – instead -, so use Series.str.split with expand=True for DataFrame:
data = ['Jun–Nov 1976', 'Sep–Oct 1976', 'Jun 1977', 'Jul–Oct 1979', 'Nov 1994', 'Nov 1994–Feb 1995', 'Jan–Jul 1995', 'Jan–Mar 1996', 'Jul 1996–Jan 1997', 'Oct 2000–Feb 2001', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Dec 2002–Apr 2003', 'Dec 2002–Apr 2003', 'Dec 2002–Apr 2003', 'Oct–Dec 2003', 'Apr–Jun 2004']
ebola = pd.DataFrame(data, columns=['Date range'])
ebola1 = ebola['Date range'].str.split('–', 1, expand=True)
ebola1.columns = ['start date','end date']
And then numpy.where for add years from end date by Series.str.extract but only if not exist in start date column tested by Series.str.contains:
mask = ebola1['start date'].str.contains('\d')
years = ebola1['end date'].str.extract('(\d+)', expand=False)
ebola1['start date'] = np.where(mask,
ebola1['start date'],
ebola1['start date'] + ' ' + years)
print (ebola1)
start date end date
0 Jun 1976 Nov 1976
1 Sep 1976 Oct 1976
2 Jun 1977 None
3 Jul 1979 Oct 1979
4 Nov 1994 None
5 Nov 1994 Feb 1995
6 Jan 1995 Jul 1995
7 Jan 1996 Mar 1996
8 Jul 1996 Jan 1997
9 Oct 2000 Feb 2001
10 Oct 2001 Mar 2002
11 Oct 2001 Mar 2002
12 Oct 2001 Mar 2002
13 Oct 2001 Mar 2002
14 Oct 2001 Mar 2002
15 Dec 2002 Apr 2003
16 Dec 2002 Apr 2003
17 Dec 2002 Apr 2003
18 Oct 2003 Dec 2003
19 Apr 2004 Jun 2004

Related

Pandas: How to draw bar graph on month over counts

I have a dataframe df as below:
Student_id Date_of_visit(d/m/y)
1 1/4/2020
1 30/12/2019
1 26/12/2019
2 3/1/2021
2 10/1/2021
3 4/5/2020
3 22/8/2020
How can I get the bar-graph with x-axis as month-year(eg: y-ticks: Dec 2019, Jan 2020, Feb 2020) and on y-axis - the total number of students (count) visited on a particular month.

Convert values to datetimes, then use DataFrame.resample with Resampler.size for counts, create new format of datetimes by DatetimeIndex.strftime:
df['Date_of_visit'] = pd.to_datetime(df['Date_of_visit'], dayfirst=True)
s = df.resample('M', on='Date_of_visit')['Student_id'].size()
s.index = s.index.strftime('%b %Y')
print (s)
Date_of_visit
Dec 2019 2
Jan 2020 0
Feb 2020 0
Mar 2020 0
Apr 2020 1
May 2020 1
Jun 2020 0
Jul 2020 0
Aug 2020 1
Sep 2020 0
Oct 2020 0
Nov 2020 0
Dec 2020 0
Jan 2021 2
Name: Student_id, dtype: int64
If need count only unique Student_id use Resampler.nunique:
s = df.resample('M', on='Date_of_visit')['Student_id'].nunique()
s.index = s.index.strftime('%b %Y')
print (s)
Date_of_visit
Dec 2019 1
Jan 2020 0
Feb 2020 0
Mar 2020 0
Apr 2020 1
May 2020 1
Jun 2020 0
Jul 2020 0
Aug 2020 1
Sep 2020 0
Oct 2020 0
Nov 2020 0
Dec 2020 0
Jan 2021 1
Name: Student_id, dtype: int64
Last plot by Series.plot.bar
s.plot.bar()

How to split one row into multiple and apply datetime on dataframe column?

I have one dataframe which looks like below:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 21 Dec 2017 18 Dec 2017 21 Dec 2017
4 22 Dec 2017 22 Dec 2017
Conditions to be checked:
Want to check if any row contains two dates or not like 3rd row. If present split them into two separate rows.
Apply the datetime on both columns.
I am trying to do the same operation like below:
df['Date_1'] = pd.to_datetime(df['Date_1'], format='%d %b %Y')
But getting below error:
ValueError: unconverted data remains:
Expected Output:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 18 Dec 2017
4 21 Dec 2017 21 Dec 2017
5 22 Dec 2017 22 Dec 2017

After using regex with findall get the you date , your problem become a unnesting problem
s=df.apply(lambda x : x.str.findall(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{,4})'))
unnesting(s,['Date_1','Date_2']).apply(pd.to_datetime)
Out[82]:
Date_1 Date_2
0 2017-12-05 2017-12-05
1 2017-12-14 2017-12-14
2 2017-12-15 2017-12-15
3 2017-12-18 2017-12-18
3 2017-12-21 2017-12-21
4 2017-12-22 2017-12-22

pandas DataFrame pivot with sort

I'm having a problem presenting data in the required way. My dataframe is formatted and then sorted by 'Site ID'. I need to present the data by Site ID with all date instances grouped alongside.
I'm 90% there in terms of how I want it to look using pivot_table
df_pivot = pd.pivot_table(df, index=['Site Ref','Site Name', 'Date'])
however the date column is not sorted.
(The tiny example output appears sorted however the ****Thu Jan 11 2018 10:43:20 entry**** illustrates my issue on large data sets)
I cannot figure out how to present it like below but also with the dates sorted per site ID
Any help is gratefully accepted
df = pd.DataFrame.from_dict([{'Site Ref': '1234567', 'Site Name': 'Building A', 'Date': 'Mon Jan 08 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Mon Jan 08 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Tue Jan 09 2018 10:43:20', 'Duration': 70}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Wed Jan 10 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1212345', 'Site Name':'Building C', 'Date': 'Fri Jan 12 2018 10:43:20', 'Duration': 100}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Thu Jan 11 2018 10:43:20', 'Duration': 80}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Fri Jan 12 2018 12:22:20', 'Duration': 80}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Mon Jan 15 2018 11:43:20', 'Duration': 90}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Wed Jan 17 2018 10:43:20', 'Duration': 220}])
df = DataFrame(df, columns=['Site Ref', 'Site Name', 'Date', 'Duration'])
df = df.sort_values(by=['Site Ref'])
df
Site Ref Site Name Date Duration
5 1123456 Building D Thu Jan 11 2018 10:43:20 80
6 1123456 Building D Fri Jan 12 2018 12:22:20 80
7 1123456 Building D Mon Jan 15 2018 11:43:20 90
8 1123456 Building D Wed Jan 17 2018 10:43:20 220
4 1212345 Building C Fri Jan 12 2018 10:43:20 100
0 1234567 Building A Mon Jan 08 2018 10:43:20 120
1 1245678 Building B Mon Jan 08 2018 10:43:20 120
2 1245678 Building B Tue Jan 09 2018 10:43:20 70
3 1245678 Building B Wed Jan 10 2018 10:43:20 120
df_pivot = pd.pivot_table(df, index=['Site Ref','Site Name', 'Date'])
df_pivot
Site Ref Site Name Date
1123456 Building D Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
****Thu Jan 11 2018 10:43:20 80****
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120

It's sorted lexicographically, because Date has object (string) dtype
Workaround - add a new column of datetime dtype, use it before Date in the pivot_table and drop it afterwards:
In [74]: (df.assign(x=pd.to_datetime(df['Date']))
.pivot_table(df, index=['Site Ref','Site Name', 'x', 'Date'])
.reset_index(level='x', drop=True))
Out[74]:
Duration
Site Ref Site Name Date
1123456 Building D Thu Jan 11 2018 10:43:20 80
Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120

You need to convert your dates to datetime values rather than strings. Something like the following would work on your current pivot table:
df_pivot.reset_index(inplace=True)
df_pivot['Date'] = pd.to_datetime(df_pivot['Date'])
df_pivot.sort_values(by=['Site Ref', 'Date'], inplace=True)

Sort the values by Site Ref, groupby mean using sort = False i.e
df.sort_values('Site Ref').groupby(['Site Ref','Site Name','Date'],sort=False).mean()
Duration
Site Ref Site Name Date
1123456 Building D Thu Jan 11 2018 10:43:20 80
Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120

Change date format in pandas dataframe

I have this dataframe:
date value
1 Thu 17th Nov 2016 385.943800
2 Fri 18th Nov 2016 1074.160340
3 Sat 19th Nov 2016 2980.857860
4 Sun 20th Nov 2016 1919.723960
5 Mon 21st Nov 2016 884.279340
6 Tue 22nd Nov 2016 869.071070
7 Wed 23rd Nov 2016 760.289260
8 Thu 24th Nov 2016 2481.689270
9 Fri 25th Nov 2016 2745.990070
10 Sat 26th Nov 2016 2273.413250
11 Sun 27th Nov 2016 2630.414900
12 Mon 28th Nov 2016 817.322310
13 Tue 29th Nov 2016 1766.876030
14 Wed 30th Nov 2016 469.388420
I would like to change the format of the date column to this format YYYY-MM-DD. The dataframe consists of more than 200 rows, and every day new rows will be added, so I need to find a way to do this automatically.
This link is not helping because it sets the dates like this dates = ['30th November 2009', '31st March 2010', '30th September 2010'] and I can't do it for every row. Anyone knows a way to solve this?

Dateutil will do this job.
from dateutil import parser
print df
df2 = df.copy()
df2.date = df2.date.apply(lambda x: parser.parse(x))
df2
Output:

comparing date using dd/mm/yy format in python

I want to write a program where i can compare current date with couple of dates that i have.
my data is
12 JUN 2016
21 MAR 1989
15 MAR 1958
15 SEP 1958
23 OCT 1930
15 SEP 1928
10 MAR 2010
23 JAN 1928
15 NOV 1925
26 AUG 2009
29 APR 1987
20 JUL 1962
10 MAY 1960
13 FEB 1955
10 MAR 1956
3 MAR 2010
14 NOV 1958
4 AUG 1985
24 AUG 1956
15 FEB 1955
19 MAY 1987
30 APR 1990
8 SEP 2014
18 JAN 2012
14 DEC 1960
1 AUG 1998
7 SEP 1963
9 MAR 2012
1 MAY 1990
14 MAY 1985
15 JUN 1945
5 APR 1995
26 FEB 1987
13 DEC 1983
15 AUG 2009
16 SEP 1980
16 JAN 2005
19 JUN 2011
Now how can i compare this to current date to know that date is not exceeding current date ( i.e 13/JUN/2016).
please help me! Thank you.

You have to create a datetime object using the string data. You can create the object by parsing the date string using strptime method.
from datetime import datetime
mydate = datetime.strptime("19 JUN 2011", "%d %b %Y")
And then use the object to compare it with today's date.
print mydate < datetime.today()
True

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split pandas column into two columns with strings and ints - python

Related

Pandas: How to draw bar graph on month over counts

How to split one row into multiple and apply datetime on dataframe column?

pandas DataFrame pivot with sort

Change date format in pandas dataframe

comparing date using dd/mm/yy format in python

Categories

Resources