Simple way to extract date from text in Pandas - python

The excerpt from data:
Givent the following example of pandas dataframe:
df =
index date
7838 2012 January
7790 2012 January
7853 2015 September
7889 2016 March
7928 2015 October
7847 1999 January
7884 2006 January
7826 1992 January
Is there a simple (and pythonic) way to convert free text into a standard date time variable? Something like:
df =
index date
7838 2012-01-01
7790 2012-01-01
7853 2015-09-01
7889 2016-03-01
7928 2015-10-01
7847 1999-01-01
7884 2006-01-01
7826 1992-01-01

Use pd.to_datetime() to convert from text to date type. You can glean the appropriate date types from this list.
df['date'] = pd.to_datetime(df['date'], format='%Y %B')

to_datetime handles this fine without any specific format specifier:
In [83]:
pd.to_datetime(df['date'])
Out[83]:
0 2012-01-01
1 2012-01-01
2 2015-09-01
3 2016-03-01
4 2015-10-01
5 1999-01-01
6 2006-01-01
7 1992-01-01
Name: date, dtype: datetime64[ns]

Related

Convert 3 columns from dataframe to date

I have dataframe like this:
I want to convert the 'start_year', 'start_month', 'start_day' columns to date
and the columns 'end_year', 'end_month', 'end_day' to another date
There is a way to do that?
Thank you.
Given a dataframe like this:
year month day
0 2019.0 12.0 29.0
1 2020.0 9.0 15.0
2 2018.0 3.0 1.0
You can convert them to date string using type cast, and str.zfill:
OUTPUT:
df.apply(lambda x: f'{int(x["year"])}-{str(int(x["month"])).zfill(2)}-{str(int(x["day"])).zfill(2)}', axis=1)
0 2019-12-29
1 2020-09-15
2 2018-03-01
dtype: object
Here's an approach
simulate some data as your data was an image
use apply against each row to row series using datetime.datetime()
import datetime as dt
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"start_year": np.random.choice(range(2018, 2022), 10),
"start_month": np.random.choice(range(1, 13), 10),
"start_day": np.random.choice(range(1, 28), 10),
"end_year": np.random.choice(range(2018, 2022), 10),
"end_month": np.random.choice(range(1, 13), 10),
"end_day": np.random.choice(range(1, 28), 10),
}
)
df = df.apply(
lambda r: r.append(pd.Series({f"{startend}_date": dt.datetime(*(r[f"{startend}_{part}"]
for part in ["year", "month", "day"]))
for startend in ["start", "end"]})),
axis=1)
df
start_year
start_month
start_day
end_year
end_month
end_day
start_date
end_date
0
2018
9
6
2020
1
3
2018-09-06 00:00:00
2020-01-03 00:00:00
1
2018
11
6
2020
7
2
2018-11-06 00:00:00
2020-07-02 00:00:00
2
2021
8
13
2020
11
2
2021-08-13 00:00:00
2020-11-02 00:00:00
3
2021
3
15
2021
3
6
2021-03-15 00:00:00
2021-03-06 00:00:00
4
2019
4
13
2021
11
5
2019-04-13 00:00:00
2021-11-05 00:00:00
5
2021
2
5
2018
8
17
2021-02-05 00:00:00
2018-08-17 00:00:00
6
2020
4
19
2020
9
18
2020-04-19 00:00:00
2020-09-18 00:00:00
7
2020
3
27
2020
10
20
2020-03-27 00:00:00
2020-10-20 00:00:00
8
2019
12
23
2018
5
11
2019-12-23 00:00:00
2018-05-11 00:00:00
9
2021
7
18
2018
5
10
2021-07-18 00:00:00
2018-05-10 00:00:00
An interesting feature of pandasonic to_datetime function is that instead of
a sequence of strings you can pass to it a whole DataFrame.
But in this case there is a requirement that such a DataFrame must have columns
named year, month and day. They can be also of float type, like your source
DataFrame sample.
So a quite elegant solution is to:
take a part of the source DataFrame (3 columns with the respective year,
month and day),
rename its columns to year, month and day,
use it as the argument to to_datetime,
save the result as a new column.
To do it, start from defining a lambda function, to be used as the rename
function below:
colNames = lambda x: x.split('_')[1]
Then just call:
df['Start'] = pd.to_datetime(df.loc[:, 'start_year' : 'start_day']
.rename(columns=colNames))
df['End'] = pd.to_datetime(df.loc[:, 'end_year' : 'end_day']
.rename(columns=colNames))
For a sample of your source DataFrame, the result is:
start_year start_month start_day evidence_method_dating end_year end_month end_day Start End
0 2019.0 12.0 9.0 Historical Observations 2019.0 12.0 9.0 2019-12-09 2019-12-09
1 2019.0 2.0 18.0 Historical Observations 2019.0 7.0 28.0 2019-02-18 2019-07-28
2 2018.0 7.0 3.0 Seismicity 2019.0 8.0 20.0 2018-07-03 2019-08-20
Maybe the next part should be to remove columns with parts of both "start"
and "end" dates. Your choice.
Edit
To avoid saving the lambda (anonymous) function under a variable, define
this function as a regular (named) function:
def colNames(x):
return x.split('_')[1]

pandas.core.indexing.IndexingError: Too many indexers

I want to extract electricity consumption for Site 2
>>> df4 = pd.read_excel(xls, 'Elec Monthly Cons')
>>> df4
Site Unnamed: 1 2014-01-01 00:00:00 2014-02-01 00:00:00 2014-03-01 00:00:00 ... 2017-08-01 00:00:00 2017-09-01 00:00:00 2017-10-01 00:00:00 2017-11-01 00:00:00 2017-12-01 00:00:00
0 Site Profile JAN 2014 FEB 2014 MAR 2014 ... AUG 2017 SEP 2017 OCT 2017 NOV 2017 DEC 2017
1 Site 1 NHH 10344 NaN NaN ... NaN NaN NaN NaN NaN
2 Site 2 HH 258351 229513 239379 ... NaN NaN NaN NaN NaN
type
type(df4)
<class 'pandas.core.frame.DataFrame'>
My goal is to take out the numerical value but I do not know how to set the index properly. What I have tried so far does not work at all.
df1 = df.loc[idx[:,1:2],:]
But
raise IndexingError('Too many indexers')
pandas.core.indexing.IndexingError: Too many indexers
It seems that I do not understand indexing. Does the series type play any role?
df.head
<bound method NDFrame.head of Site Site 2
Unnamed: 1 HH
EDIT
print (df.index)
Index([ 'Site', 'Unnamed: 1', 2014-01-01 00:00:00,
2014-02-01 00:00:00, 2014-03-01 00:00:00, 2014-04-01 00:00:00,
2014-05-01 00:00:00, 2014-06-01 00:00:00, 2014-07-01 00:00:00,
How to solve this?
In my opinion is necessary remove :, because it means select all columns, but Series have no column.
Also it seems no MultiIndex, so then need:
df1 = df.iloc[1:2]
There is problem first 2 rows are headers, so for MultiIndex DataFrame need:
df4 = pd.read_excel(xls, 'Elec Monthly Cons', header=[0,1], index_col=[0,1])
And then for select use:
idx = pd.IndexSlice
df1 = df.loc[:, idx[:,'FEB 2014':'MAR 2014']]

Python Spliting Date/Time into Month, Date, Year, Time columns

I have a data frame with a "Date/Time" column looks like this:
import pandas as pd
example = {'Date/Time' : ['4/1/2014 0:11:00', '4/1/2014 0:17:00', '4/1/2014 0:21:00', '4/1/2014 0:28:00']}
df = pd.DataFrame(example)
What I want to do is to split the column into 3 different columns (Month, Date, Year, and Time).
I have tried to use regex based on the code that I used for my other similar problem which was to split a column with "gender, email, phone number' into each.
# create gender, phone, email columns by splitting leader contact
df3[['gender','phone','email']]=df3['Contact'].str.extract('\(([A-Z])\)\s?(\d{3}-\d{3}-\d{4})?\s?(.*)', expand = False)
I basically tried to manipulate extract() part. However, it was too hard for me to figure out. I don't mind to use regex or other 'dates/time' packages.
It will be so helpful for my study if anyone can help.
I will post the most straightforward way to do this
df['date'] = pd.to_datetime(df['Date/Time'])
df['month'], df['day'], df['year'], df['time'] = (
df.date.dt.month, df.date.dt.day, df.date.dt.year, df.date.dt.time)
Date/Time date month day year time
0 4/1/2014 0:11:00 2014-04-01 00:11:00 4 1 2014 00:11:00
1 4/1/2014 0:17:00 2014-04-01 00:17:00 4 1 2014 00:17:00
2 4/1/2014 0:21:00 2014-04-01 00:21:00 4 1 2014 00:21:00
3 4/1/2014 0:28:00 2014-04-01 00:28:00 4 1 2014 00:28:00
df.assign(**{ x: getattr(pd.to_datetime(df['Date/Time']).dt, x.lower()) for x in ['Month', 'Date', 'Year', 'Time']})
Date/Time Month Date Year Time
0 4/1/2014 0:11:00 4 2014-04-01 2014 00:11:00
1 4/1/2014 0:17:00 4 2014-04-01 2014 00:17:00
2 4/1/2014 0:21:00 4 2014-04-01 2014 00:21:00
3 4/1/2014 0:28:00 4 2014-04-01 2014 00:28:00

python pandas series loc value from multi index

I have a series that looks like this
2014 7 2014-07-01 -0.045417
8 2014-08-01 -0.035876
9 2014-09-02 -0.030971
10 2014-10-01 -0.027471
11 2014-11-03 -0.032968
12 2014-12-01 -0.031110
2015 1 2015-01-02 -0.028906
2 2015-02-02 -0.035563
3 2015-03-02 -0.040338
4 2015-04-01 -0.032770
5 2015-05-01 -0.025762
6 2015-06-01 -0.019746
7 2015-07-01 -0.018541
8 2015-08-03 -0.028101
9 2015-09-01 -0.043237
10 2015-10-01 -0.053565
11 2015-11-02 -0.062630
12 2015-12-01 -0.064618
2016 1 2016-01-04 -0.064852
I want to be able to get the value from a date. Something like:
myseries.loc('2015-10-01') and it returns -0.053565
The index are tuples in the form (2016, 1, 2016-01-04)
You can do it like this:
In [32]:
df.loc(axis=0)[:,:,'2015-10-01']
Out[32]:
value
year month date
2015 10 2015-10-01 -0.053565
You can also pass slice for each level:
In [39]:
df.loc[(slice(None),slice(None),'2015-10-01'),]
Out[39]:
value
year month date
2015 10 2015-10-01 -0.053565|
Or just pass the first 2 index levels:
In [40]:
df.loc[2015,10]
Out[40]:
value
date
2015-10-01 -0.053565
Try xs:
print s.xs('2015-10-01',level=2,axis=0)
#year datetime
#2015 10 -0.053565
#Name: series, dtype: float64
print s.xs(7,level=1,axis=0)
#year datetime
#2014 2014-07-01 -0.045417
#2015 2015-07-01 -0.018541
#Name: series, dtype: float64

How to rearrange a date in python

I have a column in a pandas data frame looking like:
test1.Received
Out[9]:
0 01/01/2015 17:25
1 02/01/2015 11:43
2 04/01/2015 18:21
3 07/01/2015 16:17
4 12/01/2015 20:12
5 14/01/2015 11:09
6 15/01/2015 16:05
7 16/01/2015 21:02
8 26/01/2015 03:00
9 27/01/2015 08:32
10 30/01/2015 11:52
This represents a time stamp as Day Month Year Hour Minute. I would like to rearrange the date as Year Month Day Hour Minute. So that it would look like:
test1.Received
Out[9]:
0 2015/01/01 17:25
1 2015/01/02 11:43
...
Just use pd.to_datetime:
In [33]:
import pandas as pd
pd.to_datetime(df['date'])
Out[33]:
index
0 2015-01-01 17:25:00
1 2015-02-01 11:43:00
2 2015-04-01 18:21:00
3 2015-07-01 16:17:00
4 2015-12-01 20:12:00
5 2015-01-14 11:09:00
6 2015-01-15 16:05:00
7 2015-01-16 21:02:00
8 2015-01-26 03:00:00
9 2015-01-27 08:32:00
10 2015-01-30 11:52:00
Name: date, dtype: datetime64[ns]
In your case:
pd.to_datetime(test1['Received'])
should just work
If you want to change the display format then you need to parse as a datetime and then apply `datetime.strftime:
In [35]:
import datetime as dt
pd.to_datetime(df['date']).apply(lambda x: dt.datetime.strftime(x, '%m/%d/%y %H:%M:%S'))
Out[35]:
index
0 01/01/15 17:25:00
1 02/01/15 11:43:00
2 04/01/15 18:21:00
3 07/01/15 16:17:00
4 12/01/15 20:12:00
5 01/14/15 11:09:00
6 01/15/15 16:05:00
7 01/16/15 21:02:00
8 01/26/15 03:00:00
9 01/27/15 08:32:00
10 01/30/15 11:52:00
Name: date, dtype: object
So the above is now showing month/day/year, in your case the following should work:
pd.to_datetime(test1['Received']).apply(lambda x: dt.datetime.strftime(x, '%y/%m/%d %H:%M:%S'))
EDIT
it looks like you need to pass param dayfirst=True to to_datetime:
In [45]:
pd.to_datetime(df['date'], format('%d/%m/%y %H:%M:%S'), dayfirst=True).apply(lambda x: dt.datetime.strftime(x, '%m/%d/%y %H:%M:%S'))
Out[45]:
index
0 01/01/15 17:25:00
1 01/02/15 11:43:00
2 01/04/15 18:21:00
3 01/07/15 16:17:00
4 01/12/15 20:12:00
5 01/14/15 11:09:00
6 01/15/15 16:05:00
7 01/16/15 21:02:00
8 01/26/15 03:00:00
9 01/27/15 08:32:00
10 01/30/15 11:52:00
Name: date, dtype: object
Pandas has this in-built, you can specify your datetime format
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html.
use infer_datetime_format
>>> import pandas as pd
>>> i = pd.date_range('20000101',periods=100)
>>> df = pd.DataFrame(dict(year = i.year, month = i.month, day = i.day))
>>> pd.to_datetime(df.year*10000 + df.month*100 + df.day, format='%Y%m%d')
0 2000-01-01
1 2000-01-02
...
98 2000-04-08
99 2000-04-09
Length: 100, dtype: datetime64[ns]
you can use the datetime functions to convert from and to strings.
# converts to date
datetime.strptime(date_string, 'DD/MM/YYYY HH:MM')
and
# converts to your requested string format
datetime.strftime(date_string, "YYYY/MM/DD HH:MM:SS")

Categories