I currently have a 'Date' column serving as my index for a pandas dataframe that is of the form:
January
February
....
Year2
January
February
...
Year3
(It came from a pdf table extractor.) Is there anyway easily to separate the years and months out, with each month having the proper year or have a proper date-time column to serve as my index?
Right now I am thinking of applying a function where I select if a value is numeric and if so clone over to another column and delete, but there should be an easier way.
All are objects, but the years are in numeric form, whereas the months are in long string form.
Thank you very much in advance.
Using ffill with to_numeric
df['Year']=pd.to_numeric(df.MixCol,errors='coerce').ffill().astype(int)
df=df.loc[pd.to_numeric(df.MixCol,errors='coerce').isnull()]
df
Out[86]:
MixCol Year
1 January 2017
2 February 2017
4 January 2018
5 February 2018
Data input
MixCol
2017
January
February
2018
January
February
2019
Related
I am cleaning up a dataframe that has date of birth and date of death as a string. There are multiple formats of dates in those columns. Some contain just year (which is all I need). These are the formats of dates:
Jan 10 2020
1913
10/8/2019
June 14th 1980
All I need is the year from each date. I have not been having any luck with pandas to_datetime since a significant portion of the rows only have year to begin with.
Is there a way for me to pull just year from the strings so that I can get each column to look like:
2020
1913
2019
1980
The simplest way is to use a parser which will accept these and other formats:
import pandas as pd
from dateutil import parser
df = pd.DataFrame({"mydates":["Jan 10 2020", "1913", "10/8/2019", "June 14th 1980"]})
df['years'] = df['mydates'].apply(parser.parse).dt.strftime('%Y')
print(df)
You can use str.extract:
df['BirthDate'] = df['BirthDate'].str.extract(r'\d{4}')
I am having a hard time trying to sort dates with different formats. I have a Series with inputs containing dates in many different formats and need to extract them and sort them chronologically.
So far I have setup different regex for fully numerical dates (01/01/1989), dates with month (either Mar 12 1989 or March 1989 or 12 Mar 1989) and dates where only the year is given (see code below)
pat1=r'(\d{0,2}[/-]\d{0,2}[/-]\d{2,4})' # matches mm/dd/yy and mm/dd/yyyy
pat2=r'((\d{1,2})?\W?(Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\W+(\d{1,2})?\W?\d{4})'
pat3=r'((?<!\d)(\d{4})(?!\d))'
finalpat=pat1 + "|"+ pat2 + "|" + pat3
df2=df1.str.extractall(finalpat).groupby(level=0).first()
I now got a dataframe with the different regex expressions above in different columns that I need to transform in usable times.
The problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989 and Mar 1989 (no day) in the same column of my dataframe.
Without two formats ( Month dd YYYY and dd Month YYYY) I can easily do this :
df3=df2.copy()
dico={"Jan":'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
df3[1]=df3[1].str.replace("(?<=[A-Z]{1}[a-z]{2})\w*","") # we replace the month in the column by its number, and remove
for key,item in dico.items(): # the letters in month after the first 3.
df3[1]=df3[1].str.replace(key,item)
df3[1]=df3[1].str.replace("^(\d{1,2}/\d{4})",r'01/\g<1>')
df3[1]=pd.to_datetime(df3[1],format='%d/%m/%Y').dt.strftime('%Y%m%d') # add 01 if no day given
where df3[1] is the column of interest. I use a dictionary to change Month to their number and get my dates as I want them.
The problem is that with two formats of dates ( Mar 12 1989 and 12 Mar 1989), one of the two format will be wrongly transformed.
Is there a way to discriminate between the date formats and apply different transformations accordingly ?
Thanks a lot
problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989
and Mar 1989 (no day) in the same column of my dataframe.
pandas.to_datetime can cope with that, consider following example
import pandas as pd
df = pd.DataFrame({'d_str':["Mar 12 1989", "12 Mar 1989", "Mar 1989"]})
df['d_dt'] = pd.to_datetime(df.d_str)
print(df)
output
d_str d_dt
0 Mar 12 1989 1989-03-12
1 12 Mar 1989 1989-03-12
2 Mar 1989 1989-03-01
Now you can sort using d_dt as it has type datetime64[ns] but you must keep in mind that lack of day is treated as 1st day of given month. Be warned though it might fail if your data contain dates in middle-endian format (mm/dd/yy).
So I am really new to this and struggling with something, which I feel should be quite simple.
I have a Pandas Dataframe containing two columns: Fiscal Week (str) and Amount sold (int).
Fiscal Week
Amount sold
0
2019031
24
1
2019041
47
2
2019221
34
3
2019231
46
4
2019241
35
My problem is the fiscal week column. It contains strings which describe the fiscal year and week . The fiscal year for this purpose starts on October 1st and ends on September 30th. So basically, 2019031 is the Monday (the 1 at the end) of the third week of October 2019. And 2019221 would be the 2nd week of March 2020.
The issue is that I want to turn this data into timeseries later. But I can't do that with the data in string format - I need it to be in date time format.
I actually added the 1s at the end of all these strings using
df['Fiscal Week']= df['Fiscal Week'].map('{}1'.format)
so that I can then turn it into a proper date:
df['Fiscal Week'] = pd.to_datetime(df['Fiscal Week'], format="%Y%W%w")
as I couldn't figure out how to do it with just the weeks and no day defined.
This, of course, returns the following:
Fiscal Week
Amount sold
0
2019-01-21
24
1
2019-01-28
47
2
2019-06-03
34
3
2019-06-10
46
4
2019-06-17
35
As expected, this is clearly not what I need, as according to the definition of the fiscal year week 1 is not January at all but rather October.
Is there some simple solution to get the dates to what they are actually supposed to be?
Ideally I would like the final format to be e.g. 2019-03 for the first entry. So basically exactly like the string but in some kind of date format, that I can then work with later on. Alternatively, calendar weeks would also be fine.
Assuming you have a data frame with fiscal dates of the form 'YYYYWW' where YYY = the calendar year of the start of the fiscal year and ww = the number of weeks into the year, you can convert to calendar dates as follows:
def getCalendarDate(fy_date: str):
f_year = fy_date[0:4]
f_week = fy_date[4:]
fys = pd.to_datetime(f'{f_year}/10/01', format= '%Y/%m/%d')
return fys + pd.to_timedelta(int(f_week), "W")
You can then use this function to create the column of calendar dates as follows:
df['Calendar Date]'] = list(getCalendarDate(x) for x in df['Fiscal Week'].to_list())
I have a dataframe (df) with a column in datetime format YYYY-MM-DD ('date'). I am trying to create a new column that returns the policy year, which always starts on April 1st and thus the policy year for January through March will always be the prior calander year. There are dates that are rather old so setting up individual date ranges for the sample size below wouldn't be ideal
The dataframe would look like this
df['date']
2020-12-10
2021-02-10
2019-03-31
and output should look like this
2020
2020
2018
I now know how to get the year using df['date'].dt.year. However, I am having trouble getting the dataframe to convert each year to the respective policy year so that if df['date'].dt.month >= 4 then df['date'].dt.year, else df['date'].dt.year - 1
I am not quite sure how to set this up exactly. I have been trying to avoid setting up multiple columns to do a bool for month >= 4 and then setting up different columns. I've gone so far as to set up this but get ValueError stating the series is too ambiguous
def PolYear(x):
y = x.dt.month
if y >= 4:
x.dt.year
else:
x.dt.year - 1
df['Pol_Year'] = PolYear(df['date'])
I'm wasn't sure if this was the right way to go about it so I also tried a df.loc format for >= and < 4 but len key and value are not equal. Definitely think I'm missing something super simple.
I previously had mentioned 'fiscal year', but this is incorrect.
Quang Hoand had the right idea but used the incorrect frequency in the call to to_period(self, freq). For your purposes you want to use the following code:
df.date.dt.to_period('Q-MAR').dt.qyear
This will give you:
0 2021
1 2021
2 2019
Name: date, dtype: int64
Q-MAR defines fiscal year end in March
These values are the correct fiscal years (fiscal years use the year in which they end, not where they begin[reference]). If you you want to have the output using the year in which they begin, it's simple:
df.date.dt.to_period('Q-MAR').dt.qyear - 1
Giving you
0 2020
1 2020
2 2018
Name: date, dtype: int64
qyear docs
This is qyear:
df.date.dt.to_period('Q').dt.qyear
Output:
0 2020
1 2021
2 2019
Name: date, dtype: int64
I'm parsing a date column that contains irregular date formats that wouldn't be interpreted by pandas'. Dates include different languages for days, months, and years as well as varying formats. The date entries often include timestamps as well. (Bonus: Would separating them by string/regex with lambda/loops be the fastest method?) What's the best option and workflow to tackle these several tens of thousands of date entries?
The entries unknown to pandas and dateutil.parser.
Examples include:
19.8.2017, 21:23:32
31/05/2015 19:41:56
Saturday, 18. May
11 - 15 July 2001
2019/4/28 下午6:29:28
1 JuneMay 2000
19 aprile 2008 21:16:37 GMT+02:00
Samstag, 15. Mai 2010 20:55:10
So 23 Jun 2007 23:45 CEST
28 August 1998
30 June 2001
1 Ноябрь 2008 г. 18:46:59
Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time)
May-28-11 6:56:08 PM
Sat Jun 26 2010 21:55:54 GMT+0200 (West-Europa (zomertijd))
lunedì 5 maggio 2008 9.30.33
"ValueError: ('Unknown string format:', '1 JuneMay 2000')"
I realize this may be a cumbersome and undesirable task. Luckily the dates are currently nonessential to my project so they may be left alone, but a solution would be favorable. Any and all replies are appreciated, thank you.
Line by line, lots of your dates works:
>>> pd.to_datetime('19.8.2017, 21:23:32')
Timestamp('2017-08-19 21:23:32')
But there are many matters:
as your format is irregular, pandas cannot guess if 01-02-2019 is the first of february 2019 or the second of january 2019, I don't know if you can,
some of your example cannot be converted into date Saturday, 18. May: which year?
there are month and date in different languages (aprile seems Italian, Samstag is German)
some of your example works without the parenthesis content:
>>> pd.to_datetime('Sat Jun 18 2011 19:46:46 GMT+0200') # works
Timestamp('2011-06-18 19:46:46-0200', tz='pytz.FixedOffset(-120)')
>>> pd.to_datetime('Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) ') # doesn't work.
...
ValueError: ('Unknown string format:', 'Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) ')
It is sure that you cannot have all the date into timestamp, I would try to create a new column with the correctly parsed date in timestamp and the other saved as NaT.
For example:
date
02-01-2019
Saturday, 18. May
will become:
date new date
02-01-2019 Timestamp('2019-01-02 00:00:00.00)
Saturday, 18. May NaT
For this I would delete the parenthesis in the initial column:
df2 = df.assign(
date2=lambda x: x['date'].str.split('(')[0],
new_date=lambda x: x['date2'].apply(lambda y: pd.to_datetime(y, errors='coerce'), axis='columns')) # apply the function row by row
# This will work with python >= 3.6
After, you can see what's remaining with keeping NaT values.
For translation, you can try to replace words but it will be really long.
This is really slow (due to the apply row by row) but if your data are not consistent you cannot work directly on a column.
I hope it will help.