parse multiple date format pandas - python

I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!

we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.

You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.

Related

Pandas Converting date string (only month and year) to datetime

I am trying to convert a datetime object to datetime. In the original dataframe the data type is a string and the dataset has shape = (28000000, 26). Importantly, the format of the date is MMYYYY only. Here's a data sample:
DATE
Out[3] 0 081972
1 051967
2 101964
3 041975
4 071976
I tried:
df['DATE'].apply(pd.to_datetime(format='%m%Y'))
and
pd.to_datetime(df['DATE'],format='%m%Y')
I got Runtime Error both times
Then
df['DATE'].apply(pd.to_datetime)
it worked for the other not shown columns(with DDMMYYYY format), but generated future dates with df['DATE'] because it reads the dates as MMDDYY instead of MMYYYY.
DATE
0 1972-08-19
1 2067-05-19
2 2064-10-19
3 1975-04-19
4 1976-07-19
Expect output:
DATE
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 1976-07
If this question is a duplicate please direct me to the original one, I wasn't able to find any suitable answer.
Thank you all in advance for your help
First if error is raised obviously some datetimes not match, you can test it by errors='coerce' parameter and Series.isna, because for not matched values are returned missing values:
print (df)
DATE
0 81972
1 51967
2 101964
3 41975
4 171976 <-changed data
print (pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce'))
0 1972-08-01
1 1967-05-01
2 1964-10-01
3 1975-04-01
4 NaT
Name: DATE, dtype: datetime64[ns]
print (df[pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').isna()])
DATE
4 171976
Solution with output from changed data with converting to datetimes and the to months periods by Series.dt.to_period:
df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
DATE
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 NaT
Solution with original data:
df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 1976-07
I would have done:
df['date_formatted'] = pd.to_datetime(
dict(
year=df['DATE'].str[2:],
month=df['DATE'].str[:2],
day=1
)
)
Maybe this helps. Works for your sample data.

pandas to_datetime convert datetime string to 0

I have a column in a df which contains datetime strings,
inv_date
24/01/2008
15/06/2007 14:55:22
08/06/2007 18:26:12
15/08/2007 14:53:25
15/02/2008
07/03/2007
13/08/2007
I used pd.to_datetime with format %d%m%Y for converting the strings into datetime values;
pd.to_datetime(df.inv_date, errors='coerce', format='%d%m%Y')
I got
inv_date
24/01/2008
0
0
0
15/02/2008
07/03/2007
13/08/2007
the format is inferred from inv_date as the most common datetime format; I am wondering how to not convert 15/06/2007 14:55:22, 08/06/2007 18:26:12, 15/08/2007 14:53:25 to 0s, but 15/06/2007, 08/06/2007, 15/08/2007.
Use the regular pd.to_datetime call then use .dt.date:
>>> pd.to_datetime(df.inv_date).dt.date
0 2008-01-24
1 2007-06-15
2 2007-08-06
3 2007-08-15
4 2008-02-15
5 2007-07-03
6 2007-08-13
Name: inv_date, dtype: object
>>>
Or as #ChrisA mentioned, you can also use, only thing is the pandas format is good already, so skipped that part:
>>> pd.to_datetime(df.inv_date.str[:10], errors='coerce')
0 2008-01-24
1 2007-06-15
2 2007-08-06
3 2007-08-15
4 2008-02-15
5 2007-07-03
6 2007-08-13
Name: inv_date, dtype: object
>>>
You can also try this:
df = pd.read_csv('myfile.csv', parse_dates=['inv_date'], dayfirst=True)
df['inv_date'].dt.strftime('%d/%m/%Y')
0 24/01/2008
1 15/06/2007
2 08/06/2007
3 15/08/2007
4 15/02/2008
5 07/03/2007
6 13/08/2007
Hope this will help too.

Error converting data type float to datetime format

I would like to convert the data type float below to datetime format:
df
Date
0 NaN
1 NaN
2 201708.0
4 201709.0
5 201700.0
6 201600.0
Name: Cred_Act_LstPostDt_U324123, dtype: float64
pd.to_datetime(df['Date'],format='%Y%m.0')
ValueError: time data 201700.0 does not match format '%Y%m.0' (match)
How could I transform these rows without month information as yyyy01 as default?
You can use pd.Series.str.replace to clean up your month data:
s = [x.replace('00.0', '01.0') for x in df['Date'].astype(str)]
df['Date'] = pd.to_datetime(s, format='%Y%m.0', errors='coerce')
print(df)
Date
0 NaT
1 NaT
2 2017-08-01
4 2017-09-01
5 2017-01-01
6 2016-01-01
Create a string that contains the float using .asType(str), then split the string at the fourth char and using cat insert a hyphen. Then you can use format='%Y%m.
However this may still fail if you try to use incorrect month numbering, such as month 00
string = df['Date'].astype(str)
s = pd.Series([string[:4], '-',string[4:6])
date = s.str.cat(sep=',')
pd.to_datetime(date.astype(str),format='%Y%m')

Frequency of events in a week

My data has trips with datetime info, user id for each trip and trip type (single, round, pseudo).
Here's a data sample (pandas dataframe), named All_Data:
HoraDTRetirada idpass type
2016-02-17 15:36:00 39579449489 'single'
2016-02-18 19:13:00 39579449489 'single'
2016-02-26 09:20:00 72986744521 'pseudo'
2016-02-27 12:11:00 72986744521 'round'
2016-02-27 14:55:00 11533148958 'pseudo'
2016-02-28 12:27:00 72986744521 'round'
2016-02-28 16:32:00 72986744521 'round'
I would like to count the number of times each category repeats in a "week of year" by user.
For example, if the event happens on a monday and the next event happens on a thursday for a same user, that makes two events on the same week; however, if one event happens on a saturday and the next event happens on the following monday, they happened in different weeks.
The output I am looking for would be in a form like this:
idpass weekofyear type frequency
39579449489 1 'single' 2
72986744521 2 'round' 3
72986744521 2 'pseudo' 1
11533148958 2 'pseudo' 1
Edit: this older question approaches a similar problem, but I don't know how to do it with pandas.
import pandas as pd
data = {"HoraDTRetirada": ["2016-02-17 15:36:00", "2016-02-18 19:13:00", "2016-12-31 09:20:00", "2016-02-28 12:11:00",
"2016-02-28 14:55:00", "2016-02-29 12:27:00", "2016-02-29 16:32:00"],
"idpass": ["39579449489", "39579449489", "72986744521", "72986744521", "11533148958", "72986744521",
"72986744521"],
"type": ["single", "single", "pseudo", "round", "pseudo", "round", "round"]}
df = pd.DataFrame.from_dict(data)
print(df)
df["HoraDTRetirada"] = pd.to_datetime(df['HoraDTRetirada'])
df["week"] = df['HoraDTRetirada'].dt.strftime('%U')
k = df.groupby(["idpass", "week", "type"],as_index=False).count()
print(k)
Output:
HoraDTRetirada idpass type
0 2016-02-17 15:36:00 39579449489 single
1 2016-02-18 19:13:00 39579449489 single
2 2016-12-31 09:20:00 72986744521 pseudo
3 2016-02-28 12:11:00 72986744521 round
4 2016-02-28 14:55:00 11533148958 pseudo
5 2016-02-29 12:27:00 72986744521 round
6 2016-02-29 16:32:00 72986744521 round
idpass week type HoraDTRetirada
0 11533148958 09 pseudo 1
1 39579449489 07 single 2
2 72986744521 09 round 3
3 72986744521 52 pseudo 1
This is how I got what I was looking for:
Step 1 from suggested answers was skipped because timestamps were already in pandas datetime form.
Step 2: create column for week of year:
df['week'] = df['HoraDTRetirada'].dt.strftime('%U')
Step 3: group by user id, type and week, and count values with size()
df.groupby(['idpass','type','week']).size()
My suggestion would be to do this:
make sure your timestamp is pandas datetime and add frequency column
df['HoraDTRetirada'] = pd.to_datetime(df['HoraDTRetirada'])
df['freq'] = 1
Group it and count
res = df.groupby(['idpass', 'type', pd.Grouper(key='HoraDTRetirada', freq='1W')]).count().reset_index()
Convert time to week of a year
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
Final result looks like that:
EDIT:
You are right, in your case we should do step 3 before step 2, and if you want to do that, remember that groupby will change, so finally step 2 will be:
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
and step 3 :
res = df.groupby(['idpass', 'type', 'HoraDTRetirada')]).count().reset_index()
It's a bit different because the "Hora" variable is not a time anymore, but just an int representing a week.

Manipulating data from csv using pandas

here is a question about the data from pandas. What I am looking is to fetch two column from a csv file, and manipulate these data before finally saving them.
The csv file looks like :
year month
2007 1
2007 2
2007 3
2007 4
2008 1
2008 3
this is my current code:
records = pd.read_csv(path)
frame = pd.DataFrame(records)
combined = datetime(frame['year'].astype(int), frame['month'].astype(int), 1)
The error is :
TypeError: cannot convert the series to "<type 'int'>"
any thoughts?
datetime won't operate on a pandas Series (column of a dataframe). You can use to_datetime or you could use datetime within apply. Something like the following should work:
In [9]: df
Out[9]:
year month
0 2007 1
1 2007 2
2 2007 3
3 2007 4
4 2008 1
5 2008 3
In [10]: pd.to_datetime(df['year'].astype(str) + '-'
+ df['month'].astype(str)
+ '-1')
Out[10]:
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
dtype: datetime64[ns]
Or use apply:
In [11]: df.apply(lambda x: datetime(x['year'],x['month'],1),axis=1)
Out[11]:
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
dtype: datetime64[ns]
Another Edit: You can also do most of the date parsing with read_csv but then you need to adjust the day after you read it in (note, my data is in a string named 'data'):
In [12]: df = pd.read_csv(StringIO(data),header=True,
parse_dates={'date':['year','month']})
In [13]: df['date'] = df['date'].values.astype('datetime64[M]')
In [14]: df
Out[14]:
date
0 2007-01-01
1 2007-02-01
2 2007-03-01
3 2007-04-01
4 2008-01-01
5 2008-03-01
Had similar issue the answer is assuming that you have the Year, Month and Day in columns of your DataFrame:
df['Date'] = df[['Year', 'Month', 'Day']].apply(lambda s : datetime.datetime(*s),axis = 1)
first part selects the columns with the Year, Month and Date form the Dateframe, second bit applies the datetime function element-wise on the data.
if you do not gave the day in your data asit looks like form your data, just do:
df['Day'] = 1
to place the day there as well. should be way to do that in code, but will be quick workaround. Can always drop the Day column afterward if you dont want it.

Categories