Good Afternoon,
I have a huge dataset where time informations are stored as a float64 (or integer) in one column of the dataframe in format 'ddmmyyyy' (ex. 20 January 2020 would be the float 20012020.0). I need to convert it into a datetime like 'dd-mm-yyyy'. I saw the function to_datetime, but i can't really manage to obtain what i want. Does someone know how to do it?
Massimo
You could try converting to string and after that, to date format, you want like this:
# The first step is to change the type of the column,
# in order to get rid of the .0 we will change first to int and then to
# string
df["date"] = df["date"].astype(int)
df["date"] = df["date"].astype(str)
for i, row in df.iterrows():
# In case that the date format is similar to 20012020
x = str(df["date"].iloc[i])
if len(x) == 8:
df.at[i,'date'] = "{}-{}-{}".format(x[:2], x[2:4], x[4:])
# In case that the format is similar to 1012020
else:
df.at[i,'date'] = "0{}-{}-{}".format(x[0], x[1:3], x[3:])
Edit:
As you said this solution only works if the month always comes in 2
digits.
Added missing variable in the loop
Added change column types before entering the loop.
Let me know if this helps!
Related
I am working with a date column in this form:
Date
1871.01
1871.02
...
1871.10
1871.11
So to convert the column to a datetimeindex, I use:
df["Date"].apply(lambda x: datetime.strptime(str(x), "%Y.%m"))
however the column is converted to:
Date
1871-01-01
1871-02-01
...
1871-01-01
1871-11-01
Does anyone have any idea of what causes this, where all "10"s convert to "01"s? Is there a better way to do this given my inputs are floats?
If the first format is a float, the 1871.10 and 1871.1 are exactly the same numbers. So the string of it will have the second value (the shortest one). But then it would seems it is January (month 1).
So you should stringify forcing two digits:
df["Date"].apply(lambda x: datetime.strptime("{:.2f}" % x, "%Y.%m"))
Note: the first format is very bad. The true solution is to correct it from beginning (e.g. when you read the input file you must tell the read function that the column is a date, not a float.
I have 2 columns as month and day in my dataframe which are of the datatypes objects. I want to sort those in ascending order (Jan, Feb, Mar) but in order to do that, I need to convert them to date format. I tried using the following code, and some more but nothing seems to work.
ff['month'] = dt.datetime.strptime(ff['month'],format='%b')
and
ff['month'] = pd.to_datetime(ff['month'], format="%b")
Data Frame
Any help would be appreciated. Thank you
This works to convert Month Names to Integers:
import datetime as dt
ff['month'] = [dt.datetime.strptime(m, "%b").month for m in ff['month']]
(Basically, you're just passing strings one by one to the first function you mentioned, to make it work.)
You can then manipulate (e.g. sort) them.
Working with dataframe:
ff['month'] = ff['month'].apply(lambda x: dt.datetime.strptime(x, "%b"))
ff = ff.sort_values(by=['month'])
ff['month'] = ff['month'].apply(lambda x: x.strftime("%b"))
I have a dataframe column in the format of 20180531.
I need to split this properly i.e. I can get 2018/05/31.
This is a dataframe column that I have and I need to deal with it in a datetime format.
Currently this column is identified as int64 type
I'm not sure how efficient it'll be but if you convert it to a string, and the use pd.to_datetime with a .format=..., eg:
df['actual_datetime'] = pd.to_datetime(df['your_column'].astype(str), format='%Y%m%d')
As Emma points out - the astype(str) is redundant here and just:
df['actual_datetime'] = pd.to_datetime(df['your_column'], format='%Y%m%d')
will work fine.
Assuming the integer dates would always be fixed width at 8 digits, you may try:
df['dt'] = df['dt_int'].astype(str).str.replace(r'(\d{4})(\d{2})(\d{2})', r'\1-\2-\3')
I have the following datatable, which I would like to filter by dates greater than "2019-01-01". The problem is that the dates are strings.
dt_dates = dt.Frame({"days_date": ['2019-01-01','2019-01-02','2019-01-03']})
This is my best attempt.
dt_dates[f.days_date > datetime.strptime(f.days_date, "2019-01-01")]
this returns the error
TypeError: strptime() argument 1 must be str, not Expr
what is the best way to filter dates in python's datatable?
Reference
python datatable
f-expressions
Your datetime syntax is incorrect, for converting a string to a datetime.
What you're looking for is:
dt_dates[f.days_date > datetime.strptime(f.days_date, "%Y-%m-%d")]
Where the 2nd arguement for strptime is the date format.
However, lets take a step back, because this isn't the right way to do it.
First, we should convert all your dates in your Frame to a datetime. I'll be honest, I've never used a datatable, but the syntax looks extremely similar to panda's Dataframe.
In a dataframe, we can do the following:
df_date = df_date['days_date'].apply(lambda x: datetime.strptime(x, '%Y-%m'%d))
This goes through each row where the column is 'dates_date" and converts each string into a datetime.
From there, we can use a filter to get the relevant rows:
df_date = df_date[df_date['days_date'] > datetime.strptime("2019-01-01", "%Y-%m-%d")]
datatable version 1.0.0 introduced native support for date an time data types. Note the difference between these two ways to initialize data:
dt_dates = dt.Frame({"days_date": ['2019-01-01','2019-01-02','2019-01-03']})
dt_dates.stypes
> (stype.str32,)
and
dt_dates = dt.Frame({"days_date": ['2019-01-01','2019-01-02','2019-01-03']}, stype="date32")
dt_dates.stypes
> (stype.date32,)
The latter frame contains days_date column of type datatable.Type.date32 that represents a calendar date. Then one can filter by date as follows:
split_date = datetime.datetime.strptime("2019-01-01", "%Y-%m-%d")
dt_split_date = dt.time.ymd(split_date.year, split_date.month, split_date.day)
dt_dates[dt.f.days_date > dt_split_date, :]
Beginner python (and therefore pandas) user. I am trying to import some data into a pandas dataframe. One of the columns is the date, but in the format "YYYYMM". I have attempted to do what most forum responses suggest:
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m')
This doesn't work though (ValueError: unconverted data remains: 3). The column actually includes an additional value for each year, with MM=13. The source used this row as an average of the past year. I am guessing to_datetime is having an issue with that.
Could anyone offer a quick solution, either to strip out all of the annual averages (those with the last two digits "13"), or to have to_datetime ignore them?
pass errors='coerce' and then dropna the NaT rows:
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m', errors='coerce').dropna()
The duff month values will get converted to NaT values
In[36]:
pd.to_datetime('201613', format='%Y%m', errors='coerce')
Out[36]: NaT
Alternatively you could filter them out before the conversion
df_cons['YYYYMM'] = pd.to_datetime(df_cons.loc[df_cons['YYYYMM'].str[-2:] != '13','YYYYMM'], format='%Y%m', errors='coerce')
although this could lead to alignment issues as the returned Series needs to be the same length so just passing errors='coerce' is a simpler solution
Clean up the dataframe first.
df_cons = df_cons[~df_cons['YYYYMM'].str.endswith('13')]
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'])
May I suggest turning the column into a period index if YYYYMM column is unique in your dataset.
First turn YYYYMM into index, then convert it to monthly period.
df_cons = df_cons.reset_index().set_index('YYYYMM').to_period('M')