Converting a column into pandas.datetime or time series - python

I have a dataframe that looks like this, but with multiple records:
ID Date
1 {'day': 20, 'year': 2018, 'month':9}
I am trying to change everything in the Date column in to pandas timeseries format. I was trying to loop through the data and change each entry by doing the following but am getting an error saying that the formats don't match.
for index, rows in iterrows:
x = row['Date']
pd.to_datetime(pd.Series(x), format = 'day': %d, 'year': %y, \
'month': %m, dayfirst = True)
When running df.to_dict(), this is the output:
{'ID': {0: '1'}, 'Date':{0: "{'day': 20, 'year': 2018, 'month': 9}"}}

Steps
Convert column of dictionaries to list of dictionaries
Convert list of dictionaries to DataFrame
Pass DataFrame to pd.to_datetime - This is the cool part! pd.to_datetime accepts a DataFrame if there are appropriately named columns. So happens, your dictionaries have the right keys that will get parsed as column names in step 2.
Assign to 'Date'
df.assign(Date=pd.to_datetime(pd.DataFrame(df.Date.tolist())))
ID Date
0 1 2018-09-20

I was able to solve it with this code:
df['Date'] = pd.to_datetime(df['Date'],format="{'day': %d, 'year': %Y, 'month': %m}")
There should not have been pd.Series.

Related

Pandas Calculating Worked Duration and adding new column in end of existing csv

I am pretty new to pandas And I got certain data of employees which have start and end date in date, month and year which is basically a column having lists.
Here is my data format is as follows as extracted from csv column.
data =[
{
"starts_at":{
"day":1,
"month":8,
"year":2021
},
"ends_at":None
},
{
"starts_at":{
"day":1,
"month":9,
"year":2020
},
"ends_at":{
"day":30,
"month":4,
"year":2021
}
},
{
"starts_at":None,
"ends_at":{
"day":30,
"month":4,
"year":2021
}
}
]
basically If ends_at is None than user is working currently(ongoing) and if end date is specified than user has ended contract with company.
There are certain fault data like start_date is None and end_date have dates these are also things I considered but all by python way.
I was told to do by pandas way but I think I am missing techniques of pandas rather than using double for loops in n*2 time complexity.
here is how I made my hands dirty with the pythonic way
from datetime import datetime
from datetime import date
from dateutil import relativedelta as rdelta
today = date.today()
df =pd.read_csv('/home/navaraj/Downloads/profile-details.csv')
pd.set_option('display.max_colwidth', None)
df["experiences"] = df["experiences"].apply(eval)
print(df['experiences'])
for k in df["experiences"]:
for x in k:
starts=(x.get('starts_at'))
if starts is not None:
ends=(x.get('ends_at'))
end_date_day=end_date_month=end_date_year=None
status=None
if ends is None:
ends=today
status="On going"
else:
end_date_day=ends['day']
end_date_month=ends['month']
end_date_year=ends['year']
ends=datetime.strptime(f'{end_date_year}-{end_date_month}-{end_date_day}', '%Y-%m-%d').date()
status="ended"
starts_day=starts['day']
starts_month=starts['month']
starts_year=starts['year']
started=datetime.strptime(f'{starts_year}-{starts_month}-{starts_day}', '%Y-%m-%d').date()
rd = rdelta.relativedelta(ends,started)
result="{0.years} years and {0.months} months".format(rd)
print(result,status)
Problem and expection:
I just wanted those last line of data i.e result and working status(Ongoing or Ended) to be attached at the end of current csv file for every users that I am working any Help will be really awesome.
Assuming 'attached at the end' means at the end of each row, defining a function to use in a call to pd.apply is one pandas way to avoid explicitly looping over the df.
Below is for 'status', but the same can be done for 'result'
function
def determine_status(row: pd.Series) -> str:
if row['starts_at']:
if row['ends_at']:
return "ended"
return "On going"
applying to the df with this call, setting a new column equal to the result
df.loc[:,'status'] = df.apply(determine_status, axis=1)
output
starts_at ends_at status
0 {'day': 1, 'month': 8, 'year': 2021} None On going
1 {'day': 1, 'month': 9, 'year': 2020} {'day': 30, 'month': 4, 'year': 2021} ended
2 None {'day': 30, 'month': 4, 'year': 2021} None

Panda dataframe conversion of series of 03Mar2020 date format to 2020-03-03

I'm not able to convert input
Dates = {'dates': ['05Sep2009','13Sep2011','21Sep2010']}
to desired output
Dates = {'dates': [2019-09-02,2019-09-13,2019-09-21]}
using Pandas Dataframe.
data = {'dates': ['05Sep2009','13Sep2011','21Sep2010']}
df = pd.DataFrame(data, columns=['dates'])
df['dates'] = pd.to_datetime(df['dates'], format='%Y%m%d')
print (df)
Output:
ValueError: time data '05Sep2009' does not match format '%Y%m%d' (match)
I'm new to this library. Help is appreciated.
Currently the months are abbreviated and are not numeric, so you can't use %m.
To convert abbreviated months and get the expected output use %b, like this:
df['dates'] = pd.to_datetime(df['dates'], format='%d%b%Y')
Update: to convert the DataFrame back to a dictionary you can use the function to_dict() but first, to get the desidered output, you need to convert the column from datetime back to string type. You can achieve it through this:
df['dates'] = df['dates'].astype(str)
df.to_dict('list')
You must change %m with %b. Because %m support month as a zero-padded decimal number. But in you dataframe has a abbreviation of months. Try these code:
data = {'dates': ['05Sep2009','13Sep2011','21Sep2010']}
df = pd.DataFrame(data, columns=['dates'])
df['convert of dates'] = pd.to_datetime(df['dates'], format='%d%b%Y')
display(df)
And also you can check this link about other format:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

How do you convert the two digit year integer to four digits in Pandas?

There are three columns in my [dataset][1] that I wanted to combine into one. I did so like this:
from datetime import date
data['DATE'] = data.apply(lambda x: date(int(x['Yr']), int(x['Mo']), int(x['Dy'])), axis=1)
And then I dropped those three columns 'Yr', 'Mo', 'Dy'.
The problem is that I'm getting something like this:
DATE
0061-01-01
0061-01-02
0061-01-03
0061-01-04
0061-01-05
, where I expected it to be something like this:
DATE
1961-01-01
1961-01-02
1961-01-03
1961-01-04
1961-01-05
So, before I created the column 'DATE' I had to convert the two digits 'Yr' column into four digits manually.
def yr_fx(df):
for i in range(len(df['Yr'])):
df['Yr'][i] = '19'+str(df['Yr'][i])`
I created the above function to do the job for me but the problem is that it's taking way too long to execute like 2 - 3 minutes. And it also shows this warning:
C:\Users\abc\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
This is separate from the ipykernel package so we can avoid doing imports until
I want to know the efficient way of doing this.
IIUC,
df = pd.DataFrame({"Yr": 61, "Mo": 12, "Dy": 15}, index=[0])
df["Date"] = pd.to_datetime(
df["Yr"].astype(str) + "-" + df["Mo"].astype(str) + "-" + df["Dy"].astype(str)
)
df["Date"] = df["Date"] + pd.DateOffset(years=-100)
print(df)
Result:
Yr Mo Dy Date
0 61 12 15 1961-12-15
An alternative way - we can make use of the fact that pandas.to_datetime can interpret year, month and day properly if they are your column names. We'll also use assign to add 1900 years inline.
df = pd.DataFrame({"Yr": 61, "Mo": 12, "Dy": 15}, index=[0])
pd.to_datetime(df[['Yr', 'Mo', 'Dy']]
.rename(columns={'Yr': 'year',
'Mo': 'month',
'Dy': 'day'})
.assign(year=lambda x: x['year']+1900))
[out]
0 1961-12-15
dtype: datetime64[ns]
According to the python datetime docs (https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)
You should also be able to use lower case y to indicate that the year format is only two digits. Then you can reformat to four year date with dt.strftime and an upper case y. For my data it assumed year 2000 and above so you may have to write a lambda function if you have dates before 2000.
data['DATE_reformatted'] = pd.to_datetime(data['DATE'], format="%y-%m-%d").dt.strftime("%Y-%m-%d")

How to enforce datetime format (%Y-%m-%d %H:%M:%S) when the time is 00:00:00 when using pandas to_datetime() function?

I have to save one record of a dataframe with a datetime column to a csv.The datetime column is in this %Y-%m-%d %H:%M:%S format.When the record has the following value in the date column e.g: 2019-11-12 00:00:00,the value is saved as '2019-11-12'.Is there a way to enforce the datetime format?
pandas version : 0.24.2
#eg
d = {'run_date': ['2019-11-11 02:30:00','2019-11-12 00:00:00'], 'value': [40, 45]}
df = pd.DataFrame(data=d)
display(pd.to_datetime(df['run_date'],format='%Y-%m-%d %H:%M:%S')[-1:])
You can also specify the datetime format while saving to csv.
Please try df.iloc[-1:].to_csv("test_datetime_format.csv", index=False, date_format='%Y-%m-%d %H:%M:%S') and see if that works.
I've added .iloc[-1] to select the record you had issues with, but it should also work on the complete dataframe if you wish.
Regards, Koen

Selecting Data from Last Week in Python

I have a large database and I am looking to read only the last week for my python code.
My first problem is that the column with the received date and time is not in the format for datetime in pandas. My input (Column 15) looks like this:
recvd_dttm
1/1/2015 5:18:32 AM
1/1/2015 6:48:23 AM
1/1/2015 13:49:12 PM
From the Time Series / Date functionality in the pandas library I am looking at basing my code off of the "Week()" function shown in the example below:
In [87]: d
Out[87]: datetime.datetime(2008, 8, 18, 9, 0)
In [88]: d - Week()
Out[88]: Timestamp('2008-08-11 09:00:00')
I have tried ordering the date this way:
df =pd.read_csv('MYDATA.csv')
orderdate = datetime.datetime.strptime(df['recvd_dttm'], '%m/%d/%Y').strftime('%Y %m %d')
however I am getting this error
TypeError: must be string, not Series
Does anyone know a simpler way to do this, or how to fix this error?
Edit: The dates are not necessarily in order. AND sometimes there is a faulty error in the database like a date that is 9/03/2015 (in the future) someone mistyped. I need to be able to ignore those.
import datetime as dt
# convert strings to datetimes
df['recvd_dttm'] = pd.to_datetime(df['recvd_dttm'])
# get first and last datetime for final week of data
range_max = df['recvd_dttm'].max()
range_min = range_max - dt.timedelta(days=7)
# take slice with final week of data
sliced_df = df[(df['recvd_dttm'] >= range_min) &
(df['recvd_dttm'] <= range_max)]
You may iterate over the dates to convert by making a list comprehension
orderdate = [datetime.datetime.strptime(ttm, '%m/%d/%Y').strftime('%Y %m %d') for ttm in list(df['recvd_dttm'])]

Categories