Python: Reading excel file but Index should be DateTime not Sequential Numbers - python

Hey I am loading in data from an excel sheet. The excel sheet has 5 columns. The first colume is a DateTime, and the next 4 are datasets corresponding to that time. Here is the code:
import os
import numpy as np
import pandas as pd
df = pd.read_excel (r'path\test.xlsx', sheet_name='2018')
I thought it would load it in such that the DateTime is the index, but instead it has another column called Index which is just a set of numbers going from 0 up to the end of the array. How do I have the DateTime column be the index and remove the other column?

Try this after you read the excel, it is two extra lines
df['Datetime'] = pd.to_datetime(df['Datetime'], format="%m/%d/%Y, %H:%M:%S")
"""
Assuming that Datetime is the name of the Datetime column and the format of the column is 07/15/2020 12:24:45 -"%m/%d/%Y, %H:%M:%S"
if the format of the date time string is different change the format mentioned
"""
df = df.set_index(pd.DatetimeIndex(df['Datetime']))
"""
This will set the index as datetime index
"""

There is a solution for this problem:
import pandas as pd
df = pd.read_excel (r'path\test.xlsx', sheet_name='2018')
df = df.set_index('timestamp') #Assuming thename of your datetime column is timestamp
You can try this method for setting the Datetime column as the index.

Related

convert month-day string to datetime

I have a dataframe with a column for month-day in the format of '1-29' ( no data for year). I want to convert it from a string to datetime.
I have produced a sample dataframe as follows;
import pandas as pd
df = pd.DataFrame({'id':[78,18,94,55,68,57,78,8],
'monthday':['1-29','1-28','1-27','1-19','1-28','1-19','1-29','1-28']})
I have tried
df['month_day']= [datetime.strptime(x, '%m-%d') for x in df.monthday]
but the output inserts a year and I end up with 1900-01-29.
Also, with my full dataframe I end up with the error 'ValueError: day is out of range for month'.

Sorting datetime; pandas

I have a big excel file with a datetime format column which are in strings. The column looks like this:
ingezameldop
2022-10-10 15:51:18
2022-10-10 15:56:19
I have found two ways of trying to do this, however they do not work.
First (nice way):
import pandas as pd
from datetime import datetime
from datetime import date
dagStart = datetime.strptime(str(date.today())+' 06:00:00', '%Y-%m-%d %H:%M:%S')
dagEind = datetime.strptime(str(date.today())+' 23:00:00', '%Y-%m-%d %H:%M:%S')
data = pd.read_excel('inzamelbestand.xlsx', index_col=9)
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
data.to_excel("oefenexcel.xlsx")
However, this returns me with an excel file identical to the original one. I cant seem to fix this.
Second way (sketchy):
import pandas as pd
from datetime import datetime
from datetime import date
df = pd.read_excel('inzamelbestand.xlsx', index_col=9)
# uitfilteren dag van vandaag
dag = str(date.today())
dag1 = dag[8]+dag[9]
vgl = df['ingezameldop']
vgl2 = vgl.str[8]+vgl.str[9]
df = df.loc[vgl2 == dag1]
# uitfilteren vanaf 6 uur 's ochtends
# str11 str12 = uur
df.to_excel("oefenexcel.xlsx")
This one works for filtering out the exact day. But when I want to filter out the hours it does not. Because I use the same way (getting the 11nd and 12th character from the string) but I cant use logic operators (>=) on strings, so I cant filter out for times >6
You can modify this line of code
data = data.loc[pd.to_datetime(data['ingezameldop']).dt.time.between(dagStart.time(), dagEind.time())]
as
(dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
to get boolean values that are only true for records within the date range.
dagStart, dagEind and data['ingezameldop'] must be in datetime format.
In order to apply it on individual element of the column, wrap it in a function and use apply as follows
def filter(ingezameldop, dagStart, dagEind):
return (dagStart.hour, dagStart.minute) <= (data['ingezameldop'].hour, data['ingezameldop'].minute) < (dagEind.hour, dagEind.minute)
then apply the filter on the column in this way
data['filter'] = data['ingezameldop'].apply(filter, dagStart=dagStart, dagEind=dagEind)
That will apply the function on individual series element which must be in datetime format

How to create a "duration" column from two "dates" columns?

I have two columns ("basecamp_date" and "highpoint_date") in my "expeditions" dataframe, they have a start date (basecamp_date) and an end date ("highpoint_date") and I would like to create a new column that expresses the duration between these two dates but I have no idea how to do it.
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
In read_csv convert columns to datetimes and then subtrat columns with Series.dt.days for days:
file = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv"
expeditions = pd.read_csv(file, parse_dates=['basecamp_date','highpoint_date'])
expeditions['diff'] = expeditions['highpoint_date'].sub(expeditions['basecamp_date']).dt.days
You can convert those columns to datetime and then subtract them to get the duration:
tstart = pd.to_datetime(expeditions['basecamp_date'])
tend = pd.to_datetime(expeditions['highpoint_date'])
expeditions['duration'])= pd.Timedelta(tend - tstart)

How to read in a csv with datetime format DD/mm/YYYY HH:mm:ss in pandas

I am trying to read in a csv into a pandas dataframe. One of the columns is a datetime in the format DD/mm/yyyy HH:mm:ss
eg 01/02/2019 12:04:40
How would I do this to ensure pandas is reading all the date times correctly?
Thanks
You can use the dayfirst argument of the read_csv pandas function. Supposing your csv file is called your_csv_filename.csv and your date column is called name_of_date_column:
import pandas as pd
date_column = ['name_of_date_column']
with open('your_csv_filename.csv') as file:
df = pd.read_csv(file, parse_dates=date_column, dayfirst=True)

Comparing dates in an Excel sheet to a certain fixed date and printing a value

I'm trying to compare dates from an excel sheet to a certain static date like 30 june of 2019, and if the date in the Excel sheet is before this print "Y" else print"N".
I'm very new at Pandas.
I have tried importing the file but no idea how to iterate through each row and how to compare dates to a static date
import pandas as pd
import numpy as np
from datetime import date
from pandas import ExcelWriter
df = pd.read_excel(r'Date compare.xlsx', sheet_name= 'Sheet1')
df{"Date"} = pd.to_date(df["Date"],format="%d%m%Y")
pd.to_date(df["End Date"],format="%m%d%Y")
Supposing your 'Date' column in Excel sheet is formatted as date, you can introduce new column FLAG by comparing Date column with Timestamp you need.
import pandas as pd
df = pd.read_excel(r'Date compare.xlsx', sheet_name='Sheet1')
df["FLAG"] = pd.np.where(df["Date"] > pd.Timestamp("2019-06-30"), "Y", "N")
I would first make sure that the dates are recognized as Timestamp when reading the exel file with parse_dates=True. Then you can make comparisons converting from Timestamp to datetime.date through .date() and defining your threshold date with datetime.date(2019, 6, 30). To do so you can define a function and use apply to the Date column:
import datetime
import pandas as pd
# Import data and define threshold date
df=pd.read_excel(r'Date compare.xlsx', parse_dates=True, sheet_name= 'Sheet1')
mydate = datetime.date(2019, 6, 30)
# Define function
def compare(date):
if date.date() >= mydate:
val = "N"
else:
val = "Y"
return val
# Apply to all elements
df["check"] = df['Date'].apply(compare)

Categories