Fill in Date when only knowing startdate and continous hours? Pandas - python

I have a dataframe which is from a license log file. The log file logs only by continueous hours. In the header of the logfile is a startdate. So everytime the hour starts with 0 a new day should begin. How can i solve this in python?
Here is a Example of which i got.
Left is current structe, right is expected output:

I immediately thought of a loop solution; there might be more pythonic ways though.
import pandas as pd
from datetime import timedelta
df=pd.read_csv('date_example.csv', parse_dates=['Date'])
for idx, row in df.iloc[1:].iterrows():
if df.loc[idx,'Hour'] == 0:
df.loc[idx,'Date']= df.loc[idx-1,'Date']+timedelta(days=1)
else:
df.loc[idx,'Date'] = df.loc[idx-1, 'Date']

you didn't add the raw data so I created a similar example
this solution assumes there are no days without data.
import pandas as pd
import datetime
import numpy as np
# example data
data = [[datetime.datetime(2021,10,28), 0,5], [np.nan, 1, 6], [np.nan, 23, 7], [np.nan, 1, 8]]
df = pd.DataFrame(data, columns = [['Date', 'Hour','License_Count']])
for i in range(1, len(df)):
if df.iat[i,1] >= df.iat[i-1,1]:
df.loc[i,'Date'] = df.iat[i-1,0]
if df.iat[i,1] <= df.iat[i-1,1]:
df.loc[i,'Date'] = df.iat[i-1,0] + datetime.timedelta(days=1)

I have done this by applying the below function.
import pandas as pd
from datetime import timedelta
df["Date"] = pd.to_datetime(df["Date"])
temp=df.copy()
def func(x):
if x['Hours'] == 0:
if x.name == 0:
temp.loc[x.name, 'Date'] = temp.loc[0, 'Date'] + timedelta(days=1)
else:
temp.loc[x.name, 'Date'] = temp.loc[x.name - 1, 'Date'] + timedelta(days=1)
else:
temp.loc[x.name, 'Date'] = temp.loc[x.name - 1, 'Date']
df.apply(func, axis = 1)
print(temp)
"temp" is your desired output.

I used an Excelsheet as input.xlsx that is similiar to your input. The date automatically starts with the hour 0, therefore I didn't use the column with the hours.
The output is then stored in the output.xlsx.
import pandas as pd
from datetime import timedelta
df = pd.read_excel("input.xlsx")
date = df['Date'][0]
for index, row in df.iterrows():
df['Date'][index] = date
date += timedelta(hours=1)
df.to_excel("output.xlsx")

Related

Remove the weekend days from the event log - Pandas

Could you please help me with the following tackle?
I need to remove the weekend days from the dataframe (attached link: dataframe_running_example. I can get a list of all the weekend days between mix and max date pulled out from the event however I cannot filter out the df based on "list_excluded" list.
from datetime import timedelta, date
import pandas as pd
#Data Loading
df= pd.read_csv("running-example.csv", delimiter=";")
df["timestamp"] = pd.to_datetime(df["timestamp"])
df["timestamp_date"] = df["timestamp"].dt.date
def daterange(date1, date2):
for n in range(int ((date2 - date1).days)+1):
yield date1 + timedelta(n)
#start_dt & end_dt
start_dt = df["timestamp"].min()
end_dt = df["timestamp"].max()
print("Start_dt: {} & end_dt: {}".format(start_dt, end_dt))
weekdays = [6,7]
#List comprehension
list_excluded = [dt for dt in daterange(start_dt, end_dt) if dt.isoweekday() in weekdays]
df.info()
df_excluded = pd.DataFrame(list_excluded).rename({0: 'timestamp_excluded'}, axis='columns')
df_excluded["ts_excluded"] = df_excluded["timestamp_excluded"].dt.date
df[~df["timestamp_date"].isin(df_excluded["ts_excluded"])]
ooh an issue has been resolved. I used pd.bdate_range() function.
from datetime import timedelta, date
import pandas as pd
import numpy as np
#Wczytanie danych
df= pd.read_csv("running-example.csv", delimiter=";")
df["timestamp"] = pd.to_datetime(df["timestamp"])
df["timestamp_date"] = df["timestamp"].dt.date
#Zakres timestamp: start_dt & end_dt
start_dt = df["timestamp"].min()
end_dt = df["timestamp"].max()
print("Start_dt: {} & end_dt: {}".format(start_dt, end_dt))
bus_days = pd.bdate_range(start_dt, end_dt)
df["timestamp_date"] = pd.to_datetime(df["timestamp_date"])
df['Is_Business_Day'] = df['timestamp_date'].isin(bus_days)
df[df["Is_Business_Day"]!=False]

Sort by date with Excel file and Pandas

I am trying to sort my Excel file by the date column. When the code runs it turns the cells from a text string to a time date and it sorts, but only within the same month. That is, when I have dates from October and September it completes by the month.
I have been all over Google and YouTube.
import pandas as pd
import datetime
from datetime import timedelta
x = datetime.datetime.now()
excel_workbook = 'data.xlsx'
sheet1 = pd.read_excel(excel_workbook, sheet_name='RAW DATA')
sheet1['Call_DateTime'] = pd.to_datetime(sheet1['Call_DateTime'])
sheet1.sort_values(sheet1['Call_DateTime'], axis=1, ascending=True, inplace=True)
sheet1['SegmentDuration'] = pd.to_timedelta(sheet1['SegmentDuration'], unit='s')
sheet1['SegmentDuration'] = timedelta(hours=0.222)
sheet1.style.apply('h:mm:ss', column=['SegmentDuration'])
sheet1.to_excel("S4x Output"+x.strftime("%m-%d")+".xlsx", index = False)
print("All Set!!")
I would like it to sort oldest to newest.
Update code and this works.
import pandas as pd
import datetime
from datetime import timedelta
x = datetime.datetime.now()
excel_workbook = 'data.xlsx'
sheet1 = pd.read_excel(excel_workbook, sheet_name='RAW DATA')
sheet1['Call_DateTime'] = pd.to_datetime(sheet1['Call_DateTime'])
sheet1.sort_values(['Call_DateTime'], axis=0, ascending=True, inplace=True)
sheet1['SegmentDuration'] = pd.to_timedelta(sheet1['SegmentDuration'], unit='s')
sheet1['SegmentDuration'] = timedelta(hours=0.222)
sheet1.style.apply('h:mm:ss', column=['SegmentDuration'])
sheet1.to_excel("S4x Output"+x.strftime("%m-%d")+".xlsx", index = False)
print("All Set!!")

Comparing dates in an Excel sheet to a certain fixed date and printing a value

I'm trying to compare dates from an excel sheet to a certain static date like 30 june of 2019, and if the date in the Excel sheet is before this print "Y" else print"N".
I'm very new at Pandas.
I have tried importing the file but no idea how to iterate through each row and how to compare dates to a static date
import pandas as pd
import numpy as np
from datetime import date
from pandas import ExcelWriter
df = pd.read_excel(r'Date compare.xlsx', sheet_name= 'Sheet1')
df{"Date"} = pd.to_date(df["Date"],format="%d%m%Y")
pd.to_date(df["End Date"],format="%m%d%Y")
Supposing your 'Date' column in Excel sheet is formatted as date, you can introduce new column FLAG by comparing Date column with Timestamp you need.
import pandas as pd
df = pd.read_excel(r'Date compare.xlsx', sheet_name='Sheet1')
df["FLAG"] = pd.np.where(df["Date"] > pd.Timestamp("2019-06-30"), "Y", "N")
I would first make sure that the dates are recognized as Timestamp when reading the exel file with parse_dates=True. Then you can make comparisons converting from Timestamp to datetime.date through .date() and defining your threshold date with datetime.date(2019, 6, 30). To do so you can define a function and use apply to the Date column:
import datetime
import pandas as pd
# Import data and define threshold date
df=pd.read_excel(r'Date compare.xlsx', parse_dates=True, sheet_name= 'Sheet1')
mydate = datetime.date(2019, 6, 30)
# Define function
def compare(date):
if date.date() >= mydate:
val = "N"
else:
val = "Y"
return val
# Apply to all elements
df["check"] = df['Date'].apply(compare)

KeyError: 'Date'

import pandas as pd
import numpy as np
from nsepy import get_history
import datetime as dt
start = dt.datetime(2015, 1, 1)
end = dt.datetime.today()
infy = get_history(symbol='INFY', start = start, end = end)
infy.index = pd.to_datetime(infy.index)
infy.head()
infy_volume = infy.groupby(infy['Date'].dt.year).reset_index().Volume.sum().
"Error showed as Date", but Infy_volume should be a multi-index series
with two levels of index - Year and Month
.
Here you have the date column as index so use
infy.groupby(infy.index.year).Volume.sum().reset_index()
If you want to groupby with year and month use
infy_volume = infy.groupby([infy.index.year, infy.index.month]).Volume.sum()
infy_volume.index = infy_volume.index.rename('Month', level=1)
print(infy_volume)
# infy_volume.reset_index()

how to slice dates from a dataframe using standard input function?

I saw the documentation in the Indexing and selecting data which involves hardcore scripting method to slice a range of data from a dataframe.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('d1.csv')
df['time']=pd.to_datetime(df['time'], unit='ns')
df = df.drop('name', 1)
df['Time'] = df['time'].dt.time
df['date'] = df['time'].dt.date
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['date'])
df= df.loc['2018-07-04':'2018-07-05']
But I need to select a range of data from standard input function, How it can be done:
Rather than using df= df.loc['2018-07-04':'2018-07-05']say in the form at the console it will be asked to Enter the start date : and Enter the stop date : and by doing so I will get the data of the selected date ranges only.
I actually tried it doing as:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('d1.csv')
df['time']=pd.to_datetime(df['time'], unit='ns')
df = df.drop('name', 1)
df['Time'] = df['time'].dt.time
df['date'] = df['time'].dt.date
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['date'])
Starting_Date = input(" Please Enter the Starting_Date : ")
Ending_Date = input(" Please Enter the Ending_Date : ")
data = df[Starting_Date:Ending_Date]
But this doesn't work...kindly have a look upon it.
Please try this. The date is in format year-month-day for example '2018-08-16'.
from datetime import datetime
a = input('Starting_Date: ')
b = input('Ending_Date :')
starting_date = datetime.strptime(a, "%Y-%m-%d").date()
ending_date = datetime.strptime(b, "%Y-%m-%d").date()
df.loc[starting_date:ending_date]
Hope that works for you :)

Categories