Sort by date with Excel file and Pandas - python

I am trying to sort my Excel file by the date column. When the code runs it turns the cells from a text string to a time date and it sorts, but only within the same month. That is, when I have dates from October and September it completes by the month.
I have been all over Google and YouTube.
import pandas as pd
import datetime
from datetime import timedelta
x = datetime.datetime.now()
excel_workbook = 'data.xlsx'
sheet1 = pd.read_excel(excel_workbook, sheet_name='RAW DATA')
sheet1['Call_DateTime'] = pd.to_datetime(sheet1['Call_DateTime'])
sheet1.sort_values(sheet1['Call_DateTime'], axis=1, ascending=True, inplace=True)
sheet1['SegmentDuration'] = pd.to_timedelta(sheet1['SegmentDuration'], unit='s')
sheet1['SegmentDuration'] = timedelta(hours=0.222)
sheet1.style.apply('h:mm:ss', column=['SegmentDuration'])
sheet1.to_excel("S4x Output"+x.strftime("%m-%d")+".xlsx", index = False)
print("All Set!!")
I would like it to sort oldest to newest.

Update code and this works.
import pandas as pd
import datetime
from datetime import timedelta
x = datetime.datetime.now()
excel_workbook = 'data.xlsx'
sheet1 = pd.read_excel(excel_workbook, sheet_name='RAW DATA')
sheet1['Call_DateTime'] = pd.to_datetime(sheet1['Call_DateTime'])
sheet1.sort_values(['Call_DateTime'], axis=0, ascending=True, inplace=True)
sheet1['SegmentDuration'] = pd.to_timedelta(sheet1['SegmentDuration'], unit='s')
sheet1['SegmentDuration'] = timedelta(hours=0.222)
sheet1.style.apply('h:mm:ss', column=['SegmentDuration'])
sheet1.to_excel("S4x Output"+x.strftime("%m-%d")+".xlsx", index = False)
print("All Set!!")

Related

Remove the weekend days from the event log - Pandas

Could you please help me with the following tackle?
I need to remove the weekend days from the dataframe (attached link: dataframe_running_example. I can get a list of all the weekend days between mix and max date pulled out from the event however I cannot filter out the df based on "list_excluded" list.
from datetime import timedelta, date
import pandas as pd
#Data Loading
df= pd.read_csv("running-example.csv", delimiter=";")
df["timestamp"] = pd.to_datetime(df["timestamp"])
df["timestamp_date"] = df["timestamp"].dt.date
def daterange(date1, date2):
for n in range(int ((date2 - date1).days)+1):
yield date1 + timedelta(n)
#start_dt & end_dt
start_dt = df["timestamp"].min()
end_dt = df["timestamp"].max()
print("Start_dt: {} & end_dt: {}".format(start_dt, end_dt))
weekdays = [6,7]
#List comprehension
list_excluded = [dt for dt in daterange(start_dt, end_dt) if dt.isoweekday() in weekdays]
df.info()
df_excluded = pd.DataFrame(list_excluded).rename({0: 'timestamp_excluded'}, axis='columns')
df_excluded["ts_excluded"] = df_excluded["timestamp_excluded"].dt.date
df[~df["timestamp_date"].isin(df_excluded["ts_excluded"])]
ooh an issue has been resolved. I used pd.bdate_range() function.
from datetime import timedelta, date
import pandas as pd
import numpy as np
#Wczytanie danych
df= pd.read_csv("running-example.csv", delimiter=";")
df["timestamp"] = pd.to_datetime(df["timestamp"])
df["timestamp_date"] = df["timestamp"].dt.date
#Zakres timestamp: start_dt & end_dt
start_dt = df["timestamp"].min()
end_dt = df["timestamp"].max()
print("Start_dt: {} & end_dt: {}".format(start_dt, end_dt))
bus_days = pd.bdate_range(start_dt, end_dt)
df["timestamp_date"] = pd.to_datetime(df["timestamp_date"])
df['Is_Business_Day'] = df['timestamp_date'].isin(bus_days)
df[df["Is_Business_Day"]!=False]

Comparing dates in an Excel sheet to a certain fixed date and printing a value

I'm trying to compare dates from an excel sheet to a certain static date like 30 june of 2019, and if the date in the Excel sheet is before this print "Y" else print"N".
I'm very new at Pandas.
I have tried importing the file but no idea how to iterate through each row and how to compare dates to a static date
import pandas as pd
import numpy as np
from datetime import date
from pandas import ExcelWriter
df = pd.read_excel(r'Date compare.xlsx', sheet_name= 'Sheet1')
df{"Date"} = pd.to_date(df["Date"],format="%d%m%Y")
pd.to_date(df["End Date"],format="%m%d%Y")
Supposing your 'Date' column in Excel sheet is formatted as date, you can introduce new column FLAG by comparing Date column with Timestamp you need.
import pandas as pd
df = pd.read_excel(r'Date compare.xlsx', sheet_name='Sheet1')
df["FLAG"] = pd.np.where(df["Date"] > pd.Timestamp("2019-06-30"), "Y", "N")
I would first make sure that the dates are recognized as Timestamp when reading the exel file with parse_dates=True. Then you can make comparisons converting from Timestamp to datetime.date through .date() and defining your threshold date with datetime.date(2019, 6, 30). To do so you can define a function and use apply to the Date column:
import datetime
import pandas as pd
# Import data and define threshold date
df=pd.read_excel(r'Date compare.xlsx', parse_dates=True, sheet_name= 'Sheet1')
mydate = datetime.date(2019, 6, 30)
# Define function
def compare(date):
if date.date() >= mydate:
val = "N"
else:
val = "Y"
return val
# Apply to all elements
df["check"] = df['Date'].apply(compare)

pandas - get a dataframe for every day

I have a DataFrame with dates in the index. I make a Subset of the DataFrame for every Day. Is there any way to write a function or a loop to generate these steps automatically?
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
import datetime as dt
#Get the channel feeds from Thinkspeak
response = requests.get("https://api.thingspeak.com/channels/518038/feeds.json?api_key=XXXXXX&results=500")
#Convert Json object to Python object
response_data = response.json()
channel_head = response_data["channel"]
channel_bottom = response_data["feeds"]
#Create DataFrame with Pandas
df = pd.DataFrame(channel_bottom)
#rename Parameters
df = df.rename(columns={"field1":"PM 2.5","field2":"PM 10"})
#Drop all entrys with at least on nan
df = df.dropna(how="any")
#Convert time to datetime object
df["created_at"] = df["created_at"].apply(lambda x:dt.datetime.strptime(x,"%Y-%m-%dT%H:%M:%SZ"))
#Set dates as Index
df = df.set_index(keys="created_at")
#Make a DataFrame for every day
df_2018_12_07 = df.loc['2018-12-07']
df_2018_12_06 = df.loc['2018-12-06']
df_2018_12_05 = df.loc['2018-12-05']
df_2018_12_04 = df.loc['2018-12-04']
df_2018_12_03 = df.loc['2018-12-03']
df_2018_12_02 = df.loc['2018-12-02']
Supposing that you do that on the first day of next week (so, exporting monday to sunday next monday, you can do that as follows:
from datetime import date, timedelta
day = date.today() - timedelta(days=7) # so, if today is monday, we start monday before
df = df.loc[today]
while day < today:
df1 = df.loc[str(day)]
df1.to_csv('mypath'+str(day)+'.csv') #so that export files have different names
day = day+ timedelta(days=1)
you can use:
from datetime import date
today = str(date.today())
df = df.loc[today]
and schedule the script using any scheduler such as crontab.
You can create dictionary of DataFrames - then select by keys for DataFrame:
dfs = dict(tuple(df.groupby(df.index.strftime('%Y-%m-%d'))))
print (dfs['2018-12-07'])

KeyError: 'Date'

import pandas as pd
import numpy as np
from nsepy import get_history
import datetime as dt
start = dt.datetime(2015, 1, 1)
end = dt.datetime.today()
infy = get_history(symbol='INFY', start = start, end = end)
infy.index = pd.to_datetime(infy.index)
infy.head()
infy_volume = infy.groupby(infy['Date'].dt.year).reset_index().Volume.sum().
"Error showed as Date", but Infy_volume should be a multi-index series
with two levels of index - Year and Month
.
Here you have the date column as index so use
infy.groupby(infy.index.year).Volume.sum().reset_index()
If you want to groupby with year and month use
infy_volume = infy.groupby([infy.index.year, infy.index.month]).Volume.sum()
infy_volume.index = infy_volume.index.rename('Month', level=1)
print(infy_volume)
# infy_volume.reset_index()

how to slice dates from a dataframe using standard input function?

I saw the documentation in the Indexing and selecting data which involves hardcore scripting method to slice a range of data from a dataframe.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('d1.csv')
df['time']=pd.to_datetime(df['time'], unit='ns')
df = df.drop('name', 1)
df['Time'] = df['time'].dt.time
df['date'] = df['time'].dt.date
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['date'])
df= df.loc['2018-07-04':'2018-07-05']
But I need to select a range of data from standard input function, How it can be done:
Rather than using df= df.loc['2018-07-04':'2018-07-05']say in the form at the console it will be asked to Enter the start date : and Enter the stop date : and by doing so I will get the data of the selected date ranges only.
I actually tried it doing as:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('d1.csv')
df['time']=pd.to_datetime(df['time'], unit='ns')
df = df.drop('name', 1)
df['Time'] = df['time'].dt.time
df['date'] = df['time'].dt.date
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['date'])
Starting_Date = input(" Please Enter the Starting_Date : ")
Ending_Date = input(" Please Enter the Ending_Date : ")
data = df[Starting_Date:Ending_Date]
But this doesn't work...kindly have a look upon it.
Please try this. The date is in format year-month-day for example '2018-08-16'.
from datetime import datetime
a = input('Starting_Date: ')
b = input('Ending_Date :')
starting_date = datetime.strptime(a, "%Y-%m-%d").date()
ending_date = datetime.strptime(b, "%Y-%m-%d").date()
df.loc[starting_date:ending_date]
Hope that works for you :)

Categories