Python code to average values during certain time periods in monthly data - python

Hello everyone I have a cvs file which contains a months worth of data in hourly intervals. I need to get an average value of one of the columns for the time intervals of 12:00am-3:00am for the entire month. I am using pandas.DataFrame to try and do this.
Sample of data I am using
DateTime current voltage
11/1/2014 12:00 1.122061402 4.058617834
11/1/2014 1:00 1.120534925 4.060912132
11/1/2014 2:00 1.119349897 4.058656072
11/1/2014 3:00 1.118277733 4.060912132
11/1/2014 4:00 1.120365636 4.060912132
11/1/2014 5:00 1.120365636 4.060912132
i'd like to average column 2 from 12am-3am everyday for the entire month. I am thinking using a conditional statement on the time would be a good option however I am unsure of how to implement that conditional statement on date/time data.

I will assume that you have already imported the file into a Pandas dataframe named df.
Confirm that your "DateTime" field is being recognized by pandas as a DateTime by checking the value of df.dtypes. If not, recast e.g. with:
df['DateTime'] = pd.to_datetime(df['DateTime'])
Double-check that times like 12 AM, 1 PM, etc. are being handled properly. (You have not indicated anything to distinguish 12 AM from 12 PM etc. in your dataset.) If not, you will need to devise an appropriate method to correct them or re-export them from the original source.
Create a DatetimeIndex from your DateTime field:
df = df.set_index(pd.DatetimeIndex(df['DateTime']))
Now take Dmitry's suggestion (lightly modified):
>>> df.between_time('0:00', '3:00').resample('1D').mean()
The index of the result will show the beginning of the time interval being averaged.
Edited to take into account new info in the comments.

Related

Distinguish between, for example, month February (02) and date 02 in pandas date column in Python

I am new to Python and I am working on a project where I work with timeseries data. I have a pandas dataframe containing the date of my dataset, a small example can be seen below (dates ranging for a whole year):
result_time: 2021-01-01 00:00:08, 2021-01-01 00:00:18, 2021-01-01 00:00:28...
I am processing this column in order to determine if the specific date is a weekday or not. When processing moves to the second day of January, i.e: 2021-02-01 12:07:17, 2021-02-01 12:07:27, 2021-02-01 12:07:37, and so on, the day part of the date (02) is considered as month February. I have tried to make it work but with no luck.
For example, I tried the following But nothing works. Please any advise will be much appreciated!
df_uci['result_time1'] = df_uci['result_time'].dt.strftime('%YYYY-%dd-%mm %HH:%mm:%ss')
df_uci['result_time1'] = pd.to_datetime(df_uci['result_time1'])
df_uci['Weekday1'] = df_uci['result_time1'].dt.day_name()
Try using the pd.to_datetime() function with the format attribute. Here you can define how your data should be interpreted. (Especially, it seems your day comes before your month)
In your case this should do it:
df_uci['result_time_as_datetime'] = pd.to_datetime(df_uci['result_time'], format="%Y-%d-%m %H:%M:%S")
df_uci['Weekday1'] = df_uci['result_time_as_datetime'].dt.day_name()

Date difference: different results in Excel vs. Python

I have a pandas dataframe with two dates columns with timestamp ( i want to keep time stamp)
I want to get the difference in days between those two dates , I used the below . It works just fine.
mergethetwo['diff_days']=(mergethetwo['todaydate']-mergethetwo['LastLogon']).dt.days
The doubt is , when I got the difference between those two dates in Excel , it gave me different number.
In python for example the difference between
5/15/2020 1:48:00 PM (LastLogon) and 6/21/2020 12:00:00 AM(todaydate) is 36 .
However , in Excel using
DATEDIF =(LastLogon,todaydate,"d")
5/15/2020 1:48:00 PM and 6/21/2020 12:00:00 AM is 37 days !
Why is the difference ? Which one should I trust ? As I have 30,000 + rows I can't go through all od them to confirm.
Appreciate your support
Thank you
Excel DATEDIF with "D" seems to count "started" days (dates, as the name of the function says...); whilst the Python timedelta gives the actual delta in time - 36.425 days:
import pandas as pd
td = pd.to_datetime("6/21/2020 12:00:00 AM")-pd.to_datetime("5/15/2020 1:48:00 PM")
# Timedelta('36 days 10:12:00')
td.days
# 36
td.total_seconds() / 86400
# 36.425
You will get the same result if you do todaydate-LastLogon in Excel, without using any function.

Pandas read and parse Excel data that shows as a datetime, but shouldn't be a datetime

I have a system I am reading from that implemented a time tracking function in a pretty poor way - It shows the tracked working time as [hh]:mm in the cell. Now this is problematic when attempting to read this data because when you click that cell the data bar shows 11:00:00 PM, but what that 23:00 actually represents is 23 hours of time spent and not 11PM. So whenever the time is 24:00 or more you end up with 1/1/1900 12:00:00 AM and on up ( 25:00 = 1/1/1900 01:00:00 AM).
So pandas picks up the 11:00:00 AM or 1/1/1900 01:00:00 AM when it comes into the dataframe. I am at a loss as to how I would put this back into an INT for and get the number of hours in a whole number format 24, 25, 32, etc.
Can anyone help me figure out how to turn this horribly formatted data into the number of hours in int format?
If you want 1/1/1900 01:00:00 AM to represent 25 hours of elapsed time then this tells me your reference timestamp is 12/31/1899 00:00:00. Try the following:
time_delta = pd.Timestamp('1/1/1900 01:00:00 AM') - pd.Timestamp('12/31/1899 00:00:00')
# returns Timedelta('1 days 01:00:00')
You can get the total number of seconds by using the Timedelta.total_seconds() method:
time_delta.total_seconds()
# returns 90000.0
and then you could get the number of hours with
time_delta.total_seconds() / 3600.0
# returns 25.0
So try subtracting pd.Timestamp('12/31/1899 00:00:00') from your DatetimeIndex based on the year 1900 to get a TimedeltaIndex. You can then leave your TimedeltaIndex as is or convert it to a Float64Index with TimedeltaIndex.total_seconds().
pandas is not at fault its the excel that is interpreting the data wrongly,
Set the data to text in that column and it wont interpret as date.
then save the file and open through pandas and it should work fine.
other wise export as CSV and try to open in pandas.
Here is where I ended and it does work:
for i in range(len(df['Planned working time'])) :
pwt = df['Planned working time'][i]
if len(str(df['Planned working time'][i]).split(' ')) > 1 :
if str(str(pwt).split(' ')[0]).split('-')[0] == '1900' :
workint = int(24)*int(str(str(pwt).split(' ')[0]).split('-')[2]) + int(str(pwt).split(' ')[1].split(':')[0])
elif len(str(pwt).split(' ')) == 1 :
if str(str(pwt).split(' ')[0]).split('-')[0] != '1900' :
workint = int(str(pwt).split(' ')[0].split(':')[0])
df.set_value(i, 'Planned working time', workint)
any suggested improvements are welcome, but this results in the correct int values in all cases. Tested on over 14K rows of data. This would likely have to be refined if there were minutes, but there are no cases where minutes show up in the data and the UI on the front end doesn't appear to actually allow minutes.

Resample pandas time series 30 mins for 9:15 as start time

I am trying to resample OHLC data to 30 mins. The market data starts at at 9:15 and I would like the resampled time to have 9:15-9:45 and so on. But I am able to get the data resampled as 9:00-9:30
Paste Bin link to 1 min market data
pd.DataFrame(download_data).set_index('date'['close'].resample('30T').ohlc()
As you see in the picture the start time is 9:00 and not 9:15...
There is one more way of doing it, you can use the base argument of resample:
pd.DataFrame(download_data).set_index('date'['close'].resample('30T', base=15).ohlc()
Solution is add parameter loffset in resample:
loffset : timedelta
Adjust the resampled time labels
df = (pd.DataFrame(download_data)
.set_index('date')['close']
.resample('30T', loffset='15min')
.ohlc())
print (df)
open high low close
date
2018-11-05 09:15:00+05:30 25638.25 25641.85 25589.3 25630.00
2018-11-05 09:45:00+05:30 25622.00 25745.00 25622.0 25714.85
2018-11-05 10:15:00+05:30 25720.05 25740.00 25692.9 25717.00
2018-11-05 10:45:00+05:30 25698.30 25744.75 25667.9 25673.95
2018-11-05 11:15:00+05:30 25680.30 25690.45 25642.9 25655.90
#TFA the use of origin='start' may not be that useful esp. if our start time does not fit in that rule of starting from 9:15 am, e.g. if we are trying to convert 1 min data to 15 mins, and we give data from 9:37 am, the 15mins candles will start from 9:37 am, which is not what we need.
The better solution that I could find was to use origin = <timestamp>, so something like origin = datetime.fromisoformat('1970-01-01 09:15:00+05:30') does the magic.
Overall code:
pd.DataFrame(download_data).set_index('date'['close']).resample('30T', origin=datetime.fromisoformat('1970-01-01 09:15:00+05:30')).ohlc()
Though the answers provided here work as expected, it must be noted that both loffset and base are deprecated since version 1.1.0. The best and simplest way now is
pd.DataFrame(download_data).set_index('date'['close']).resample('30T', origin='start').ohlc()
This will set the start time to the first timestamp available in the dataframe.
I will summarize the answers and will add another option for origin.
So yeah, resample method had and argument base as a starting position, so it was possible to write it like:
data.resample('24h', base=9.5).ohlc()
I am using random data just to show how it is done. The base takes floats, so I assumed 9.5 would be 9:30 and it seems to work.
So I am saying it was possible to use the base argument because it is deprecated already. Although it still brings nice output:
As for now, there is a new, nice and clear argument named origin in resample method. It takes string as an input, as other answer mentions it also takes keywords like 'start' or 'end' but also a string date, so we just put here something something 9:30 and here is what we get:
data.resample('24h', origin='2011-12-31 09:30:00').ohlc()
And here is the output:
So just try it with your data.

Python CSV data analysis based on date time

I have a large CSV file that we will be using to import assets into our asset management database. Here is a smaller example for the CSV data.
Serial number,Movement type,Posting date
2LMXK1,101,1/5/15 9:00
2LMXK1,102,1/5/15 9:30
2LMXK1,201,1/5/15 10:30
2LMXK1,202,1/5/15 13:00
2LMXK1,301,1/5/15 14:00
JEMLP3,101,1/6/15 9:00
JEMLP3,102,1/7/15 10:00
JEMLP3,201,1/7/15 13:30
JEMLP3,202,1/7/15 15:30
JEMLP3,203,1/7/15 17:30
BR83GP,101,1/5/15 9:00
BR83GP,102,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,202,1/7/15 15:30
BR83GP,301,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,301,1/9/15 15:30
Here are the requirements: “What is the LATEST movement type for each serial number?”
I need to parse the CSV file and for each UNIQUE serial number, take the movement type that has the LATEST “posting date”.
As an example, for Serial Number 2LMXK1 the latest posting date/time is 1/5/15 at 14:00.
Here is basically what I will need to obtain:
“Serial Number 2LMXK1 has a movement type 301 and was last updated 1/5/15 14:00”.
I have started with some code that parses the CSV file and creates a dictionary.
#Import modules
import csv
import pandas as pd
fields = ['Serial number','Movement type','Posting date']
df = pd.read_csv('import.csv', skipinitialspace=True, usecols=fields)
dc = df.to_dict()
#print (df['Serial number'])
for value in dc.items():
print (value)
This code works to parse the CSV and create a dictionary.
However, I need help with the date comparison and filtering techniques. How may I create another dictionary that only lists unique serial numbers with the latest posting date? Once I have created a new filtered data dictionary I can use that to import into our asset management database. The idea is that I will use python to analyze and manipulate the data before importing into our system.
Pandas is a useful library for more than just reading csv files. In fact, you don't need the csv library at all here (it's not being used in the code sample you posted)
First you need to make sure the dates are read in as dates, by using the parse_dates parameter of the read_csv function. Then you can use pandas' grouping functionality.
# parse the 3rd column (index 2) as dates
df = pd.read_csv('import.csv', skipinitialspace=True, usecols=fields, parse_dates=[2])
last_movement = df.sort_values('Posting date').groupby('Serial number').last()
To create the string that you want, you can then iterate through the rows of last_movement:
for index, row in last_movement.iterrows():
print('Serial Number {} has a movement type {} and was last updated {}'
.format(index, row['Movement type'], row['Posting date']))
Which will produce the following:
Serial Number 2LMXK1 has a movement type 301 and was last updated 2015-01-05 14:00:00
Serial Number BR83GP has a movement type 301 and was last updated 2015-01-09 15:30:00
Serial Number JEMLP3 has a movement type 203 and was last updated 2015-01-07 17:30:00
Side note: Pandas should be able to read the column headings for you, so you shouldn't need the usecols parameter
The dict creation or best way to sort the list depends a little on what you want but for the parsing side of things, to convert a string into a date object so you can then do sane comparisons etc you probably want the datetime module in datetime (yes, datetime.datetime)
It's got a strptime() function that will do exactly that:
import datetime
datetime.datetime.strptime(r"1/5/15 13:00", "%d/%m/%y %H:%M")
# I've assumed you have a Day/Month/Year format
The only bit of oddness is the format specifier, which is documented here:
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
(note that where it talks about zero-padded, that's for output. It'll parse non-zero padded numbers fine)

Categories