I'm working with futures timeseries where the trading day starts at 17:00:00 CT and ends at 15:15:00 CT of the next day. To account for this, I make a change in the index, however, when pivoting the dataframe it ignores this change....
Let's look at it with an example:
# Dummy Data
rng = pd.date_range('1/1/2011', periods=5000, freq='min')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
df = pd.DataFrame(ts, columns=['ts'])
df_1 = df.resample('5min').last()
# Change index to account for futures hours
df_1.index = pd.to_datetime(df_1.index.values + np.where((df_1.index.time >= datetime.time(17)), pd.offsets.Day(1).nanos, 0))
# Pivoting df_1 and making some formatting changes
df_2 = pd.pivot_table(df_1, index=df_1.index.date, columns=df_1.index.time, values='ts')
df_2.columns = df_2.columns.map(lambda t: t.strftime('%H%M'))
df_2_cols = df_2.columns.tolist()
for i in range(len(df_2_cols)):
if df_2_cols[i][0] == '0':
df_2_cols[i] = df_2_cols[i][1:4]
After doing all this, the dataframe is in the shape and format I want but the first column, corresponding to the first timestamp of the day is 00:00:00 instead of 17:00:00, as I intended with the index modification.
How can I fix this??
After pivoting the columns get sorted chronologically. But you can reorder them. Say, the columns are already formatted, so we look for '1700':
pos = np.nonzero(df_2.columns == '1700')[0][0]
(np.nonzero returns a tuple of arrays, hence those double [0]). Then
new_cols = df_2.columns[pos:].append(df_2.columns[:pos])
df_2 = df_2.reindex(columns = new_cols)
Related
I am new to pandas , so assume I must be missing something obvious...
Summary:
I have a DataFrame with 300K+ rows. I retrieve a row of new data which may or may not be related to the existing subset of rows in the DF(identified by Group ID), either retrieve the existing Group ID or generate new one and finally insert it with the Group ID.
Pandas seems very slow for this.
Please advise : What am I missing / should I be using something else?
Details:
Columns are (example):
columnList = ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
Each groupID can have many unique timeStamp's
groupID is internally generated :
Either using an existing one (by matching the row to existing data, say by column 'D')
Generate new groupID
Thus (in my view at least) I cannot do updates/inserts in bulk, I have to do it row by row
I used an SQL DB analogy to create an index as concat of groupID and timeStamp (I Have tried MultiIndex but it seems even slower).
Finally I insert/update using .loc(ind,columnName)
Code:
import pandas as pd
import numpy as np
import time
columnList = ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
columnTypeDict = {'groupID':'int64','timeStamp':'int64'}
startID = 1234567
df = pd.DataFrame(columns=columnList)
df = df.astype(columnTypeDict)
fID = list(range(startID,startID+300000))
df['groupID'] = fID
ts = [1000000000]*150000 + [10000000001]*150000
df['timeStamp'] = ts
indx = [str(i) + str(j) for i, j in zip(fID, ts)]
df['Index'] = indx
df['Index'] = df['Index'].astype('uint64')
df = df.set_index('Index')
startTime = time.time()
for groupID in range(startID+49000,startID+50000) :
timeStamp = 1000000003
# Obtain/generate an index
ind =int(str(groupID) + str(timeStamp))
#print(ind)
df.loc[ind,'A'] = 1
print(df)
print(time.time()-startTime,"secs")
If the index column already exists, its fast, but if it doesn't 10,000 inserts take 140secs
I think accessing dataframes is a relatively expensive operation.
You can save temporatily these values and use them to create dataframe that will be merged with the original one as follows:
startTime = time.time()
temporary_idx = []
temporary_values = []
for groupID in range(startID+49000,startID+50000) :
timeStamp = 1000000003
# Obtain/generate an index
ind = int(str(groupID) + str(timeStamp))
temporary_idx.append(ind)
temporary_values.append(1)
# create a dataframe with new values and apply a join with the original dataframe
df = df.drop(columns=["A"])\
.merge(
pd.DataFrame({"A": temporary_values}, index=temporary_idx).rename_axis("Index", axis="index"),
how="outer", right_index=True, left_index=True
)
print(df)
print(time.time()-startTime,"secs")
When I benchmarked, This takes less than 2 seconds to execute
I don't know what is exactly your real use case, but this for the case of inserting column A as you stated in your example. If your use case is more complex than that, then there might be a better solution
I got a dataframe on which I would like to perform some analysis. An easy example of what I would like to achieve is, having the dataframe:
data = ['2017-02-13', '2017-02-13', '2017-02-13', '2017-02-15', '2017-02-16']
df = pd.DataFrame(data = data, columns = ['date'])
I would like to create a new dataframe from this. The new dataframe should contain 2 columns, the entire date span. So it should also include 2017-02-14 and the number of times each date appears in the original data.
I managed to construct a dataframe that includes all the dates as so:
dates = pd.to_datetime(df['date'], format = "%Y-%m-%d")
dateRange = pd.date_range(start = dates.min(), end = dates.max()).tolist()
df2 = pd.DataFrame(data = datumRange, columns = ['datum'])
My question is, how would I add the counts of each date from df to df 2? I've been messing around trying to write my own functions but have not managed to achieve it. I am assuming this needs to be done more often and that I am thinking to difficult...
Try this:
df2['counts'] = df2['datum'].map(pd.to_datetime(df['date']).value_counts()).fillna(0).astype(int)
I have the following data in the format below (see below)
I next perform recasting, groupby and averaging (see code) to reduce data dimensionality.
df_mod=pd.read_csv('wet_bulb_hr.csv')
#Mod Date
df_mod['wbt_date'] = pd.to_datetime(df_mod['wbt_date'])
#Mod Time
df_mod['wbt_time'] = df_mod['wbt_time'].astype('int')
df_mod['wbt_date'] = df_mod['wbt_date'] + \
pd.to_timedelta(df_mod['wbt_time']-1, unit='h')
df_mod['wet_bulb_temperature'] = \
df_mod['wet_bulb_temperature'].astype('float')
df = df_mod
df = df.drop(['wbt_time','_id'], axis = 1)
#df_novel = df.mean()
df = df.groupby([df.wbt_date.dt.year,df.wbt_date.dt.month]).mean()
After writing to an output file, I get an output that looks like this.
Investigating further, I can understand why. All my processing has resulted in a dataframe of shape 1 but what I really need is the 2 wbt_date columns to be exported as well. This does not seem to happen due to the groupby function
My question: How do I generate an index and have the groupby wbt_date columns as a new single column such that the output is:
You can flatten MultiIndex to Index in YYYY-MM by list comprehension:
df = df.groupby([df.wbt_date.dt.year,df.wbt_date.dt.month]).mean()
df.index = [f'{y}-{m}' for y, m in df.index]
df = df.rename_axis('date').reset_index()
Or use month period by Series.dt.to_period:
df = df.groupby([df.wbt_date.dt.to_period('m')).mean().reset_index()
Try this,
# rename exisiting index & on reset will get added as new column.
df.index.rename("wbt_year", inplace=True)
df.reset_index(inplace=True)
df['month'] = df['wbt_year'].astype(str) + "-" + df['wbt_date'].astype(str)
Output,
>>> df['month']
0 2019-0
1 2018-1
2 2017-2
please see the data here: screenshot from Google Colab
I am trying to assign the time 19:00 (7pm) for all records of the column "Beginn_Zeit". For now I put the float 19.00. Now I need to convert it to a time format so that I can subsequently merge it with a date of the column "Beginn_Datum". Once I have this merged column, I need to paste its value to a all records with NaT of a different column "Delta2".
dfd['Beginn'] = pd.to_datetime(df['Beginn'], dayfirst=True)
dfd['Ende'] = pd.to_datetime(df['Ende'], dayfirst=True)
dfd['Delta2'] = dfd['Ende']-dfd['Beginn']
dfd.Ende.fillna(dfd.Beginn,inplace=True)
dfd['Beginn_Datum'] = dfd['Beginn'].dt.date
dfd["Beginn_Zeit"] = 19.00
Edited to better match your updated example.
from datetime import time, datetime
dfd['Beginn_Zeit'] = time(19,0)
# create new column combining date and time
new_col = dfd.apply(lambda row: datetime.combine(row['Beginn_Datum'], row['Beginn_Zeit']), axis=1)
# replace null values in Delta2 with new combined dates
dfd.loc[dfd['Delta2'].isnull(), 'Delta2'] = new_col
New to pandas.
Have a DataFrame of the order:
A B C Date1 Date2 D with multiple rows with values. I want to divide the entire DataFrame into multiple dataframes based on quarters, i.e (Jan-Mar, Apr-Jun, Jul-Sep, Oct-Dec). I am trying to use only the Date1 column values for the same. I tried the following so far:
data_q = data.groupby(pandas.TimeGrouper(freq = '3M'))
The dates are in the form 2009-11-03.
There a few ways to do this.
I would ensure that Date1 column is a datetime type using the .dtype method.
e.g. df['Date1'].dtype
If it's not, cast to datetime object using:
df.Date1 = pd.to_datetime(df.Date1)
Add a quarters column for eventual data frame slicing:
df['quarters'] = df.Date1.dt.quarter
Create your data frames:
q1 = df[df.quarters == 1]
q2 = df[df.quarters == 2]
q3 = df[df.quarters == 3]
q4 = df[df.quarters == 4]
So the approach that appears easiest to me is to convert Date1 to your index, then groupby on the quarter.
df2 = df.set_index('Date1')
quardfs = list(df2.groupby(df2.index.quarter))
This will leave you with quardfs, which a list of DataFrames.
If you don't want to set Date1 to an index, you can also copy it out of the DataFrame and use it:
quars = pd.DatetimeIndex(df['Date1']).quarter
quardfs = list(df2.groupby(quars))