Time Series Data Reformat - python

I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,

Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876

You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN

Related

Extract a min value and corresponding row information based on date time index

Im unable to filter my dataframe to give me the details i need. Im trying to obtain the min value for any given 24 hour period but keep the detail in the filtered rows like the day name
below is the head of df dataframe. It has t datetimeindex format and ive pulled the hour column as an integer from the index time information and the low is a float. when i run the code between the two dataframes below the df2 is the second dataframe i get returned
df = pd.read_csv(r'C:/Users/Oliver/Documents/Data/eurusd1hr.csv')
df.drop(columns=(['Close', 'Open', 'High', 'Volume']), inplace=True)
df['t'] = (df['Date'] + ' ' + df['Time'])
df.drop(columns = ['Date', 'Time'], inplace=True)
df['t'] = pd.DatetimeIndex(df.t)
df.set_index('t', inplace=True)
df['hour'] = df.index.hour
df['day'] = df.index.day_name()
Low hour day
t
2003-05-05 03:00:00 1.12154 3 Monday
2003-05-05 04:00:00 1.12099 4 Monday
2003-05-05 05:00:00 1.12085 5 Monday
2003-05-05 06:00:00 1.12049 6 Monday
2003-05-05 07:00:00 1.12079 7 Monday```
```df2 = df.between_time('00:00', '23:00').groupby(pd.Grouper(freq='d')).min()```
``` Low hour day
t
2003-05-05 1.12014 3.0 Monday
2003-05-06 1.12723 0.0 Tuesday
2003-05-07 1.13265 0.0 Wednesday
2003-05-08 1.13006 0.0 Thursday
2003-05-09 1.14346 0.0 Friday```
I want to keep the corresponding hour in the hour column and also the hour information in the original index like the day name has been maintained
i was expecting the index and hour column to keep the information
ive tried add a 2nd grouper method but failed. I have also tried to reset the index. any help would be gratefully received. thanks

Get several previous rows of a dataframe while using iterrow

While using Iterrow(), I would like to create a "temporary" dataframe which would include several previous rows (not consecutive) from my initial dataframe identified using the index.
For each step of the Iterrow(), I will create the "temporary" dataframe including 4 previous prices from the initial df and all prices separated by 4 hours. I will then calculate the average of these prices. Objective is to be able to change numbers of prices and gap between prices easily.
I tried several way to get the previous rows but without success. I understand that as my index is a timestamp I need to use timedelta but it doesn't work.
My initial dataframe "df":
Price
timestamp
2022-04-01 00:00:00 124.39
2022-04-01 01:00:00 121.46
2022-04-01 02:00:00 118.75
2022-04-01 03:00:00 121.95
2022-04-01 04:00:00 121.15
... ...
2022-04-09 13:00:00 111.46
2022-04-09 14:00:00 110.90
2022-04-09 15:00:00 109.59
2022-04-09 16:00:00 110.25
2022-04-09 17:00:00 110.88
My code :
from datetime import timedelta
df_test = None
dt_test = pd.DataFrame(columns=['Price','Price_Avg'])
dt_Avg = None
dt_Avg = pd.DataFrame(columns=['PreviousPrices'])
for index, row in df.iterrows():
Price = row['Price']
#Creation of a temporary Df to stock 4 previous prices, each price separated by 4 hours :
for i in range (0,4):
delta = 4*(i+1)
PrevPrice = df.loc[(index-timedelta(hours= delta)),'Price']
myrow_dt_Avg = {'PreviousPrices': PrevPrice}
dt_Avg = dt_Avg.append(myrow_dt_Avg, ignore_index=True)
#Calculation of the Avg of the 4 previous prices :
Price_Avg = dt_Avg['PreviousPrices'].sum()/4
#Clear dt_Avg :
dt_Avg = dt_Avg[0:0]
myrow_df_test = {'Price':Price,'Price_Avg': Price_Avg}
df_test = df_test.append(myrow_df_test, ignore_index=True)
dt_test
The line PrevPrice = df.loc[(index-timedelta(hours= delta)),'Price'] is bugging, do you have any idea?

How do I delete the first year of data by variable from a pandas dataframe?

I have a pandas dataframe with various stock symbols and prices by day, and I'm looking to remove the first year of data for each symbol.
Current dataframe:
Date Symbol Price
2009-01-01 00:00:00 A $10.00
2009-01-02 00:00:00 A $11.00
...
2010-01-01 00:00:00 A $12.00
...
2019-01-01 00:00:00 A $15.00
2009-01-01 00:00:00 B $100.00
...
2019-01-01 00:00:00 B $200.00
Goal dataframe:
Date Symbol Price
2010-01-01 00:00:00 A $12.00
...
2019-01-01 00:00:00 A $15.00
2010-01-01 00:00:00 B $100.00
...
2019-01-01 00:00:00 B $200.00
Any help is appreciated, thanks!
You can use column Data to get only year and then you can use it to drop rows.
If Date is as string then you could try to get year using
df["Year"] = df["Date"].str[:4]
and filter using string "2009"
df = df[ df["Year"] != "2009" ]
If it keeps Date as datetime object then you may need something like
df["Year"] = df["Date"].dt.year
and filter using integer 2009
df = df[ df["Year"] != 2009 ]
but I'm not sure.

Pandas: How to group the non-continuous date column?

I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))

adding column with per-row computed time difference from group start?

(newbie to python and pandas)
I have a data set of 15 to 20 million rows, each row is a time-indexed observation of a time a 'user' was seen, and I need to analyze the visit-per-day patterns of each user, normalized to their first visit. So, I'm hoping to plot with an X axis of "days after first visit" and a Y axis of "visits by this user on this day", i.e., I need to get a series indexed by a timedelta and with values of visits in the period ending with that delta [0:1, 3:5, 4:2, 6:8,] But I'm stuck very early ...
I start with something like this:
rng = pd.to_datetime(['2000-01-01 08:00', '2000-01-02 08:00',
'2000-01-01 08:15', '2000-01-02 18:00',
'2000-01-02 17:00', '2000-03-01 08:00',
'2000-03-01 08:20','2000-01-02 18:00'])
uid=Series(['u1','u2','u1','u2','u1','u2','u2','u3'])
misc=Series(['','x1','A123','1.23','','','','u3'])
df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
df=df.set_index(df.ts)
grouped = df.groupby('uid')
firstseen = grouped.first()
The ts values are unique to each uid, but can be duplicated (two uid can be seen at the same time, but any one uid is seen only once at any one timestamp)
The first step is (I think) to add a new column to the DataFrame, showing for each observation what the timedelta is back to the first observation for that user. But, I'm stuck getting that column in the DataFrame. The simplest thing I tried gives me an obscure-to-newbie error message:
df['sinceseen'] = df.ts - firstseen.ts[df.uid]
...
ValueError: cannot reindex from a duplicate axis
So I tried a brute-force method:
def f(row):
return row.ts - firstseen.ts[row.uid]
df['sinceseen'] = Series([{idx:f(row)} for idx, row in df.iterrows()], dtype=timedelta)
In this attempt, df gets a sinceseen but it's all NaN and shows a type of float for type(df.sinceseen[0]) - though, if I just print the Series (in iPython) it generates a nice list of timedeltas.
I'm working back and forth through "Python for Data Analysis" and it seems like apply() should work, but
def fg(ugroup):
ugroup['sinceseen'] = ugroup.index - ugroup.index.min()
return ugroup
df = df.groupby('uid').apply(fg)
gives me a TypeError on the "ugroup.index - ugroup.index.min(" even though each of the two operands is a Timestamp.
So, I'm flailing - can someone point me at the "pandas" way to get to the data structure Ineed?
Does this help you get started?
>>> df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
>>> df = df.sort(["uid", "ts"])
>>> df["since_seen"] = df.groupby("uid")["ts"].apply(lambda x: x - x.iloc[0])
>>> df
misc ts uid since_seen
0 2000-01-01 08:00:00 u1 0 days, 00:00:00
2 A123 2000-01-01 08:15:00 u1 0 days, 00:15:00
4 2000-01-02 17:00:00 u1 1 days, 09:00:00
1 x1 2000-01-02 08:00:00 u2 0 days, 00:00:00
3 1.23 2000-01-02 18:00:00 u2 0 days, 10:00:00
5 2000-03-01 08:00:00 u2 59 days, 00:00:00
6 2000-03-01 08:20:00 u2 59 days, 00:20:00
7 u3 2000-01-02 18:00:00 u3 0 days, 00:00:00
[8 rows x 4 columns]

Categories