Im unable to filter my dataframe to give me the details i need. Im trying to obtain the min value for any given 24 hour period but keep the detail in the filtered rows like the day name
below is the head of df dataframe. It has t datetimeindex format and ive pulled the hour column as an integer from the index time information and the low is a float. when i run the code between the two dataframes below the df2 is the second dataframe i get returned
df = pd.read_csv(r'C:/Users/Oliver/Documents/Data/eurusd1hr.csv')
df.drop(columns=(['Close', 'Open', 'High', 'Volume']), inplace=True)
df['t'] = (df['Date'] + ' ' + df['Time'])
df.drop(columns = ['Date', 'Time'], inplace=True)
df['t'] = pd.DatetimeIndex(df.t)
df.set_index('t', inplace=True)
df['hour'] = df.index.hour
df['day'] = df.index.day_name()
Low hour day
t
2003-05-05 03:00:00 1.12154 3 Monday
2003-05-05 04:00:00 1.12099 4 Monday
2003-05-05 05:00:00 1.12085 5 Monday
2003-05-05 06:00:00 1.12049 6 Monday
2003-05-05 07:00:00 1.12079 7 Monday```
```df2 = df.between_time('00:00', '23:00').groupby(pd.Grouper(freq='d')).min()```
``` Low hour day
t
2003-05-05 1.12014 3.0 Monday
2003-05-06 1.12723 0.0 Tuesday
2003-05-07 1.13265 0.0 Wednesday
2003-05-08 1.13006 0.0 Thursday
2003-05-09 1.14346 0.0 Friday```
I want to keep the corresponding hour in the hour column and also the hour information in the original index like the day name has been maintained
i was expecting the index and hour column to keep the information
ive tried add a 2nd grouper method but failed. I have also tried to reset the index. any help would be gratefully received. thanks
While using Iterrow(), I would like to create a "temporary" dataframe which would include several previous rows (not consecutive) from my initial dataframe identified using the index.
For each step of the Iterrow(), I will create the "temporary" dataframe including 4 previous prices from the initial df and all prices separated by 4 hours. I will then calculate the average of these prices. Objective is to be able to change numbers of prices and gap between prices easily.
I tried several way to get the previous rows but without success. I understand that as my index is a timestamp I need to use timedelta but it doesn't work.
My initial dataframe "df":
Price
timestamp
2022-04-01 00:00:00 124.39
2022-04-01 01:00:00 121.46
2022-04-01 02:00:00 118.75
2022-04-01 03:00:00 121.95
2022-04-01 04:00:00 121.15
... ...
2022-04-09 13:00:00 111.46
2022-04-09 14:00:00 110.90
2022-04-09 15:00:00 109.59
2022-04-09 16:00:00 110.25
2022-04-09 17:00:00 110.88
My code :
from datetime import timedelta
df_test = None
dt_test = pd.DataFrame(columns=['Price','Price_Avg'])
dt_Avg = None
dt_Avg = pd.DataFrame(columns=['PreviousPrices'])
for index, row in df.iterrows():
Price = row['Price']
#Creation of a temporary Df to stock 4 previous prices, each price separated by 4 hours :
for i in range (0,4):
delta = 4*(i+1)
PrevPrice = df.loc[(index-timedelta(hours= delta)),'Price']
myrow_dt_Avg = {'PreviousPrices': PrevPrice}
dt_Avg = dt_Avg.append(myrow_dt_Avg, ignore_index=True)
#Calculation of the Avg of the 4 previous prices :
Price_Avg = dt_Avg['PreviousPrices'].sum()/4
#Clear dt_Avg :
dt_Avg = dt_Avg[0:0]
myrow_df_test = {'Price':Price,'Price_Avg': Price_Avg}
df_test = df_test.append(myrow_df_test, ignore_index=True)
dt_test
The line PrevPrice = df.loc[(index-timedelta(hours= delta)),'Price'] is bugging, do you have any idea?
I have a pandas dataframe with various stock symbols and prices by day, and I'm looking to remove the first year of data for each symbol.
Current dataframe:
Date Symbol Price
2009-01-01 00:00:00 A $10.00
2009-01-02 00:00:00 A $11.00
...
2010-01-01 00:00:00 A $12.00
...
2019-01-01 00:00:00 A $15.00
2009-01-01 00:00:00 B $100.00
...
2019-01-01 00:00:00 B $200.00
Goal dataframe:
Date Symbol Price
2010-01-01 00:00:00 A $12.00
...
2019-01-01 00:00:00 A $15.00
2010-01-01 00:00:00 B $100.00
...
2019-01-01 00:00:00 B $200.00
Any help is appreciated, thanks!
You can use column Data to get only year and then you can use it to drop rows.
If Date is as string then you could try to get year using
df["Year"] = df["Date"].str[:4]
and filter using string "2009"
df = df[ df["Year"] != "2009" ]
If it keeps Date as datetime object then you may need something like
df["Year"] = df["Date"].dt.year
and filter using integer 2009
df = df[ df["Year"] != 2009 ]
but I'm not sure.
I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))
(newbie to python and pandas)
I have a data set of 15 to 20 million rows, each row is a time-indexed observation of a time a 'user' was seen, and I need to analyze the visit-per-day patterns of each user, normalized to their first visit. So, I'm hoping to plot with an X axis of "days after first visit" and a Y axis of "visits by this user on this day", i.e., I need to get a series indexed by a timedelta and with values of visits in the period ending with that delta [0:1, 3:5, 4:2, 6:8,] But I'm stuck very early ...
I start with something like this:
rng = pd.to_datetime(['2000-01-01 08:00', '2000-01-02 08:00',
'2000-01-01 08:15', '2000-01-02 18:00',
'2000-01-02 17:00', '2000-03-01 08:00',
'2000-03-01 08:20','2000-01-02 18:00'])
uid=Series(['u1','u2','u1','u2','u1','u2','u2','u3'])
misc=Series(['','x1','A123','1.23','','','','u3'])
df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
df=df.set_index(df.ts)
grouped = df.groupby('uid')
firstseen = grouped.first()
The ts values are unique to each uid, but can be duplicated (two uid can be seen at the same time, but any one uid is seen only once at any one timestamp)
The first step is (I think) to add a new column to the DataFrame, showing for each observation what the timedelta is back to the first observation for that user. But, I'm stuck getting that column in the DataFrame. The simplest thing I tried gives me an obscure-to-newbie error message:
df['sinceseen'] = df.ts - firstseen.ts[df.uid]
...
ValueError: cannot reindex from a duplicate axis
So I tried a brute-force method:
def f(row):
return row.ts - firstseen.ts[row.uid]
df['sinceseen'] = Series([{idx:f(row)} for idx, row in df.iterrows()], dtype=timedelta)
In this attempt, df gets a sinceseen but it's all NaN and shows a type of float for type(df.sinceseen[0]) - though, if I just print the Series (in iPython) it generates a nice list of timedeltas.
I'm working back and forth through "Python for Data Analysis" and it seems like apply() should work, but
def fg(ugroup):
ugroup['sinceseen'] = ugroup.index - ugroup.index.min()
return ugroup
df = df.groupby('uid').apply(fg)
gives me a TypeError on the "ugroup.index - ugroup.index.min(" even though each of the two operands is a Timestamp.
So, I'm flailing - can someone point me at the "pandas" way to get to the data structure Ineed?
Does this help you get started?
>>> df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
>>> df = df.sort(["uid", "ts"])
>>> df["since_seen"] = df.groupby("uid")["ts"].apply(lambda x: x - x.iloc[0])
>>> df
misc ts uid since_seen
0 2000-01-01 08:00:00 u1 0 days, 00:00:00
2 A123 2000-01-01 08:15:00 u1 0 days, 00:15:00
4 2000-01-02 17:00:00 u1 1 days, 09:00:00
1 x1 2000-01-02 08:00:00 u2 0 days, 00:00:00
3 1.23 2000-01-02 18:00:00 u2 0 days, 10:00:00
5 2000-03-01 08:00:00 u2 59 days, 00:00:00
6 2000-03-01 08:20:00 u2 59 days, 00:20:00
7 u3 2000-01-02 18:00:00 u3 0 days, 00:00:00
[8 rows x 4 columns]