Find the latest datetime for each date in a dataframe PANDAS - python

I have a folder on my computer that contains ~8500 .csv files that are all names of various stock tickers. Within each .csv file, there is a 'timestamp' and 'users_holding' column. I have the 'timestamp' column set up as a datetime index, as the entries in that column include hourly entries for each day ex/ 2019-12-01 01:50, 2020-01-01 02:55... 2020-01-01 01:45 etc. Each one of those timestamps has a corresponding integer representing the number of users holding at that time. I want to create a for loop that iterates through all of the .csv files and tallies up the total users holding across all .csv files for the latest time every day starting on February 1st, 2020 (2020-02-01) until the last day in the .csv file. The folder updates daily, so I can't really have an end date.
This is the for loop I have set up to establish each ticker as a dataframe:
path = 'C:\\Users\\N****\\Desktop\\r******\\t**\\p*********\\'
all_files = glob.glob(path + "/*.csv")
for filename in all_files:
df = pd.read_csv(filename, header = 0, parse_dates = ['timestamp'], index_col='timestamp')
If anyone could show me how to write the for loop that finds the latest entry for each date and tallies up that number for each day, that would be amazing.
Thank you!

First, create a data frame with a Datetime index (in one-hour steps):
import numpy as np
import pandas as pd
idx = pd.date_range(start='2020-01-01', end='2020-01-31', freq='H')
data = np.arange(len(idx) * 3).reshape(len(idx), 3)
columns = ['ticker-1', 'ticker-2', 'ticker-3']
df = pd.DataFrame(data=data, index=idx, columns=columns)
print(df.head())
ticker-1 ticker-2 ticker-3
2020-01-01 00:00:00 0 1 2
2020-01-01 01:00:00 3 4 5
2020-01-01 02:00:00 6 7 8
2020-01-01 03:00:00 9 10 11
2020-01-01 04:00:00 12 13 14
Then, groupby the index (keep year-month-day), but drop hours-minutes-seconds). The aggregation function is .last()
result = (df.groupby(by=df.index.strftime('%Y-%m-%d'))
[['ticker-1', 'ticker-2', 'ticker-3']]
.last()
)
print(result.head())
ticker-1 ticker-2 ticker-3
2020-01-01 69 70 71
2020-01-02 141 142 143
2020-01-03 213 214 215
2020-01-04 285 286 287
2020-01-05 357 358 359

Related

Pandas question - calculated dataframe column

I'm have half hourly data held within a pandas dataframe as follows:
DateTime Open High Low Close Volume
0 2005-09-06 17:00:00 1103.00 1103.50 1103.00 1103.25 744
I want to add a column to this data called "Daily_Open", which basically equals to the open price, at that given day, at 14:30. Lets say that for each day in question, there are 10 half hour rows referenced, before moving to the data relating to the next day and so on. This desired column would merely show the open price at 14:30 of that particular day, repeated for all relevant rows. In TSQL, I would either do this using a correlated subquery or a join on the date part of the DateTime column. I have tried the following code:
data = pd.read_csv("ESHalf.txt", )
data.rename(columns={"Close/Last": "Close"}, inplace=True)
data.columns = ["DateTime", "Open", "High", "Low", "Close", "Volume"]
data["DateTime"] = pd.to_datetime(data["DateTime"])
data["Date"] = data["DateTime"].dt.date
open_cond = (data["DateTime"].dt.hour == 9) & (data["DateTime"].dt.minute == 30)
data["Daily_Low"] = data["Open"][open_cond]
which successfully extracts the item in question but when applied to the original dataframe, NaN are created for all rows where the underlying time part of the datetime object is not 14:30 etc. I have a feeling that I use apply or transform in some way -any ideas?
Many thanks,
You can mask the non '14:30' values and transform with the first valid value per group:
# ensure datetime
df['DateTime'] = pd.to_datetime(df['DateTime'])
# locate target time
from datetime import time
mask = df['DateTime'].dt.time.eq(time(14, 30))
df['Daily_Open'] = (df['Open'].where(mask).groupby(df['DateTime'].dt.date)
.transform('first')
)
example (with more dummy rows):
DateTime Open High Low Close Volume Daily_Open
0 2005-09-06 14:30:00 1000.0 2000.0 500.0 1200.00 700 1000.0
1 2005-09-06 17:00:00 1103.0 1103.5 1103.0 1103.25 744 1000.0
2 2005-09-07 14:30:00 1200.0 2000.0 500.0 1200.00 700 1200.0
3 2005-09-07 17:00:00 1103.0 1103.5 1103.0 1103.25 744 1200.0

Pulling start date, end date, and mean quantity for unbalanced dataset

I have a dataset (seen in the image) that consists of cities (column "IBGE"), dates, and quantities (column "QTD"). I am trying to extract three things into a new column: start date per "IBGE", end date per "IBGE", and mean per "code".
Also, before doing so, should I change the index of my dataset?
The panel data is unbalanced, so different "IBGE" values have different start and end dates, and mean. How could I go about creating a new data frame with the following information separated in columns? I want the dataframe to look like this:
CODE
Start
End
Mean QTD
10001
2020-01-01
2022-01-01
604
10002
2019-09-01
2021-10-01
1008
10003
2019-02-01
2020-12-01
568
10004
2020-03-01
2021-05-01
223
...
...
...
...
99999
2020-02-01
2022-04-01
9394
I am thinking that maybe a for while loop could potentially take that info, but I am not sure how to write the code.
Try with groupby and named aggregations:
#convert DATE column to datetime if needed
df["DATE"] = pd.to_datetime(df["DATE"])
output = df.groupby("IBGE").agg(Start=("DATE","min"),
End=("DATE","max"),
Mean_QTD=("QTD","mean"))

Daily python data to weekly via pandas

I want to take a CSV that shows daily data and create a new sheet that includes the same data in a weekly view.
Currently I have:
#import required libraries
import pandas as pd
from datetime import datetime
#read the daily data file
looker_data = pd.read_csv("jan21_looker_upload.csv")
#convert date column into datetime object
looker_data['Date'] = looker_data['Date'].astype('datetime64[ns]')
#convert daily data to weekly
weekly_data = looker_data.groupby("URL").resample('W-Mon', label='right', closed='right', on='Date').sum().reset_index().sort_values(by='Date')
##Export in Excel
weekly_data.to_excel("jan21-looker.xlsx")
The code works but removes specific data points that I would like to keep in the new view. For reference, The existing CSV looks something like this:
Date | URL | Sessions | Conversions
1/14/21 | example.com/ | 110333. | 330
But when I run the code I get:
URL | Date | Conversions
example.com/ | 1/14/21 | 330
Is there something I am missing that will help me take the output include all data in a weekly view? All help is appreciated!
synthesized data you note
resampled in same way
additionally put column order back in place by final loc[]
d = pd.date_range(dt.date(2021,1,20), "today")
df = pd.DataFrame({
"Date":d,
"URL":np.random.choice(["example.com/","google.com/","bigbank.com/"],len(d)),
"Sessions": np.random.randint(3000, 300000, len(d)),
"Conversations": np.random.randint(200, 500, len(d))
})
dfw = (df.groupby("URL").resample('W-Mon', label='right', closed='right', on='Date').sum()
# cleanup - index as columns, order or row & columns
.reset_index().sort_values("Date").loc[:,df.columns]
)
Date
URL
Sessions
Conversations
0
2021-01-25 00:00:00
bigbank.com/
187643
226
4
2021-01-25 00:00:00
example.com/
454543
966
7
2021-01-25 00:00:00
google.com/
143307
574
1
2021-02-01 00:00:00
bigbank.com/
335614
904
5
2021-02-01 00:00:00
example.com/
260055
623
8
2021-02-01 00:00:00
google.com/
396096
866
2
2021-02-08 00:00:00
bigbank.com/
743231
1143
6
2021-02-08 00:00:00
example.com/
562073
1206
9
2021-02-08 00:00:00
google.com/
229929
472
3
2021-02-15 00:00:00
bigbank.com/
327898
747

How to add a new column by searching for data in a Pandas time series dataframe

I have a Pandas time series dataframe.
It has minute data for a stock for 30 days.
I want to create a new column, stating the price of the stock at 6 AM for that day, e.g. for all lines for January 1, I want a new column with the price at noon on January 1, and for all lines for January 2, I want a new column with the price at noon on January 2, etc.
Existing timeframe:
Date Time Last_Price Date Time 12amT
1/1/19 08:00 100 1/1/19 08:00 ?
1/1/19 08:01 101 1/1/19 08:01 ?
1/1/19 08:02 100.50 1/1/19 08:02 ?
...
31/1/19 21:00 106 31/1/19 21:00 ?
I used this hack, but it is very slow, and I assume there is a quicker and easier way to do this.
for lab, row in df.iterrows() :
t=row["Date"]
df.loc[lab,"12amT"]=df[(df['Date']==t)&(df['Time']=="12:00")]["Last_Price"].values[0]
One way to do this is to use groupby with pd.Grouper:
For pandas 24.1+
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].to_numpy()[0])
Older pandas use:
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].values[0])
MVCE:
df = pd.DataFrame(np.arange(48*60), index=pd.date_range('02-01-2019',periods=(48*60), freq='T'))
df['12amT'] = df.groupby(pd.Grouper(freq='D'))[0].transform(lambda x: x.loc[(x.index.hour == 12)&(x.index.minute==0)].to_numpy()[0])
Output (head):
0 12amT
2019-02-01 00:00:00 0 720
2019-02-01 00:01:00 1 720
2019-02-01 00:02:00 2 720
2019-02-01 00:03:00 3 720
2019-02-01 00:04:00 4 720
I'm not sure why you have two DateTime columns, I made my own example to demonstrate:
ind = pd.date_range('1/1/2019', '30/1/2019', freq='H')
df = pd.DataFrame({'Last_Price':np.random.random(len(ind)) + 100}, index=ind)
def noon_price(df):
noon_price = df.loc[df.index.hour == 12, 'Last_Price'].values
noon_price = noon_price[0] if len(noon_price) > 0 else np.nan
df['noon_price'] = noon_price
return df
df.groupby(df.index.day).apply(noon_price).reindex(ind)
reindex by default will fill each day's rows with its noon_price.
To add a column with the next day's noon price, you can shift the column 24 rows down, like this:
df['T+1'] = df.noon_price.shift(-24)

Iterate over pd df with date column by week python

I have a one month DataFrame with a datetime object column and a bunch of functions I want to apply to it - by week. So I want to loop over the DataFrame and apply the functions to each week. How do I iterate over weekly time periods?
My DataFrame looks like this:
here is some random datetime code:
np.random.seed(123)
n = 500
df = pd.DataFrame(
{'date':pd.to_datetime(
pd.DataFrame( { 'year': np.random.choice(range(2017,2019), size=n),
'month': np.random.choice(range(1,2), size=n),
'day': np.random.choice(range(1,28), size=n)
} )
) }
)
df['random_num'] = np.random.choice(range(0,1000), size=n)
My week length is inconsistent (sometimes I have 1000 tweets per week sometimes 100,000). Could please someone give me an example of how to loop over this dataframe by week? (I don't need aggregation or groupby functions.)
If you really don't want to use groupby and aggregations then:
for week in df['date'].dt.week.unique():
this_weeks_data = df[df['date'].dt.week == week]
This will, of course, go wrong if you have data from more than one year.
Given your sample dataframe
date random_num
0 2017-01-01 214
1 2018-01-19 655
2 2017-01-24 663
3 2017-01-26 723
4 2017-01-01 974
First, you can try to set the index to datetime object as follows
df.set_index(df.date, inplace=True)
df.drop('date', axis=1, inplace=True)
This sets the index to the date column and drops the original column. You will get
>>> df.head()
date random_num
2017-01-01 214
2018-01-19 655
2017-01-24 663
2017-01-26 723
2017-01-01 974
Then you can use the pandas groupby function to group the data as per your frequency and apply any function of your choice.
# To group by week and count the number of occurances
>>> df.groupby(pd.Grouper(freq='W')).count().head()
date random_num
2017-01-01 11
2017-01-08 65
2017-01-15 55
2017-01-22 66
2017-01-29 45
# To group by week and sum the random numbers per week
>>> df.groupby(pd.Grouper(freq='W')).sum().head()
date random_num
2017-01-01 7132
2017-01-08 33916
2017-01-15 31028
2017-01-22 31509
2017-01-29 22129
You can also apply any generic function myFunction by using the apply method of pandas
df.groupby(pd.Grouper(freq='W')).apply(myFunction)
If you want to apply a function myFunction to any specific column columnName after grouping, you can also do that as follows
df.groupby(pd.Grouper(freq='W'))[columnName].apply(myFunction)
[SOLVED FOR MULTIPLE YEARS]
pd.Grouper(freq='W') works fine but sometimes I have come across some undesired behaviors related to how weeks are split when the number of weeks are not even. So this is why I sometimes prefer to do the week split by hand like shown in this example.
So, having a dataset that spans in multiple years
import numpy as np
import pandas as pd
import datetime
# Create dataset
np.random.seed(123)
n = 100000
date = pd.to_datetime({
'year': np.random.choice(range(2017, 2020), size=n),
'month': np.random.choice(range(1, 13), size=n),
'day': np.random.choice(range(1, 28), size=n)
})
random_num = np.random.choice(
range(0, 1000),
size=n)
df = pd.DataFrame({'date': date, 'random_num': random_num})
Such as:
print(df.head())
date random_num
0 2019-12-11 413
1 2018-06-08 594
2 2019-08-06 983
3 2019-10-11 73
4 2017-09-19 32
First create a helper index that allows you to iterate by week (considering the year as well):
df['grp_idx'] = df['date'].apply(
lambda x: '%s-%s' % (x.year, '{:02d}'.format(x.week)))
print(df.head())
date random_num grp_idx
0 2019-12-11 413 2019-50
1 2018-06-08 594 2018-23
2 2019-08-06 983 2019-32
3 2019-10-11 73 2019-41
4 2017-09-19 32 2017-38
Then just apply your function that makes a computation on the weekly-subset, something like this:
def something_to_do_by_week(week_data):
"""
Computes the mean random value.
"""
return week_data['random_num'].mean()
weekly_mean = df.groupby('grp_idx').apply(something_to_do_by_week)
print(weekly_mean.head())
grp_idx
2017-01 515.875668
2017-02 487.226704
2017-03 503.371681
2017-04 497.717647
2017-05 475.323420
Once you have your weekly metrics you'll probably would like to get back to actual dates which are more useful than year-week indices:
def from_year_week_to_date(year_week):
"""
"""
year, week = year_week.split('-')
year, week = int(year), int(week)
date = pd.to_datetime('%s-01-01' % year)
date += datetime.timedelta(days=week * 7)
return date
weekly_mean.index = [from_year_week_to_date(x) for x in weekly_mean.index]
print(weekly_mean.head())
2017-01-08 515.875668
2017-01-15 487.226704
2017-01-22 503.371681
2017-01-29 497.717647
2017-02-05 475.323420
dtype: float64
Finally, now you can make nice plots with nice interpretable dates:
Just as a sanity check, the computation using pd.Grouper(freq='W') gives me almost the same results (somehow it adds an extra week at the beginning of the pd.Series)
df.set_index('date').groupby(
pd.Grouper(freq='W')
).mean().head()
Out[27]:
random_num
date
2017-01-01 532.736364
2017-01-08 515.875668
2017-01-15 487.226704
2017-01-22 503.371681
2017-01-29 497.717647

Categories