I want to take a CSV that shows daily data and create a new sheet that includes the same data in a weekly view.
Currently I have:
#import required libraries
import pandas as pd
from datetime import datetime
#read the daily data file
looker_data = pd.read_csv("jan21_looker_upload.csv")
#convert date column into datetime object
looker_data['Date'] = looker_data['Date'].astype('datetime64[ns]')
#convert daily data to weekly
weekly_data = looker_data.groupby("URL").resample('W-Mon', label='right', closed='right', on='Date').sum().reset_index().sort_values(by='Date')
##Export in Excel
weekly_data.to_excel("jan21-looker.xlsx")
The code works but removes specific data points that I would like to keep in the new view. For reference, The existing CSV looks something like this:
Date | URL | Sessions | Conversions
1/14/21 | example.com/ | 110333. | 330
But when I run the code I get:
URL | Date | Conversions
example.com/ | 1/14/21 | 330
Is there something I am missing that will help me take the output include all data in a weekly view? All help is appreciated!
synthesized data you note
resampled in same way
additionally put column order back in place by final loc[]
d = pd.date_range(dt.date(2021,1,20), "today")
df = pd.DataFrame({
"Date":d,
"URL":np.random.choice(["example.com/","google.com/","bigbank.com/"],len(d)),
"Sessions": np.random.randint(3000, 300000, len(d)),
"Conversations": np.random.randint(200, 500, len(d))
})
dfw = (df.groupby("URL").resample('W-Mon', label='right', closed='right', on='Date').sum()
# cleanup - index as columns, order or row & columns
.reset_index().sort_values("Date").loc[:,df.columns]
)
Date
URL
Sessions
Conversations
0
2021-01-25 00:00:00
bigbank.com/
187643
226
4
2021-01-25 00:00:00
example.com/
454543
966
7
2021-01-25 00:00:00
google.com/
143307
574
1
2021-02-01 00:00:00
bigbank.com/
335614
904
5
2021-02-01 00:00:00
example.com/
260055
623
8
2021-02-01 00:00:00
google.com/
396096
866
2
2021-02-08 00:00:00
bigbank.com/
743231
1143
6
2021-02-08 00:00:00
example.com/
562073
1206
9
2021-02-08 00:00:00
google.com/
229929
472
3
2021-02-15 00:00:00
bigbank.com/
327898
747
Related
The following code produces a list of dates that exclude 6/18/2021 & 12/31/2022:
holidayList = ['2019-04-19', '2020-04-10', '2021-01-18', '2019-01-21', '2022-01-17', '2022-04-15', '2019-03-20', '2020-04-30']
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar(), holidays = holidayList)
oneMonth = pd.read_csv("1M SOFR.csv", names=["1 Month", "trash"])
oneMonthDates = pd.DataFrame({'Date': pd.date_range(start='01/03/2019', end='10/11/2022', freq=us_bd)})
oneMonth = oneMonth.drop("trash", axis = 1)
oneM = [oneMonthDates,oneMonth]
oneM = pd.concat(oneM, axis = 1)
print(oneM)
I understand that 6/18/2021 & 12/31/2022 are excluded due to USFederalHolidayCalendar(), but I would like to add them to the calendar since these specific days were not observed.
I've considered pd.concat() to concat the days I want onto the column following the initial generation, but I receive the following when doing so:
Date
0 2019-01-03 00:00:00
1 2019-01-04 00:00:00
2 2019-01-07 00:00:00
3 2019-01-08 00:00:00
4 2019-01-09 00:00:00
.. ...
938 2022-10-06 00:00:00
939 2022-10-07 00:00:00
940 2022-10-11 00:00:00
941 2022-12-31
942 2021-6-18
(I do not want the '00:00:00' with the dates)
Any help would be appreciated, I've been stumped for a while so this is my last resort.
Thanks.
I have weekly data for several years where I have the start date and end date in datetime format. I now want to make a new column for each year I have data where the mean value of each month is calculated and stored for each hour for the years. All years should have the same format, so ignoring the leap year. So to summarize I have the following data:
input_data:
datetime | A | B | C | D | ... | Z |
---------------------|---|---|---|---| --- |---|
2015-01-01 00:00:00 |123| 23| 67|189| ... | 78|
................... |...|...|...|...| ... |...|
2021-06-01 00:00:00 |345| 87|456| 89| ... | 23|
where I have 2015-01-01 00:00:00 as start date and 2021-06-01 08:00:00 as end date. I would like to get something like:
output:
datetime | 2015 | 2016 | 2017| 2018 | ... | 2021 |
----------------|---------|---------|---------|-----------|-----|----------|
01-01 00:00:00 |mean(A:Z)| mean(A:Z)| mean(A:Z)|mean(A:Z)| ... | mean(A:Z)|
................|.........|..........|..........|.........| ... |..........|
12-31 23:00:00 |mean(A:Z)| mean(A:Z)|mean(A:Z)| mean(A:Z)| ... | mean(A:Z)|
where mean(A:Z) is the mean value for each month of the columns A to Z. I would like to avoid to iterate over each hour for each year. How can best achieve this? Sorry if the question is to simple but I am currently stuck....
IIUC, you can use:
# Update
out = (df.assign(datetime=df['datetime'].dt.strftime('%m-%d %H:%M:%S'),
year=df['datetime'].dt.year.values)
.set_index(['datetime', 'year']).mean(axis=1)
.unstack('year'))
print(out)
# Alternative
# out = (df.set_index('datetime').mean(axis=1).to_frame('mean')
# .assign(datetime=df['datetime'].dt.strftime('%m-%d %H:%M:%S').values,
# year=df['datetime'].dt.year.values)
# .pivot('datetime', 'year', 'mean'))
# Output
year 2015 2016 2017
datetime
01-01 00:00:00 259.000000 420.000000 263.333333
01-01 01:00:00 263.000000 205.333333 169.000000
01-01 02:00:00 342.000000 268.000000 302.000000
01-01 03:00:00 63.000000 243.000000 220.000000
01-01 04:00:00 299.333333 282.666667 421.666667
... ... ... ...
12-31 19:00:00 82.666667 215.000000 84.333333
12-31 20:00:00 316.000000 367.000000 237.666667
12-31 21:00:00 319.666667 170.666667 275.666667
12-31 22:00:00 119.666667 263.666667 325.333333
12-31 23:00:00 252.666667 300.000000 94.666667
[8784 rows x 3 columns]
Setup:
import pandas
import numpy as np
np.random.seed(2022)
dti = pd.date_range('2015-01-01', '2017-12-31 23:00:00', freq='H', name='datetime')
df = pd.DataFrame(np.random.randint(1, 500, (len(dti), 3)),
index=dti, columns=list('ABC')).reset_index()
I would start by creating a new column for the year in the original data frame
input_data['year'] = input_data['datetime'].dt.year
The I would use the groupby method wih a foreach loop to calculate the means as following
output = pd.DataFrame()
output['datetime'] = input_data['datetime']
for name, group in input_data.groupby(['year']):
group.drop(['year', 'datetime'], axis = 1, inplace = True)
output[name] = group.mean(axis = 1).reset_index(0,drop=True)
Output image
That being said I am making an assumption here based on your question that the leap year is to be ignored and that all years has the same format and number of samples. If you have any further questions ot that the years don't have the same numbers of samples please tell me.
I have a csv like this:
date,asin,ordered,forecast
2020-05-31,AAAAAA,300,1000
2020-05-31,BBBBBB,500,2000
...
2020-06-28,AAAAAA,980,1500
2020-06-28,BBBBBB,1900,2500
I want to find the date + 28 days and add a new column with the value in ordered. For example, adding 28 days to 2020-05-31 will give me 2020-06-28. For date = 2020-05-31 and asin = AAAAAA, there will be a new column new with the number 980 (from ordered column), which corresponds to the same asin but different date (2020-06-28):
date,asin,ordered,forecast,new
2020-05-31,AAAAAA,300,1000,980
2020-05-31,BBBBBB,500,2000,1900
...
2020-06-28,AAAAAA,980,1500, <this value will look for the date 28 days after 2020-06-28 and asin AAAAAA and get that ordered value>
2020-06-28,BBBBBB,1900,2500 <this value will look for the date 28 days after 2020-06-28 and asin BBBBBB and get that ordered value>
So far I have gotten the +28 days part by doing df['date'] + pd.DateOffset(days=28) but I don't know how to search for the new date and asin elsewhere in the dataframe and bring in the ordered value to the current row.
Try this. I am using the 4 rows of the dataframe that are visible in your answer.
First we add the new date:
df['new_date'] = df['date'] + timedelta(days=28)
Then we merge df with itself on columns as indicated in the command. This basically matches up new_date to date (for each asin separately) and brings the corresponding ordered value to the right row. Then I clean it up a bit. You may want to do this step by step to understand what's going on
(df.merge(df[['date','asin','ordered']],
left_on = ['new_date','asin'], right_on=['date','asin'], how='left', suffixes = ('','_new'))
.drop(columns = ['date_new'])
.rename(columns = {'ordered_new':'new'})
)
output
date asin ordered forecast new_date new
-- ------------------- ------ --------- ---------- ------------------- -----
0 2020-05-31 00:00:00 AAAAAA 300 1000 2020-06-28 00:00:00 980
1 2020-05-31 00:00:00 BBBBBB 500 2000 2020-06-28 00:00:00 1900
2 2020-06-28 00:00:00 AAAAAA 980 1500 2020-07-26 00:00:00 nan
3 2020-06-28 00:00:00 BBBBBB 1900 2500 2020-07-26 00:00:00 nan
I have a folder on my computer that contains ~8500 .csv files that are all names of various stock tickers. Within each .csv file, there is a 'timestamp' and 'users_holding' column. I have the 'timestamp' column set up as a datetime index, as the entries in that column include hourly entries for each day ex/ 2019-12-01 01:50, 2020-01-01 02:55... 2020-01-01 01:45 etc. Each one of those timestamps has a corresponding integer representing the number of users holding at that time. I want to create a for loop that iterates through all of the .csv files and tallies up the total users holding across all .csv files for the latest time every day starting on February 1st, 2020 (2020-02-01) until the last day in the .csv file. The folder updates daily, so I can't really have an end date.
This is the for loop I have set up to establish each ticker as a dataframe:
path = 'C:\\Users\\N****\\Desktop\\r******\\t**\\p*********\\'
all_files = glob.glob(path + "/*.csv")
for filename in all_files:
df = pd.read_csv(filename, header = 0, parse_dates = ['timestamp'], index_col='timestamp')
If anyone could show me how to write the for loop that finds the latest entry for each date and tallies up that number for each day, that would be amazing.
Thank you!
First, create a data frame with a Datetime index (in one-hour steps):
import numpy as np
import pandas as pd
idx = pd.date_range(start='2020-01-01', end='2020-01-31', freq='H')
data = np.arange(len(idx) * 3).reshape(len(idx), 3)
columns = ['ticker-1', 'ticker-2', 'ticker-3']
df = pd.DataFrame(data=data, index=idx, columns=columns)
print(df.head())
ticker-1 ticker-2 ticker-3
2020-01-01 00:00:00 0 1 2
2020-01-01 01:00:00 3 4 5
2020-01-01 02:00:00 6 7 8
2020-01-01 03:00:00 9 10 11
2020-01-01 04:00:00 12 13 14
Then, groupby the index (keep year-month-day), but drop hours-minutes-seconds). The aggregation function is .last()
result = (df.groupby(by=df.index.strftime('%Y-%m-%d'))
[['ticker-1', 'ticker-2', 'ticker-3']]
.last()
)
print(result.head())
ticker-1 ticker-2 ticker-3
2020-01-01 69 70 71
2020-01-02 141 142 143
2020-01-03 213 214 215
2020-01-04 285 286 287
2020-01-05 357 358 359
I have a column full of dates of 2 million rows. The data format is 'Year-Month-Day', ex: '2019-11-28'. Each time I load the document I have to change the format of the column (which takes a long time) doing:
pd.to_datetime(df['old_date'])
I would like to rearrange the order to 'Month-Day-Year' so that I wouldn't have to change the format of the column each time I load it.
I tried doing:
df_1['new_date'] = df_1['old_date'].dt.month+'-'+df_1['old_date'].dt.day+'-'+df_1['old_date'].dt.year
But I received the following error: 'unknown type str32'
Could anyone help me?
Thanks!
You could use pandas.Series.dt.strftime (documentation) to change the format of your dates. In the code below I have a column with your old format dates, I create a new columns with this method:
import pandas as pd
df = pd.DataFrame({'old format': pd.date_range(start = '2020-01-01', end = '2020-06-30', freq = 'd')})
df['new format'] = df['old format'].dt.strftime('%m-%d-%Y')
Output:
old format new format
0 2020-01-01 01-01-2020
1 2020-01-02 01-02-2020
2 2020-01-03 01-03-2020
3 2020-01-04 01-04-2020
4 2020-01-05 01-05-2020
5 2020-01-06 01-06-2020
6 2020-01-07 01-07-2020
7 2020-01-08 01-08-2020
8 2020-01-09 01-09-2020
9 2020-01-10 01-10-2020