I would like to create a loop that creates multiple lists, named differently.
I have a dataframe that contains an excel file that I am trying to filter through depending on the month. (ideally the list should be 1, 2 , 3, etc.)
Each month should create a list
In the end I need to loop through those lists again to count the average and count len
If you have any questions let me know.
import pandas as pd
#read data
excel = 'short data.xlsx'
data = pd.read_excel(excel, parse_dates=['Closed Date Time'])
df = pd.DataFrame(data)
# data.info()
#Format / delete time from date column
data['Closed Date Time'] = pd.to_datetime(data['Closed Date Time'])
df['Close_Date'] = data['Closed Date Time'].dt.date
df['Close_Date'] = pd.to_datetime(data['Close_Date'])
#loop to create multiple lists
times = 12
for _ in range(times):
if times <= 9:
month = df[df['Close_Date'].dt.strftime('%Y-%m') == f'2018-0{times}']
month = df[df['Close_Date'].dt.strftime('%Y-%m') == f'2018-{times}']
example data
[
Creating lists with different names can be totally wrong idea.
You should rather create single list with sublisst (and indexes instead of names) or single dictinary with names as keys. Or even better you should create single DataFrame with all values (in rows or columns). It will be more useful for next calculations.
And all this may not need for-loop.
But I think you may do it in different way. You could create column with Year Month
df['Year_Month'] = df['Close_Date'].dt.strftime('%Y-%m')
And later use groupby() to execute function on every month without using for-loops.
averages = df.groupby('Year_Month').mean()
sizes = df.groupby('Year_Month').size()
Minimal working code with example data:
import pandas as pd
#df = pd.read_excel('short data.xlsx', parse_dates=['Closed Date Time'])
data = {
'Closed Date Time': ['2022.10.25 01:00', '2022.10.24 01:00', '2018.10.25 01:00', '2018.10.24 01:00', '2018.10.23 01:00'],
'Price': [1, 2, 3, 4, 5],
'User': ['A','A','A','B','C'],
}
df = pd.DataFrame(data)
print(df)
df['Closed Date Time'] = pd.to_datetime(df['Closed Date Time'])
df['Close_Date'] = df['Closed Date Time'].dt.date
df['Close_Date'] = pd.to_datetime(df['Close_Date'])
df['Year_Month'] = df['Close_Date'].dt.strftime('%Y-%m')
print(df)
print('\n--- averages ---\n')
averages = df.groupby('Year_Month').mean()
print(averages)
print('\n--- sizes ---\n')
sizes = df.groupby('Year_Month').size()
print(sizes)
Result:
Closed Date Time Price
0 2022.10.25 01:00 1
1 2022.10.24 01:00 2
2 2018.10.25 01:00 3
3 2018.10.24 01:00 4
4 2018.10.23 01:00 5
Closed Date Time Price Close_Date Year_Month
0 2022-10-25 01:00:00 1 2022-10-25 2022-10
1 2022-10-24 01:00:00 2 2022-10-24 2022-10
2 2018-10-25 01:00:00 3 2018-10-25 2018-10
3 2018-10-24 01:00:00 4 2018-10-24 2018-10
4 2018-10-23 01:00:00 5 2018-10-23 2018-10
--- averages ---
Price
Year_Month
2018-10 4.0
2022-10 1.5
--- sizes ---
Year_Month
2018-10 3
2022-10 2
dtype: int64
EDIT:
data = df.groupby('Year_Month').agg({'Price':['mean','size']})
print(data)
Result:
Price
mean size
Year_Month
2018-10 4.0 3
2022-10 1.5 2
EDIT:
Example with .groupby() and .apply() to execute more complex function.
And later it uses .to_dict() and .plot()
import pandas as pd
#df = pd.read_excel('short data.xlsx', parse_dates=['Closed Date Time'])
data = {
'Closed Date Time': ['2022.10.25 01:00', '2022.10.24 01:00', '2018.10.25 01:00', '2018.10.24 01:00', '2018.10.23 01:00'],
'Price': [1, 2, 3, 4, 5],
'User': ['A','A','A','B','C'],
}
df = pd.DataFrame(data)
#print(df)
df['Closed Date Time'] = pd.to_datetime(df['Closed Date Time'])
df['Close_Date'] = df['Closed Date Time'].dt.date
df['Close_Date'] = pd.to_datetime(df['Close_Date'])
df['Year_Month'] = df['Close_Date'].dt.strftime('%Y-%m')
#print(df)
def calculate(group):
#print(group)
#print(group['Price'].mean())
#print(group['User'].unique().size)
result = {
'Mean': group['Price'].mean(),
'Users': group['User'].unique().size,
'Div': group['Price'].mean()/group['User'].unique().size
}
return pd.Series(result)
data = df.groupby('Year_Month').apply(calculate)
print(data)
print('--- dict ---')
print(data.to_dict())
#print(data.to_dict('dict'))
print('--- records ---')
print(data.to_dict('records'))
print('--- list ---')
print(data.to_dict('list'))
print('--- index ---')
print(data.to_dict('index'))
import matplotlib.pyplot as plt
data.plot(kind='bar', rot=0)
plt.show()
Result:
Mean Users Div
Year_Month
2018-10 4.0 3.0 1.333333
2022-10 1.5 1.0 1.500000
--- dict ---
{'Mean': {'2018-10': 4.0, '2022-10': 1.5}, 'Users': {'2018-10': 3.0, '2022-10': 1.0}, 'Div': {'2018-10': 1.3333333333333333, '2022-10': 1.5}}
--- records ---
[{'Mean': 4.0, 'Users': 3.0, 'Div': 1.3333333333333333}, {'Mean': 1.5, 'Users': 1.0, 'Div': 1.5}]
--- list ---
{'Mean': [4.0, 1.5], 'Users': [3.0, 1.0], 'Div': [1.3333333333333333, 1.5]}
--- index ---
{'2018-10': {'Mean': 4.0, 'Users': 3.0, 'Div': 1.3333333333333333}, '2022-10': {'Mean': 1.5, 'Users': 1.0, 'Div': 1.5}}
Related
For every customer_id I have several start dates and end dates.
When a customer has several overlapping date ranges I would like to reduce those to one line that has the minimum start date of the overlapping date ranges and the maximum end date of the overlapping date ranges.
Here's my example data frame:
customer_id start_date end_date
1 2019-01-01 2019-03-01
1 2020-01-02 2020-03-01
1 2020-01-03 2020-05-04
1 2020-01-05 2020-06-01
1 2020-01-07 2020-02-02
1 2020-09-03 2020-09-05
1 2020-09-04 2020-09-04
1 2020-10-01 NaT
2 2020-05-01 2020-05-03
This is what the end result should look like:
customer_id start_date end_date
1 2019-01-01 2019-03-01
1 2020-01-02 2020-06-01
1 2020-09-03 2020-09-05
1 2020-10-01 NaT
2 2020-05-01 2020-05-03
I've tried the following already, but that didn't really work out:
Find date range overlap in python
Here's sample code that generated these examples:
import pandas as pd
df = pd.DataFrame(data=[
[1, '2019-01-01', '2019-03-01'],
[1, '2020-01-03', '2020-05-04'],
[1, '2020-01-05', '2020-06-01'],
[1, '2020-01-02', '2020-03-01'],
[1, '2020-01-07', '2020-02-02'],
[1, '2020-09-03', '2020-09-05'],
[1, '2020-09-04', '2020-09-04'],
[1, '2020-10-01', None],
[2, '2020-05-01', '2020-05-03']],
columns=['customer_id', 'start_date', 'end_date'],
)
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df.sort_values(by=['customer_id', 'start_date', 'end_date'])
expected_df = pd.DataFrame(data=[
[1, '2019-01-01', '2019-03-01'],
[1, '2020-01-02', '2020-06-01'],
[1, '2020-09-03', '2020-09-05'],
[1, '2020-10-01', None],
[2, '2020-05-01', '2020-05-03']],
columns=['customer_id', 'start_date', 'end_date'],
)
expected_df['start_date'] = pd.to_datetime(expected_df['start_date'])
expected_df['end_date'] = pd.to_datetime(expected_df['end_date'])
expected_df.sort_values(by=['customer_id', 'start_date', 'end_date'])
Henry Ecker pointed me in the right direction considering this problem as a graph:
Pandas combining rows based on dates
The code only needed a very small bit of rewriting to get the right answer:
from scipy.sparse.csgraph import connected_components
def reductionFunction(data):
# create a 2D graph of connectivity between date ranges
start = data.start_date.values
end = data.end_date.values
graph = (start <= end[:, None]) & (end >= start[:, None])
# find connected components in this graph
n_components, indices = connected_components(graph)
# group the results by these connected components
return data.groupby(indices).aggregate({'start_date': 'min',
'end_date': 'max'})
df.groupby(['customer_id']).apply(reductionFunction).reset_index('customer_id')
I have a start date and and end date and I would like to have the date range between start and end, on a specific day (e.g the 10th day of every month)
Example:
start_date = '2020-01-03'
end_date = '2020-10-19'
wanted_result = ['2020-01-03', '2020-01-10', '2020-02-10', '2020-03-10',...,'2020-10-10', '2020-10-19']
I currently have a solution which creates all the dates between start_date and end_date and then subsamples only the dates on the 10th, but I do not like it, I think it is too cumbersome. Any ideas?
import pandas as pd
querydate = 10
dates = pd.date_range(start=start_date, end=end_date)
dates = dates[[0]].append(dates[dates.day == querydate])
If need also get first and last value add Index.isin by last and first value - so get all values unique, not duplicates if first or last day is 10:
dates = pd.date_range(start=start_date, end=end_date)
dates = dates[dates.isin(dates[[0,-1]]) | (dates.day == querydate)]
print (dates)
DatetimeIndex(['2020-01-03', '2020-01-10', '2020-02-10', '2020-03-10',
'2020-04-10', '2020-05-10', '2020-06-10', '2020-07-10',
'2020-08-10', '2020-09-10', '2020-10-10', '2020-10-19'],
dtype='datetime64[ns]', freq=None)
If need list:
print (list(dates.strftime('%Y-%m-%d')))
['2020-01-03', '2020-01-10', '2020-02-10', '2020-03-10',
'2020-04-10', '2020-05-10', '2020-06-10', '2020-07-10',
'2020-08-10', '2020-09-10', '2020-10-10', '2020-10-19']
Changed sample data:
start_date = '2020-01-10'
end_date = '2020-10-10'
querydate = 10
dates = pd.date_range(start=start_date, end=end_date)
dates = dates[dates.isin(dates[[0,-1]]) | (dates.day == querydate)]
print (dates)
DatetimeIndex(['2020-01-10', '2020-02-10', '2020-03-10', '2020-04-10',
'2020-05-10', '2020-06-10', '2020-07-10', '2020-08-10',
'2020-09-10', '2020-10-10'],
dtype='datetime64[ns]', freq=None)
Try this:
dates = pd.Series([pd.to_datetime(start_date)] + [i for i in pd.date_range(start=start_date, end=end_date) if i.day == 10] + [pd.to_datetime(end_date)]).drop_duplicates()
print(dates)
Output:
0 2020-01-03
1 2020-01-10
2 2020-02-10
3 2020-03-10
4 2020-04-10
5 2020-05-10
6 2020-06-10
7 2020-07-10
8 2020-08-10
9 2020-09-10
10 2020-10-10
11 2020-10-19
dtype: datetime64[ns]
I have DataFrame like below:
rng = pd.date_range('2020-12-11', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'status': ['active', 'active', 'finished', 'finished', 'active'] })
And I need to create 2 new columns in this DataFrame:
New1 = amount of days from "Date" column until today for status 'active'
New2 = amount of days from "Date" column until today for status 'finished'
Below sample result:
Use Series.rsub for subtract from right side with today by Timestamp and Timestamp.floor, convert timedeltas to days by Series.dt.days and assign new columns by condition in Series.where:
rng = pd.date_range('2020-12-01', periods=5, freq='D')
df = pd.DataFrame({ 'Date': rng,
'status': ['active', 'active', 'finished', 'finished', 'active'] })
days = df['Date'].rsub(pd.Timestamp('now').floor('d')).dt.days
df['New1'] = days.where(df['status'].eq('active'))
df['New2'] = days.where(df['status'].eq('finished'))
print (df)
Date status New1 New2
0 2020-12-01 active 13.0 NaN
1 2020-12-02 active 12.0 NaN
2 2020-12-03 finished NaN 11.0
3 2020-12-04 finished NaN 10.0
4 2020-12-05 active 9.0 NaN
I have a dataframe with a column of dates of the form
2004-01-01
2005-01-01
2006-01-01
2007-01-01
2008-01-01
2009-01-01
2010-01-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
2016-01-01
2017-01-01
2018-01-01
2019-01-01
Given an integer number k, let's say k=5, I would like to generate an array of the next k years after the maximum date of the column. The output should look like:
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
Let's use pd.to_datetime + max to compute the largest date in the column date then use pd.date_range to generate the dates based on the offset frequency one year and having the number of periods equals to k=5:
strt, offs = pd.to_datetime(df['date']).max(), pd.DateOffset(years=1)
dates = pd.date_range(strt + offs, freq=offs, periods=k).strftime('%Y-%m-%d').tolist()
print(dates)
['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01']
Here you go:
import pandas as pd
# this is your k
k = 5
# Creating a test DF
array = {'dt': ['2018-01-01', '2019-01-01']}
df = pd.DataFrame(array)
# Extracting column of year
df['year'] = pd.DatetimeIndex(df['dt']).year
year1 = df['year'].max()
# creating a new DF and populating it with k years
years_df = pd.DataFrame()
for i in range (1,k+1):
row = {'dates':[str(year1 + i) + '-01-01']}
years_df = years_df.append(pd.DataFrame(row))
years_df
The output:
dates
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
I have to dataframe that I want to merge on the date, id and time variable in order to compute a duration.
from numpy import *
from pandas import *
df1 = DataFrame({
'id': ['a']*4,
'date': ['02-02-2015']*4,
'time_1': ['08:00:00', '09:00:00', '10:30:00', '12:45']})
df1
id date time
0 a 02-02-2015 08:00:00
1 a 02-02-2015 09:00:00
2 a 02-02-2015 10:30:00
3 a 02-02-2015 12:45:00
-------------------------------------------------------------------------------------------------
df2 = DataFrame({
'id': ['a']*7,
'date': ['02-02-2015']*7,
'time_2': ['08:00:00', '08:09:00', '08:04:01','08:52:36', '09:34:25', '10:30:00', '11:23:38']})
df2
id date time
0 a 02-02-2015 08:00:00
1 a 02-02-2015 08:09:00
2 a 02-02-2015 08:04:01
3 a 02-02-2015 08:52:36
4 a 02-02-2015 09:00:00
5 a 02-02-2015 10:30:00
6 a 02-02-2015 11:23:38
The rule that I want my merge to follow is that each row needs in df2 needs to go with the closest previous time in df1.
The intermediate result would be
intermediateResult = DataFrame({
'id': ['a']*8,
'date': ['02-02-2015']*8,
'time_1': ['08:00:00', '08:00:00', '08:00:00','08:00:00', '09:00:00', '10:30:00', '10:30:00', '12:45'],
'time_2': ['08:00:00', '08:09:00', '08:04:01','08:52:36', '09:34:25', '10:30:00', '11:23:38', nan] })
intermediateResult
id date time_1 time_2
0 a 02-02-2015 08:00:00 08:00:00
1 a 02-02-2015 08:00:00 08:09:00
2 a 02-02-2015 08:00:00 08:04:01
3 a 02-02-2015 08:00:00 08:52:36 # end
4 a 02-02-2015 09:00:00 09:34:25 # end
5 a 02-02-2015 10:30:00 10:30:00
6 a 02-02-2015 10:30:00 11:23:38 # end
7 a 02-02-2015 12:45 NaN
Finally, I want to get the time difference between the latest time_2 of each period (inicated with the comment # end) and their corresponding time_1.
The final result would look like this
finalResult = DataFrame({
'id': ['a']*4,
'date': ['02-02-2015']*4,
'Duration': ['00:52:36', '00:34:25', '00:53:38', nan]})
finalResult
id date Duration
0 a 02-02-2015 00:52:36
1 a 02-02-2015 00:34:25
2 a 02-02-2015 00:53:38
3 a 02-02-2015 NaN
Using different merge methods, came to the same answer. Eventually used merge_as0f direction =backward as per your request. Unfortunately not similar to yours in the sense that I have no NaN. Happy to help furtherif you gave information on how you end up with NaN in one row.
#Join dateto time and coerce to datetime
df1['datetime']=pd.to_datetime(df1.date.str.cat(df1.time_1,sep=' '))
df2['datetime']=pd.to_datetime(df2.date.str.cat(df2.time_2,sep=' '))
df2['time_2'] = df2['time_2'].apply(lambda x: (x[-5:]))#StripHours from time_2. I anticipate to use it as duration
#sort to allow merge_asof
df1=df1.sort_values('datetime')
df2=df2.sort_values('datetime')
#Merge to the dataframes joining using datetime to the nearest hour
df3=pd.merge_asof(df2, df1,on='datetime', by='id', tolerance=pd.Timedelta('2H'),allow_exact_matches=True,direction='backward').dropna()
#df3=df2.merge(df1, left_on=df2.datetime.dt.hour, right_on=df1.datetime.dt.hour, how='left').drop(columns=['key_0', 'id_y', 'date_y']).fillna(0)#Alternative merge
df3.set_index('datetime', inplace=True)#set datetime as index
df3['minutes']=df3.index.minute#Extract minute in each row. Looks to me you want the highest minute in each hour
#Groupby hour idxmax Helps boolean select the index with the highest minutes in an hour. aND DROP UNWANTED ROWS
finalResult=df3.loc[df3.groupby([df3.index.hour, df3.date_x])['minutes'].idxmax()].reset_index().drop(columns=['datetime','time_1','date_y','minutes'])
finalResult.columns=['id','date','Duration(min)']
finalResult
Using the solution suggested by #wwnde, I've found one that scales better to my real data set:
import numpy as np
import pandas as pd
df1 = DataFrame({
'id': ['a']*4,
'date': ['02-02-2015']*4,
'time_1': ['08:00:00', '09:00:00', '10:30:00', '12:45:00']
})
df2 = DataFrame({
'id': ['a']*7,
'date': ['02-02-2015',
'02-02-2015',
'03-02-2015', # small change here relatively to the df un my first post
'02-02-2015',
'02-02-2015',
'02-02-2015',
'02-02-2015'],
'time_2': ['08:00:00', '08:09:00', '08:04:01','08:52:36', '09:34:25', '10:30:00', '11:23:38']
})
----------------------------------------------------
def preproDf(df1, df2, time_1, time_2, _id, date):
'''
Preprocess the dataframes for the following operations
df1: pd.DataFrame, left dataframe
df2: pd.DataFrame, right dataframe
time_1:str, name of the left dataframe
time_2:str, name of the right dataframe
_id:str, name of the id variable. Should be the same for both dataframes
date:str, name of the date variable. Should be the same for both dataframes
return: None
'''
df2[time_2] = df2[time_2].apply(pd.to_datetime)
df1[time_1] = df1[time_1].apply(pd.to_datetime)
#sort to allow merge_asof
df1=df1.sort_values([_id, date, time_1])
df2=df2.sort_values([_id, date, time_2])
def processDF(df1, df2, time_1, time_2, _id, date):
# initialisation
groupKeys = list(df2.groupby([_id, date]).groups.keys())
dfGroup=groupKeys[0]
group = df2.groupby([_id, date]).get_group(dfGroup)
rslt = pd.merge_asof(group, df1, left_on=time_2, right_on=time_1, by=[_id, date], tolerance=pd.Timedelta('2H'),allow_exact_matches=True,direction='backward')#.dropna()
# For loop to get the values in an array
for group in groupKeys[1:]: # Iteration start at the second elmt
group = df2.groupby([_id, date]).get_group(group)
item = pd.merge_asof(group, df1, left_on=time_2, right_on=time_1, by=[_id, date], tolerance=pd.Timedelta('2H'),allow_exact_matches=True,direction='backward')#.dropna()
rslt = np.vstack((rslt, item))
rslt = DataFrame(rslt, columns=item.columns)
# Creating timeDifference variable
rslt['timeDifference'] = rslt[time_2] - rslt[time_1]
# Getting the actual result
rslt = rslt.groupby([_id, date, time_1]).timeDifference.max()
rslt = pd.DataFrame(rslt).reset_index()
rslt.rename({time_1: 'openTime'}, axis='columns')
return rslt
The result:
preproDf(df1, df2, 'time_1', 'time_2', 'id', 'date')
processDF(df1, df2, 'time_1', 'time_2', 'id', 'date')
id date time_1 screenOnDuration
0 a 02-02-2015 2020-05-29 08:00:00 00:52:36
1 a 02-02-2015 2020-05-29 09:00:00 00:34:25
2 a 02-02-2015 2020-05-29 10:30:00 00:53:38