I have a few pandas series with PeriodIndex of varying frequency. I'd like to filter these based on another PeriodIndex of which the frequency is in principle unknown (specified directly in the example below as selectionA or selectionB, but in practice stripped from another series).
I've found 3 approaches, each with its own downside, shown in the example below. Is there a better way?
import numpy as np
import pandas as pd
y = pd.Series(np.random.random(4), index=pd.period_range('2018', '2021', freq='A'), name='speed')
q = pd.Series(np.random.random(16), index=pd.period_range('2018Q1', '2021Q4', freq='Q'), name='speed')
m = pd.Series(np.random.random(48), index=pd.period_range('2018-01', '2021-12', freq='M'), name='speed')
selectionA = pd.period_range('2018Q3', '2020Q2', freq='Q') #subset of y, q, and m
selectionB = pd.period_range('2014Q3', '2015Q2', freq='Q') #not subset of y, q, and m
#Comparing some options:
#1: filter method
#2: slicing
#3: selection based on boolean comparison
#1: problem when frequencies unequal: always returns empty series
yA_1 = y.filter(selectionA, axis=0) #Fail: empty series
qA_1 = q.filter(selectionA, axis=0)
mA_1 = m.filter(selectionA, axis=0) #Fail: empty series
yB_1 = y.filter(selectionB, axis=0)
qB_1 = q.filter(selectionB, axis=0)
mB_1 = m.filter(selectionB, axis=0)
#2: problem when frequencies unequal: wrong selection and error instead of empty result
yA_2 = y[selectionA[0]:selectionA[-1]]
qA_2 = q[selectionA[0]:selectionA[-1]]
mA_2 = m[selectionA[0]:selectionA[-1]] #Fail: selects 22 months instead of 24
yB_2 = y[selectionB[0]:selectionB[-1]] #Fail: error
qB_2 = q[selectionB[0]:selectionB[-1]]
mB_2 = m[selectionB[0]:selectionB[-1]] #Fail: error
#3: works, but very verbose
yA_3 =y[(y.index >= selectionA[0].start_time) & (y.index <= selectionA[-1].end_time)]
qA_3 =q[(q.index >= selectionA[0].start_time) & (q.index <= selectionA[-1].end_time)]
mA_3 =m[(m.index >= selectionA[0].start_time) & (m.index <= selectionA[-1].end_time)]
yB_3 =y[(y.index >= selectionB[0].start_time) & (y.index <= selectionB[-1].end_time)]
qB_3 =q[(q.index >= selectionB[0].start_time) & (q.index <= selectionB[-1].end_time)]
mB_3 =m[(m.index >= selectionB[0].start_time) & (m.index <= selectionB[-1].end_time)]
Many thanks
I've solved it by adding start_time and end_time to the slice range:
yA_2fixed = y[selectionA[0].start_time: selectionA[-1].end_time]
qA_2fixed = q[selectionA[0].start_time: selectionA[-1].end_time]
mA_2fixed = m[selectionA[0].start_time: selectionA[-1].end_time] #now has 24 rows
yB_2fixed = y[selectionB[0].start_time: selectionB[-1].end_time] #doesn't fail; returns empty series
qB_2fixed = q[selectionB[0].start_time: selectionB[-1].end_time]
mB_2fixed = m[selectionB[0].start_time: selectionB[-1].end_time] #doesn't fail; returns empty series
But if there's a more concise way to write this, I'm still all ears. I especially would like to know if it's possible to do this filtering in a way that is more 'native' to the PeriodIndex, i.e., not converting it into datetime instances first with the start_time and end_time attributes.
Related
data: https://github.com/zero-jack/data/blob/main/hy_data.csv#L7
Goal
get the idxmax from last n rows for each group.
Try
df=df.assign(
l6d_highest_date=lambda x: x.groupby('hy_code')['high'].transform(lambda x: x.rolling(6).idxmax())
AttributeError: 'Rolling' object has no attribute 'idxmax'
notice: week_date is the index.
My solution is based on the conversion of the argmax computed on each sliding-window. For each date, thanks to this information, you can infer the date the argmax refers to.
df = pd.read_csv(
"https://raw.githubusercontent.com/zero-jack/data/main/hy_data.csv",
sep=",", index_col="week_date"
)
def rolling_idmax(series, n):
#fist compute the index in the sliding window
ids = series.rolling(n).apply(np.argmax)
#0 <= ids <= n-1
#how many rows have past from the sliding window maximum?
ids = n-1-ids
#0 <= ids <= n-1
#subtract `ids` from the actual positions
ids = np.arange(len(series))-ids
#0 <= ids <= len(series)-1
#convert the positions stored in `ids` with the corrisponding dates (series.index)
ids.loc[~ids.isna()] = series.index[ids.dropna().astype(int)]
#"2005-06-10" <= ids <= "2022-03-04"
return ids
df["l6d_highest_date"] = df.groupby("hy_code").high.apply(rolling_idmax, 6)
Based on this answer, I get the following workaround. Note that the linked answer can only handle series with the default index, I add x.index[global_index] to deal with non-default index.
window_size = 6
def get_idxmax_in_rolling(x: pd.Series):
local_index = x.rolling(window_size).apply(np.argmax)[window_size-1:].astype(int) # local index, removed nan before astype()
global_index = local_index + np.arange(len(x)-window_size+1)
# return list(x.index[global_index]) + [np.nan]*(window_size-1)
return [np.nan]*(window_size-1) + list(x.index[global_index]) # add nan back
df = df.assign(l6d_highest_date=lambda x: x.groupby('hy_code')['high'].transform(get_idxmax_in_rolling))
You can apply idxmax (for older versions of pandas before 1.0.0 you need to pass raw=False). The only caveat is that rolling must return a float (see linked docs), not a Timestamp. That's why you need to temporaryly reset the index, get the idxmax values and the corresponding week_dates and reset the index:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/zero-jack/data/main/hy_data.csv', index_col='week_date', parse_dates=True)
df = df.reset_index()
df['l6d_highest_date'] = df.groupby('hy_code')['high'].transform(lambda x: x.rolling(6).apply(pd.Series.idxmax))
df.loc[df.l6d_highest_date.notna(), 'l6d_highest_date'] = df.loc[df.loc[df.l6d_highest_date.notna(), 'l6d_highest_date'].values, 'week_date'].values
df = df.set_index('week_date')
I have the below script - which aims to create a "merge based on a partial match" functionality since this is not possible with the normal .merge() funct to the best of my knowledge.
The below works / returns the desired result, but unfortunately, it's incredibly slow to the point that it's almost unusable where I need it.
Been looking around at other Stack Overflow posts that contain similar problems, but haven't yet been able to find a faster solution.
Any thoughts on how this could be accomplished would be appreciated!
import pandas as pd
df1 = pd.DataFrame([ 'https://wwww.example.com/hi', 'https://wwww.example.com/tri', 'https://wwww.example.com/bi', 'https://wwww.example.com/hihibi' ]
,columns = ['pages']
)
df2 = pd.DataFrame(['hi','bi','geo']
,columns = ['ngrams']
)
def join_on_partial_match(full_values=None, matching_criteria=None):
# Changing columns name with index number
full_values.columns.values[0] = "full"
matching_criteria.columns.values[0] = "ngram_match"
# Creating matching column so all rows match on join
full_values['join'] = 1
matching_criteria['join'] = 1
dfFull = full_values.merge(matching_criteria, on='join').drop('join', axis=1)
# Dropping the 'join' column we created to join the 2 tables
matching_criteria = matching_criteria.drop('join', axis=1)
# identifying matching and returning bool values based on whether match exists
dfFull['match'] = dfFull.apply(lambda x: x.full.find(x.ngram_match), axis=1).ge(0)
# filtering dataset to only 'True' rows
final = dfFull[dfFull['match'] == True]
final = final.drop('match', axis=1)
return final
join = join_on_partial_match(full_values=df1,matching_criteria=df2)
print(join)
>> full ngram_match
0 https://wwww.example.com/hi hi
7 https://wwww.example.com/bi bi
9 https://wwww.example.com/hihibi hi
10 https://wwww.example.com/hihibi bi
For anyone who is interested - ended up figuring out 2 ways to do this.
First returns all matches (i.e., it duplicates the input value and matches with all partial matches)
Only returns the first match.
Both are extremely fast. Just ended up using a pretty simple masking script
def partial_match_join_all_matches_returned(full_values=None, matching_criteria=None):
"""The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with all matching values (duplicating the full value).
Args:
full_values = None: This is the series that contains the full values for matching pair.
partial_values = None: This is the series that contains the partial values for matching pair.
Returns:
A dataframe with 2 columns - 'full' and 'match'.
"""
start_join1 = time.time()
matching_criteria = matching_criteria.to_frame("match")
full_values = full_values.to_frame("full")
full_values = full_values.drop_duplicates()
output=[]
for n in matching_criteria['match']:
mask = full_values['full'].str.contains(n, case=False, na=False)
df = full_values[mask]
df_copy = df.copy()
df_copy['match'] = n
# df = df.loc[n, 'match']
output.append(df_copy)
final = pd.concat(output)
end_join1 = (time.time() - start_join1)
end_join1 = str(round(end_join1, 2))
len_join1 = len(final)
return final
def partial_match_join_first_match_returned(full_values=None, matching_criteria=None):
"""The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with the first matching value.
Args:
full_values = None: This is the series that contains the full values for matching pair.
partial_values = None: This is the series that contains the partial values for matching pair.
Returns:
A dataframe with 2 columns - 'full' and 'match'.
"""
start_singlejoin = time.time()
matching_criteria = matching_criteria.to_frame("match")
full_values = full_values.to_frame("full").drop_duplicates()
output=[]
for n in matching_criteria['match']:
mask = full_values['full'].str.contains(n, case=False, na=False)
df = full_values[mask]
df_copy = df.copy()
df_copy['match'] = n
# leaves us with only the 1st of each URL
df_copy.drop_duplicates(subset=['full'])
output.append(df_copy)
final = pd.concat(output)
end_singlejoin = (time.time() - start_singlejoin)
end_singlejoin = str(round(end_singlejoin, 2))
len_singlejoin = len(final)
return final
I am trying to improve the performance of a current piece of code, whereby I loop through a dataframe (dataframe 'r') and find the average values from another dataframe (dataframe 'p') based on criteria.
I want to find the average of all values (column 'Val') from dataframe 'p' where (r.RefDate = p.RefDate) & (r.Item = p.Item) & (p.StartDate >= r.StartDate) & (p.EndDate <= r.EndDate)
Dummy data for this can be generated as per the below;
import pandas as pd
import numpy as np
from datetime import datetime
######### START CREATION OF DUMMY DATA ##########
rng = pd.date_range('2019-01-01', '2019-10-28')
daily_range = pd.date_range('2019-01-01','2019-12-31')
p = pd.DataFrame(columns=['RefDate','Item','StartDate','EndDate','Val'])
for item in ['A','B','C','D']:
for date in daily_range:
daily_p = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'StartDate':date,
'EndDate':date,
'Val' : np.random.randint(0,100,len(rng))})
p = p.append(daily_p)
r = pd.DataFrame(columns=['RefDate','Item','PeriodStartDate','PeriodEndDate','AvgVal'])
for item in ['A','B','C','D']:
r1 = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'PeriodStartDate':'2019-10-25',
'PeriodEndDate':'2019-10-31',#datetime(2019,10,31),
'AvgVal' : 0})
r = r.append(r1)
r.reset_index(drop=True,inplace=True)
######### END CREATION OF DUMMY DATA ##########
The piece of code I currently have calculating and would like to improve the performance of is as follows
for i in r.index:
avg_price = p['Val'].loc[((p['StartDate'] >= r.loc[i]['PeriodStartDate']) &
(p['EndDate'] <= r.loc[i]['PeriodEndDate']) &
(p['RefDate'] == r.loc[i]['RefDate']) &
(p['Item'] == r.loc[i]['Item']))].mean()
r['AvgVal'].loc[i] = avg_price
The first change is that generating r DataFrame, both PeriodStartDate and
PeriodEndDate are created as datetime, see the following fragment of your
initiation code, changed by me:
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
'PeriodStartDate': pd.to_datetime('2019-10-25'),
'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0})
To get better speed, I the set index in both DataFrames to RefDate and Item
(both columns compared on equality) and sorted by index:
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)
This way, the access by index is significantly quicker.
Then I defined the following function computing the mean for rows
from p "related to" the current row from r:
def myMean(row):
pp = p.loc[row.name]
return pp[pp.StartDate.ge(row.PeriodStartDate) &
pp.EndDate.le(row.PeriodEndDate)].Val.mean()
And the only thing to do is to apply this function (to each row in r) and
save the result in AvgVal:
r.AvgVal = r.apply(myMean2, axis=1)
Using %timeit, I compared the execution time of the code proposed by EdH with mine
and got the result almost 10 times shorter.
Check on your own.
By using iterrows I managed to improve the performance, although still may be quicker ways.
for index, row in r.iterrows():
avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) &
(p['EndDate'] <= row.PeriodEndDate) &
(p['RefDate'] == row.RefDate) &
(p['Item'] == row.Item))].mean()
r.loc[index, 'AvgVal'] = avg_price
I am trying to slice a pandas.Series at specified time stamps. From other SO questions I got the following workflow:
import pandas as pd
x = ... # some time data
y = ... # some value data
lower_limit_x = pd.to_datetime(x.index) >= pd.to_datetime('2019-01-23 20:59:04')
upper_limit_x = pd.to_datetime(x.index) <= pd.to_datetime('2019-01-23 21:37:44')
lower_limit_y = pd.to_datetime(y.index) >= pd.to_datetime('2019-01-23 20:59:04')
upper_limit_y = pd.to_datetime(y.index) <= pd.to_datetime('2019-01-23 21:37:44')
mask_x = lower_limit_x & upper_limit_x
mask_y = lower_limit_y & upper_limit_y
sliced_x = x[mask_x]
sliced_y = y[mask_y]
However if I start with the following data set that spans from approx. 2019-01-23 20:45 to 2019-01-23 04:00:
The resulting data seems to be empty. If I do
sliced_y.values
the result is empty.
How can I successfully slice my data by time stamps?
You can create a single dataframe, then use the loc acessor:
df = pd.DataFrame(y.values, index=x.values)
sliced_df = df.loc['2019-01-23 20:59:04': '2019-01-23 21:37:44']
sliced_df is now a single dataframe and you can access your x and y coordinates as follows:
sliced_times = sliced_df.index
sliced_values = sliced_df.iloc[:, 0].values
I have 3 different CSV files. Each has 70 rows and 430 columns. I want to create and save a boolean result file (with the same shape) that put true if the condition is met.
one file include temperature data, one wind data and one Rh data.condition is: [(t>=35) & (w>=7) & (rh<30)]
I want the saved file to be 0 and 1 file that show in which cell the condition has been meet (1) or not (0). The problem is that results are not true! I really appreciate your help.
import numpy as np
import pandas as pd
dft = pd.read_csv ("D:/practicet.csv",header = None)
dfrh = pd.read_csv ("D:/practicerh.csv",header = None)
dfw = pd.read_csv ("D:/practicew.csv",header = None)
result_set = []
for i in range (0,dft.shape[1]):
t=dft[i]
w=dfw[i]
rh=dfrh[i]
result=np.empty(dft.shape,dtype=bool)
result=result[(t>=35) & (w>=7) & (rh<30)]
result_set = np.append(result_set,result)
np.savetxt("D:/result.csv", result_set, delimiter = ",")
You can generate boolean series by testing each column of the frame. You simply then concatenate columns back into a DataFrame object.
import pandas as pd
data = pd.read_csv('data.csv')
bool_temp = data['temperature'] > 22
bool_week = data['week'] > 5
bool_humid = data['humidity'] > 50
data_tmp = [bool_humid, bool_temp, bool_week]
df = pd.concat(data_tmp, axis=1, keys=[s.name for s in data_tmp])
The dummy data:
temperature,week,humidity
25,3,80
29,4,60
22,4,20
20,5,30
2,7,80
30,9,80
are written to data.csv
Give this a shot.
This is a proxy problem for yours, with random arrays from [0,100] in the same shape as your CSV.
import numpy as np
dft = np.random.rand(70,430)*100.
dfrh = np.random.rand(70,430)*100.
dfw = np.random.rand(70,430)*100.
result_set = []
for i in range(dft.shape[0]):
result = ((dft[i] >= 35) & (dfw[i] >= 7) & (dfrh[i] < 30))
result_set.append(result)
np.savetxt("result.csv", result_set, delimiter = ",")
The critical problem with your code is:
result=np.empty(dft.shape,dtype=bool)
result=result[(t>=35) & (w>=7) & (rh<30)]
This does not do what you think it's doing. You (i) initialize an empty array (which will have garbage values), and then you (ii) apply your boolean mask to it. So, now you have a garbage array masked into another garbage array according to your specified boolean rules.
As an example...
In [5]: a = np.array([1,2,3,4,5])
In [6]: mask = np.array([True,False,False,False,True])
In [7]: a[mask]
Out[7]: array([1, 5])