Related
I have this DataFrame.
High Close
Close Time
2022-10-23 21:41:59.999 19466.02 19461.29
2022-10-23 21:42:59.999 19462.48 19457.83
2022-10-23 21:43:59.999 19463.13 19460.09
2022-10-23 21:44:59.999 19465.15 19463.76
I'm attempting to check if Close at a later date (up to 600 rows later but no more) goes above the close of an earlier date & High is lower than the High of the same earlier date then I want to get the location of both the earlier and later date and make new columns in the Dataframe with those locations.
Expected output:
High Close LC HC HH LH
Close Time
2022-10-23 21:41:59.999 19466.02 19461.29 19461.29 NaN 19466.02 NaN
2022-10-23 21:42:59.999 19462.48 19457.83 NaN NaN NaN NaN
2022-10-23 21:43:59.999 19463.13 19460.09 NaN NaN NaN NaN
2022-10-23 21:44:59.999 19465.15 19463.76 NaN 19463.76 NaN 19465.15
This is the code I have tried
# Checking if conditions are met
for i in range(len(df)):
for a in range(i,600):
if (df.iloc[i:, 1] < df.iloc[a, 1]) & (df.iloc[i:, 0] > df.iloc[a, 0]):
# Creating new DataFrame columns
df['LC'] = df.iloc[i, 1]
df['HC'] = df.get_loc[i,1]
df['HH'] = df.get_loc[a, 0]
df['LH'] = df.get_loc[a, 0]
else:
continue
This line: if (df.iloc[i:, 1] < df.iloc[a, 1]) & (df.iloc[i:, 0] > df.iloc[a, 0]):
Is causing error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I believe I should be using any() instead of an if statement but I am unsure of how to apply it. I also think that there many be an issue with the way I am using the df.get_loc[] but I am unsure. I'm a pandas beginner so if it is obvious I apologize
Here is an image to help visualise what I am attempting to do using a candlestick chart
what I want to do is check if HC is higher than LC and LH is lower than HH then add that data to new columns in the DataFrame
Here is an additional way I tried to achieve the desired output
idx_close, idx_high = map(df.columns.get_loc, ["Close", "High"])
# Check Conditions
for i in range(len(df)):
bool_l = [((df.iloc[i, idx_close] < df.iloc[a, idx_close]) &
(df.iloc[i, idx_high] > df.iloc[a, idx_high])
).any() for a in range(i, 600)]
# Creating new DataFrame columns
df.loc[i, 'LC'] = df.iloc[i,1]
df.loc[bool_l, 'HC'] = df.iloc[bool_l, 1]
# Creating new DataFrame columns
df.loc[i, 'HH'] = df.iloc[i, 0]
df.loc[bool_l, 'LH'] = df.iloc[bool_l, 0]
And I get an error IndexError: Boolean index has wrong length: 600 instead of 2867
On the line df.loc[bool_l, 'HC'] = df.iloc[bool_l, 1]
I assume the error comes from the range(i,600) but I don't know how to get around it
As mentioned by #Jimpsoni, the question is a little unclear in defining what you mean by LC, HC, HH and HL. I will use the below definitions to answer your question:
C_t is the close price on a given day, t
H_t is the high price on a given dat, t
Then if I understand correctly, you want to check the following two conditions:
close condition: is there a future close price in the next 600 days which is higher than the the current close price?
high condition: is there a future high price in the next 600 days which is lower than the high price from today?
And then for each day t, you want to find the first day in the future where both the conditions above are satisfied simultaneously.
With this understanding, I sketch a solution below.
Part 1: Setting up sample data (do this in your question next time)
import pandas as pd
import numpy as np
np.random.seed(2022)
# make example data
close = np.sin(range(610)) + 10
high = close + np.random.rand(*close.shape)
close[2] += 100 # to ensure the close condition cannot be met
dates = pd.date_range(end='2022-06-30', periods=len(close))
# insert into pd.dataframe
df = pd.DataFrame(index=dates, data=np.array([close, high]).T, columns=['close',
'high'])
The sample dataframe looks like below:
Part 2: Understand how to use rolling functions
Since at each point in time you want to subset your dataframe to a particular view in a rolling manner (always look 600 forward), it makes sense to use the built in .rolling method of pandas dataframes here. The .rolling() method is similar to a groupby in the sense that it allows you to iterate over different subsets of the dataframe, without explicitly writing the loop. rolling by default looks backwards, so we need to import a forward looking indexer to achieve the forward window. Note that you can also achieve the same forward window with some shifting, but it is less intuitive. The below chunk demonstrates how both methods give you the same result:
# create the rolling window object
# the simplest solution is to use the fixed size forward window as below
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=600)
method_1 = df.rolling(indexer, min_periods=600).sum()
# can also achieve the same thing by doing the rolling computation and then shifting backward
method_2 = df.rolling(600, min_periods=600).sum().shift(-599)
# confirm they are the same
(method_1.dropna() == method_2.dropna()).mean().mean() # returns 1.0
Part 3: Write the logic that will be applied to each rolling view
Once we have the rolling frame, we can simply apply a function that will be applied to each rolling view of the dataframe. Here is the function, below will be some comments:
def check_conditions(ser, df_full):
df_chunk = df_full.loc[ser.index, :]
today_close, today_high = df_chunk.iloc[0,:2]
future_close_series, future_high_series = df_chunk.iloc[1:, 0], df_chunk.iloc[1:, 1]
close_condition = today_close < future_close_series
high_condition = today_high > future_high_series
both_condition = close_condition & high_condition
if both_condition.sum() == 0:
return 1
first_date_satisfied = future_close_series.index[np.where(both_condition)][0]
future_vals = df_chunk.loc[first_date_satisfied, ['close','high']].values
df_full.loc[ser.index[0], ['Future_Close', 'Future_High', 'Future_Date']] = np.append(future_vals, first_date_satisfied)
return 1
Some comments: First, notice that the function takes two arguments, the first is a series, and the second is the full dataframe. Unfortunately, the .rolling() only acts on a single column / row at a time. This is in contrast to .groupby() which allows access to the full dataframe produced by each group. To get around this, I use a pattern proposed in this answer, where we use the index of the series (the series is the rolling view on a single column of the dataframe) to subset the full dataframe appropriately. Then check the conditions on appropriately subset dataframe with all columns available, and modify that dataframe when the conditions match. This may not be particularly memory efficient as pointed out in the comments to that answer linked above.
Final part: run the function
Here we set up the df_full dataframe, and then use rolling and apply the function to generate the output:
df_out = df.copy()
df_out[['Future_Close','Future_High','Future_Date']] = np.nan
_ = df.rolling(indexer, min_periods=600).apply(lambda x: check_conditions(x, df_out))
df_out
In df_out, the extra column will be NaN if no day in the next 600 days meet the criteria. If a day does meet the criteria, the close and high price on that day, as well as the date, are attached.
Output looks like:
Hope it helps.
Using regular for loop to iterate over dataframe is slow, instead you should use pandas built-in methods for dataframe because they are much faster.
Series has a built-in method diff() that iterates over that series and compares every value to previous one. So you could get 2 new serieses that each has information whether the previous close was less than todays
# You need numpy for this, if already imported, ignore this line
import numpy as np
# Make series for close and high
close = df["Close"].diff()
high = df["High"].diff()
I think you want to have lower "high" value and higher "close" value but with this logic it doesn't generate desired output.
# Then create new columns
df['LC'] = np.where((close > 0) & (high < 0), np.NaN, df["Close"])
df['HC'] = np.where((close > 0) & (high < 0), np.NaN, df["High"])
df['HH'] = np.where((close > 0) & (high < 0), np.NaN, df["Close"])
df['LH'] = np.where((close > 0) & (high < 0), np.NaN, df["High"])
if you provide more infomation about what LC, HC, HH, LH are supposed to be, or provide more examples, I can help you get correct results.
I have a problem with coloring dataframe and exporting the df to excel. I have two df's with same shape and index. In the first are only numbers 0, 1, 2 and 3. The second contains various numbers and strings.
What I want is to color the second df based on numbers in the first df. For that purpose I made two function as you can see bellow.
def apply_color(x):
colors = {0:'transparent',1: 'grey',2: 'yellow', 3: 'red'}
return df1.applymap(lambda val: 'background-color: {}'.format(colors.get(val,'')))
def coloring(dfInt,df):
df1 = dfInt
df2 = df.style.apply(apply_color, axis=None)
return df2
I have a big df, inside it I stored some info and two other df's on position 7 (df with numbers) and 8 (df with various numbers and strings).
x=0
a=[]
writer=pd.ExcelWriter("%s.xlsx"%file)
for i in df["Dimension"]:
dfExport = df.iat[x,8]
dfExportColor = df.iat[x,7]
sheet_name = i
# a.append(dfExportColor)
# a.append(dfExport)
dfa = coloring(dfExportColor,dfExport)
dfa.to_excel(writer,sheet_name=sheet_name)
x += 1
writer.save()
If I start the code, first three loops are OK. On the fourth it gives me ValueError
ValueError: Function <function apply_color at 0x0000025A60925990> created invalid index labels.
Usually, this is the result of the function returning a DataFrame which contains invalid labels, or returning an incorrectly shaped, list-like object which cannot be mapped to labels, possibly due to applying the function along the wrong axis.
Result index has shape: (1232,)
Expected index shape: (28484,)
But! I added into the code list "a" that contain all the df's. If I maunually use the last two (those that caused the error), the code works!
df1 = a[6]
df2 = a[7]
x = coloring(df1,df2)
writer=pd.ExcelWriter("x.xlsx")
x.to_excel(writer)
writer.save()
And in this moment, if I restart the for loop, it drops in the first loop. Then, if I again use the "manual" code for the df's from the loop, it works. And now, if I again restart the for loop, it drops again in the fourth lopp, and so on.
I am trying to fix it for the last 24 hours and I have no idea what more I can do.
Please, does anyone know how to fix it?
I’ve been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby, and especially multiple input columns and multiple output columns. I found a large amount of questions on SO about this topic and many old & outdated answers. So I started to create a notebook for every possible combination of x inputs & outputs, rolling, rolling & groupby combined and I focused on performance as well. Since I’m not the only one struggling with these questions I thought I’d provide my solutions here with working examples, hoping it helps any existing/future pandas-users.
Important notes
The combination of apply & rolling in pandas has a very strong output requirement. You have to return one single value. You can not return a pd.Series, not a list, not an array, not secretly an array within an array, but just one value, e.g. one integer. This requirement makes it hard to get a working solution when trying to return multiple outputs for multiple columns. I don’t understand why it has this requirement for 'apply & rolling', because without rolling 'apply' doesn’t have this requirement. Must be due to some internal pandas functions.
The combination of 'apply & rolling' combined with multiple input columns simply does not work! Imagine a dataframe with 2 columns, 6 rows and you want to apply a custom function with a rolling window of 2. Your function should get an input array with 2x2 values - 2 values of each column for 2 rows. But it seems pandas can’t handle rolling and multiple input columns at the same time. I tried to use the axis parameter to get it working but:
Axis = 0, will call your function per column. In the dataframe described above, it will call your function 10 times (not 12 because rolling=2) and since it’s per column, it only provides the 2 rolling values of that column…
Axis = 1, will call your function per row. This is what you probably want, but pandas will not provide a 2x2 input. It actually completely ignores the rolling and only provides one row with values of 2 columns...
When using 'apply' with multiple input columns, you can provide a parameter called raw (boolean). It’s False by default, which means the input will be a pd.Series and thus includes indexes next to the values. If you don’t need the indexes you can set raw to True to get a Numpy array, which often achieves a much better performance.
When combining 'rolling & groupby', it returns a multi-indexes series which can’t easily serve as an input for a new column. The easiest solution is to append a reset_index(drop=True) as answered & commented here (Python - rolling functions for GroupBy object).
You might ask me, when would you ever want to use a rolling, groupby custom function with multiple outputs!? Answer: I recently had to do a Fourier transform with sliding windows (rolling) over a dataset of 5 million records (speed/performance is important) with different batches within the dataset (groupby). And I needed to save both the power & phase of the Fourier transform in different columns (multiple outputs). Most people probably only need some of the basic examples below, but I believe that especially in the Machine Learning/Data-science sectors the more complex examples can be useful.
Please let me know if you have even better, clearer or faster ways to perform any of the solutions below. I'll update my answer and we can all benefit!
Code examples
Let’s create a dataframe first that will be used in all the examples below, including a group-column for the groupby examples.
For the rolling window and multiple input/output columns I just use 2 in all code examples below, but obviously this could be any number > 1.
df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]
It will look like this:
group a b
0 0 2 2
1 0 4 1
2 0 0 4
3 1 0 2
4 1 3 2
5 1 3 0
Input 1 column, output 1 column
Basic
def func_i1_o1(x):
return x+1
df['c'] = df['b'].apply(func_i1_o1)
Rolling
def func_i1_o1_rolling(x):
return (x[0] + x[1])
df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)
Roling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)
Input 2 columns, output 1 column
Basic
def func_i2_o1(x):
return np.sum(x)
df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)
Rolling
As explained in point 2 in the notes above, there isn't a 'normal' solution for 2 inputs. The workaround below uses the 'raw=False' to ensure the input is a pd.Series, which means we also get the indexes next to the values. This enables us to get values from other columns at the correct indexes to be used.
def func_i2_o1_rolling(x):
values_b = x
values_c = df.loc[x.index, 'c'].to_numpy()
return np.sum(values_b) + np.sum(values_c)
df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)
Input 1 column, output 2 columns
Basic
You could use a 'normal' solution by returning pd.Series:
def func_i1_o2(x):
return pd.Series((x+1, x+2))
df[['i', 'j']] = df['b'].apply(func_i1_o2)
Or you could use the zip/tuple combination which is about 8 times faster!
def func_i1_o2_fast(x):
return x+1, x+2
df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))
Rolling
As explained in point 1 in the notes above, we need a workaround if we want to return more than 1 value when using rolling & apply combined. I found 2 working solutions.
1
def func_i1_o2_rolling_solution1(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['m', 'n']] = output_1, output_2
return 0
df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)
Pros: Everything is done within 1 function.
Cons: You have to create the columns first and it is slower since it doesn't use the raw input.
2
rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()
def func_i1_o2_rolling_solution2(x):
output_list_1.append(np.max(x))
output_list_2.append(np.min(x))
return 0
df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2
Pros: It uses the raw input which makes it about twice as fast. And since it doesn't use indexes to set the output values the code looks a bit more clear (to me at least).
Cons: You have to create the nan-prefix yourself and it takes a bit more lines of code.
Rolling & Groupby
Normally, I would use the faster 2nd solution above. However, since we're combining groups and rolling this means you'd have to manually set NaN's/zeros (depending on the number of groups) at the right indexes somewhere in the middle of the dataset. To me it seems that when combining rolling, groupby and multiple output columns, the first solution is easier and solves the automatic NaNs/grouping automatically. Once again, I use the reset_index solution at the end.
def func_i1_o2_rolling_groupby(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['q', 'r']] = output_1, output_2
return 0
df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)
Input 2 columns, output 2 columns
Basic
I suggest using the same 'fast' way as for i1_o2 with the only difference that you get 2 input values to use.
def func_i2_o2(x):
return np.mean(x), np.median(x)
df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))
Rolling
As I use a workaround for applying rolling with multiple inputs and I use another workaround for rolling with multiple outputs, you can guess I need to combine them for this one.
1. Get values from other columns using indexes (see func_i2_o1_rolling)
2. Set the final multiple outputs on the correct index (see func_i1_o2_rolling_solution1)
def func_i2_o2_rolling(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['u', 'v']] = output_1, output_2
return 0
df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
def func_i2_o2_rolling_groupby(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['w', 'x']] = output_1, output_2
return 0
df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
I've got a dataframe containing country names & their percentage of energy output.
I need to add a new column that assigns a 1 or 0, based on whether the country's energy output is above or below the median of energy output. Some dummy code is:
import pandas as pd
def answer():
df = pd.DataFrame({'name':['china', 'america', 'canada'], 'output': [33.2, 15.0, 5.0]})
df['newcol'] = df.where(df['output'] > df['output'].median(), 1, 0)
return df['newcol']
answer()
the code returns
ValueError: Wrong number of items passed 2, placement implies 1
I feel like this is an incredibly simple fix but I'm new to working with Pandas.
Please help end my frustration
#Vaishali explains why pd.DataFrame.where didn't work as you expected and suggested you use np.where instead, which is very good advice.
I'll offer up that you could have simply converted your boolean result to integers.
Setup
df = pd.DataFrame({
'name':['china', 'america', 'canada'],
'output': [33.2, 15.0, 5.0]
})
Option 1
df['newcol'] = (df['output'] > df['output'].median()).astype(int)
Option 2
Or faster yet by using the underlying numpy arrays
o = df['output'].values
df['newcol'] = (o > np.median(o)).astype(int)
You don't need loop as the solution is vectorized.
df['newcol'] = np.where((df['output'] > df['output'].median()), 1, 0)
name output newcol
0 china 33.2 1
1 america 15.0 0
2 canada 5.0 0
For the error wrong number of items passed, df.where works a little different from np.where. It Returns an object of same shape as self whose corresponding entries are from self where cond is True and otherwise are from other. So its returning a dataframe in your case with two columns instead of a series and hence when you try to assign that dataframe to a series, you get the error message.