Faster method to multiply column lookup values with vectorization - python

I have two Dataframes, one contains values and is the working dataset (postsolutionDF), while the other is simply for reference as a lookup table (factorimportpcntDF). The goal is to add a column to postsolutionDF that contains the product of the lookup values from each row of postsolutionDF (new column name = num_predict). That product is then multiplied by 2700. For example, on first row, the working values are 0.5, 2, -6. The equivalent lookup values for these are 0.1182, 0.2098, and 0.8455. The product of those is 0.0209, which when multiplied by 2700 is 56.61 as shown in output.
The code below works for this simplified example, but it is very slow in the real solution (1.6MM rows x 15 numbered columns). I'm sure there is a better way to do this by removing the 'for k in range' loop but am struggling with how since already using apply on rows. I've found many tangential solutions but nothing that has worked for my situation yet. Thanks for any help.
import pandas as pd
import numpy as np
postsolutionDF = pd.DataFrame({'SCRN' : (['2019-01-22-0000001', '2019-01-22-0000002', '2019-01-22-0000003']), '1' : 0.5,
'2' : 2, '3' : ([-6, 1.0, 8.0])})
postsolutionDF = postsolutionDF[['SCRN', '1', '2', '3']]
print('printing initial postsolutionDF..')
print(postsolutionDF)
factorimportpcntDF = pd.DataFrame({'F1_Val' : [0.5, 1, 1.5, 2], 'F1_Pcnt' : [0.1182, 0.2938, 0.4371, 0.5433], 'F2_Val'
: [2, 3, np.nan, np.nan], 'F2_Pcnt' : [0.2098, 0.7585, np.nan, np.nan], 'F3_Val' : [-6, 1, 8, np.nan], 'F3_Pcnt' :
[0.8455, 0.1753, 0.072, np.nan]})
print('printing factorimportpcntDF..')
print(factorimportpcntDF)
def zero_filter(row): # row is series
inner_value = 1
for k in range(1, 4): # number of columns in postsolutionDF with numeric headers, dynamic in actual code
inner_value *= factorimportpcntDF.loc[factorimportpcntDF['F'+str(k)+'_Val']==row[0+k], 'F'+str(k)+'_Pcnt'].values[0]
inner_value *= 2700
return inner_value
postsolutionDF['num_predict'] = postsolutionDF.apply(zero_filter, axis=1)
print('printing new postsolutionDF..')
print(postsolutionDF)
Print Output:
C:\ProgramData\Anaconda3\python.exe C:/Users/Eric/.PyCharmCE2017.3/config/scratches/scratch_5.py
printing initial postsolutionDF..
SCRN 1 2 3
0 2019-01-22-0000001 0.5 2 -6.0
1 2019-01-22-0000002 0.5 2 1.0
2 2019-01-22-0000003 0.5 2 8.0
printing factorimportpcntDF..
F1_Pcnt F1_Val F2_Pcnt F2_Val F3_Pcnt F3_Val
0 0.1182 0.5 0.2098 2.0 0.8455 -6.0
1 0.2938 1.0 0.7585 3.0 0.1753 1.0
2 0.4371 1.5 NaN NaN 0.0720 8.0
3 0.5433 2.0 NaN NaN NaN NaN
printing new postsolutionDF..
SCRN 1 2 3 num_predict
0 2019-01-22-0000001 0.5 2 -6.0 56.610936
1 2019-01-22-0000002 0.5 2 1.0 11.737312
2 2019-01-22-0000003 0.5 2 8.0 4.820801
Process finished with exit code 0

I'm not sure how to do this in native pandas, but if you go back to numpy, it is pretty easy.
The numpy.interp function is designed to interpolate between values in the lookup table, but if the input values exactly match the values in the lookup table (like yours do), it becomes just a simple lookup instead of an interpolation.
postsolutionDF['1new'] = np.interp(postsolutionDF['1'].values, factorimportpcntDF['F1_Val'], factorimportpcntDF['F1_Pcnt'])
postsolutionDF['2new'] = np.interp(postsolutionDF['2'].values, factorimportpcntDF['F2_Val'], factorimportpcntDF['F2_Pcnt'])
postsolutionDF['3new'] = np.interp(postsolutionDF['3'].values, factorimportpcntDF['F3_Val'], factorimportpcntDF['F3_Pcnt'])
postsolutionDF['num_predict'] = postsolutionDF['1new'] * postsolutionDF['2new'] * postsolutionDF['3new'] * 2700
postsolutionDF.drop(columns=['1new', '2new', '3new'], inplace=True)
Gives the output:
In [167]: postsolutionDF
Out[167]:
SCRN 1 2 3 num_predict
0 2019-01-22-0000001 0.5 2 -6.0 56.610936
1 2019-01-22-0000002 0.5 2 1.0 11.737312
2 2019-01-22-0000003 0.5 2 8.0 4.820801
I had to pad out the factorimportpcntDF so all the columns had 4 values, otherwise looking up the highest value for a column wouldn't work. You can just use the same value multiple times, or split it into 3 lookup tables if you prefer, then the columns could be different lengths.
factorimportpcntDF = pd.DataFrame({'F1_Val' : [0.5, 1, 1.5, 2], 'F1_Pcnt' : [0.1182, 0.2938, 0.4371, 0.5433],
'F2_Val' : [2, 3, 3, 3], 'F2_Pcnt' : [0.2098, 0.7585, 0.7585, 0.7585],
'F3_Val' : [-6, 1, 8, 8], 'F3_Pcnt' : [0.8455, 0.1753, 0.072, 0.072]})
Note that the documentation specifies that your F1_val etc columns need to be in increasing order (yours are here, just an FYI). Otherwise interp will run, but won't necessarily give good results.

Related

Using np.where to return the mean of df row's based on criteria

Let's suppose I have this following code
import pandas as pd
import numpy as np
flag = pd.DataFrame({'flag': [ [], ['red'], ['red, green'], ['red, blue'], ['blue'] ]})
colors_values = pd.DataFrame({'red': [1, 1, 1, 1, 1], 'green': [2, 2, 2, 2, 2], 'blue': [4, 4, 4, 4, 4]})
I have a 1D df called 'flag' that each row contains a list of colors (red, green, blue) and another df 'colors_values' with these colors names. They have the same number of rows.
My goal is to use np.where to return the mean of the values for each row of 'colors_values' based on 'flag'. The output would be something like this:
If there is a better/faster way to do it instead of using np.where, I'd like to know.
Pandas merge is pretty fast, if you allow of bit of a ramp up time you could do a merge/groupby:
df_flag = flag.explode('flag').reset_index()
df_colors = colors_values.reset_index().melt(ignore_index=False, var_name='flag').reset_index()
df_flag = df_flag.merge(df_colors, on=['index', 'flag'], how='left')
df_grouped = df_flag.groupby(['index'])['value'].mean()
Fast solution
from sklearn.preprocessing import MultiLabelBinarizer
# encode the colors into indicator variables
mask = MultiLabelBinarizer().fit_transform(flag['flag'])
# mask the color values where indicator is zero then calculate mean
result = colors_values.sort_index(axis=1).mask(mask == 0).mean(axis=1)
Result
0 NaN
1 1.0
2 1.5
3 2.5
4 4.0
dtype: float64
You can arrange color names matching between dataframes as shown below:
means = colors_values.apply(lambda x: x[flag.iloc[x.name][0]].mean(), axis=1)
0 NaN
1 1.0
2 1.5
3 2.5
4 4.0
you could use str.get_dummies() and multiply by the color_values df
(flag['flag']
.str[0]
.str.get_dummies(sep=', ')
.mul(colors_values)
.where(lambda x: x.ne(0))
.mean(axis=1))
Output:
0 NaN
1 1.0
2 1.5
3 2.5
4 4.0

Filling pandas DataFrame with values from another DataFrame of different shape

I have a dataframe df2 containing four columns: A, B, C, D. I want to fill this dataframe with the values from another data frame temp= (1, 2, 6.5, 8, 3, 4, 6.6, 7.8, 5, 6, 5, 4).
What I want to obtain is given in
Any idea on how to do this?
If length of values modulo 4 is equal 0 seelct first row to Series by DataFrame.iloc, convert to numpy array and reshape by -1 for by default counts number of rows and 4 for number of columns:
print (len(df.iloc[0]) % 4)
0
df2 = pd.DataFrame(df.iloc[0].to_numpy().reshape(-1, 4), columns=list('ABCD'))
print (df2)
A B C D
0 1.0 2.0 6.5 8.0
1 3.0 4.0 6.6 7.8
2 5.0 6.0 5.0 4.0

How to get next rows after filtered rows in Pandas

I have a Data Frame called Data. I wrote this filter for filter the rows:
data[data["Grow"] >= 1.5]
It returned some rows like these:
PriceYesterday Open High Low
------------------------------------------------------------------
7 6888.0 6881.66 7232.0 6882.0
53 7505.0 7555.72 7735.0 7452.0
55 7932.0 8093.08 8120.0 7974.0
64 7794.0 7787.29 8001.0 7719.0
...
As you see there are some rows in the indexes 7, 53, 55 ,.... Now I want to get rows in indexes 8, 54, 56, ... too. Is there any straight forward way to do this? Thanks
You can use Index.intersection for avoid error if matching last row and want select not exist index values:
data = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'Grow':[0,8,2,0.4,2,3.3],
})
df1 = data[data["Grow"] >= 1.5]
print (df1)
A B Grow
1 b 5 8.0
2 c 4 2.0
4 e 5 2.0
5 f 4 3.3
df2 = data.loc[data.index.intersection(df1.index + 1)]
print (df2)
A B Grow
2 c 4 2.0
3 d 5 0.4
5 f 4 3.3
Another idea is select by shifted values by Series.shift
df1 = data[data["Grow"] >= 1.5]
df2 = data[data["Grow"].shift() >= 1.5]
print (df2)
A B Grow
2 c 4 2.0
3 d 5 0.4
5 f 4 3.3
df1 = data[data["Grow"] >= 1.5]
df2 = data.loc[df1.index + 1]
print (df2)
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([6], dtype='int64'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
You should create a mask, and then shift that mask by one:
import numpy as np
df = pd.DataFrame({'a': np.random.random(20)})
print(df)
mask = df['a']>0.8
print("items that fit the mask:")
print(df.loc[mask])
print("items following these:")
print(df.loc[mask.shift().fillna(False)])
In your specific case I believe it would be
data.loc[(data["Grow"] >= 1.5).shift().fillna(False)]
data[data.shift()["Grow"] >= 1.5]
The shift moves every cell one step to the end of the frame. So this says: Give my those entries, which predecessors match my criteria.

Pandas fillna() not filling values from series

I'm trying to fill missing values in a column in a DataFrame with the value from another DataFrame's column. Here's the setup:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [2, 3, 5, np.nan, np.nan],
'b': [10, 11, 13, 14, 15]
})
df2 = pd.DataFrame({
'x': [1]
})
I can of course do this and it works:
df['a'] = df['a'].fillna(1)
However, this results in the missing values not being filled:
df['a'] = df['a'].fillna(df2['x'])
And this results in an error:
df['a'] = df['a'].fillna(df2['x'].values)
How can I use the value from df2['x'] to fill in missing values in df['a']?
If you can guarantee df2['x'] only has a single element, then use .item:
df['a'] = df['a'].fillna(df2.values.item())
Or,
df['a'] = df['a'].fillna(df2['x'].item())
df
a b
0 2.0 10
1 3.0 11
2 5.0 13
3 1.0 14
4 1.0 15
Otherwise, this isn't possible unless they're either the same length and/or index-aligned.
As a rule of thumb; either
pass a scalar, or
pass a dictionary mapping the index of the NaN value to its replacement value (e.g., df.a.fillna({3 : 1, 4 : 1})), or
index aligned series
I think one general solution is select first value by [0] for scalar:
print (df2['x'].values[0])
1
df['a'] = df['a'].fillna(df2['x'].values[0])
#similar solution for select by loc
#df['a'] = df['a'].fillna(df2.loc[0, 'x'])
print (df)
a b
0 2.0 10
1 3.0 11
2 5.0 13
3 1.0 14
4 1.0 15

python pandas Ignore Nan in integer comparisons

I am trying to create dummy variables based on integer comparisons in series where Nan is common. A > comparison raises errors if there are any Nan values, but I want the comparison to return a Nan. I understand that I could use fillna() to replace Nan with a value that I know will be false, but I would hope there is a more elegant way to do this. I would need to change the value in fillna() if I used less than, or used a variable that could be positive or negative, and that is one more opportunity to create errors. Is there any way to make 30 < Nan = Nan?
To be clear, I want this:
df['var_dummy'] = df[df['var'] >= 30].astype('int')
to return a null if var is null, 1 if it is 30+, and 0 otherwise. Currently I get ValueError: cannot reindex from a duplicate axis.
Here's a way:
s1 = pd.Series([1, 3, 4, 2, np.nan, 5, np.nan, 7])
s2 = pd.Series([2, 1, 5, 5, np.nan, np.nan, 2, np.nan])
(s1 < s2).mask(s1.isnull() | s2.isnull(), np.nan)
Out:
0 1.0
1 0.0
2 1.0
3 1.0
4 NaN
5 NaN
6 NaN
7 NaN
dtype: float64
This masks the boolean array returned from (s1 < s2) if any of them is NaN. In that case, it returns NaN. But you cannot have NaNs in a boolean array so it will be casted as float.
Solution 1
df['var_dummy'] = 1 * df.loc[~pd.isnull(df['var']), 'var'].ge(30)
Solution 2
df['var_dummy'] = df['var'].apply(lambda x: np.nan if x!=x else 1*(x>30))
x!=x is equivalent to math.isnan()
You can use the notna() method. Here is an example:
import pandas as pd
list1 = [12, 34, -4, None, 45]
list2 = ['a', 'b', 'c', 'd', 'e']
# Calling DataFrame constructor on above lists
df = pd.DataFrame(list(zip(list1, list2)), columns =['var1','letter'])
#Assigning new dummy variable:
df['var_dummy'] = df['var1'][df['var1'].notna()] >= 30
# or you can also use: df['var_dummy'] = df.var1[df.var1.notna()] >= 30
df
Will produce the below output:
var1 letter var_dummy
0 12.0 a False
1 34.0 b True
2 -4.0 c False
3 NaN d NaN
4 45.0 e True
So the new dummy variable has NaN value for the original variable's NaN rows.
The only thing that does not match your request is that the dummy variable takes False and True values instead of 0 and 1, but you can easily reassign the values.
One thing, however, you cannot change is that the new dummy variable has to be float type because it contains NaN value, which by itself is a special float value.
More information about NaN float are mentioned here:
How can I check for NaN values?
and here:
https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b

Categories