##1
M_members = [1000 , 1450, 1900]
M = pd.DataFrame(M_members)
##2
a_h_members = [0.4 , 0.6 , 0.8 ]
a_h = pd.DataFrame(a_h_members)
##3
d_h_members = [0.1 , 0.2 ]
d_h = pd.DataFrame(d_h_members)
As the output I want is in dataframe form:
1000 0.4 0.1
1000 0.4 0.2
1000 0.6 0.1
1000 0.6 0.2
1000 0.8 0.1
1000 0.8 0.2
1450 0.4 0.1
1450 0.4 0.2
1450 0.6 0.1
1450 0.6 0.2
1450 0.8 0.1
1450 0.8 0.2
1900 0.4 0.1
1900 0.4 0.2
1900 0.6 0.1
1900 0.6 0.2
1900 0.8 0.1
1900 0.8 0.2
I want to do this loop for more dataframes actually.
Use itertools.product
>>> import itertools
>>> pd.DataFrame(itertools.product(*[M_members, a_h_members, d_h_members]))
0 1 2
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
Depending on your data size, expand_grid from pyjanitor may help with performance:
# pip install pyjanitor
import janitor as jn
import pandas as pd
others = {'a':M, 'b':a_h, 'c':d_h}
jn.expand_grid(others = others)
a b c
0 0 0
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
You can drop a column level, or flatten it:
jn.expand_grid(others = others).droplevel(axis = 1, level = 1)
a b c
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
If you're starting from the DataFrames, you can use a repeated cross merge:
dfs = [M, a_h, d_h]
from functools import reduce
out = (reduce(lambda a,b: a.merge(b, how='cross'), dfs)
.set_axis(range(len(dfs)), axis=1)
)
Output:
0 1 2
0 1000 0.4 0.1
1 1000 0.4 0.2
2 1000 0.6 0.1
3 1000 0.6 0.2
4 1000 0.8 0.1
5 1000 0.8 0.2
6 1450 0.4 0.1
7 1450 0.4 0.2
8 1450 0.6 0.1
9 1450 0.6 0.2
10 1450 0.8 0.1
11 1450 0.8 0.2
12 1900 0.4 0.1
13 1900 0.4 0.2
14 1900 0.6 0.1
15 1900 0.6 0.2
16 1900 0.8 0.1
17 1900 0.8 0.2
Related
I have a dataframe that looks like this
a b c d
0 0.6 -0.4 0.2 0.7
1 0.8 0.2 -0.2 0.3
2 -0.1 0.5 0.5 -0.4
3 0.8 -0.6 -0.7 -0.2
And I wish to create column 'e' such that it displays the column number of the first instance in each row where the value is less than 0
So the goal result will look like this
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
I can do this in Excel using a MATCH(True) type function but am struggling to make progress in Pandas.
Thanks for any help
You can use np.argmax:
# where the values are less than 0
a = df.values < 0
# if the row is all non-negative, return 0
df['e'] = np.where(a.any(1), np.argmax(a,axis=1)+1, 0)
Output:
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
Something like idxmin with np.sin
import numpy as np
df['e']=df.columns.get_indexer(np.sign(df).idxmin(1))+1
df
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
Get the first max, combined with get indexer for to get the column numbers:
df["e"] = df.columns.get_indexer_for(df.lt(0, axis=1).idxmax(axis=1).array) + 1
df
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
I have a data frame as shown below
B_ID no_show Session slot_num walkin ns_w c_ns_w c_walkin
1 0.4 S1 1 0.2 0.2 0.2 0.2
2 0.3 S1 2 0.5 -0.2 0.2 0.7
3 0.8 S1 3 0.5 0.3 0.5 1.2
4 0.3 S1 4 0.8 -0.5 0.0 2.0
5 0.6 S1 5 0.4 0.2 0.2 2.4
6 0.8 S1 6 0.2 0.6 0.8 2.6
7 0.9 S1 7 0.1 0.8 1.4 2.7
8 0.4 S1 8 0.5 -0.1 1.3 3.2
9 0.6 S1 9 0.1 0.5 1.8 3.3
12 0.9 S2 1 0.9 0.0 0.0 0.9
13 0.5 S2 2 0.4 0.1 0.1 1.3
14 0.3 S2 3 0.1 0.2 0.3 1.4
15 0.7 S2 4 0.4 0.3 0.6 1.8
20 0.7 S2 5 0.1 0.6 1.2 1.9
16 0.6 S2 6 0.3 0.3 1.5 2.2
17 0.8 S2 7 0.5 0.3 1.8 2.7
19 0.3 S2 8 0.8 -0.5 1.3 3.5
where,
df[ns_w] = df['no_show'] - df['walkin']
c_ns_w = cumulaitve of ns_w
df['c_ns_w'] = df.groupby(['Session'])['ns_w'].cumsum()
c_walkin = cumulative of walkin
df['c_walkin'] = df.groupby(['Session'])['walkin'].cumsum()
From the above I would like to calculate two columns called u_ns_w and u_c_walkin.
And when ever u_c_walkin > 0.9 create a new row with no_show = 0, walkin=0 and all other values will be same as the above row. where B_ID = walkin1, 2, etc, and subtract 1 from the above u_c_walkin.
At the same time when ever u_c_ns_w > 0.8 add a new row with B_ID = overbook1, 2 etc, with no_show = 0.5, walkin=0, ns_w = 0.5 and all other values same as above row and subtract 0.5 from the above u_c_ns_w.
Expected output:
B_ID no_show Session slot_num walkin ns_w c_ns_w c_walkin u_c_walkin u_c_ns_w
1 0.4 S1 1 0.2 0.2 0.2 0.2 0.2 0.2
2 0.3 S1 2 0.5 -0.2 0.2 0.7 0.7 0.2
3 0.8 S1 3 0.5 0.3 0.5 1.2 1.2 0.5
walkin1 0.0 S1 3 0.0 0.3 0.5 1.2 0.2 0.5
4 0.3 S1 4 0.8 -0.5 0.0 2.0 1.0 0.0
walkin2 0.0 S1 4 0.0 -0.5 0.0 2.0 0.0 0.0
5 0.6 S1 5 0.4 0.2 0.2 2.4 0.4 0.2
6 0.8 S1 6 0.2 0.6 0.8 2.6 0.6 0.8
7 0.9 S1 7 0.1 0.8 1.4 2.7 0.7 1.4
overbook1 0.5 S1 7 0.0 0.5 1.4 2.7 0.7 0.9
8 0.4 S1 8 0.5 -0.1 1.3 3.2 1.2 0.8
walkin3 0.0 S1 8 0.0 -0.1 1.3 3.2 0.2 0.8
9 0.6 S1 9 0.1 0.5 1.8 3.3 0.1 1.3
overbook2 0.5 S1 9 0.0 0.5 1.8 3.3 0.1 0.8
12 0.9 S2 1 0.9 0.0 0.0 0.9 0.9 0.0
13 0.5 S2 2 0.4 0.1 0.1 1.3 1.3 0.1
walkin1 0.0 S2 2 0.0 0.1 0.1 1.3 0.3 0.1
14 0.3 S2 3 0.1 0.2 0.3 1.4 0.4 0.3
15 0.7 S2 4 0.4 0.3 0.6 1.8 0.8 0.6
20 0.7 S2 5 0.1 0.6 1.2 1.9 0.9 1.2
overbook1 0.5 S2 5 0.0 0.5 1.2 1.9 0.9 0.7
16 0.6 S2 6 0.3 0.3 1.5 2.2 1.2 1.0
walkin2 0.0 S2 6 0.3 0.3 1.5 2.2 0.2 1.0
overbook2 0.5 S2 6 0.0 0.5 1.5 2.2 0.2 0.5
17 0.8 S2 7 0.5 0.3 1.8 2.7 0.7 0.8
19 0.3 S2 8 0.8 -0.5 1.3 3.5 1.5 0.3
walkin3 0.0 S2 8 0.8 -0.5 1.3 3.5 0.5 0.3
I tried below code to create the walkin rows but not able to create for overbook rows.
def create_u_columns (ser):
l_index = []
arr_ns = ser.to_numpy()
# array for latter insert
arr_idx = np.zeros(len(ser), dtype=int)
walkin_id = 1
for i in range(len(arr_ns)-1):
if arr_ns[i]>0.8:
# remove 1 to u_no_show
arr_ns[i+1:] -= 1
# increment later idx to add
arr_idx[i] = walkin_id
walkin_id +=1
#return a dataframe with both columns
return pd.DataFrame({'u_cumulative': arr_ns, 'mask_idx':arr_idx}, index=ser.index)
df[['u_c_walkin', 'mask_idx']]= df.groupby(['Session'])['c_walkin'].apply(create_u_columns)
# select the rows
df_toAdd = df.loc[df['mask_idx'].astype(bool), :].copy()
# replace the values as wanted
df_toAdd['no_show'] = 0
df_toAdd['walkin'] = 0
df_toAdd['EpisodeNumber'] = 'walkin'+df_toAdd['mask_idx'].astype(str)
df_toAdd['u_c_walkin'] -= 1
# add 0.5 to index for later sort
df_toAdd.index += 0.5
new_df = pd.concat([df,df_toAdd]).sort_index()\
.reset_index(drop=True).drop('mask_idx', axis=1)
Here you can modify the function this way to do both check at the same time. Please check that it is exactly the conditions you want to apply for the walkin and overbook dataframes.
def create_columns(dfg):
arr_walkin = dfg['c_walkin'].to_numpy()
arr_ns = dfg['c_ns_w'].to_numpy()
# array for latter insert
arr_idx_walkin = np.zeros(len(arr_walkin), dtype=int)
arr_idx_ns = np.zeros(len(arr_ns), dtype=int)
walkin_id = 1
oberbook_id = 1
for i in range(len(arr_ns)):
# condition on c_walkin
if arr_walkin[i]>0.9:
# remove 1 to u_no_show
arr_walkin[i+1:] -= 1
# increment later idx to add
arr_idx_walkin[i] = walkin_id
walkin_id +=1
# condition on c_ns_w
if arr_ns[i]>0.8:
# remove 1 to u_no_show
arr_ns[i+1:] -= 0.5
# increment later idx to add
arr_idx_ns[i] = oberbook_id
oberbook_id +=1
#return a dataframe with both columns
return pd.DataFrame({'u_c_walkin': arr_walkin,
'u_c_ns_w': arr_ns,
'mask_idx_walkin':arr_idx_walkin,
'mask_idx_ns': arr_idx_ns }, index=dfg.index)
df[['u_c_walkin', 'u_c_ns_w', 'mask_idx_walkin', 'mask_idx_ns']]=\
df.groupby(['Session'])[['c_walkin', 'c_ns_w']].apply(create_columns)
# select the rows for walkin
df_walkin = df.loc[df['mask_idx_walkin'].astype(bool), :].copy()
# replace the values as wanted
df_walkin['no_show'] = 0
df_walkin['walkin'] = 0
df_walkin['B_ID'] = 'walkin'+df_walkin['mask_idx_walkin'].astype(str)
df_walkin['u_c_walkin'] -= 1
# add 0.5 to index for later sort
df_walkin.index += 0.2
# select the rows for ns_w
df_ns = df.loc[df['mask_idx_ns'].astype(bool), :].copy()
# replace the values as wanted
df_ns['no_show'] = 0.5
df_ns['walkin'] = 0
df_ns['ns_w'] = 0.5
df_ns['B_ID'] = 'overbook'+df_ns['mask_idx_ns'].astype(str)
df_ns['u_c_ns_w'] -= 0.5
# add 0.5 to index for later sort
df_ns.index += 0.4
new_df = pd.concat([df,df_walkin, df_ns]).sort_index()\
.reset_index(drop=True).drop(['mask_idx_walkin','mask_idx_ns'], axis=1)
and you get:
print (new_df)
B_ID no_show Session slot_num walkin ns_w c_ns_w c_walkin \
0 1 0.4 S1 1 0.2 0.2 0.2 0.2
1 2 0.3 S1 2 0.5 -0.2 0.2 0.7
2 3 0.8 S1 3 0.5 0.3 0.5 1.2
3 walkin1 0.0 S1 3 0.0 0.3 0.5 1.2
4 4 0.3 S1 4 0.8 -0.5 0.0 2.0
5 walkin2 0.0 S1 4 0.0 -0.5 0.0 2.0
6 5 0.6 S1 5 0.4 0.2 0.2 2.4
7 6 0.8 S1 6 0.2 0.6 0.8 2.6
8 7 0.9 S1 7 0.1 0.8 1.4 2.7
9 overbook1 0.5 S1 7 0.0 0.5 1.4 2.7
10 8 0.4 S1 8 0.5 -0.1 1.3 3.2
11 walkin3 0.0 S1 8 0.0 -0.1 1.3 3.2
12 9 0.6 S1 9 0.1 0.5 1.8 3.3
13 overbook2 0.5 S1 9 0.0 0.5 1.8 3.3
14 12 0.9 S2 1 0.9 0.0 0.0 0.9
15 13 0.5 S2 2 0.4 0.1 0.1 1.3
16 walkin1 0.0 S2 2 0.0 0.1 0.1 1.3
17 14 0.3 S2 3 0.1 0.2 0.3 1.4
18 15 0.7 S2 4 0.4 0.3 0.6 1.8
19 20 0.7 S2 5 0.1 0.6 1.2 1.9
20 overbook1 0.5 S2 5 0.0 0.5 1.2 1.9
21 16 0.6 S2 6 0.3 0.3 1.5 2.2
22 walkin2 0.0 S2 6 0.0 0.3 1.5 2.2
23 overbook2 0.5 S2 6 0.0 0.5 1.5 2.2
24 17 0.8 S2 7 0.5 0.3 1.8 2.7
25 19 0.3 S2 8 0.8 -0.5 1.3 3.5
26 walkin3 0.0 S2 8 0.0 -0.5 1.3 3.5
u_c_walkin u_c_ns_w
0 0.2 0.2
1 0.7 0.2
2 1.2 0.5
3 0.2 0.5
4 1.0 0.0
5 0.0 0.0
6 0.4 0.2
7 0.6 0.8
8 0.7 1.4
9 0.7 0.9
10 1.2 0.8
11 0.2 0.8
12 0.3 1.3
13 0.3 0.8
14 0.9 0.0
15 1.3 0.1
16 0.3 0.1
17 0.4 0.3
18 0.8 0.6
19 0.9 1.2
20 0.9 0.7
21 1.2 1.0
22 0.2 1.0
23 1.2 0.5
24 0.7 0.8
25 1.5 0.3
26 0.5 0.3
I want to read every nth row of a list of DataFrames and create a new DataFrames by appending all the Nth rows.
Let's say we have the following DataFrames:
>>> df1
A B C D
0 -0.8 -2.8 -0.3 -0.1
1 -0.1 -0.9 0.2 -0.7
2 0.7 -3.3 -1.1 -0.4
>>> df2
A B C D
0 1.4 -0.7 1.5 -1.3
1 1.6 1.4 1.4 0.2
2 -1.4 0.2 -1.7 0.7
>>> df3
A B C D
0 0.3 -0.5 -1.6 -0.8
1 0.2 -0.5 -1.1 1.6
2 -0.3 0.7 -1.0 1.0
I have used the following approach to get the desired df:
df = pd.DataFrame()
df_list = [df1, df2, df3]
for i in range(len(df1)):
for x in df_list:
df = df.append(x.loc[i], ignore_index = True)
Here's the result:
>>> df
A B C D
0 -0.8 -2.8 -0.3 -0.1
1 1.4 -0.7 1.5 -1.3
2 0.3 -0.5 -1.6 -0.8
3 -0.1 -0.9 0.2 -0.7
4 1.6 1.4 1.4 0.2
5 0.2 -0.5 -1.1 1.6
6 0.7 -3.3 -1.1 -0.4
7 -1.4 0.2 -1.7 0.7
8 -0.3 0.7 -1.0 1.0
I was just wondering if there is a pandas way of rewriting this code which would do the same thing (maybe by using .iterrows, pd.concat, pd.join, or pd.merge)?
Cheers
Update
Simply appending one df after another is not what I am looking for here.
The code should do:
df.row1 = df1.row1
df.row2 = df2.row1
df.row3 = df3.row1
df.row4 = df1.row2
df.row5 = df2.row2
df.row6 = df3.row2
...
For a single output dataframe, you can concatenate and sort by index:
res = pd.concat([df1, df2, df3]).sort_index().reset_index(drop=True)
A B C D
0 -0.8 -2.8 -0.3 -0.1
1 1.4 -0.7 1.5 -1.3
2 0.3 -0.5 -1.6 -0.8
3 -0.1 -0.9 0.2 -0.7
4 1.6 1.4 1.4 0.2
5 0.2 -0.5 -1.1 1.6
6 0.7 -3.3 -1.1 -0.4
7 -1.4 0.2 -1.7 0.7
8 -0.3 0.7 -1.0 1.0
For a dictionary of dataframes, You can concatenate and then group by index:
res = dict(tuple(pd.concat([df1, df2, df3]).groupby(level=0)))
With the dictionary defined as above, each value represents a row number. For example, res[0] will give the first row from each input dataframe.
There is pd.concat
df=pd.concat([df1,df2,df3]).reset_index(drop=True)
recommended by Jez
df=pd.concat([df1,df2,df3],ignore_index=True)
try :
>>> df1 = pd.DataFrame({'A':['-0.8', '-0.1', '0.7'],
... 'B':['-2.8', '-0.9', '-3.3'],
... 'C':['-0.3', '0.2', '-1.1'],
... 'D':['-0.1', '-0.7', '-0.4']})
>>>
>>> df2 = pd.DataFrame({'A':['1.4', '1.6', '-1.4'],
... 'B':['-0.7', '1.4', '0.2'],
... 'C':['1.5', '1.4', '-1.7'],
... 'D':['-1.3', '0.2', '0.7']})
>>>
>>> df3 = pd.DataFrame({'A':['0.3', '0.2', '-0.3'],
... 'B':['-0.5', '-0.5', '0.7'],
... 'C':['-1.6', '-1.1', '-1.0'],
... 'D':['-0.8', '1.6', '1.0']})
>>> df=pd.concat([df1,df2,df3],ignore_index=True)
>>> print(df)
A B C D
0 -0.8 -2.8 -0.3 -0.1
1 -0.1 -0.9 0.2 -0.7
2 0.7 -3.3 -1.1 -0.4
3 1.4 -0.7 1.5 -1.3
4 1.6 1.4 1.4 0.2
5 -1.4 0.2 -1.7 0.7
6 0.3 -0.5 -1.6 -0.8
7 0.2 -0.5 -1.1 1.6
8 -0.3 0.7 -1.0 1.0
OR
df=pd.concat([df1,df2,df3], axis=0, join='outer', ignore_index=True)
Note:
axis: whether we will concatenate along rows (0) or columns (1)
join: can be set to inner, outer, left, or right. by using outer its sort it's lexicographically
ignore_index: whether or not the original row labels from should be retained, by default False ,If True, do not use the index labels.
You can concatenate them keeping their original indexes as a column this way:
df_total = pd.concat([df1.reset_index(), df2.reset_index(),
df3.reset_index()])
>> df_total
index A B C D
0 0 -0.8 -2.8 -0.3 -0.1
1 1 -0.1 -0.9 0.2 -0.7
2 2 0.7 -3.3 -1.1 -0.4
0 0 1.4 -0.7 1.5 -1.3
1 1 1.6 1.4 1.4 0.2
2 2 -1.4 0.2 -1.7 0.7
0 0 0.3 -0.5 -1.6 -0.8
1 1 0.2 -0.5 -1.1 1.6
2 2 -0.3 0.7 -1.0 1.0
Then you can make a multiindex dataframe and order by index:
df_joined = df_total.reset_index(drop=True).reset_index()
>> df_joined
level_0 index A B C D
0 0 0 -0.8 -2.8 -0.3 -0.1
1 1 1 -0.1 -0.9 0.2 -0.7
2 2 2 0.7 -3.3 -1.1 -0.4
3 3 0 1.4 -0.7 1.5 -1.3
4 4 1 1.6 1.4 1.4 0.2
5 5 2 -1.4 0.2 -1.7 0.7
6 6 0 0.3 -0.5 -1.6 -0.8
7 7 1 0.2 -0.5 -1.1 1.6
8 8 2 -0.3 0.7 -1.0 1.0
>> df_joined = df_joined.set_index(['index', 'level_0']).sort_index()
>> df_joined
A B C D
index level_0
0 0 -0.8 -2.8 -0.3 -0.1
3 1.4 -0.7 1.5 -1.3
6 0.3 -0.5 -1.6 -0.8
1 1 -0.1 -0.9 0.2 -0.7
4 1.6 1.4 1.4 0.2
7 0.2 -0.5 -1.1 1.6
2 2 0.7 -3.3 -1.1 -0.4
5 -1.4 0.2 -1.7 0.7
8 -0.3 0.7 -1.0 1.0
You can put all this a dataframe just by doing:
>> pd.DataFrame(df_joined.values, columns = df_joined.columns)
A B C D
0 -0.8 -2.8 -0.3 -0.1
1 1.4 -0.7 1.5 -1.3
2 0.3 -0.5 -1.6 -0.8
3 -0.1 -0.9 0.2 -0.7
4 1.6 1.4 1.4 0.2
5 0.2 -0.5 -1.1 1.6
6 0.7 -3.3 -1.1 -0.4
7 -1.4 0.2 -1.7 0.7
8 -0.3 0.7 -1.0 1.0
Hi im not sure if this has already been asked before but i couldn't find an answer to satisfy me.
if you had X,Y and Temperature data see below for an example. how would you plot the temperature contour of the data in python
X Y Temp
0 0 23
0.1 0 23
0.2 0 23
0.3 0 23
0.4 0 23
0.5 0 23
0.6 0 23
0.7 0 23
0.8 0 23
0.9 0 23
1 0 23
0 0.1 23
0.1 0.1 23
0.2 0.1 23
0.3 0.1 23
0.4 0.1 23
0.5 0.1 23
0.6 0.1 23
0.7 0.1 23
0.8 0.1 23
0.9 0.1 23
1 0.1 23
0 0.2 24
0.1 0.2 24
0.2 0.2 24
0.3 0.2 24
0.4 0.2 24
0.5 0.2 24
0.6 0.2 24
0.7 0.2 24
0.8 0.2 24
0.9 0.2 24
1 0.2 24
0 0.3 23
0.1 0.3 23
0.2 0.3 23
0.3 0.3 23
0.4 0.3 23
0.5 0.3 23
0.6 0.3 23
0.7 0.3 23
0.8 0.3 23
0.9 0.3 23
1 0.3 23
0 0.4 25
0.1 0.4 25
0.2 0.4 25
0.3 0.4 25
0.4 0.4 25
0.5 0.4 25
0.6 0.4 25
0.7 0.4 25
0.8 0.4 25
0.9 0.4 25
1 0.4 25
0 0.5 23
0.1 0.5 23
0.2 0.5 23
0.3 0.5 23
0.4 0.5 23
0.5 0.5 23
0.6 0.5 23
0.7 0.5 23
0.8 0.5 23
0.9 0.5 23
1 0.5 23
The following extract is of a 500 row table that I'm trying to build a numpy lookup function for. My problem is that the values are non-linear.
The user enters a density, volume, and content. so the function will be:
def capacity_lookup(density, volume, content:
For example a typical user entry would be capacity_lookup (47, 775, 41.3). The function should interpolate between the values of 45 and 50 and densities 700 and 800, and contents 40 and 45.
The table extract is:
Volume Density Content
<30 35 40 45 50>=
45.0 <=100 0.1 1.8 0.9 2.0 0.3
45.0 200 1.5 1.6 1.4 2.4 3.0
45.0 400 0.4 2.1 0.9 1.8 2.5
45.0 600 1.3 0.8 0.2 1.7 1.9
45.0 800 0.6 0.9 0.8 0.4 0.2
45.0 1000 0.3 0.8 0.5 0.3 1.0
45.0 1200 0.6 0.0 0.6 0.2 0.2
45.0 1400 0.6 0.4 0.3 0.7 0.1
45.0 >=1600 0.3 0.0 0.6 0.1 0.3
50.0 <=100 0.1 0.0 0.5 0.9 0.2
50.0 200 1.3 0.4 0.8 0.2 2.7
50.0 400 0.4 0.1 0.7 1.3 1.7
50.0 600 0.8 0.7 0.1 1.2 1.6
50.0 800 0.5 0.3 0.4 0.2 0.0
50.0 1000 0.2 0.4 0.4 0.2 0.3
50.0 1200 0.4 0.0 0.0 0.2 0.0
50.0 1400 0.0 0.3 0.1 0.5 0.1
50.0 >=1600 0.1 0.0 0.0 0.0 0.2
55.0 <=100 0.0 0.0 0.4 0.6 0.1
55.0 200 0.8 0.3 0.7 0.1 1.2
55.0 400 0.3 0.1 0.3 1.1 0.7
55.0 600 0.4 0.3 0.0 0.6 0.1
55.0 800 0.0 0.0 0.0 0.2 0.0
55.0 1000 0.2 0.1 0.2 0.1 0.3
55.0 1200 0.1 0.0 0.0 0.1 0.0
55.0 1400 0.0 0.2 0.0 0.2 0.1
55.0 >=1600 0.0 0.0 0.0 0.0 0.1
Question
How can I store the 500 row table so I can do interpolation on its non linear data and get the correct value based on user input?
Clarifications
If the user inputs the following vector (775, 47, 41.3), the program should return an interpolated value between the following four vectors: 45.0, 600, 0.2, 1.7, 45.0, 800, 0.8, 0.4, 50.0, 600, 0.1, 1.2, and 50.0, 800, 0.4, 0.2
Assume data will be pulled from a DB as a numpy array of your design
The first difficulty I found were the <= and >=, which I could handle duplicating the extremities for Density, and changing their values for very close dummy values 99 and 1601, which will not affect the interpolation.
Volume Density Content
<30 35 40 45 50>=
45.0 99 0.1 1.8 0.9 2.0 0.3
45.0 100 0.1 1.8 0.9 2.0 0.3
45.0 200 1.5 1.6 1.4 2.4 3.0
45.0 400 0.4 2.1 0.9 1.8 2.5
45.0 600 1.3 0.8 0.2 1.7 1.9
45.0 800 0.6 0.9 0.8 0.4 0.2
45.0 1000 0.3 0.8 0.5 0.3 1.0
45.0 1200 0.6 0.0 0.6 0.2 0.2
45.0 1400 0.6 0.4 0.3 0.7 0.1
45.0 1600 0.3 0.0 0.6 0.1 0.3
45.0 1601 0.3 0.0 0.6 0.1 0.3
50.0 99 0.1 0.0 0.5 0.9 0.2
50.0 100 0.1 0.0 0.5 0.9 0.2
50.0 200 1.3 0.4 0.8 0.2 2.7
50.0 400 0.4 0.1 0.7 1.3 1.7
50.0 600 0.8 0.7 0.1 1.2 1.6
50.0 800 0.5 0.3 0.4 0.2 0.0
50.0 1000 0.2 0.4 0.4 0.2 0.3
50.0 1200 0.4 0.0 0.0 0.2 0.0
50.0 1400 0.0 0.3 0.1 0.5 0.1
50.0 1600 0.1 0.0 0.0 0.0 0.2
50.0 1601 0.1 0.0 0.0 0.0 0.2
55.0 99 0.0 0.0 0.4 0.6 0.1
55.0 100 0.0 0.0 0.4 0.6 0.1
55.0 200 0.8 0.3 0.7 0.1 1.2
55.0 400 0.3 0.1 0.3 1.1 0.7
55.0 600 0.4 0.3 0.0 0.6 0.1
55.0 800 0.0 0.0 0.0 0.2 0.0
55.0 1000 0.2 0.1 0.2 0.1 0.3
55.0 1200 0.1 0.0 0.0 0.1 0.0
55.0 1400 0.0 0.2 0.0 0.2 0.1
55.0 1600 0.0 0.0 0.0 0.0 0.1
55.0 1601 0.0 0.0 0.0 0.0 0.1
Then, as #Jaime already pointed out, you have to find 8 vertices in order to do the tri-linear interpolation.
The following algorithm will give you the points:
import numpy as np
def get_8_points(filename, vi, di, ci):
a = np.loadtxt(filename, skiprows=2)
vol = a[:,0].repeat(a.shape[1]-2).reshape(-1,)
den = a[:,1].repeat(a.shape[1]-2).reshape(-1,)
#FIXME maybe you have to change the next line
con = np.tile(np.array([30., 35., 40., 45., 50.]),a.shape[0]).reshape(-1,)
#
val = a[:,2:].reshape(a.shape[0]*5).reshape(-1,)
u = np.unique(vol)
diff = np.absolute(u-vi)
vols = u[diff.argsort()][:2]
u = np.unique(den)
diff = np.absolute(u-di)
dens = u[diff.argsort()][:2]
u = np.unique(con)
diff = np.absolute(u-ci)
cons = u[diff.argsort()][:2]
check = np.in1d(vol,vols) & np.in1d(den,dens) & np.in1d(con,cons)
points = np.vstack((vol[check], den[check], con[check], val[check]))
return points.T
Using your example:
vi, di, ci = 47, 775, 41.3
points = get_8_points(filename, vi, di, ci)
#array([[ 4.50e+01, 6.00e+02, 4.00e+01, 2.00e-01],
# [ 4.50e+01, 6.00e+02, 4.50e+01, 1.70e+00],
# [ 4.50e+01, 8.00e+02, 4.00e+01, 8.00e-01],
# [ 4.50e+01, 8.00e+02, 4.50e+01, 4.00e-01],
# [ 5.00e+01, 6.00e+02, 4.00e+01, 1.00e-01],
# [ 5.00e+01, 6.00e+02, 4.50e+01, 1.20e+00],
# [ 5.00e+01, 8.00e+02, 4.00e+01, 4.00e-01],
# [ 5.00e+01, 8.00e+02, 4.50e+01, 2.00e-01]])
Now you can perform the tri-linear interpolation...
To complement Saullo's answer, here's how to do trilinear interpolation. You basically interpolate the cube into a square, then the square into a segment, and the segment into a point. Order of the interpolations does not alter the final result. Saullo's numbering scheme is already the right one: the base vertex is number 0, increasing the last dimension adds 1 to the vertex number, the second-to-last adds 2, and the first dimension adds 4. So from his vertex returning function, you could do the following:
coords = np.array([47, 775, 41.3])
ndim = len(coords)
# You would get this with a call to:
# vertices = get_8_points(filename, *coords)
vertices = np.array([[ 4.50e+01, 6.00e+02, 4.00e+01, 2.00e-01],
[ 4.50e+01, 6.00e+02, 4.50e+01, 1.70e+00],
[ 4.50e+01, 8.00e+02, 4.00e+01, 8.00e-01],
[ 4.50e+01, 8.00e+02, 4.50e+01, 4.00e-01],
[ 5.00e+01, 6.00e+02, 4.00e+01, 1.00e-01],
[ 5.00e+01, 6.00e+02, 4.50e+01, 1.20e+00],
[ 5.00e+01, 8.00e+02, 4.00e+01, 4.00e-01],
[ 5.00e+01, 8.00e+02, 4.50e+01, 2.00e-01]])
for dim in xrange(ndim):
vtx_delta = 2**(ndim - dim - 1)
for vtx in xrange(vtx_delta):
vertices[vtx, -1] += ((vertices[vtx + vtx_delta, -1] -
vertices[vtx, -1]) *
(coords[dim] -
vertices[vtx, dim]) /
(vertices[vtx + vtx_delta, dim] -
vertices[vtx, dim]))
print vertices[0, -1] # prints 0.55075
The function reuses the vertices array for the intermediate interpolations leading to the final value, stored in vertices[0, -1], you would have to do a copy of the vertices array if you will need it afterwards.