Efficiently find index of DataFrame values in array - python

I have a DataFrame that resembles:
x y z
--------------
0 A 10
0 D 13
1 X 20
...
and I have two sorted arrays for every possible value for x and y:
x_values = [0, 1, ...]
y_values = ['a', ..., 'A', ..., 'D', ..., 'X', ...]
so I wrote a function:
def lookup(record, lookup_list, lookup_attr):
return np.searchsorted(lookup_list, getattr(record, lookup_attr))
and then call:
df_x_indicies = df.apply(lambda r: lookup(r, x_values, 'x')
df_y_indicies = df.apply(lambda r: lookup(r, y_values, 'y')
# df_x_indicies: [0, 0, 1, ...]
# df_y_indicies: [26, ...]
but is there are more performant way to do this? and possibly multiple columns at once to get a returned DataFrame rather than a series?
I tried:
np.where(np.in1d(x_values, df.x))[0]
but this removes duplicate values and that is not desired.

You can convert your index arrays to pd.Index objects to make lookup fast(er).
u, v = map(pd.Index, [x_values, y_values])
pd.DataFrame({'x': u.get_indexer(df.x), 'y': v.get_indexer(df.y)})
x y
0 0 1
1 0 2
2 1 3
Where,
x_values
# [0, 1]
y_values
# ['a', 'A', 'D', 'X']
As to your requirement of having this work for multiple columns, you will have to iterate over each one. Here's a version of the code above that should generalise to N columns and indices.
val_list = [x_values, y_values] # [x_values, y_values, z_values, ...]
idx_list = map(pd.Index, val_list)
pd.DataFrame({
f'{c}': idx.get_indexer(df[c]) for idx, c in zip(idx_list, df)})
x y
0 0 1
1 0 2
2 1 3

Update using Series with .loc , you may can also try with reindex
pd.Series(range(len(x_values)),index=x_values).loc[df.x].tolist()
Out[33]: [0, 0, 1]

Related

Check if a value in one column is in a list in another column using pd.isin()

I have a DataFrame as below
df = pd.DataFrame({
'x' : range(0,5),
'y' : [[0,2],[3,4],[2,3],[3,4],[7,9]]
})
I would like to test for each row of x, if the value is in the list specified by column y
df[df.x.isin(df.y)]
so I would end up with:
Not sure why isin() does not work in this case
df.x.isin(df.y) checks for each element x, e.g. 0, is equal to some of the values of df.y, e.g. is 0 equal to [0,2], no, and so on.
With this, you can just do a for loop:
df[ [x in y for x,y in zip(df['x'], df['y'])] ]
Let us try explode with index loc
out = df.loc[df.explode('y').query('x==y').index.unique()]
Out[217]:
x y
0 0 [0, 2]
2 2 [2, 3]
3 3 [3, 4]
Just an other solution:
result = (
df
.assign(origin_y = df.y)
.explode('y')
.query('x==y')
.drop(columns=['y'])
.rename({'origin_y': 'y'})
)
x y
0 0 [0, 2]
2 2 [2, 3]
3 3 [3, 4]

Splitting Pandas dataframe

I'd like to split my time-series data into X and y by shifting the data. The dummy dataframe looks like:
i.e. if the time steps equal to 2, X and y look like: X=[3,0] -> y= [5]
X=[0,5] -> y= [7] (this should be applied to the entire samples (rows))
I wrote the function below, but it returns empty matrices when I pass pandas dataframe to the function.
def create_dataset(dataset, time_step=1):
dataX, dataY = [], []
for i in range (len(dataset)-time_step-1):
a = dataset.iloc[:,i:(i+time_step)]
dataX.append(a)
dataY.append(dataset.iloc[:, i + time_step ])
return np.array(dataX), np.array(dataY)
Thank you for any solutions.
Here is an example that replicates the example, IIUC:
import pandas as pd
# function to process each row
def process_row(s):
assert isinstance(s, pd.Series)
return pd.concat([
s.rename('timestep'),
s.shift(-1).rename('x_1'),
s.shift(-2).rename('x_2'),
s.shift(-3).rename('y')
], axis=1).dropna(how='any', axis=0).astype(int)
# test case for the example
process_row( pd.Series([2, 3, 0, 5, 6]) )
# type in first two rows of the data frame
df = pd.DataFrame(
{'x-2': [3, 2], 'x-1': [0, 3],
'x0': [5, 0], 'x1': [7, 5], 'x2': [1, 6]})
# perform the transformation
ts = list()
for idx, row in df.iterrows():
t = process_row(row)
t.index = [idx] * t.index.size
ts.append(t)
print(pd.concat(ts))
# results
timestep x_1 x_2 y
0 3 0 5 7
0 0 5 7 1
1 2 3 0 5 <-- first part of expected results
1 3 0 5 6 <-- second part
Do you mean something like this:
df = df.shift(periods=-2, axis='columns')
# you can also pass a fill values parameter
df = df.shift(periods=-2, axis='columns', fill_value = 0)

way to unpack data within dataframe

I have a df that "packed" and I'm trying to find a way to unpack into multiple columns and rows:
input as a df with multiple list within a column
all_labels values labels
[A,B,C] [[10,1,3],[5,6,3],[0,0,0]] [X,Y,Z]
desired output: unpacked df
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
I tried this for all_labels & labels column but not sure how to do it for values column :
df.labels.apply(pd.Series)
df.all_labels.apply(pd.Series)
Setup
packed = pd.DataFrame({
'all_labels': [['A', 'B', 'C']],
'values': [[[10, 1, 3], [5, 6, 3], [0, 0, 0]]],
'labels': [['X', 'Y', 'Z']]
})
Keep It Simple
pd.DataFrame(packed['values'][0], packed['all_labels'][0], packed['labels'][0])
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
rename and dict unpacking
The columns are so close to the argument names of the dataframe constructor, I couldn't resist...
rnm = {'all_labels': 'index', 'values': 'data', 'labels': 'columns'}
pd.DataFrame(**packed.rename(columns=rnm).loc[0])
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
Without rename and list unpacking instead
Making sure to list the column names in the same order the arguments are expected in the pandas.DataFrame constructor
pd.DataFrame(*packed.loc[0, ['values', 'all_labels', 'labels']])
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
Bonus Material
The pandas.DataFrame.to_dict method will return a dictionary that looks similar to this.
df = pd.DataFrame(*packed.loc[0, ['values', 'all_labels', 'labels']])
df.to_dict('split')
{'index': ['A', 'B', 'C'],
'columns': ['X', 'Y', 'Z'],
'data': [[10, 1, 3], [5, 6, 3], [0, 0, 0]]}
That we could wrap in another dataframe constructor call to get back something very similar to what we started with.
pd.DataFrame([df.to_dict('split')])
index columns data
0 [A, B, C] [X, Y, Z] [[10, 1, 3], [5, 6, 3], [0, 0, 0]]

Finding smallest possible difference between lists of unequal length

I have a dataframe with two columns A and B that contains lists:
import pandas as pd
df = pd.DataFrame({"A" : [[1,5,10], [], [2], [1,2]],
"B" : [[15, 2], [], [6], []]})
I want to construct a third column C that is defined such that it is equal to the smallest possible difference between list-elements in A and B if they are non-empty, and 0 if one or both of them are empty.
For the first row the smallest difference is 1 (we take absolute value..), for the second row it is 0 due to lists being empty, third row is 4 and fourth row is 0 again due to one empty list, so we ultimately end up with:
df["C"] = [1, 0, 4, 0]
This isn't easily vectorisable, since you have object dtype series of lists. You can use a list comprehension with itertools.product:
from itertools import product
zipper = zip(df['A'], df['B'])
df['C'] = [min((abs(x - y) for x, y in product(*vals)), default=0) for vals in zipper]
# alternative:
# df['C'] = [min((abs(x - y) for x, y in product(*vals)), default=0) \
# for vals in df[['A', 'B']].values]
print(df)
# A B C
# 0 [1, 5, 10] [15, 2] 1
# 1 [] [] 0
# 2 [2] [6] 4
# 3 [1, 2] [] 0
You can use the following list comprehension, checking for the min difference of the cartesian product (itertools.product) from both columns
[min(abs(i-j) for i,j in product(*a)) if all(a) else 0 for a in df.values]
[1, 0, 4, 0]
df['C'] = df.apply(lambda row: min([abs(x - y) for x in row['A'] for y in row['B']], default=0), axis=1)
I just want to introduce the unnesting again
df['Diff']=unnesting(df[['B']],['B']).join(unnesting(df[['A']],['A'])).eval('C=B-A').C.abs().min(level=0)
df.Diff=df.Diff.fillna(0).astype(int)
df
Out[60]:
A B Diff
0 [1, 5, 10] [15, 2] 1
1 [] [] 0
2 [2] [6] 4
3 [1, 2] [] 0
FYI
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
I think this works
def diff(a,b):
if len(a) > 0 and len(b) > 0:
return min([abs(i-j) for i in a for j in b])
return 0
df['C'] = df.apply(lambda x: diff(x.A, x.B), axis=1)
df
A B C
0 [1, 5, 10] [15, 2] 1
1 [] [] 0
2 [2] [6] 4
3 [1, 2] [] 0

How to add rows for all missing values of one multi-index's level?

Suppose that I have the following dataframe df, indexed by a 3-level multi-index:
In [52]: df
Out[52]:
C
L0 L1 L2
0 w P 1
y P 2
R 3
1 x Q 4
R 5
z S 6
Code to create the DataFrame:
idx = pd.MultiIndex(levels=[[0, 1], ['w', 'x', 'y', 'z'], ['P', 'Q', 'R', 'S']],
labels=[[0, 0, 0, 1, 1, 1], [0, 2, 2, 1, 1, 3], [0, 0, 2, 1, 2, 3]],
names=['L0', 'L1', 'L2'])
df = pd.DataFrame({'C': [1, 2, 3, 4, 5, 6]}, index=idx)
The possible values for the L2 level are 'P', 'Q', 'R', and 'S', but some of these values are missing for particular combinations of values for the remaining levels. For example, the combination (L0=0, L1='w', L2='Q') is not present in df.
I would like to add enough rows to df so that, for each combination of values for the levels other than L2, there is exactly one row for each of the L2 level's possible values. For the added rows, the value of the C column should be 0.
IOW, I want to expand df so that it looks like this:
C
L0 L1 L2
0 w P 1
Q 0
R 0
S 0
y P 2
Q 0
R 3
S 0
1 x P 0
Q 4
R 5
S 0
z P 0
Q 0
R 0
S 6
REQUIREMENTS:
the operation should leave the types of the columns unchanged;
the operation should add the smallest number of rows needed to complete only the specified level (L2)
Is there a simple way to perform this expansion?
Suppose L2 initially contains all the possible values you need, you can use unstack.stack trick:
df.unstack('L2', fill_value=0).stack(level=1)

Categories