Slicing Dataframe with elements as lists - python

My dataframe has list as elements and I want to have more efficient way to check for some conditions.
My dataframe looks like this
col_a col_b
0 100 [1, 2, 3]
1 200 [2, 1]
2 300 [3]
I want to get only those rows which have 1 in col_b.
I have tried the naive way
temp_list=list()
for i in range(len(df1.index)):
if 1 in df1.iloc[i,1]:
temp_list.append(df1.iloc[i,0])
This takes a lot of time for big dataframes like this. How could I make the search more efficient for dataframes like this?

df[df.col_b.apply(lambda x: 1 in x)]
Results in:
col_a col_b
0 100 [1, 2, 3]
1 200 [2, 1]

Use boolean indexing with list comprehension and loc for seelct column col_a:
a = df1.loc[[1 in x for x in df1['col_b']], 'col_a'].tolist()
print (a)
[100, 200]
If need select first column:
a = df1.iloc[[1 in x for x in df1['col_b']], 0].tolist()
print (a)
[100, 200]
If need all rows:
df2 = df1[[1 in x for x in df1['col_b']]]
print (df2)
col_a col_b
0 100 [1, 2, 3]
1 200 [2, 1]
Another solution with sets and isdisjoint:
df2 = df1[~df1['col_b'].map(set({1}).isdisjoint)]
print (df2)
col_a col_b
0 100 [1, 2, 3]
1 200 [2, 1]

You can a list comprehension to check if 1 is present in a given list, and use the result to perform boolean indexing on the dataframe:
df.loc[[1 in i for i in df.col_B ],:]
col_a col_B
0 100 [1, 2, 3]
1 200 [2, 1]
Here's another approach using sets:
df[df.col_B.ne(df.col_B.map(set).sub({1}).map(list))]
col_a col_B
0 100 [1, 2, 3]
1 200 [2, 1]

I experimented with this approach:
df['col_b'] = df.apply(lambda x: eval(x['col_b']), axis = 1)
s=df['col_b']
d = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.concat([df, d], axis=1);
print(df)
print('...')
print(df[1.0])
That gave me the indices like this at the end (column with the name 1.0 as number):
id col_a col_b 1.0 2.0 3.0
0 1 100 (1, 2, 3) 1 1 1
1 2 200 (1, 2) 1 1 0
2 3 300 3 0 0 1
...
0 1
1 1
2 0
Name: 1.0, dtype: uint8
To printout the result:
df.loc[df[1.0]==1, ['id', 'col_a', 'col_b']]

Related

List in dataframe is different to the order it appears in the original list?

So lets say I have a list of lists:
data = [[1, 2, 3, 4], [0, 2, 3, 4], [0, 0 , 3, 4], [0, 0, 0, 4]]
When I am trying to output this into a dataframe, the dataframe is appearing as follows:
df = pd.DataFrame(data)
Current (incorrect) output:
list 1
list 2
list 3
list 4
1
2
3
4
2
3
4
0
3
4
0
0
4
0
0
0
This is the output I am hoping for:
list 1
list 2
list 3
list 4
1
0
0
0
2
2
0
0
3
3
3
0
4
4
4
4
Does anyone have any suggestion on how to fix this?
Each sublist in data is supposed to become a row in df, not a column, so you have to transpose data and then turn it into a df, like this:
df = pd.DataFrame(np.array(data).T)
Alternatively, if data contains strings, you could do:
df = pd.DataFrame(list(map(list, zip(*data))))
which transposes the list of lists and then turns into a df.
IIUC, you want to "push" the 0s to the end.
You can use sorted with a custom key.
The exact output you expect is yet ambiguous as the data is symmetric, but you can transpose if needed:
data = [[1, 2, 3, 4], [0, 2, 3, 4], [0, 0 , 3, 4], [0, 0, 0, 4]]
df = (pd.DataFrame([sorted(l, key=lambda x: x==0) for l in data])
.T # remove the ".T" of you don't want to transpose
)
NB. If you want to push the 0s to the end and sort the other values, use lambda x: (x==0, x) as key
Output:
0 1 2 3
0 1 2 3 4
1 2 3 4 0
2 3 4 0 0
3 4 0 0 0

Calculate sum based on multiple rows from list column for each row in pandas dataframe

I have a dataframe that looks something like this:
df = pd.DataFrame({'id': range(5), 'col_to_sum': np.random.rand(5), 'list_col': [[], [1], [1,2,3], [2], [3,1]]})
id col_to_sum list_col
0 0 0.557736 []
1 1 0.147333 [1]
2 2 0.538681 [1, 2, 3]
3 3 0.040329 [2]
4 4 0.984439 [3, 1]
In reality I have more columns and ~30000 rows but the extra columns are irrelevant for this. Note that all the list elements are from the id column and that the id column is not necessarily the same as the index.
I want to make a new column that for each row sums the values in col_to_sum corresponding to the ids in list_col. In this example that would be:
id col_to_sum list_col sum
0 0 0.557736 [] 0.000000
1 1 0.147333 [1] 0.147333
2 2 0.538681 [1, 2, 3] 0.726343
3 3 0.040329 [2] 0.538681
4 4 0.984439 [3, 1] 0.187662
I have found a way to do this but it requires looping through the entire dataframe and is quite slow on the larger df with ~30000 rows (~6 min). The way I found was
df['sum'] = 0
for i in range(len(df)):
mask = df['id'].isin(df['list_col'].iloc[i])
df.loc[i, 'sum'] = df.loc[mask, 'col_to_sum'].sum()
Ideally I would want a vectorized way to do this but I haven't been able to do it. Any help is greatly appreciated.
I'm using non-random values in this demo because they're easier to reproduce and check.
I'm also using an id-column ([0, 1, 3, 2, 4]) that is not identical to the index.
Setup:
>>> df = pd.DataFrame({'id': [0, 1, 3, 2, 4], 'col_to_sum': [1, 2, 3, 4, 5], 'list_col': [[], [1], [1, 2, 3], [2], [3, 1]]})
>>> df
id col_to_sum list_col
0 0 1 []
1 1 2 [1]
2 3 3 [1, 2, 3]
3 2 4 [2]
4 4 5 [3, 1]
Solution:
df = df.set_index('id')
df['sum'] = df['list_col'].apply(lambda l: df.loc[l, 'col_to_sum'].sum())
df = df.reset_index()
Output:
>>> df
id col_to_sum list_col sum
0 0 1 [] 0
1 1 2 [1] 2
2 3 3 [1, 2, 3] 9
3 2 4 [2] 4
4 4 5 [3, 1] 5
You can use a lambda function that will let you use the list_col and find the iloc of the corresponding list_col to summarize
df['sum_col'] = df['list_col'].apply(lambda x : df['col_to_sum'].iloc[x].sum())

pandas dataframe select list value from another column

Everyone! I have a pandas dataframe like this:
A B
0 [1,2,3] 0
1 [2,3,4] 1
as we can see, the A column is a list and the B column is an index value. I want to get a C column which is index by B from A:
A B C
0 [1,2,3] 0 1
1 [2,3,4] 1 3
Is there any elegant method to solve this? Thank you!
Use list comprehension with indexing:
df['C'] = [x[y] for x, y in df[['A','B']].to_numpy()]
Or DataFrame.apply, but it should be slowier if large DataFrame:
df['C'] = df.apply(lambda x: x.A[x.B], axis=1)
print (df)
A B C
0 [1, 2, 3] 0 1
1 [2, 3, 4] 1 3

Create a new column in the dataframe that has values of other columns for that row in a list

I want to change
This datafrme to the
how should I use apply function to achive this?
Try this:
df['bbox'] = df.apply(lambda x: [y for y in x], axis=1)
so for a df that looks like:
In [15]: df
Out[15]:
a b c
0 1 3 1
1 2 4 1
2 3 5 1
3 4 6 1
you'll get:
In [16]: df['bbox'] = df.apply(lambda x: [y for y in x], axis=1)
In [17]: df
Out[17]:
a b c bbox
0 1 3 1 [1, 3, 1]
1 2 4 1 [2, 4, 1]
2 3 5 1 [3, 5, 1]
3 4 6 1 [4, 6, 1]
Hope this helps!
As per your example to achieve required result, you need to convert each row in list. Add that list to new DataFrame. Once you add new list to DataFrame apply whatever calculation(your output DataFrame values are different from input DataFrame hence expecting you have done some calculation on each cell or row) you want to apply on the same.
import pandas as pd
data = {'x':[121,216,49],'y':[204,288,449],'w':[108,127,184]}
df = pd.DataFrame(data,columns=['x','y','w'])
new_data = [[row.to_list()] for i, row in df.iterrows()]
new_df = pd.DataFrame(new_data, columns='bbox')
print(new_df)
bbox
0 [121, 216, 49]
1 [204, 288,449]
2 [108, 127, 184]

Why I can't get the correct mask columns in pandas

For example, if I have a data frame
x f
0 0 [0, 1]
1 1 [3]
2 2 [2, 3, 4]
3 3 [3, 6]
4 4 [4, 5]
If I want to remove the rows which columns x doesn't in f columns, I tried with where and apply but I can't get the expected results. I got the below table and I want to know why row 0,2,3 are 0 instead of 1?
x f mask
0 0 [0, 1] 0
1 1 [3] 0
2 2 [2, 3, 4] 0
3 3 [3, 6] 0
4 4 [4, 5] 0
Anyone knows why? And should I do to handle this number vs list case?
df1 = pd.DataFrame({'x': [0,1,2,3,4],'f' :[[0,1],[3],[2,3,4],[3,6],[3,5]]}, index = [0,1,2,3,4])
df1['mask'] = np.where(df1.x.values in df1.f.values ,1,0)
Here is necessary test values by pairs - solution with in in list comprehension:
df1['mask'] = np.where([a in b for a, b in df1[['x', 'f']].values],1,0)
Or with DataFrame.apply and axis=1:
df1['mask'] = np.where(df1.apply(lambda x: x.x in x.f, axis=1),1,0)
print (df1)
x f mask
0 0 [0, 1] 1
1 1 [3] 0
2 2 [2, 3, 4] 1
3 3 [3, 6] 1
4 4 [3, 5] 0
IIUC row explode then use isin
pd.DataFrame(df1.f.tolist()).isin(df1.x).any(1).astype(int)
Out[10]:
0 1
1 0
2 1
3 1
4 0
dtype: int32
df1['mask'] = pd.DataFrame(df1.f.tolist()).isin(df1.x).any(1).astype(int)

Categories