Everyone! I have a pandas dataframe like this:
A B
0 [1,2,3] 0
1 [2,3,4] 1
as we can see, the A column is a list and the B column is an index value. I want to get a C column which is index by B from A:
A B C
0 [1,2,3] 0 1
1 [2,3,4] 1 3
Is there any elegant method to solve this? Thank you!
Use list comprehension with indexing:
df['C'] = [x[y] for x, y in df[['A','B']].to_numpy()]
Or DataFrame.apply, but it should be slowier if large DataFrame:
df['C'] = df.apply(lambda x: x.A[x.B], axis=1)
print (df)
A B C
0 [1, 2, 3] 0 1
1 [2, 3, 4] 1 3
Related
I have a dataframe like this
df = pd.DataFrame({'a' : [1,1,0,0], 'b': [0,1,1,0], 'c': [0,0,1,1]})
I want to get
a b c
a 2 1 0
b 1 2 1
c 0 1 2
where a,b,c are column names, and I get the values counting '1' in all columns when the filter is '1' in another column.
For ample, when df.a == 1, we count a = 2, b =1, c = 0 etc
I made a loop to solve
matrix = []
for name, values in df.iteritems():
matrix.append(pd.DataFrame( df.groupby(name, as_index=False).apply(lambda x: x[x == 1].count())).values.tolist()[1])
pd.DataFrame(matrix)
But I think that there is a simpler solution, isn't it?
You appear to want the matrix product, so leverage DataFrame.dot:
df.T.dot(df)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
Alternatively, if you want the same level of performance without the overhead of pandas, you could compute the product with np.dot:
v = df.values
pd.DataFrame(v.T.dot(v), index=df.columns, columns=df.columns)
Or, if you want to get cute,
(lambda a, c: pd.DataFrame(a.T.dot(a), c, c))(df.values, df.columns)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
—piRSquared
np.einsum
Not as pretty as df.T.dot(df) but how often do you see np.einsum amirite?
pd.DataFrame(np.einsum('ij,ik->jk', df, df), df.columns, df.columns)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
You can do a multiplication using # operator for numpy arrays.
df = pd.DataFrame(df.values.T # df.values, df.columns, df.columns)
Numpy matmul
np.matmul(df.values.T,df.values)
Out[87]:
array([[2, 1, 0],
[1, 2, 1],
[0, 1, 2]], dtype=int64)
#pd.DataFrame(np.matmul(df.values.T,df.values), df.columns, df.columns)
I want to change
This datafrme to the
how should I use apply function to achive this?
Try this:
df['bbox'] = df.apply(lambda x: [y for y in x], axis=1)
so for a df that looks like:
In [15]: df
Out[15]:
a b c
0 1 3 1
1 2 4 1
2 3 5 1
3 4 6 1
you'll get:
In [16]: df['bbox'] = df.apply(lambda x: [y for y in x], axis=1)
In [17]: df
Out[17]:
a b c bbox
0 1 3 1 [1, 3, 1]
1 2 4 1 [2, 4, 1]
2 3 5 1 [3, 5, 1]
3 4 6 1 [4, 6, 1]
Hope this helps!
As per your example to achieve required result, you need to convert each row in list. Add that list to new DataFrame. Once you add new list to DataFrame apply whatever calculation(your output DataFrame values are different from input DataFrame hence expecting you have done some calculation on each cell or row) you want to apply on the same.
import pandas as pd
data = {'x':[121,216,49],'y':[204,288,449],'w':[108,127,184]}
df = pd.DataFrame(data,columns=['x','y','w'])
new_data = [[row.to_list()] for i, row in df.iterrows()]
new_df = pd.DataFrame(new_data, columns='bbox')
print(new_df)
bbox
0 [121, 216, 49]
1 [204, 288,449]
2 [108, 127, 184]
My dataframe has list as elements and I want to have more efficient way to check for some conditions.
My dataframe looks like this
col_a col_b
0 100 [1, 2, 3]
1 200 [2, 1]
2 300 [3]
I want to get only those rows which have 1 in col_b.
I have tried the naive way
temp_list=list()
for i in range(len(df1.index)):
if 1 in df1.iloc[i,1]:
temp_list.append(df1.iloc[i,0])
This takes a lot of time for big dataframes like this. How could I make the search more efficient for dataframes like this?
df[df.col_b.apply(lambda x: 1 in x)]
Results in:
col_a col_b
0 100 [1, 2, 3]
1 200 [2, 1]
Use boolean indexing with list comprehension and loc for seelct column col_a:
a = df1.loc[[1 in x for x in df1['col_b']], 'col_a'].tolist()
print (a)
[100, 200]
If need select first column:
a = df1.iloc[[1 in x for x in df1['col_b']], 0].tolist()
print (a)
[100, 200]
If need all rows:
df2 = df1[[1 in x for x in df1['col_b']]]
print (df2)
col_a col_b
0 100 [1, 2, 3]
1 200 [2, 1]
Another solution with sets and isdisjoint:
df2 = df1[~df1['col_b'].map(set({1}).isdisjoint)]
print (df2)
col_a col_b
0 100 [1, 2, 3]
1 200 [2, 1]
You can a list comprehension to check if 1 is present in a given list, and use the result to perform boolean indexing on the dataframe:
df.loc[[1 in i for i in df.col_B ],:]
col_a col_B
0 100 [1, 2, 3]
1 200 [2, 1]
Here's another approach using sets:
df[df.col_B.ne(df.col_B.map(set).sub({1}).map(list))]
col_a col_B
0 100 [1, 2, 3]
1 200 [2, 1]
I experimented with this approach:
df['col_b'] = df.apply(lambda x: eval(x['col_b']), axis = 1)
s=df['col_b']
d = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.concat([df, d], axis=1);
print(df)
print('...')
print(df[1.0])
That gave me the indices like this at the end (column with the name 1.0 as number):
id col_a col_b 1.0 2.0 3.0
0 1 100 (1, 2, 3) 1 1 1
1 2 200 (1, 2) 1 1 0
2 3 300 3 0 0 1
...
0 1
1 1
2 0
Name: 1.0, dtype: uint8
To printout the result:
df.loc[df[1.0]==1, ['id', 'col_a', 'col_b']]
I have a list with columns to create :
new_cols = ['new_1', 'new_2', 'new_3']
I want to create these columns in a dataframe and fill them with zero :
df[new_cols] = 0
Get error :
"['new_1', 'new_2', 'new_3'] not in index"
which is true but unfortunate as I want to create them...
EDIT : This is a duplicate of this question : Add multiple empty columns to pandas DataFrame however I keep this one too because the accepted answer here was the simple solution I was looking for, and it was not he accepted answer out there
EDIT 2 : While the accepted answer is the most simple, interesting one-liner solutions were posted below
You need to add the columns one by one.
for col in new_cols:
df[col] = 0
Also see the answers in here for other methods.
Use assign by dictionary:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
print (df)
0 a 0
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 d 8
new_cols = ['new_1', 'new_2', 'new_3']
df = df.assign(**dict.fromkeys(new_cols, 0))
print (df)
A B new_1 new_2 new_3
0 a 0 0 0 0
1 a 1 0 0 0
2 a 2 0 0 0
3 a 3 0 0 0
4 b 4 0 0 0
5 b 5 0 0 0
6 b 6 0 0 0
7 c 7 0 0 0
8 d 8 0 0 0
import pandas as pd
new_cols = ['new_1', 'new_2', 'new_3']
df = pd.DataFrame.from_records([(0, 0, 0)], columns=new_cols)
Is this what you're looking for ?
You can use assign:
new_cols = ['new_1', 'new_2', 'new_3']
values = [0, 0, 0] # could be anything, also pd.Series
df = df.assign(**dict(zip(new_cols, values)
Try looping through the column names before creating the column:
for col in new_cols:
df[col] = 0
We can use the Apply function to loop through the columns in the dataframe and assigning each of the element to a new field
for instance for a list in a dataframe with a list named keys
[10,20,30]
In your case since its all 0 we can directly assign them as 0 instead of looping through. But if we have values we can populate them as below
...
df['new_01']=df['keys'].apply(lambda x: x[0])
df['new_02']=df['keys'].apply(lambda x: x[1])
df['new_03']=df['keys'].apply(lambda x: x[2])
I have a pandas dataframe which looks like the following:
0 1
0 2
2 3
1 4
What I want to do is the following: if I get 2 as input my code is supposed to search for 2 in the dataframe and when it finds it returns the value of the other column. In the above example my code would return 0 and 3. I know that I can simply look at each row and check if any of the elements is equal to 2 but I was wondering if there is one-liner for such a problem.
UPDATE: None of the columns are index columns.
Thanks
>>> df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
>>> df
A B
0 0 1
1 0 2
2 2 3
3 1 4
The following pandas syntax is equivalent to the SQL SELECT B FROM df WHERE A = 2
>>> df[df['A'] == 2]['B']
2 3
Name: B, dtype: int64
There's also pandas.DataFrame.query:
>>> df.query('A == 2')['B']
2 3
Name: B, dtype: int64
You may need this:
n_input = 2
df[(df == n_input).any(1)].stack()[lambda x: x != n_input].unique()
# array([0, 3])
df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
t = [df.loc[lambda df: df['A'] == 3]]
t