Multiply all elements of a column in a pandas dataframe - python

I have a pandas dataframe:
Idx
A
B
C
1
2
5
1
2
1
2
2
3
3
1
1
4
2
3
0
I want to calculate the product of all elements in all columns. e.g.
- P_A = 2*1*3*2 = 12
- P_B = 5*2*1*3 = 30
- P_C = 1*2*1*0 = 0
Ideally, the result would be in a list format [P_A, P_B, P_C].
What is the most efficient way to compute this?

Try:
>>> df[['A', 'B', 'C']].prod().tolist()
[12, 30, 0]
>>>
Or:
>>> df.set_index('Idx').prod().tolist()
[12, 30, 0]
>>>
Or also:
>>> df.filter(regex='[^Idx]').prod().tolist()
[12, 30, 0]
>>>
Or with iloc:
>>> df.iloc[:, 1:].prod().tolist()
[12, 30, 0]
>>>
Or with drop:
>>> df[df.columns.drop('Idx')].prod().tolist()
[12, 30, 0]
>>>

You can apply numpy.product:
import numpy as np
np.product(df.set_index('Idx'))
output:
A 12
B 30
C 0
as list:
products = np.product(df.set_index('Idx')).to_list()

Related

Conflate sets of pandas columns into a single column

I have a dataframe that looks like this:
n objects id x y Vx Vy id.1 x.1 ... Vx.40 Vy.40 ...
0 41 1 2 3 4 5 17 3 ... 5 6 ...
1 21 1 2 3 4 5 17 3 ... 0 0 ...
2 36 1 2 3 4 5 17 3 ... 0 0 ...
My goal is to conflate the contents of every set of id, x, y, Vx, and Vy columns into a single column.
I.e. the end result should look like this:
n objects object_0 object_1 object_40 ...
0 41 [1,2,3,4,5] [17,3,...] ... [...5,6] ...
1 21 [1,2,3,4,5] [17,3,...] ... [...0,0] ...
2 36 [1,2,3,4,5] [17,3,...] ... [...0,0] ...
I am kind of at a loss as to how to achieve that. My only idea was hardcoding it like
df['object_0'] = df[['id', 'x', 'y', 'Vx', 'Vy']].values.tolist()
df.drop(['id', 'x', 'y', 'Vx', 'Vy'], inplace=True)
for i in range(1,41):
df[f'object_{i}'] = df[[f'id.{i}', f'x.{i}', f'y.{i}', f'Vx.{i}', f'Vy.{i}']].values.tolist()
df.drop([f'id.{i}', f'x.{i}', f'y.{i}', f'Vx.{i}', f'Vy.{i}'], inplace=True)
but that is not a good option, as the number (and names) of repeating columns varies between dataframes. What is consistent is that the number of objects per row is listed, and every object has the same number of elements (i.e. there are no cases of columns going like id.26, y.26, Vx.26, id.27 Vy.27, id.28...)
I suppose I could find the number of objects via something like
last_obj = max([ int(col.split('.')[-1]) for col in df.columns ])
and then dig out the number and names of cols per object by
[ col.split('.')[0] for col in df.columns if col.split('.')[-1] == last_obj ]
but at that point this all starts seeming a bit too cluttered and hacky.
Is there a cleaner way to do that, one that works irrespective of the number of objects, of columns per object, and (ideally) of column names? Any help would be appreciated!
EDIT:
This does work, but is there a more elegant way of doing it?
last_obj = max([ int(col.split('.')[-1]) for col in df.columns if '.' in col])
obj_col_names = [ col.split('.')[0] for col in df.columns if col.split('.')[-1] == str(last_obj) ]
df['object_0'] = df[obj_col_names].values.tolist()
df.drop(obj_col_names, axis=1, inplace=True)
for i in range(1, last_obj+1):
current_col_set = [ "".join([col, f'.{i}']) for col in obj_col_names ]
df[f'object_{i}'] = df[current_col_set].values.tolist()
df.drop(current_col_set, axis=1, inplace=True)
This solution renames the columns into same-named groups. Then does a groupby on those columns and converts them into lists.
Starting with
n objects id x y Vx Vy id.1 x.1 y.1 Vx.1 Vy.1
0 0 41 1 2 3 4 5 17 3 3 4 5
1 1 21 1 2 3 4 5 17 3 3 4 5
2 2 36 1 2 3 4 5 17 3 3 4 5
Then
nb_cols = df.shape[1]-2
nb_groups = int(df.columns[-1].split('.')[1])+1
cols_per_group = nb_cols // nb_groups
group_cols = np.arange(nb_cols)//cols_per_group
explode_cols = list(np.arange(nb_groups))
pd.concat([df.loc[:,:'objects'].reset_index(drop=True), \
df.loc[:,'id':].set_axis(group_cols, axis=1).groupby(level=0, axis=1) \
.apply(lambda x: x.values).to_frame().T.explode(explode_cols).reset_index(drop=True) \
.rename(columns = lambda x: 'object_' + str(x)) \
], axis=1)
Result
n objects object_0 object_1
0 0 41 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
1 1 21 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
2 2 36 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]

Changing the values in a column using python

import pandas as pd
import numpy as np
data_A=pd.read_csv('D:/data_A.csv')
data_A has column named power.
powercolumn only has 0 and 1 and dtype is int64.
I want to make sure that there are only 0 and 1 in column power.
So, if there are other numbers except 0 and 1 in column power, I want to make the values 0. How can I do?
You can use DataFrame.loc to conditionally access a group of rows and columns.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({"power": [1, 0, 1, 2, 5, 6, 0, 1]})
>>> df
power
0 1
1 0
2 1
3 2
4 5
5 6
6 0
7 1
>>> df.loc[~(df["power"].isin([1, 0])), "power"] = 0
>>> df
power
0 1
1 0
2 1
3 0
4 0
5 0
6 0
7 1
The condition ~(df["power"].isin([1, 0])) returns a Boolean Series which can be use to select the rows that have 'power' not equal to 1 or 0
You could also use list comprehension if your dataframe is small.
data_A.power = [x if x == 1 else 0 for x in data_A.power]
Or numpy for a longer column (this solution assumes you don't have negative values)
import numpy as np
power_np = np.array(data_A.power)
power_np[power_np > 1] = 0
data_A.power = power_np
Try this:
import pandas as pd
# example df
p = [1, 0, 3, 4, 's']
data_A = pd.DataFrame(p, columns=['power'])
def convert_row(row):
if row == 1 or row == 0:
return row
else:
return 0
data_A['power'] = data_A['power'].apply(convert_row)
print(data_A)

How to remove rows from DataFrame

I have a DataFrame with n rows and an ndarray with n values (-1 for outliers and 1 for inlier). Is there a pythonic way to remove DataFrame rows that match the indices of the elements of the nparray marked as -1?
You can just do: new_df = old_df[arr == 1].
Example:
df = pd.DataFrame(np.random.randn(5,5))
arr = np.random.choice([1,-1], 5)
>>> df
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
1 -0.162404 -1.272317 0.342051 -0.787938 0.464699
2 -0.965481 0.727143 -0.887149 -0.430592 -2.074865
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973
4 0.228538 0.799445 -0.217787 0.398572 -1.255639
>>> arr
array([ 1, -1, -1, 1, -1])
>>> df[arr == 1]
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973

Find integer row-index from pandas index

The following code find index where df['A'] == 1
import pandas as pd
import numpy as np
import random
index = range(10)
random.shuffle(index)
df = pd.DataFrame(np.zeros((10,1)).astype(int), columns = ['A'], index = index)
df.A.iloc[3:6] = 1
df.A.iloc[6:] = 2
print df
print df.loc[df['A'] == 1].index.tolist()
It returns pandas index correctly. How do I get the integer index ([3,4,5]) instead using pandas API?
A
8 0
4 0
6 0
3 1
7 1
1 1
5 2
0 2
2 2
9 2
[3, 7, 1]
what about?
In [12]: df.index[df.A == 1]
Out[12]: Int64Index([3, 7, 1], dtype='int64')
or (depending on your goals):
In [15]: df.reset_index().index[df.A == 1]
Out[15]: Int64Index([3, 4, 5], dtype='int64')
Demo:
In [11]: df
Out[11]:
A
8 0
4 0
6 0
3 1
7 1
1 1
5 2
0 2
2 2
9 2
In [12]: df.index[df.A == 1]
Out[12]: Int64Index([3, 7, 1], dtype='int64')
In [15]: df.reset_index().index[df.A == 1]
Out[15]: Int64Index([3, 4, 5], dtype='int64')
Here is one way:
df.reset_index().index[df.A == 1].tolist()
This re-indexes the data frame with [0, 1, 2, ...], then extracts the integer index values based on the boolean mask df.A == 1.
Edit Credits to #Max for the index[df.A == 1] idea.
No need for numpy, you're right. Just pure python with a listcomp:
Just find the indexes where the values are 1
print([i for i,x in enumerate(df['A'].values) if x == 1])

Panda-Column as index for numpy array

How can I use a panda row as index for a numpy array? Say I have
>>> grid = arange(10,20)
>>> df = pd.DataFrame([0,1,1,5], columns=['i'])
I would like to do
>>> df['j'] = grid[df['i']]
IndexError: unsupported iterator index
What is a short and clean way to actually perform this operation?
Update
To be precise, I want an additional column that has the values that correspond to the indices that the first column contains: df['j'][0] = grid[df['i'][0]] in column 0 etc
expected output:
index i j
0 0 10
1 1 11
2 1 11
3 5 15
Parallel Case: Numpy-to-Numpy
Just to show where the idea comes from, in standard python / numpy, if you have
>>> keys = [0, 1, 1, 5]
>>> grid = arange(10,20)
>>> grid[keys]
Out[30]: array([10, 11, 11, 15])
Which is exactly what I want to do. Only that my keys are not stored in a vector, they are stored in a column.
This is a numpy bug that surfaced with pandas 0.13.0 / numpy 1.8.0.
You can do:
In [5]: grid[df['i'].values]
Out[5]: array([0, 1, 1, 5])
In [6]: Series(grid)[df['i']]
Out[6]:
i
0 0
1 1
1 1
5 5
dtype: int64
This matches your output. You can assign an array to a column, as long as the length of the array/list is the same as the frame (otherwise how would you align it?)
In [14]: grid[keys]
Out[14]: array([10, 11, 11, 15])
In [15]: df['j'] = grid[df['i'].values]
In [17]: df
Out[17]:
i j
0 0 10
1 1 11
2 1 11
3 5 15

Categories