Multiply two Pandas dataframes with same shape and same columns names - python

I have two dataframes A, B with NxM shape. I want to multiply both such that each element of A is multiplied with respective element of B.
e.g:
A,B = input dataframes
C = final dataframe
I want C[i][j] = A[i][j]*B[i][j] for i=1..N and j=1..M
I searched but couldn't get exactly the solution.

I think you can use:
C = A * B
Next solution is with mul:
C = A.mul(B)
Sample:
print A
a b
0 1 3
1 2 4
2 3 7
print B
a b
0 2 3
1 1 4
2 3 2
print A * B
a b
0 2 9
1 2 16
2 9 14
print A.mul(B)
a b
0 2 9
1 2 16
2 9 14
Timings with lenght of A and B 300k:
In [218]: %timeit A * B
The slowest run took 4.27 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.57 ms per loop
In [219]: %timeit A.mul(B)
100 loops, best of 3: 3.56 ms per loop
A = pd.concat([A]*100000).reset_index(drop=True)
B = pd.concat([B]*100000).reset_index(drop=True)
print A * B
print A.mul(B)

Related

How can I know the type of values in column names and list? [duplicate]

I'm selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".
Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:
df = pd.DataFrame( [[0,1,2]], columns=list('ABC') )
lst = list('ARB')
data = df[lst] # error: not in index
I think you need Index.intersection:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
lst = ['A','R','B']
print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')
data = df[df.columns.intersection(lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Another solution with numpy.intersect1d:
data = df[np.intersect1d(df.columns, lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Few other ways, and list comprehension is much faster
In [1357]: df[df.columns & lst]
Out[1357]:
A B
0 1 4
1 2 5
2 3 6
In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
A B
0 1 4
1 2 5
2 3 6
Timings
In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop
In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop
In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop
In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop
Details
In [1365]: df
Out[1365]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
In [1366]: lst
Out[1366]: ['A', 'R', 'B']
A really simple solution here is to use filter(). In your example, just type:
df.filter(lst)
and it will automatically ignore any missing columns. For more, see the documentation for filter.
As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from #jezrael, you could type either of the following.
df.filter(regex='A|R|B')
df.filter(regex='[ARB]')
Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:
df.filter(regex='^[ARB]')
FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don't think speed is really much of a concern here -- even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.
Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.
Use * with list
data = df[[*lst]]
It will give the desired result.
please try this:
syntax : Dataframe[[List of Columns]]
for example : df[['a','b']]
a
Out[5]:
a b c
0 1 2 3
1 12 3 44
X is the list of req columns to slice
x = ['a','b']
this would give you the req slice:
a[x]
Out[7]:
a b
0 1 2
1 12 3
Performance:
%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Frequency of repetitive position in pandas data frame

Hi I am working to find out repetitive position of the following data frame:
data = pd.DataFrame()
data ['league'] =['A','A','A','A','A','A','B','B','B']
data ['Team'] = ['X','X','X','Y','Y','Y','Z','Z','Z']
data ['week'] =[1,2,3,1,2,3,1,2,3]
data ['position']= [1,1,2,2,2,1,2,3,4]
I will compare the data for position from previous row, it is it the same, I will assign one. If it is different previous row, I will assign as 1
My expected outcome will be as follow:
It means I will group by (League, Team and week) and work out the frequency.
Can anyone advise how to do that in Pandas
Thanks,
Zep
Use diff, and compare against 0:
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
print(df)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
For performance reasons, you should try to avoid a fillna call.
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['frequency'] = df['position'].diff().abs().fillna(0,downcast='infer')
%%timeit
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
83.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
10.9 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To extend this answer to work in a groupby, use
v = df.groupby(['league', 'Team', 'week']).position.diff()
v[np.isnan(v)] = 0
df['frequency'] = v.ne(0).astype(int)
Use diff and abs with fillna:
data['frequency'] = data['position'].diff().abs().fillna(0,downcast='infer')
print(data)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
Using groupby gives all zeros, since you are comparing within groups not on whole dataframe.
data.groupby(['league', 'Team', 'week'])['position'].diff().fillna(0,downcast='infer')
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Name: position, dtype: int64

merge groupby results directly back to dataframe

Suppose I have the following data:
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
id1 id2 x
0 1 1 10
1 1 2 20
2 1 3 50
3 2 1 15
4 2 2 20
5 2 3 30
6 3 1 40
7 3 2 70
The dataframe is sorted along the two ids. Suppose I'd like to know the value of x of the FIRST observation within each group of id1 observations. The result would be like
id1 id2 x first_x
1 1 10 10
1 2 30 10
1 3 50 10
2 1 15 15
2 2 20 15
2 3 30 15
3 1 40 40
3 2 70 40
How do I achieve this 'subscripting'? Ideally, the new column would be filled for each observation.
I thought along the lines of
df['first_x'] = df.groupby(['id1'])[0]
I think simpliest is transform with first:
df['first_x'] = df.groupby('id1')['x'].transform('first')
Or map by Series created by drop_duplicates:
df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
print (df)
id1 id2 x first_x
0 1 1 10 10
1 1 2 20 10
2 1 3 50 10
3 2 1 15 15
4 2 2 20 15
5 2 3 30 15
6 3 1 40 40
7 3 2 70 40
First is shortest and fastest solution:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'id1': np.random.randint(10000,size=N),
'x':np.random.randint(10000,size=N)})
df = df.sort_values('id1').reset_index(drop=True)
print (df)
In [179]: %timeit df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
10 loops, best of 3: 125 ms per loop
In [180]: %%timeit
...: first_xs = df.groupby(['id1']).first().to_dict()['x']
...:
...: df['first_x'] = df['id1'].map(lambda id: first_xs[id])
...:
1 loop, best of 3: 524 ms per loop
In [181]: %timeit df['first_x'] = df.groupby('id1')['x'].transform('first')
10 loops, best of 3: 54.9 ms per loop
In [182]: %timeit df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
10 loops, best of 3: 142 ms per loop
Something like this?
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
df = df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
As you need to consider the entire dataframe when building values for each row, you need an intermediate step.
The following gets your first_x value using a group by, then uses that as a map to add a new column.
import pandas as pd
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
first_xs = df.groupby(['id1']).first().to_dict()['x']
df['first_x'] = df['id1'].map(lambda id: first_xs[id])

Selecting columns by list (and columns are subset of list)

I'm selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".
Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:
df = pd.DataFrame( [[0,1,2]], columns=list('ABC') )
lst = list('ARB')
data = df[lst] # error: not in index
I think you need Index.intersection:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
lst = ['A','R','B']
print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')
data = df[df.columns.intersection(lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Another solution with numpy.intersect1d:
data = df[np.intersect1d(df.columns, lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Few other ways, and list comprehension is much faster
In [1357]: df[df.columns & lst]
Out[1357]:
A B
0 1 4
1 2 5
2 3 6
In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
A B
0 1 4
1 2 5
2 3 6
Timings
In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop
In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop
In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop
In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop
Details
In [1365]: df
Out[1365]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
In [1366]: lst
Out[1366]: ['A', 'R', 'B']
A really simple solution here is to use filter(). In your example, just type:
df.filter(lst)
and it will automatically ignore any missing columns. For more, see the documentation for filter.
As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from #jezrael, you could type either of the following.
df.filter(regex='A|R|B')
df.filter(regex='[ARB]')
Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:
df.filter(regex='^[ARB]')
FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don't think speed is really much of a concern here -- even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.
Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.
Use * with list
data = df[[*lst]]
It will give the desired result.
please try this:
syntax : Dataframe[[List of Columns]]
for example : df[['a','b']]
a
Out[5]:
a b c
0 1 2 3
1 12 3 44
X is the list of req columns to slice
x = ['a','b']
this would give you the req slice:
a[x]
Out[7]:
a b
0 1 2
1 12 3
Performance:
%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Is there any column match or row match function in python?

I have two data frame lets say:
dataframe A with column 'name'
name
0 4
1 2
2 1
3 3
Another dataframe B with two columns i.e. name and value
name value
0 3 5
1 2 6
2 4 7
3 1 8
I want to rearrange the value in dataframe B according to the name column in dataframe A
I am expecting final dataframe similar to this:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Here are two options:
dfB.set_index('name').loc[dfA.name].reset_index()
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Or,
dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
dfA
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Timings:
import numpy as np
import pandas as pd
prng = np.random.RandomState(0)
names = np.arange(10**7)
prng.shuffle(names)
dfA = pd.DataFrame({'name': names})
prng.shuffle(names)
dfB = pd.DataFrame({'name': names, 'value': prng.randint(0, 100, 10**7)})
%timeit dfB.set_index('name').loc[dfA.name].reset_index()
1 loop, best of 3: 2.27 s per loop
%timeit dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
1 loop, best of 3: 1.65 s per loop
%timeit dfB.set_index('name').ix[dfA.name].reset_index()
1 loop, best of 3: 1.66 s per loop

Categories