How to combine 2 dataframes, using the dot product - python

I have 2 dataframes:
df_1 = pd.DataFrame({"c1":[2,3,5,0],
"c2":[1,0,5,2],
"c3":[8,1,5,1]},
index=[1,2,3,4])
df_2 = pd.DataFrame({"u1":[1,0,1,0],
"u2":[-1,0,1,1]},
index=[1,2,3,4])
For every combination of "c" and "u", I want to calculate the dot product, e.g. with np.dot().
For example, the value of c1-u1 is calculated like this: 2*1 + 3*0 + 5*1 + 0*0 = 7
The resulting dataframe should look like this:
u1 u2
c1 7 3
c2 6 6
c3 13 -2
Is there an "elegant" way of solving this or is iterating through the 2 dataframes the only way?

Do you mean:
df_1.T # df_2
# or equivalently
# df1.T.dot(df2)
Output:
u1 u2
c1 7 3
c2 6 6
c3 13 -2

We can do matrix multiplication using pandas dot function.
df_1.T.dot(df_2)
Output:
u1 u2
c1 7 3
c2 6 6
c3 13 -2

Related

How to sort dataframe based on column whose entries consist of letters and numbers?

I have a dataframe like this:
import pandas as pd
df = pd.DataFrame(
{
'pos': ['A1', 'B03', 'A2', 'B01', 'A3', 'B02'],
'ignore': range(6)
}
)
pos ignore
0 A1 0
1 A03 1
2 A2 2
3 B01 3
4 B3 4
5 B02 5
Which I would like to sort according to pos whereby
it should be first sorted according to the number and then according to the letter and
leading 0s should be ignored,
so the desired outcome is
pos ignore
0 A1 0
1 B01 3
2 A2 2
3 B02 5
4 A03 1
5 B3 4
I currently do it like this:
df[['let', 'num']] = df['pos'].str.extract(
'([A-Za-z]+)([0-9]+)'
)
df['num'] = df['num'].astype(int)
df = (
df.sort_values(['num', 'let'])
.drop(['let', 'num'], axis=1)
.reset_index(drop=True)
)
That works, but what I don't like is that I need two temporary columns I later have to drop again. Is there a more straightforward way of doing it?
You can use argsort with zfill and first sort on the numbers as 01, 02, 03 etc. This way you don't have to assign / drop columns:
val = df['pos'].str.extract('(\D+)(\d+)')
df.loc[(val[1].str.zfill(2) + val[0]).argsort()]
pos ignore
0 A1 0
3 B01 3
2 A2 2
5 B02 5
4 A3 4
1 B03 1
Here's one way:
import re
def extract_parts(x):
groups = re.match('([A-Za-z]+)([0-9]+)', x)
return (int(groups[2]), groups[1])
df.reindex(df.pos.transform(extract_parts).sort_values().index).reset_index(drop=True)
Output
Out[1]:
pos ignore
0 A1 0
1 B01 3
2 A2 2
3 B02 5
4 A03 1
5 B3 4

Pandas: all possible combinations of rows

I have a DataFrame looking like..
ID c1 c2 cX
r1 2 3 ..
r2 8 9 ..
rY ..
I want to generate a new DataFrame with all possible (two-part) combinations of rows whilst concatenating the columns of the two combined rows (so that the new DF would have twice as much columns). The result should look like:
ID c1_r1 c1_r2 c2_r1 c2_r2 cX_rA
r1_r2 2 8 3 9 ..
r1_r3 .. .. .. ..
rA_rB ..
The ID name isn't important (it could even be a MultiIndex) nor is the order of the columns of importance.
How to approach this?
Consider df
c1 c2
ID
r1 2 3
r2 8 9
r3 0 7
I'd do it like this
from itertools import combinations
a, b = map(list, zip(*combinations(df.index, 2)))
print(a, b, sep='\n')
['r1', 'r1', 'r2']
['r2', 'r3', 'r3']
Then use pd.concat
d = pd.concat(
[df.loc[a].reset_index(), df.loc[b].reset_index()],
keys=['a', 'b'], axis=1
)
d
a b
ID c1 c2 ID c1 c2
0 r1 2 3 r2 8 9
1 r1 2 3 r3 0 7
2 r2 8 9 r3 0 7
Finally, tie up loose ends
d.set_index([('a', 'ID'), ('b', 'ID')]).rename_axis(['a', 'b'])
a b
c1 c2 c1 c2
a b
r1 r2 2 3 8 9
r3 2 3 0 7
r2 r3 8 9 0 7

Extract all the following rows in pandas

I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?
One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.
using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2
If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get:

How to use two different functions within crosstab/pivot_table in pandas?

Using pandas, is it possible to compute a single cross-tabulation (or pivot table) containing values calculated from two different functions?
import pandas as pd
import numpy as np
c1 = np.repeat(['a','b'], [50, 50], axis=0)
c2 = list('xy'*50)
c3 = np.repeat(['G1','G2'], [50, 50], axis=0)
np.random.shuffle(c3)
c4=np.repeat([1,2], [50,50],axis=0)
np.random.shuffle(c4)
val = np.random.rand(100)
df = pd.DataFrame({'c1':c1, 'c2':c2, 'c3':c3, 'c4':c4, 'val':val})
frequencyTable = pd.crosstab([df.c1,df.c2],[df.c3,df.c4])
meanVal = pd.crosstab([df.c1,df.c2],[df.c3,df.c4],values=df.val,aggfunc=np.mean)
So, both the rows and the columns are the same in both tables, but what I'd really like is a table with both frequencies and mean values:
c3 G1 G2
c4 1 2 1 2
c1 c2 freq val freq val freq val freq val
a x 6 0.624931 5 0.582268 8 0.528231 6 0.362804
y 7 0.493890 8 0.465741 3 0.613126 7 0.312894
b x 9 0.488255 5 0.804015 6 0.722640 5 0.369480
y 6 0.462653 4 0.506791 5 0.583695 10 0.517954
You can give a list of functions:
pd.crosstab([df.c1,df.c2], [df.c3,df.c4], values=df.val, aggfunc=[len, np.mean])
If you want the table as shown in your question, you will have to rearrange the levels a bit:
In [42]: table = pd.crosstab([df.c1,df.c2], [df.c3,df.c4], values=df.val, aggfunc=[len, np.mean])
In [43]: table
Out[43]:
len mean
c3 G1 G2 G1 G2
c4 1 2 1 2 1 2 1 2
c1 c2
a x 4 6 8 7 0.303036 0.414474 0.624900 0.425234
y 5 5 8 7 0.543363 0.480419 0.583499 0.637657
b x 10 6 4 5 0.400279 0.436929 0.442924 0.287572
y 6 8 5 6 0.400427 0.623319 0.764506 0.408708
In [44]: table.reorder_levels([1, 2, 0], axis=1).sort_index(axis=1)
Out[44]:
c3 G1 G2
c4 1 2 1 2
len mean len mean len mean len mean
c1 c2
a x 4 0.303036 6 0.414474 8 0.624900 7 0.425234
y 5 0.543363 5 0.480419 8 0.583499 7 0.637657
b x 10 0.400279 6 0.436929 4 0.442924 5 0.287572
y 6 0.400427 8 0.623319 5 0.764506 6 0.408708

problems with MultiIndex

I'm having problems with MultiIndex and stack(). The following example is based on a solution from Calvin Cheung on StackOvervlow.
=== multi.csv ===
h1,main,h3,sub,h5
a,A,1,A1,1
b,B,2,B1,2
c,B,3,A1,3
d,A,4,B2,4
e,A,5,B3,5
f,B,6,A2,6
=== multi.py ===
#!/usr/bin/env python
import pandas as pd
df1 = pd.read_csv('multi.csv')
df2 = df1.pivot('main', 'sub').stack()
print(df2)
=== output ===
h1 h3 h5
main sub
A A1 a 1 1
B2 d 4 4
B3 e 5 5
B A1 c 3 3
A2 f 6 6
B1 b 2 2
This works as long as the entries in the sub column are unique with respect to the corresponding entry in the main column. But if we change the sub column entry in row e to B2, then B2 is no longer unique in the group of A rows and we get an error message: "pandas.core.reshape.ReshapeError: Index contains duplicate entries, cannot reshape".
I was expected the shape of the sub index to behave like the shape of the primary index where duplicates are indicated with blank entries under the first row entry.
=== expected output ===
h1 h3 h5
main sub
A A1 a 1 1
B2 d 4 4
e 5 5
B A1 c 3 3
A2 f 6 6
B1 b 2 2
So my question is, how can I structure a MultiIndex in a way that allows duplicates in sub-levels?
Rather than do a pivot*, just set_index directly (this works for both examples):
In [11]: df
Out[11]:
h1 main h3 sub h5
0 a A 1 A1 1
1 b B 2 B1 2
2 c B 3 A1 3
3 d A 4 B2 4
4 e A 5 B2 5
5 f B 6 A2 6
In [12]: df.set_index(['main', 'sub'])
Out[12]:
h1 h3 h5
main sub
A A1 a 1 1
B B1 b 2 2
A1 c 3 3
A B2 d 4 4
B2 e 5 5
B A2 f 6 6
*You're not really doing a pivot here anyway, it just happens to work in the above case.

Categories