Pandas: all possible combinations of rows

Pandas: all possible combinations of rows - python

I have a DataFrame looking like..
ID c1 c2 cX
r1 2 3 ..
r2 8 9 ..
rY ..
I want to generate a new DataFrame with all possible (two-part) combinations of rows whilst concatenating the columns of the two combined rows (so that the new DF would have twice as much columns). The result should look like:
ID c1_r1 c1_r2 c2_r1 c2_r2 cX_rA
r1_r2 2 8 3 9 ..
r1_r3 .. .. .. ..
rA_rB ..
The ID name isn't important (it could even be a MultiIndex) nor is the order of the columns of importance.
How to approach this?

Consider df
c1 c2
ID
r1 2 3
r2 8 9
r3 0 7
I'd do it like this
from itertools import combinations
a, b = map(list, zip(*combinations(df.index, 2)))
print(a, b, sep='\n')
['r1', 'r1', 'r2']
['r2', 'r3', 'r3']
Then use pd.concat
d = pd.concat(
[df.loc[a].reset_index(), df.loc[b].reset_index()],
keys=['a', 'b'], axis=1
)
d
a b
ID c1 c2 ID c1 c2
0 r1 2 3 r2 8 9
1 r1 2 3 r3 0 7
2 r2 8 9 r3 0 7
Finally, tie up loose ends
d.set_index([('a', 'ID'), ('b', 'ID')]).rename_axis(['a', 'b'])
a b
c1 c2 c1 c2
a b
r1 r2 2 3 8 9
r3 2 3 0 7
r2 r3 8 9 0 7

Related

Extract TLDs , SLDs from a dataframe column into new columns

I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.

You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr

You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv

How to sort dataframe based on column whose entries consist of letters and numbers?

I have a dataframe like this:
import pandas as pd
df = pd.DataFrame(
{
'pos': ['A1', 'B03', 'A2', 'B01', 'A3', 'B02'],
'ignore': range(6)
}
)
pos ignore
0 A1 0
1 A03 1
2 A2 2
3 B01 3
4 B3 4
5 B02 5
Which I would like to sort according to pos whereby
it should be first sorted according to the number and then according to the letter and
leading 0s should be ignored,
so the desired outcome is
pos ignore
0 A1 0
1 B01 3
2 A2 2
3 B02 5
4 A03 1
5 B3 4
I currently do it like this:
df[['let', 'num']] = df['pos'].str.extract(
'([A-Za-z]+)([0-9]+)'
)
df['num'] = df['num'].astype(int)
df = (
df.sort_values(['num', 'let'])
.drop(['let', 'num'], axis=1)
.reset_index(drop=True)
)
That works, but what I don't like is that I need two temporary columns I later have to drop again. Is there a more straightforward way of doing it?

You can use argsort with zfill and first sort on the numbers as 01, 02, 03 etc. This way you don't have to assign / drop columns:
val = df['pos'].str.extract('(\D+)(\d+)')
df.loc[(val[1].str.zfill(2) + val[0]).argsort()]
pos ignore
0 A1 0
3 B01 3
2 A2 2
5 B02 5
4 A3 4
1 B03 1

Here's one way:
import re
def extract_parts(x):
groups = re.match('([A-Za-z]+)([0-9]+)', x)
return (int(groups[2]), groups[1])
df.reindex(df.pos.transform(extract_parts).sort_values().index).reset_index(drop=True)
Output
Out[1]:
pos ignore
0 A1 0
1 B01 3
2 A2 2
3 B02 5
4 A03 1
5 B3 4

How to combine 2 dataframes, using the dot product

I have 2 dataframes:
df_1 = pd.DataFrame({"c1":[2,3,5,0],
"c2":[1,0,5,2],
"c3":[8,1,5,1]},
index=[1,2,3,4])
df_2 = pd.DataFrame({"u1":[1,0,1,0],
"u2":[-1,0,1,1]},
index=[1,2,3,4])
For every combination of "c" and "u", I want to calculate the dot product, e.g. with np.dot().
For example, the value of c1-u1 is calculated like this: 2*1 + 3*0 + 5*1 + 0*0 = 7
The resulting dataframe should look like this:
u1 u2
c1 7 3
c2 6 6
c3 13 -2
Is there an "elegant" way of solving this or is iterating through the 2 dataframes the only way?

Do you mean:
df_1.T # df_2
# or equivalently
# df1.T.dot(df2)
Output:
u1 u2
c1 7 3
c2 6 6
c3 13 -2

We can do matrix multiplication using pandas dot function.
df_1.T.dot(df_2)
Output:
u1 u2
c1 7 3
c2 6 6
c3 13 -2

How to use two different functions within crosstab/pivot_table in pandas?

Using pandas, is it possible to compute a single cross-tabulation (or pivot table) containing values calculated from two different functions?
import pandas as pd
import numpy as np
c1 = np.repeat(['a','b'], [50, 50], axis=0)
c2 = list('xy'*50)
c3 = np.repeat(['G1','G2'], [50, 50], axis=0)
np.random.shuffle(c3)
c4=np.repeat([1,2], [50,50],axis=0)
np.random.shuffle(c4)
val = np.random.rand(100)
df = pd.DataFrame({'c1':c1, 'c2':c2, 'c3':c3, 'c4':c4, 'val':val})
frequencyTable = pd.crosstab([df.c1,df.c2],[df.c3,df.c4])
meanVal = pd.crosstab([df.c1,df.c2],[df.c3,df.c4],values=df.val,aggfunc=np.mean)
So, both the rows and the columns are the same in both tables, but what I'd really like is a table with both frequencies and mean values:
c3 G1 G2
c4 1 2 1 2
c1 c2 freq val freq val freq val freq val
a x 6 0.624931 5 0.582268 8 0.528231 6 0.362804
y 7 0.493890 8 0.465741 3 0.613126 7 0.312894
b x 9 0.488255 5 0.804015 6 0.722640 5 0.369480
y 6 0.462653 4 0.506791 5 0.583695 10 0.517954

You can give a list of functions:
pd.crosstab([df.c1,df.c2], [df.c3,df.c4], values=df.val, aggfunc=[len, np.mean])
If you want the table as shown in your question, you will have to rearrange the levels a bit:
In [42]: table = pd.crosstab([df.c1,df.c2], [df.c3,df.c4], values=df.val, aggfunc=[len, np.mean])
In [43]: table
Out[43]:
len mean
c3 G1 G2 G1 G2
c4 1 2 1 2 1 2 1 2
c1 c2
a x 4 6 8 7 0.303036 0.414474 0.624900 0.425234
y 5 5 8 7 0.543363 0.480419 0.583499 0.637657
b x 10 6 4 5 0.400279 0.436929 0.442924 0.287572
y 6 8 5 6 0.400427 0.623319 0.764506 0.408708
In [44]: table.reorder_levels([1, 2, 0], axis=1).sort_index(axis=1)
Out[44]:
c3 G1 G2
c4 1 2 1 2
len mean len mean len mean len mean
c1 c2
a x 4 0.303036 6 0.414474 8 0.624900 7 0.425234
y 5 0.543363 5 0.480419 8 0.583499 7 0.637657
b x 10 0.400279 6 0.436929 4 0.442924 5 0.287572
y 6 0.400427 8 0.623319 5 0.764506 6 0.408708

problems with MultiIndex

I'm having problems with MultiIndex and stack(). The following example is based on a solution from Calvin Cheung on StackOvervlow.
=== multi.csv ===
h1,main,h3,sub,h5
a,A,1,A1,1
b,B,2,B1,2
c,B,3,A1,3
d,A,4,B2,4
e,A,5,B3,5
f,B,6,A2,6
=== multi.py ===
#!/usr/bin/env python
import pandas as pd
df1 = pd.read_csv('multi.csv')
df2 = df1.pivot('main', 'sub').stack()
print(df2)
=== output ===
h1 h3 h5
main sub
A A1 a 1 1
B2 d 4 4
B3 e 5 5
B A1 c 3 3
A2 f 6 6
B1 b 2 2
This works as long as the entries in the sub column are unique with respect to the corresponding entry in the main column. But if we change the sub column entry in row e to B2, then B2 is no longer unique in the group of A rows and we get an error message: "pandas.core.reshape.ReshapeError: Index contains duplicate entries, cannot reshape".
I was expected the shape of the sub index to behave like the shape of the primary index where duplicates are indicated with blank entries under the first row entry.
=== expected output ===
h1 h3 h5
main sub
A A1 a 1 1
B2 d 4 4
e 5 5
B A1 c 3 3
A2 f 6 6
B1 b 2 2
So my question is, how can I structure a MultiIndex in a way that allows duplicates in sub-levels?

Rather than do a pivot*, just set_index directly (this works for both examples):
In [11]: df
Out[11]:
h1 main h3 sub h5
0 a A 1 A1 1
1 b B 2 B1 2
2 c B 3 A1 3
3 d A 4 B2 4
4 e A 5 B2 5
5 f B 6 A2 6
In [12]: df.set_index(['main', 'sub'])
Out[12]:
h1 h3 h5
main sub
A A1 a 1 1
B B1 b 2 2
A1 c 3 3
A B2 d 4 4
B2 e 5 5
B A2 f 6 6
*You're not really doing a pivot here anyway, it just happens to work in the above case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: all possible combinations of rows - python

Related

Extract TLDs , SLDs from a dataframe column into new columns

How to sort dataframe based on column whose entries consist of letters and numbers?

How to combine 2 dataframes, using the dot product

How to use two different functions within crosstab/pivot_table in pandas?

problems with MultiIndex

Categories

Resources