How to use two different functions within crosstab/pivot_table in pandas? - python

Using pandas, is it possible to compute a single cross-tabulation (or pivot table) containing values calculated from two different functions?
import pandas as pd
import numpy as np
c1 = np.repeat(['a','b'], [50, 50], axis=0)
c2 = list('xy'*50)
c3 = np.repeat(['G1','G2'], [50, 50], axis=0)
np.random.shuffle(c3)
c4=np.repeat([1,2], [50,50],axis=0)
np.random.shuffle(c4)
val = np.random.rand(100)
df = pd.DataFrame({'c1':c1, 'c2':c2, 'c3':c3, 'c4':c4, 'val':val})
frequencyTable = pd.crosstab([df.c1,df.c2],[df.c3,df.c4])
meanVal = pd.crosstab([df.c1,df.c2],[df.c3,df.c4],values=df.val,aggfunc=np.mean)
So, both the rows and the columns are the same in both tables, but what I'd really like is a table with both frequencies and mean values:
c3 G1 G2
c4 1 2 1 2
c1 c2 freq val freq val freq val freq val
a x 6 0.624931 5 0.582268 8 0.528231 6 0.362804
y 7 0.493890 8 0.465741 3 0.613126 7 0.312894
b x 9 0.488255 5 0.804015 6 0.722640 5 0.369480
y 6 0.462653 4 0.506791 5 0.583695 10 0.517954

You can give a list of functions:
pd.crosstab([df.c1,df.c2], [df.c3,df.c4], values=df.val, aggfunc=[len, np.mean])
If you want the table as shown in your question, you will have to rearrange the levels a bit:
In [42]: table = pd.crosstab([df.c1,df.c2], [df.c3,df.c4], values=df.val, aggfunc=[len, np.mean])
In [43]: table
Out[43]:
len mean
c3 G1 G2 G1 G2
c4 1 2 1 2 1 2 1 2
c1 c2
a x 4 6 8 7 0.303036 0.414474 0.624900 0.425234
y 5 5 8 7 0.543363 0.480419 0.583499 0.637657
b x 10 6 4 5 0.400279 0.436929 0.442924 0.287572
y 6 8 5 6 0.400427 0.623319 0.764506 0.408708
In [44]: table.reorder_levels([1, 2, 0], axis=1).sort_index(axis=1)
Out[44]:
c3 G1 G2
c4 1 2 1 2
len mean len mean len mean len mean
c1 c2
a x 4 0.303036 6 0.414474 8 0.624900 7 0.425234
y 5 0.543363 5 0.480419 8 0.583499 7 0.637657
b x 10 0.400279 6 0.436929 4 0.442924 5 0.287572
y 6 0.400427 8 0.623319 5 0.764506 6 0.408708

Related

How to combine 2 dataframes, using the dot product

I have 2 dataframes:
df_1 = pd.DataFrame({"c1":[2,3,5,0],
"c2":[1,0,5,2],
"c3":[8,1,5,1]},
index=[1,2,3,4])
df_2 = pd.DataFrame({"u1":[1,0,1,0],
"u2":[-1,0,1,1]},
index=[1,2,3,4])
For every combination of "c" and "u", I want to calculate the dot product, e.g. with np.dot().
For example, the value of c1-u1 is calculated like this: 2*1 + 3*0 + 5*1 + 0*0 = 7
The resulting dataframe should look like this:
u1 u2
c1 7 3
c2 6 6
c3 13 -2
Is there an "elegant" way of solving this or is iterating through the 2 dataframes the only way?
Do you mean:
df_1.T # df_2
# or equivalently
# df1.T.dot(df2)
Output:
u1 u2
c1 7 3
c2 6 6
c3 13 -2
We can do matrix multiplication using pandas dot function.
df_1.T.dot(df_2)
Output:
u1 u2
c1 7 3
c2 6 6
c3 13 -2

substitue values of one dataframe from values of another dataframe based on condition

I have a df which looks like
floor id p1 p2 p3
L1 1 5 6 7
L1 2 5 8 3
L2 1 4 2 1
L2 2 4 5 4
and df2
floor id p1 p2 p4
L1 1 6 6 5
L1 2 9 8 5
L2 1 5 5 5
L2 2 4 5 5
How do I replace the values of p1 and p2 in my df for particular floor and id with the values the respective values from df2?
We can also use DataFrame.merge
df1 = (df1[df1.columns.difference(['p1','p2'])].merge(df2,
on =['floor','id'],
how ='left')
.fillna(df1)[df1.columns])
print(df1)
floor id p1 p2 p3
0 L1 1 6 6 7
1 L1 2 9 8 3
2 L2 1 5 5 1
3 L2 2 4 5 4
Merge can be used for this particular problem:
# left join
df = (df.merge(df2, left_on=['floor', 'id'], how='left', right_on=['floor', 'id'])
# fill missing values with corresponding original df values
df['p1_y'] = df['p1_y'].fillna(df['p1_x']).astype(int)
df['p2_y'] = df['p2_y'].fillna(df['p2_x']).astype(int)
# drop unnecessary columns
df.drop(['p1_x', 'p2_x', 'p4'], axis=1, inplace=True)
df.rename(columns={'p1_y': 'p1', 'p2_y': 'p2'}, inplace=True)

Randomly choose two values without repetition in dataframe

Consider a dataframe df with N columns and M rows:
>>> df = pd.DataFrame(np.random.randint(1, 10, (10, 5)), columns=list('abcde'))
>>> df
a b c d e
0 4 4 5 5 7
1 9 3 8 8 1
2 2 8 1 8 5
3 9 5 1 2 7
4 3 5 8 2 3
5 2 8 8 2 8
6 3 1 7 2 6
7 4 1 5 6 3
8 5 4 4 9 5
9 3 7 5 6 6
I want to randomly choose two columns and then randomly choose one particular row (this would give me two values of the same row). I can achieve this using
>>> df.sample(2, axis=1).sample(1,axis=0)
e a
1 3 5
I want to perform this K times like below :
>>> for i in xrange(5):
... df.sample(2, axis=1).sample(1,axis=0)
...
e a
1 3 5
d b
2 1 9
e b
4 8 9
c b
0 6 5
e c
1 3 5
I want to ensure that I do not choose the same two values (by choosing the same two columns and same row) in any of the trials. How would I achieve this?
I want to then perform a bitwise XOR operation on the two chosen values in each trial as well. For example, 3 ^ 5, 1 ^ 9 , .. and count all the bit differences in the chosen values.
You can create a list of all of the index by 2 column tuples. And then take random selections from that without replacement.
Sample Data
import pandas as pd
import numpy as np
from itertools import combinations, product
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1, 10, (10, 5)), columns=list('abcde'))
#df = df.reset_index() #if index contains duplicates
Code
K = 5
choices = np.array(list(product(df.index, combinations(df.columns, 2))))
idx = choices[np.r_[np.random.choice(len(choices), K, replace=False)]]
#array([[9, ('a', 'e')],
# [2, ('a', 'e')],
# [1, ('a', 'c')],
# [3, ('b', 'e')],
# [8, ('d', 'e')]], dtype=object)
Then you can decide how exactly you want your output, but something like this is close to what you show:
pd.concat([df.loc[myid[0], list(myid[1])].reset_index().T for myid in idx])
# 0 1
#index a e
#9 4 8
#index a e
#2 1 1
#index a c
#1 7 1
#index b e
#3 2 3
#index d e
#8 5 7

Pandas: all possible combinations of rows

I have a DataFrame looking like..
ID c1 c2 cX
r1 2 3 ..
r2 8 9 ..
rY ..
I want to generate a new DataFrame with all possible (two-part) combinations of rows whilst concatenating the columns of the two combined rows (so that the new DF would have twice as much columns). The result should look like:
ID c1_r1 c1_r2 c2_r1 c2_r2 cX_rA
r1_r2 2 8 3 9 ..
r1_r3 .. .. .. ..
rA_rB ..
The ID name isn't important (it could even be a MultiIndex) nor is the order of the columns of importance.
How to approach this?
Consider df
c1 c2
ID
r1 2 3
r2 8 9
r3 0 7
I'd do it like this
from itertools import combinations
a, b = map(list, zip(*combinations(df.index, 2)))
print(a, b, sep='\n')
['r1', 'r1', 'r2']
['r2', 'r3', 'r3']
Then use pd.concat
d = pd.concat(
[df.loc[a].reset_index(), df.loc[b].reset_index()],
keys=['a', 'b'], axis=1
)
d
a b
ID c1 c2 ID c1 c2
0 r1 2 3 r2 8 9
1 r1 2 3 r3 0 7
2 r2 8 9 r3 0 7
Finally, tie up loose ends
d.set_index([('a', 'ID'), ('b', 'ID')]).rename_axis(['a', 'b'])
a b
c1 c2 c1 c2
a b
r1 r2 2 3 8 9
r3 2 3 0 7
r2 r3 8 9 0 7

Extract all the following rows in pandas

I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?
One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.
using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2
If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get:

Categories