joining merge multiple dataframes without matching index - python

I have a complete data set that looks like this:
pandas==1.1.5
all_data_set = [
('A','Area1','AA','A B D E'),
('B','Area1','AA','A B D E'),
('C','Area2','BB','C'),
('D','Area1','CC','A B D E'),
('E','Area1','CC','A B D E'),
('F','Area3','BB','F'),
('G','Area4','AA','G H'),
('H','Area4','CC','G H'),
('I','Area5','BB','I'),
('J','Area6','AA','J L'),
('L','Area6','CC','J L'),
('M','Area5','BB','M')
]
all_df = pd.DataFrame(data = all_data_set, columns = ['Name','Area','Type','Group'])
Name Area Type Group
0 A Area1 AA A B D E
1 B Area1 AA A B D E
2 C Area2 BB C
3 D Area1 CC A B D E
4 E Area1 CC A B D E
5 F Area3 BB F
6 G Area4 AA G H
7 H Area4 CC G H
8 I Area5 BB I
9 J Area6 AA J L
10 L Area6 CC J L
11 M Area5 BB M
From this data set I created 3 df's grouped by Type:
aa_df = all_df.loc[all_df['Type']=='AA']
aa_df = aa_df.rename(columns={'Group':'AA group'})
bb_df = all_df.loc[all_df['Type']=='BB']
bb_df = bb_df.rename(columns={'Group':'BB group'})
cc_df = all_df.loc[all_df['Type']=='CC']
cc_df = cc_df.rename(columns={'Group':'CC group'
Name Area Type AA group
0 A Area1 AA A B D E
1 B Area1 AA A B D E
6 G Area4 AA G H
9 J Area6 AA J L
Name Area Type BB group
2 C Area2 BB C
5 F Area3 BB F
8 I Area5 BB I
11 M Area5 BB M
Name Area Type CC group
3 D Area1 CC A B D E
4 E Area1 CC A B D E
7 H Area4 CC G H
10 L Area6 CC J L
My Goal is to Join them following these rules:
All Members are grouped by matching Area . ie Area1 has Names A B D E
AA Members are only Type = AA . ie of A B D E only A and B are AA Type
CC Members are only Type = CC
BB Members are always single and are also AA and CC Members
The resulting df should look like this
Name Area Type All Members AA Members CC Members
0 A Area1 AA A B D E A B D E
1 B Area1 AA A B D E A B D E
2 C Area2 BB C C C
3 D Area1 CC A B D E A B D E
4 E Area1 CC A B D E A B D E
5 F Area3 BB F F F
6 G Area4 AA G H G H
7 H Area4 CC G H G H
8 I Area5 BB I I I
9 J Area6 AA J L J L
10 L Area6 CC J L J L
11 M Area5 BB M M M
Im lost on how to join the 3 types of DFs since I don't have a shared index between the 3 , I was thinking I need some type of isin to look back at the all_df and reference the group. But the group is just like you see it, its names separated by spaces so I think I need to convert that to a list maybe?
Is there a way to do this using pandas or will I need a series of loops and lookups?

If think you don't need your grouped dfs. You can compute your members with groupby then use the created df to lookup AA and CC members. Finally fill NA values with Name:
import pandas as pd
all_data_set = [
('A','Area1','AA','A B D E'),
('B','Area1','AA','A B D E'),
('C','Area2','BB','C'),
('D','Area1','CC','A B D E'),
('E','Area1','CC','A B D E'),
('F','Area3','BB','F'),
('G','Area4','AA','G H'),
('H','Area4','CC','G H'),
('I','Area5','BB','I'),
('J','Area6','AA','J L'),
('L','Area6','CC','J L'),
('M','Area5','BB','M')
]
all_df = pd.DataFrame(data = all_data_set, columns = ['Name','Area','Type','Group'])
members_df = all_df.groupby(['Area', 'Type']).agg({'Name': list})
#print(members_df)
def get_members(row, typ):
try:
return " ".join(members_df.loc[(row['Area'], typ), 'Name'])
except KeyError:
return
all_df['AA members'] = all_df.apply(lambda x: get_members(x, 'AA'), axis=1)
all_df['CC members'] = all_df.apply(lambda x: get_members(x, 'CC'), axis=1)
# filling na values
all_df.loc[all_df['AA members'].isna(), 'AA members'] = all_df['Name']
all_df.loc[all_df['CC members'].isna(), 'CC members'] = all_df['Name']
print(all_df)
Output:
Name Area Type Group AA members CC members
0 A Area1 AA A B D E A B D E
1 B Area1 AA A B D E A B D E
2 C Area2 BB C C C
3 D Area1 CC A B D E A B D E
4 E Area1 CC A B D E A B D E
5 F Area3 BB F F F
6 G Area4 AA G H G H
7 H Area4 CC G H G H
8 I Area5 BB I I I
9 J Area6 AA J L J L
10 L Area6 CC J L J L
11 M Area5 BB M M M

Related

Pandas replace Na using merge or join

I want to replace Na in column A with based on shared values of column B so column rows with x in column B have 1 in column A and rows with y in column B have 2 in column A
A B C D E
1 x d e q
Na x v s f
Na x v e j
2 y w e v
Na y b d g
'''
Use groupby.transform('first'), eventually combined with convert_dtypes:
df['A'] = df.groupby('B')['A'].transform('first').convert_dtypes()
output:
A B C D E
0 1 x d e q
1 1 x v s f
2 1 x v e j
3 2 y w e v
4 2 y b d g

How to transpose only a specific amount and add it to the existing dataframe

I searched the internet to find a solution for my problem, but i could not find it.
I have the folowing dataframe
pos1 pos2 pos3
0 A A A
1 B B B
2 C C C
3 D D D
4 E E E
5 F F F
6 G G G
7 H H H
8 I I I
and i want to add to the existing dataframe the folowing dataframe:
pos1 pos2 pos3
0 A B C
1 A B C
2 A B C
3 D E F
4 D E F
5 D E F
6 G H I
7 G H I
8 G H I
So that i get the following dataframe:
pos1 pos2 pos3
0 A A A
1 B B B
2 C C C
3 D D D
4 E E E
5 F F F
6 G G G
7 H H H
8 I I I
9 A B C
10 A B C
11 A B C
12 D E F
13 D E F
14 D E F
15 G H I
16 G H I
17 G H I
I know that the number of row are always a multiple of the number of columns. That means if i have 4 columns than the rows should be either 4, 8, 12, 16, etc. Im my example the columns are 3 and the rows are 9
What i then want to do is transpose the rows into columns but only for that number of columns. So i want the first 3 rows to be transposed with the columns, then the next 3 rows and so forth.
I have now the following code:
import pandas as pd
import io
s = """pos1 pos2 pos3
A A A
B B B
C C C
D D D
E E E
F F F
G G G
H H H
I I I
"""
df = pd.read_csv(io.StringIO(s), delim_whitespace=True)
final_df = df.copy()
index_values = final_df.index.values
value = 0
while value < len(df.index):
sub_df = df[value:value+3]
sub_df.columns = index_values[value: value + 3]
sub_df = sub_df.T
sub_df.columns = df.columns
final_df = pd.concat([final_df, sub_df])
value += len(df.columns)
final_df = final_df.reset_index(drop=True)
print(final_df)
The code that i now have is slow because of the forloop.
Is it possible to obtain the same solution without using the forloop?
You can use the underlying numpy array with ravel and reshape with the order='F' parameter (column-major order) and the pandas.DataFrame constructor.
Then concat the output with the original array:
pd.concat([df,
pd.DataFrame(df.to_numpy().ravel().reshape(df.shape, order='F'),
columns=df.columns)
], ignore_index=True)
output:
pos1 pos2 pos3
0 A A A
1 B B B
2 C C C
3 D D D
4 E E E
5 F F F
6 G G G
7 H H H
8 I I I
9 A D G
10 A D G
11 A D G
12 B E H
13 B E H
14 B E H
15 C F I
16 C F I
17 C F I
this is somewhat efficient if you want to use pandas only.
for value in range(1,int(len(df.index)/3)):
df.loc[len(df)+value*value]=df.iloc[(value*3)-3:value*3,0:1].T.values[0]
df.loc[len(df)+value*value+1]=df.iloc[(value*3)-3:value*3,0:1].T.values[0]
df.loc[len(df)+value*value+2]=df.iloc[(value*3)-3:value*3,0:1].T.values[0]

How do I merge categories for crosstab in pandas?

Suppose my pandas dataframe has 3 categories for variable X: [A, B, C] and 2 categories for variable Y:[D,E]. I want to cross-tab this, with something like:
+--------+----------------------+-----+
| X/Y | D | E |
+--------+----------------------+-----+
| A or B | count(X=A or B, Y=D) | ... |
| C | count(X=C),Y=D) | ... |
+--------+----------------------+-----+
Is this what you are looking for?
import pandas as pd
import numpy as np
x = np.random.choice(['A', 'B', 'C'], size=10)
y = np.random.choice(['D', 'E'], size=10)
df = pd.DataFrame({'X':x, 'Y':y})
df.head()
Output:
X Y
0 A D
1 B D
2 B E
3 B D
4 A E
Dataframe modifications:
df['X'] = df['X'].apply(lambda x: 'A or B' if x == 'A' or x == 'B' else x)
Crosstab application:
pd.crosstab(df.X, df.Y)
Output:
Y D E
X
A or B 1 3
C 4 2
You can use pandas.pivot_table() for this purpose. This should do the trick - df refers to input dataframe.
import numpy as np
df["catX"]=np.where(df["X"].isin(["A","B"]), "AB", np.where(df["X"]=="C", "C", "other"))
df2=df.pivot_table(index="catX", columns="Y", aggfunc='count', values="X")
Sample output:
#input - df with extra categorical column - catX
X Y catX
0 A D AB
1 B D AB
2 C E C
3 B E AB
4 C D C
5 B D AB
6 C D C
7 A E AB
8 A D AB
9 A E AB
10 C E C
11 C E C
12 A E AB
#result:
Y D E
catX
AB 4 4
C 2 3

How to split a string and assign as column name for a pandas dataframe?

I have a dataframe which has a single column like this:
a;d;c;d;e;r;w;e;o
--------------------
0 h;j;r;d;w;f;g;t;r
1 a;f;c;x;d;e;r;t;y
2 b;h;g;t;t;t;y;u;f
3 g;t;u;n;b;v;d;s;e
When I split it I am getting like this:
0 1 2 3 4 5 6 7 8
------------------------------
0 h j r d w f g t r
1 a f c x d e r t y
2 b h g t t t y u f
3 g t u n b v d s e
I need to assign a d c d e r w e o instead of 0 1 2 3 4 5 6 7 8 as column names.
I tried :
df = dataframe
df = df.iloc[:,0].str.split(';')
res = pd.DataFrame(df.columns.tolist())
res = pd.DataFrame(df.values.tolist())
I am getting values assigned to each column..But not column headers. What to do?
I think need create new DataFrame by expand=True parameter and then assign new columns names:
res = df.iloc[:,0].str.split(';', expand=True)
res.columns = df.columns[0].split(';')
print (res)
a d c d e r w e o
0 h j r d w f g t r
1 a f c x d e r t y
2 b h g t t t y u f
3 g t u n b v d s e
But maybe need sep=';' in read_csv if only one column data:
res = pd.read_csv(file, sep=';')

How do I flatten a pandas dataframe keeping index and column names

I have a pandas dataframe as
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
A B C D
E -0.585995 1.325598 -1.172405 -2.810322
F -2.282079 -1.203231 -0.304155 -0.119221
G -0.739126 1.114628 0.381701 -0.485394
H 1.162010 -1.472594 1.767941 1.450582
I 0.119481 0.097139 -0.091432 -0.415333
J 1.266389 0.875473 1.787459 -1.149971
How can I flatten this array, whilst keeping the column and index IDs as here:
E A -0.585995
E B 1.325598
E C -1.172405
E D -2.810322
F A ...
F B ...
...
...
J D -1.149971
It doesnt matter what order the values occur in...
np.flatten() can be used to flatten the df.values into a 1D array, but then I lose the order of the index and columns...
Use stack + set_index:
df = df.stack().reset_index()
df.columns = ['a','b','c']
print (df)
a b c
0 E A -0.585995
1 E B 1.325598
2 E C -1.172405
3 E D -2.810322
4 F A -2.282079
5 F B -1.203231
6 F C -0.304155
7 F D -0.119221
8 G A -0.739126
9 G B 1.114628
10 G C 0.381701
11 G D -0.485394
12 H A 1.162010
13 H B -1.472594
14 H C 1.767941
15 H D 1.450582
16 I A 0.119481
17 I B 0.097139
18 I C -0.091432
19 I D -0.415333
20 J A 1.266389
21 J B 0.875473
22 J C 1.787459
23 J D -1.149971
Numpy solution with numpy.tile + numpy.repeat + numpy.ravel:
b = np.tile(df.columns, len(df.index))
a = np.repeat(df.index, len(df.columns))
c = df.values.ravel()
df = pd.DataFrame({'a':a, 'b':b, 'c':c})
print (df)
a b c
0 E A -0.585995
1 E B 1.325598
2 E C -1.172405
3 E D -2.810322
4 F A -2.282079
5 F B -1.203231
6 F C -0.304155
7 F D -0.119221
8 G A -0.739126
9 G B 1.114628
10 G C 0.381701
11 G D -0.485394
12 H A 1.162010
13 H B -1.472594
14 H C 1.767941
15 H D 1.450582
16 I A 0.119481
17 I B 0.097139
18 I C -0.091432
19 I D -0.415333
20 J A 1.266389
21 J B 0.875473
22 J C 1.787459
23 J D -1.149971
Timings:
In [103]: %timeit (df.stack().reset_index())
1000 loops, best of 3: 1.26 ms per loop
In [104]: %timeit (pd.DataFrame({'a':np.repeat(df.index, len(df.columns)), 'b':np.tile(df.columns, len(df.index)), 'c':df.values.ravel()}))
1000 loops, best of 3: 436 µs per loop

Categories