Pandas hierarchical columns - python

Say I have two dataframes, is it possible to concatenate them by columns, but with the second one appearing as a single column in the concatenated dataframe?
Pictorially, I'm looking for:
df_A:
C1 C2 C3
1 2 3
11 22 33
df_B:
D1 D2 D3
3 4 5
33 44 55
Concatenated:
C1 C2 C3 df_B
D1 D2 D3
1 2 3 3 4 5
11 22 33 33 44 55

You can contruct a MultiIndex to created a DataFrame with the desired appearance:
import pandas as pd
df_A = pd.DataFrame([(1,2,3), (11,22,33)], columns=['C1', 'C2', 'C3'])
df_B = pd.DataFrame([(3,4,5), (33,44,55)], columns=['D1', 'D2', 'D3'])
result = pd.concat([df_A, df_B], axis=1)
result.columns = pd.MultiIndex.from_tuples([(col,'') for col in df_A]
+ [('df_B', col) for col in df_B])
print(result)
yields
C1 C2 C3 df_B
D1 D2 D3
0 1 2 3 3 4 5
1 11 22 33 33 44 55

Related

How can I sum two dataframe's totals in a new dataframe?

I got the following code:
df_A = pd.DataFrame ({'a1': [2,2,3,5,6],
'a2' : [8,6,3,5,2],
'a3': [7,4,3,0,6] })
df_B = pd.DataFrame ({'b1': [9,5,3,7,6],
'b2' : [0,6,4,5,3],
'b3': [7,8,8,0,10] })
This looks like:
a1 a2 a3
0 2 8 7
1 2 6 4
2 3 3 3
3 5 5 0
4 6 2 6
and:
b1 b2 b3
0 9 0 7
1 5 6 8
2 3 4 8
3 7 5 0
4 6 3 10
I want to have the sum of each column so I did:
total_A = df_A.sum()
total_B = df_B.sum()
The outcome for total_A was:
0
a1 18
a2 24
a3 20
for total_B:
0
b1 30
b2 18
b3 33
And then both totals needs to be summed as well. But I am getting NaNs
I prefer to get a df with column named
total_1, total_2, total_3
and as key the total values for each column:
total_1, total_2, total_3
48 42 53
So 48 is sum of column a1 + column b1; 42 is sum of column a2 + column b2 and 53 is sum of column a3 + column b3.
Can someone help me please?
The indexes are not aligned, so pandas won't sum a1 with b1. You need to align the index and there are many different ways/
You can to use the underlying numpy data on B to avoid index aligment:
df_A.sum()+df_B.sum().values
or rename B columns to match that of A:
df_A.add(df_B.set_axis(df_A.columns, axis=1)).sum()
output:
a1 48
a2 42
a3 53
dtype: int64
or set a common index:
(df_A
.rename(columns=lambda x: x.replace('a', 'total_'))
.add(df_B.rename(columns=lambda x: x.replace('b', 'total_')))
.sum()
)
output:
total_1 48
total_2 42
total_3 53
dtype: int64
as numpy array:
(df_A.to_numpy()+df_B.to_numpy()).sum(0)
output:
array([48, 42, 53])

Removing duplicate rows in a dataframe with some conditions on data in a particular column

I have the following dataframe, df
Index time block cell
0 9 25 c1
1 9 25 c1
2 33 35 c2
3 47 4 c1
4 47 17 c2
5 100 21 c1
6 120 21 c1
7 120 36 c2
The duplicates are to be dropped based on time column. However, there is a condition:
- if two or more similar times have the same cells, for example, index 0 and index 1 have c1
then keep any of the columns.
- if two or more similar times have different cells eg index 3 and 4 and index 6 and 7 then keep all the rows corresponding to duplicate times
The resulting data frame will be as follows: df_result =
Index time block cell
0 9 25 c1
2 33 35 c2
3 47 4 c1
4 47 17 c2
5 100 21 c1
6 120 21 c1
7 120 36 c2
Tried
df.drop_duplicates('time')
You can achieve this by binning the original DataFrame into categories and then running drop_duplicates() within each category.
import pandas as pd
df = pd.DataFrame({'time':[9,9,33,47,47,100,120,120],'block':[25,25,35,4,17,21,21,36],'cell':'c1;c1;c2;c1;c2;c1;c1;c2'.split(';')})
categories = df['cell'].astype('category').unique()
df2 = pd.DataFrame()
for category in categories:
df2 = pd.concat([df2, df[df['cell'] == category].drop_duplicates(keep='first')])
df2 = df2.sort_index()
This will result in df2 being
time block cell
0 9 25 c1
2 33 35 c2
3 47 4 c1
4 47 17 c2
5 100 21 c1
6 120 21 c1
7 120 36 c2
You can group by one of the desired columns, then drop the duplicates on the other column as follows:
df = pd.DataFrame({'time':[9,9,33,47,47,100,120,120],'block':[25,25,35,4,17,21,21,36],'cell': ['c1','c1','c2','c1','c2','c1','c1','c2']})
grouped = df.groupby('time')
final_df = pd.DataFrame({'time':[] ,'block':[],'cell':[]})
for ind, gr in grouped:
final_df = final_df.append(gr.drop_duplicates("cell"))

How to iterate json column to columns and then append origin dataframe?

import pandas as pd
inp = [{'c1':10,'cols':{'c2':20,'c3':'str1'}, 'c4':'41'}, {'c1':11,'cols':{'c2':20,'c3':'str2'},'c4':'42'}, {'c1':12,'cols':{'c2':20,'c3':'str3'},'c4':'43'}]
df = pd.DataFrame(inp)
print (df)
The df is:
c1 c4 cols
0 10 41 {'c2': 20, 'c3': 'str1'}
1 11 42 {'c2': 20, 'c3': 'str2'}
2 12 43 {'c2': 20, 'c3': 'str3'}
The cols column is JSON type.
I need to make cols column to json_decode,which means change df to:
c1 c4 c2 c3
0 10 41 20 str1
1 11 42 20 str2
2 12 43 20 str3
How to do it?
Thanks in advance!
Use pd.io.json.json_normalize
pd.io.json.json_normalize(inp)
Outputs
c1 c4 cols.c2 cols.c3
0 10 41 20 str1
1 11 42 20 str2
2 12 43 20 str3
If you have a pd.DataFrame, convert back using to_dict
pd.io.json.json_normalize(df.to_dict('records'))
Use DataFrame.pop for extract column, convert to numpy array and lists and pass to DataFrame constructor, last DataFrame.join to original:
df = df.join(pd.DataFrame(df.pop('cols').values.tolist(), index=df.index))
print (df)
c1 c4 c2 c3
0 10 41 20 str1
1 11 42 20 str2
2 12 43 20 str3
You can use:
df = df.join(pd.DataFrame.from_dict(df['cols'].tolist()))
df.drop('cols', axis=1, inplace=True)
print(df)
Output:
c1 c4 c2 c3
0 10 41 20 str1
1 11 42 20 str2
2 12 43 20 str3

Search for a word and insert an empty row

I am new to python and developing a code
I want to search for a word in a column and if a match is found, i want to insert an empty row below that.
My code is below
If df.columnname=='total':
Df.insert
Could someone pls help me.
Do give the following a try:
>>>df
id Label
0 1 A
1 2 B
2 3 B
3 4 B
4 5 A
5 6 B
6 7 A
7 8 A
8 9 C
9 10 C
10 11 C
# Create a separate dataframe with the id of the rows to be duplicated
df1 = df.loc[df['Label']=='B', 'id']
# Join it back and reset the index
df = pd.concat(df,df1).sort_index()
>>>df
id Label
0 1 A
1 2 B
2 2 NaN
3 3 B
4 3 NaN
5 4 B
6 4 NaN
7 5 A
8 6 B
9 6 NaN
10 7 A
11 8 A
12 9 C
13 10 C
14 11 C
Use below code:
from numpy import nan as Nan
import pandas as pd
df1 = pd.DataFrame({'Column1': ['A0', 'total', 'total', 'A3'],'Column2': ['B0', 'B1',
'B2', 'B3'],'Column3': ['C0', 'C1', 'C2', 'C3'],'Column4': ['D0', 'D1', 'D2',
'D3']},index=[0, 1, 2, 3])
count = 0
for index, row in df1.iterrows():
if row["Column1"] == 'total':
df1 = pd.DataFrame(np.insert(df1.values, index+1+count, values=[" "]
* len(df1.columns), axis=0),columns = df1.columns)
count += 1
print (df1)
Input:
Column1 Column2 Column3 Column4
0 A0 B0 C0 D0
1 total B1 C1 D1
2 total B2 C2 D2
3 A3 B3 C3 D3
Output:
Column1 Column2 Column3 Column4
0 A0 B0 C0 D0
1 total B1 C1 D1
2
3 total B2 C2 D2
4
5 A3 B3 C3 D3

How to find value of one column in Dataframe in another Dataframe [duplicate]

This question already has answers here:
pandas - filter dataframe by another dataframe by row elements
(7 answers)
Closed 4 years ago.
I have two dataframes namely:
df1:
col1 col2
A1 20
B1 22
A2 23
B2 24
df2:
Column1 Column2
A1 20
A2 23
A3 25
A4 28
B1 22
B2 24
B3 27
B4 33
Now, I want to return all rows from df2 having values of df1
Hence , output should be:
df2:
A1 20
B1 22
A2 23
B2 24
You can using merge
df2.merge(df1,left_on=['Column1','Column2'],right_on=['col1','col2'],how='left').dropna()[df2.columns]
Out[446]:
Column1 Column2
0 A1 20
1 A2 23
4 B1 22
5 B2 24
Or using tuple with isin
df2[df2.apply(tuple,1).isin(df1.apply(tuple,1))]
Out[453]:
Column1 Column2
0 A1 20
1 A2 23
4 B1 22
5 B2 24
You need to have same column names in order to perform a full inner merge.
df1.rename(columns=dict(zip(df1.columns, df2.columns))).merge(df2)
Output:
Column1 Column2
0 A1 20
1 B1 22
2 A2 23
3 B2 24
You can use merge:
import pandas as pd
df1 = pd.DataFrame({'col1': ['A1', 'B1', 'A2', 'B2'], 'col2': [20, 22, 23, 24]})
df2 = pd.DataFrame({'Column1': ['A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'B3', 'B4'], 'Column2': [20, 23, 25, 28, 22, 24, 27, 33]})
df3 = df1.merge(df2, left_on='col1', right_on='Column1', how='left')
df4 = df3[['col1','Column2']]
print(df4)
> col1 Column2
0 A1 20
1 B1 22
2 A2 23
3 B2 24

Categories