Pandas: Add an empty row after every index in a MultiIndex dataframe - python

Consider below df:
IA1 IA2 IA3
Name Subject
Abc DS 45 43 34
DMS 43 23 45
ADA 32 46 36
Bcd BA 45 35 37
EAD 23 45 12
DS 23 35 43
Cdf EAD 34 33 23
ADA 12 34 25
How can I add an empty row after each Name index?
Expected output:
IA1 IA2 IA3
Name Subject
Abc DS 45 43 34
DMS 43 23 45
ADA 32 46 36
Bcd BA 45 35 37
EAD 23 45 12
DS 23 35 43
Cdf EAD 34 33 23
ADA 12 34 25

Use custom function for add empty rows in GroupBy.apply:
def f(x):
x.loc[('', ''), :] = ''
return x
Or:
def f(x):
return x.append(pd.DataFrame('', columns=df.columns, index=[(x.name, '')]))
df = df.groupby(level=0, group_keys=False).apply(f)
print (df)
IA1 IA2 IA3
Name Subject
Abc DS 45 43 34
DMS 43 23 45
ADA 32 46 36
Bcd BA 45 35 37
EAD 23 45 12
DS 23 35 43
Cdf EAD 34 33 23
ADA 12 34 25

Adding another way using df.reindex and fill_value as '' after using pd.MultiIndex.from_product and Index.union and then sorting it.
idx = df.index.union(pd.MultiIndex.from_product((df.index.levels[0],[''])),sort=False)
out = df.reindex(sorted(idx,key=lambda x: x[0]),fill_value='')
print(out)
IA1 IA2 IA3
Name Subject
Abc DS 45 43 34
DMS 43 23 45
ADA 32 46 36
Bcd BA 45 35 37
EAD 23 45 12
DS 23 35 43
Cdf EAD 34 33 23
ADA 12 34 25
We use sort=False when using Index.union the index so order is retained , then using sorted on the first element returns:
sorted(idx,key=lambda x:x[0])
[('Abc', 'DS'),
('Abc', 'DMS'),
('Abc', 'ADA'),
('Abc', ''),
('Bcd', 'BA'),
('Bcd', 'EAD'),
('Bcd', 'DS'),
('Bcd', ''),
('Cdf', 'EAD'),
('Cdf', 'ADA'),
('Cdf', '')]

# reset index
dfn = df.reset_index()
# find the border idx of 'Name', [2, 5, 7]
idx_list = dfn.drop_duplicates('Name', keep='last').index
# use the border idx, create an empty df, and append to the origin df, then sort the index
df_append = pd.DataFrame('', index = idx_list, columns = dfn.columns)
obj = dfn.append(df_append).sort_index().set_index(['Name', 'Subject'])
print(obj)
IA1 IA2 IA3
Name Subject
Abc DS 45 43 34
DMS 43 23 45
ADA 32 46 36
Bcd BA 45 35 37
EAD 23 45 12
DS 23 35 43
Cdf EAD 34 33 23
ADA 12 34 25

Related

Venn Diagram for each row in DataFrame

I have a set of data that looks like this:
Exp # ID Q1 Q2 All IDs Q1 unique Q2 unique Overlap Unnamed: 8
0 1 58 32 58 58 14 40 18 18
1 2 55 38 44 55 28 34 10 10
2 4 95 69 83 95 37 51 32 32
3 5 92 68 84 92 31 47 37 37
4 6 0 0 0 0 0 0 0 0
5 7 71 52 65 71 27 40 25 25
6 8 84 69 69 84 39 39 30 30
7 10 65 35 63 65 17 45 18 18
8 11 90 72 72 90 39 39 33 33
9 14 88 84 80 88 52 48 32 32
10 17 89 56 75 89 30 49 26 26
11 19 83 56 70 83 32 46 24 24
12 20 94 72 83 93 35 46 37 37
13 21 73 57 56 73 38 37 19 19
For each exp #, I want to make a Venn diagram with the values Q1 Unique, Q2 Unique, and Overlap.
I have tried a couple of things, the below code has gotten me the closest:
from matplotlib import pyplot as plt
import numpy as np
from matplotlib_venn import venn2, venn2_circles
import csv
import pandas as pd
import numpy as np
val_path = r"C:\Users\lawashburn\Documents\DIA\DSD First Pass\20220202_Acquisition\Overlap_Values.csv"
val_tab = pd.read_csv(val_path)
exp_num = val_tab['Exp #']
cols = ['Q1 unique','Q2 unique', 'Overlap']
df = pd.DataFrame()
df ['Exp #'] = exp_num
df['combined'] = val_tab[cols].apply(lambda row: ','.join(row.values.astype(str)), axis=1)
print(df)
exp_no = df['Exp #'].tolist()
combined = df['combined'].tolist()
#combined = [int(i) for i in combined]
print(combined)
for a in exp_no:
plt.figure(figsize=(4,4))
plt.title(a)
for b in combined:
v = venn2(subsets=(b), set_labels = ('Q1', 'Q2'), set_colors=('purple','skyblue'), alpha=0.7)
v.get_label_by_id('A').set_text('Q1')
c = venn2_circles(subsets=(b))
plt.show()
plt.savefig(a + 'output.png')
This generates a DataFrame:
Exp # combined
0 1 14,40,18
1 2 28,34,10
2 4 37,51,32
3 5 31,47,37
4 6 0,0,0
5 7 27,40,25
6 8 39,39,30
7 10 17,45,18
8 11 39,39,33
9 14 52,48,32
10 17 30,49,26
11 19 32,46,24
12 20 35,46,37
13 21 38,37,19
However, I think I run into the issue when I export the combined column into a list:
['14,40,18', '28,34,10', '37,51,32', '31,47,37', '0,0,0', '27,40,25', '39,39,30', '17,45,18', '39,39,33', '52,48,32', '30,49,26', '32,46,24', '35,46,37', '38,37,19']
As after this I get the error:
numpy.core._exceptions.UFuncTypeError: ufunc 'absolute' did not contain a loop with signature matching types dtype('<U8') -> dtype('<U8')
How should I proceed from here? I would like 13 separate Venn Diagrams, and to export each of them into a separate .png file.

Pandas groupby: remove duplicates

input: (CSV file)
name subject internal_1_marks internal_2_marks final_marks
abc python 45 50 47
pqr java 45 46 46
pqr python 40 33 37
xyz java 45 43 49
xyz node 40 30 35
xyz ruby 50 45 47
Expected output: (CSV file)
name subject internal_1_marks internal_2_marks final_marks
abc python 45 50 47
pqr java 45 46 46
python 40 33 37
xyz java 45 43 49
node 40 30 35
ruby 50 45 47
I've tried this:
df = pd.read_csv("student_info.csv")
df.groupby(['name', 'subject']).sum().to_csv("output.csv")
but it's giving duplicate in first column as shown bellow.
name subject internal_1_marks internal_2_marks final_marks
abc python 45 50 47
pqr java 45 46 46
pqr python 40 33 37
xyz java 45 43 49
xyz node 40 30 35
xyz ruby 50 45 47
I need to remove duplicate in first column as shown in expected output.
Thanks.
Similar answer here
mask = df['name'].duplicated()
df.loc[mask.values,['name']] = ''
name subject internal_1_marks internal_2_marks final_marks
0 abc python 45 50 47
1 pqr java 45 46 46
2 python 40 33 37
3 xyz java 45 43 49
4 node 40 30 35
5 ruby 50 45 47
You can filter the dupes after the group by
df.groupby(['name', 'subject']).sum().reset_index().assign(name=lambda x: x['name'].where(~x['name'].duplicated(), '')).to_csv('filename.csv', index=False)
Also when reading the file you can pass index_col for the dupes
df = pd.read_csv('test.csv', index_col=[0])

How to create a column that contains the penultimate value of each row?

I have a DataFrame and I need to create a new column which contains the second largest value of each row in the original Dataframe.
Sample:
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
Desired output:
0 1 2 3 4 5 6 7 8 9 penultimate
0 52 69 62 7 20 69 38 10 57 17 62
1 52 94 49 63 1 90 14 76 20 84 90
2 78 37 58 7 27 41 27 26 48 51 58
3 6 39 99 36 62 90 47 25 60 84 90
4 37 36 91 93 76 69 86 95 69 6 93
5 5 54 73 61 22 29 99 27 46 24 73
6 71 65 45 9 63 46 4 93 36 18 71
7 85 7 76 46 65 97 64 52 28 80 85
How can this be done in as little code as possible?
You could use NumPy for this:
import numpy as np
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df['penultimate'] = np.sort(df.values, 1)[:, -2]
print(df)
Using NumPy is faster.
Here is a simple lambda function!
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# Output
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)
Cheers!

Reshape Dataframe from horizontal column to vertical

Is there any efficient way to reshape a dataframe from:
(A1, A2, A3, B1, B2, B3, C1, C2, C3, TT, YY and ZZ are columns)
A1 A2 A3 B1 B2 B3 C1 C2 C3 TT YY ZZ
11 22 33 44 55 66 77 88 99 23 24 25
11 22 33 44 55 66 77 88 99 23 24 25
11 22 33 44 55 66 77 88 99 23 24 25
11 22 33 44 55 66 77 88 99 23 24 25
11 22 33 44 55 66 77 88 99 23 24 25
11 22 33 44 55 66 77 88 99 23 24 25
TO:
HH JJ KK TT YY ZZ
11 22 33 23 24 25
11 22 33 23 24 25
11 22 33 23 24 25
11 22 33 23 24 25
11 22 33 23 24 25
11 22 33 23 24 25
44 55 66 23 24 25
44 55 66 23 24 25
44 55 66 23 24 25
44 55 66 23 24 25
44 55 66 23 24 25
44 55 66 23 24 25
77 88 99 23 24 25
77 88 99 23 24 25
77 88 99 23 24 25
77 88 99 23 24 25
77 88 99 23 24 25
77 88 99 23 24 25
HH, JJ and KK are new columns where I would make a vertical stack of column A, B, C and keeping in horizontal stack TT, YY and ZZ
A1 A2 A3 TT YY ZZ
B1 B2 B3 TT YY ZZ
C1 C2 C3 TT YY ZZ
Thanks for your help
You can use Column splitting and concatenation
df = pd.read_clipboard()
ColSets= [df.columns[i:i+3] for i in np.arange(0,len(df.columns)-3,3)]
LCols = df.columns[-3:]
NewDf = pd.concat([df[ColSet].join(df[LCols]).T.reset_index(drop=True).T for ColSet in ColSets])
NewDf.columns = ['HH', 'JJ', 'KK', 'TT', 'YY', 'ZZ']
Out:
HH JJ KK TT YY ZZ
0 11 22 33 23 24 25
1 11 22 33 23 24 25
2 11 22 33 23 24 25
3 11 22 33 23 24 25
4 11 22 33 23 24 25
5 11 22 33 23 24 25
0 44 55 66 23 24 25
1 44 55 66 23 24 25
2 44 55 66 23 24 25
3 44 55 66 23 24 25
4 44 55 66 23 24 25
5 44 55 66 23 24 25
0 77 88 99 23 24 25
1 77 88 99 23 24 25
2 77 88 99 23 24 25
3 77 88 99 23 24 25
4 77 88 99 23 24 25
5 77 88 99 23 24 25
a bit longer than the previous solution :
#extract columns ending with numbers
abc = df.filter(regex='\d$')
#sort columns into separate lists
from itertools import groupby
from operator import itemgetter
cols = sorted(abc.columns,key=itemgetter(0))
filtered_columns = [list(g) for k,g in groupby(cols,key=itemgetter(0))]
#iterate through the dataframe
#and stack them
abc_stack = pd.concat([abc.filter(col)
.set_axis(['HH','JJ','KK'],axis='columns')
for col in filtered_columns],
ignore_index=True)
#filter for columns ending with alphabets
tyz = df.filter(regex= '[A-Z]$')
#get the dataframe to be the same length as abc_stack
tyz_stack = pd.concat([tyz] * len(filtered_columns),ignore_index=True)
#combine both dataframes
res = pd.concat([abc_stack,tyz_stack], axis=1)
res
HH JJ KK TT YY ZZ
0 11 22 33 23 24 25
1 11 22 33 23 24 25
2 11 22 33 23 24 25
3 11 22 33 23 24 25
4 11 22 33 23 24 25
5 11 22 33 23 24 25
6 44 55 66 23 24 25
7 44 55 66 23 24 25
8 44 55 66 23 24 25
9 44 55 66 23 24 25
10 44 55 66 23 24 25
11 44 55 66 23 24 25
12 77 88 99 23 24 25
13 77 88 99 23 24 25
14 77 88 99 23 24 25
15 77 88 99 23 24 25
16 77 88 99 23 24 25
17 77 88 99 23 24 25
UPDATE : 2021-01-08
The reshaping process could be abstracted by using the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github:
The data you shared has patterns (some columns ends with 1, others with 2, and some end with 3), we can use these patterns to reshape the data;
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
(df.pivot_longer(names_to=("HH", "JJ", "KK"),
names_pattern=("1$", "2$", "3$"),
index=("TT", "YY", "ZZ")
)
.sort_index(axis="columns"))
Basically, what it does is look for columns that end with 1, aggregates them into one column ("TT") and does the same for 2 and 3.

Filter pandas DataFrame through list of dicts

I have DataFrame of arbitrary length, with X columns (lets say 10):
>>> names = ['var_' + str(x) for x in range(1, 11)]
>>> names
['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'var_6', 'var_7', 'var_8', 'var_9', 'var_10']
>>> df = pd.DataFrame(np.random.randint(100, size=(10,10)), columns = names)
>>> df
var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10
0 39 49 6 39 16 41 8 86 23 52
1 6 16 21 20 81 97 83 25 56 73
2 72 97 43 50 10 46 22 75 7 18
3 20 35 69 59 14 24 57 31 47 20
4 39 93 45 80 74 87 83 50 52 67
5 93 75 83 67 40 46 79 11 31 95
6 75 76 57 82 69 98 74 75 93 13
7 35 19 28 67 39 23 72 16 63 67
8 93 87 52 25 63 29 46 64 78 12
9 81 43 4 90 88 64 1 83 26 22
Now i want to filter this DataFrame rowwise using list of dicts:
>>> test_dict_1 = {'var_1': 89, 'var_2': 12, 'var_3': 34}
>>> test_dict_2 = {'var_7': 3, 'var_2': 11, 'var_4': 19, 'var_1': 9}
>>> test_dict_3 = {'var_3': 31}
>>> filter = [test_dict_1, test_dict_2, test_dict_3]
To have something as result (dict? DataFrame? few DataFrames?), that contains only those rows with at least one of the filter passed (i.e. all of variables are same values in row as in filter). Besides that i ofcourse need to know which filters passed.
I'm quite new to pandas, so i'm a bit confused if i can do it without "for" loops. Any solutions please?
I know about chain solutions like df[(df.A == 1) & (df.D == 6)], but is it somehow possible to have few different filters?
Final goal is to have every row flagged with filters passed, without loops.
I'm not sure if I get it right, but if you want to filter your dataframe by few criteria from a dictionnary you could do something like this :
In [107]: df
Out[107]:
var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10
0 45 36 84 24 86 26 44 6 44 15
1 72 16 67 75 87 89 8 68 32 49
2 9 49 0 4 77 75 65 9 45 70
test_dict_1 = {'var_1': 72, 'var_2': 16, 'var_3': 67}
cond = True
for var in test_dict_1.keys():
cond = cond & (df[var] == test_dict_1[var])
df = df.loc[cond]
then you'll get :
In [109]: df
Out[109]:
var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10
1 72 16 67 75 87 89 8 68 32 49

Categories