How to groupby multiple columns in pandas based on name? - python

I want to create a new dataframe with columns calculated as means of columns with similar names from this dataframe:
B6_i B6_ii B6_iii ... BXD80_i BXD80_ii BXD81_i
data ...
Cd38 0.598864 -0.225322 0.306926 ... -0.312190 0.281429 0.424752
Trim21 1.947399 2.920681 2.805861 ... 1.469634 2.103585 0.827487
Kpnb1 -0.458240 -0.417507 -0.441522 ... -0.314313 -0.153509 -0.095863
Six1 1.055255 0.868148 1.012298 ... 0.142565 0.264753 0.807692
The new dataframe should look like this:
B6 BXD80 ... BXD81
data
Cd38 -0.041416 -0.087859 ... 0.424752
Trim21 15.958981 3.091500 ... 0.827487
Kpnb1 -0.084471 0.048250 ... -0.095863
Six1 0.927383 0.037745 ... 0.807692
(like (B6_i + B6_ii + B6_iii)/3), based on all characters until the underscore "_")
Some columns are one of n columns, and others are singular (like 'BXD81_i'), so I need a method that can work with a varying number for each mean calculation.

You can aggregate mean per columns by values before _:
df.columns = df.columns.str.split('_', expand=True)
df1 = df.groupby(level=0, axis=1).mean()
Or:
df1 = df.groupby(lambda x: x.split('_')[0], axis=1).mean()
print (df1)
B6 BXD80 BXD81
data
Cd38 0.226823 -0.015381 0.424752
Trim21 2.557980 1.786609 0.827487
Kpnb1 -0.439090 -0.233911 -0.095863
Six1 0.978567 0.203659 0.807692

Related

How to separate characters of a column based on its intersection with another column?

There are two columns in my df, the second column includes data of the other column+other characters (alphabets and/or numbers):
values = {
'number': [2830, 8457, 9234],
'nums': ['2830S', '8457M', '923442']
}
df = pd.DataFrame(values, columns=['number', 'nums'])
The extra characters are always after the common characters! How can I separate the characters that are not common between the two columns? I am looking for a simple solution, not a loop to check every character.
Replace common characters by empty string:
f_diff = lambda x: x['nums'].replace(x['number'], '')
df['extra'] = df[['number', 'nums']].astype(str).apply(f_diff, axis=1)
print(df)
# Output
number nums extra
0 2830 2830S S
1 8457 8457M M
2 9234 923442 42
Update
If number values are always the first characters of nums column, you can use a simpler function:
f_diff2 = lambda x: x['nums'][len(x['number']):]
df['extra'] = df[['number', 'nums']].astype(str).apply(f_diff2, axis=1)
print(df)
# Output
# Output
number nums extra
0 2830 2830S S
1 8457 8457M M
2 9234 923442 42
I would delete the prefix of the string. For this you can the method apply() to apply following function on each row:
def remove_prefix(text, prefix):
if text.startswith(prefix):
return text[len(prefix):]
return text
df['nums'] = df.apply(lambda x: remove_prefix(x['nums'], str(x['number'])), axis=1)
df
Output:
number nums
0 2830 S
1 8457 M
2 9234 42
If you have python version >= 3.9 you only need this:
df['nums'] = df.apply(lambda x: x['nums'].removeprefix(x['number']), axis=1)

Summing certain columns by similar part of its name

How to sum columns by already fetched list of unique columns partly names ?
list = ['13-14', '15-16']
DataFrame:
X.13-14 Y.13-14 Z.13-14 X.15-16 ...
id
182761 10274.00 6097173.00 5758902.00 3345841.00
I.e. I want to create '13-14' and '15-16' columns with corresponding sum of (X.13-14,Y.13-14,Z.13-14), then (X.15-16,Y.15-16,Z.15-16)
If want sum columns by columns names after . use lambda function in DataFrame.groupby with axis=1:
df1 = df.groupby(lambda x: x.split('.')[1], axis=1).sum()
print (df1)
13-14 15-16
id
182761 11866349.0 3345841.0
Or if need only columns by list:
L = ['13-14', '15-16']
df.columns = df.columns.str.extract(f'({"|".join(L)})', expand=False)
df1 = df.sum(level=0, axis=1)[L]
print (df1)
13-14 15-16
id
182761 11866349.0 3345841.0
If need add to original:
df = df.join(df1)
print (df)
X.13-14 Y.13-14 Z.13-14 X.15-16 13-14 15-16
id
182761 10274.0 6097173.0 5758902.0 3345841.0 11866349.0 3345841.0

Create a dataframe from a dictionary with multiple keys and values

So I have a dictionary with 20 keys, all structured like so (same length):
{'head': X Y Z
0 -0.203363 1.554352 1.102800
1 -0.203410 1.554336 1.103019
2 -0.203449 1.554318 1.103236
3 -0.203475 1.554299 1.103446
4 -0.203484 1.554278 1.103648
... ... ... ...
7441 -0.223008 1.542740 0.598634
7442 -0.222734 1.542608 0.599076
7443 -0.222466 1.542475 0.599520
7444 -0.222207 1.542346 0.599956
7445 -0.221962 1.542225 0.600375
I'm trying to convert this dictionary to a dataframe, but I'm having trouble with getting the output I want. What I want is a dataframe structured like so: columns = [headX, headY, headZ etc.] and rows being the 0-7445 rows.
Is that possible? I've tried:
df = pd.DataFrame.from_dict(mydict, orient="columns")
And different variations of that, but can't get the desired output.
Any help will be great!
EDIT: The output I want has 60 columns in total, i.e. from each of the 20 keys, I want an X, Y, Z for each of them. So columns would be: [key1X, key1Y, key1Z, key2X, key2Y, key2Z, ...]. So the dataframe will be 60 columns x 7446 rows.
Use concat with axis=1 and then flatten Multiindex by f-strings:
df = pd.concat(d, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')

How to split a column into many columns where the name of this columns change

I defined a data frame into a "function" where the name of each column in the dataframes changes continuously so I can't specify the name of this column and then split it to many columns. For example, I can't say df ['name'] and then split it into many columns. The number of columns and rows of this dataframes is not constant. I need to split any column contains more than one item to many components (columns).
For example:
This is one of the dataframes which I have:
name/one name/three
(192.26949,) (435.54,436.65,87.3,5432)
(189.4033245,) (45.51,56.612, 54253.543, 54.321)
(184.4593252,) (45.58,56.6412,654.876,765.66543)
I want to convert it to:
name/one name/three1 name/three2 name/three3 name/three4
192.26949 435.54 436.65 87.3 5432
189.4033245 45.51 56.612 54253.543 54.321
184.4593252 45.58 56.6412 654.876 765.66543
Solution if all data are tuples in all rows and all columns use concat with DataFrame constructor and DataFrame.add_prefix:
df = pd.concat([pd.DataFrame(df[c].tolist()).add_prefix(c) for c in df.columns], axis=1)
print (df)
name/one0 name/three0 name/three1 name/three2 name/three3
0 192.269490 435.54 436.6500 87.300 5432.00000
1 189.403324 45.51 56.6120 54253.543 54.32100
2 184.459325 45.58 56.6412 654.876 765.66543
If possible string repr of tuples:
import ast
L = [pd.DataFrame([ast.literal_eval(y) for y in df[c]]).add_prefix(c) for c in df.columns]
df = pd.concat(L, axis=1)
print (df)
name/one0 name/three0 name/three1 name/three2 name/three3
0 192.269490 435.54 436.6500 87.300 5432.00000
1 189.403324 45.51 56.6120 54253.543 54.32100
2 184.459325 45.58 56.6412 654.876 765.66543

Diff between two dataframes in pandas

I have two dataframes both of which have the same basic schema. (4 date fields, a couple of string fields, and 4-5 float fields). Call them df1 and df2.
What I want to do is basically get a "diff" of the two - where I get back all rows that are not shared between the two dataframes (not in the set intersection). Note, the two dataframes need not be the same length.
I tried using pandas.merge(how='outer') but I was not sure what column to pass in as the 'key' as there really isn't one and the various combinations I tried were not working. It is possible that df1 or df2 has two (or more) rows that are identical.
What is a good way to do this in pandas/Python?
Try this:
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
diff_df = diff_df.loc[diff_df['Exist'] != 'both']
You will have a dataframe of all rows that don't exist on both df1 and df2.
IIUC:
You can use pd.Index.symmetric_difference
pd.concat([df1, df2]).loc[
df1.index.symmetric_difference(df2.index)
]
You can use this function, the output is an ordered dict of 6 dataframes which you can write to excel for further analysis.
'df1' and 'df2' refers to your input dataframes.
'uid' refers to the column or combination of columns that make up the unique key. (i.e. 'Fruits')
'dedupe' (default=True) drops duplicates in df1 and df2. (refer to Step 4 in comments)
'labels' (default = ('df1','df2')) allows you to name the input dataframes. If a unique key exists in both dataframes, but have
different values in one or more columns, it is usually important to know these rows, put them one on top of the other and label the row with the name so we know to which dataframe does it belong to.
'drop' can take a list of columns to be excluded from the consideration when considering the difference
Here goes:
df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity'])
df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity'])
dict1 = diff_func(df1, df2, 'Fruits')
In [10]: dict1['df1_only']:
Out[10]:
Fruits Quantity
1 coconut 3
In [11]: dict1['df2_only']:
Out[11]:
Fruits Quantity
3 durian 4
In [12]: dict1['Diff']:
Out[12]:
Fruits Quantity df1 or df2
0 banana 2 df1
1 banana 3 df2
In [13]: dict1['Merge']:
Out[13]:
Fruits Quantity
0 apple 1
Here is the code:
import pandas as pd
from collections import OrderedDict as od
def diff_func(df1, df2, uid, dedupe=True, labels=('df1', 'df2'), drop=[]):
dict_df = {labels[0]: df1, labels[1]: df2}
col1 = df1.columns.values.tolist()
col2 = df2.columns.values.tolist()
# There could be columns known to be different, hence allow user to pass this as a list to be dropped.
if drop:
print ('Ignoring columns {} in comparison.'.format(', '.join(drop)))
col1 = list(filter(lambda x: x not in drop, col1))
col2 = list(filter(lambda x: x not in drop, col2))
df1 = df1[col1]
df2 = df2[col2]
# Step 1 - Check if no. of columns are the same:
len_lr = len(col1), len(col2)
assert len_lr[0]==len_lr[1], \
'Cannot compare frames with different number of columns: {}.'.format(len_lr)
# Step 2a - Check if the set of column headers are the same
# (order doesnt matter)
assert set(col1)==set(col2), \
'Left column headers are different from right column headers.' \
+'\n Left orphans: {}'.format(list(set(col1)-set(col2))) \
+'\n Right orphans: {}'.format(list(set(col2)-set(col1)))
# Step 2b - Check if the column headers are in the same order
if col1 != col2:
print ('[Note] Reordering right Dataframe...')
df2 = df2[col1]
# Step 3 - Check datatype are the same [Order is important]
if set((df1.dtypes == df2.dtypes).tolist()) - {True}:
print ('dtypes are not the same.')
df_dtypes = pd.DataFrame({labels[0]:df1.dtypes,labels[1]:df2.dtypes,'Diff':(df1.dtypes == df2.dtypes)})
df_dtypes = df_dtypes[df_dtypes['Diff']==False][[labels[0],labels[1],'Diff']]
print (df_dtypes)
else:
print ('DataType check: Passed')
# Step 4 - Check for duplicate rows
if dedupe:
for key, df in dict_df.items():
if df.shape[0] != df.drop_duplicates().shape[0]:
print(key + ': Duplicates exists, they will be dropped.')
dict_df[key] = df.drop_duplicates()
# Step 5 - Check for duplicate uids.
if type(uid)==str or type(uid)==list:
print ('Uniqueness check: {}'.format(uid))
for key, df in dict_df.items():
count_uid = df.shape[0]
count_uid_unique = df[uid].drop_duplicates().shape[0]
var = [0,1][count_uid_unique == df.shape[0]] #<-- Round off to the nearest integer if it is 100%
pct = round(100*count_uid_unique/df.shape[0], var)
print ('{}: {} out of {} are unique ({}%).'.format(key, count_uid_unique, count_uid, pct))
# Checks complete, begin merge. '''Remenber to dedupe, provide labels for common_no_match'''
dict_result = od()
df_merge = pd.merge(df1, df2, on=col1, how='inner')
if not df_merge.shape[0]:
print ('Error: Merged DataFrame is empty.')
else:
dict_result[labels[0]] = df1
dict_result[labels[1]] = df2
dict_result['Merge'] = df_merge
if type(uid)==str:
uid = [uid]
if type(uid)==list:
df1_only = df1.append(df_merge).reset_index(drop=True)
df1_only['Duplicated']=df1_only.duplicated(keep=False) #keep=False, marks all duplicates as True
df1_only = df1_only[df1_only['Duplicated']==False]
df2_only = df2.append(df_merge).reset_index(drop=True)
df2_only['Duplicated']=df2_only.duplicated(keep=False)
df2_only = df2_only[df2_only['Duplicated']==False]
label = labels[0]+' or '+labels[1]
df_lc = df1_only.copy()
df_lc[label] = labels[0]
df_rc = df2_only.copy()
df_rc[label] = labels[1]
df_c = df_lc.append(df_rc).reset_index(drop=True)
df_c['Duplicated'] = df_c.duplicated(subset=uid, keep=False)
df_c1 = df_c[df_c['Duplicated']==True]
df_c1 = df_c1.drop('Duplicated', axis=1)
df_uc = df_c[df_c['Duplicated']==False]
df_uc_left = df_uc[df_uc[label]==labels[0]]
df_uc_right = df_uc[df_uc[label]==labels[1]]
dict_result[labels[0]+'_only'] = df_uc_left.drop(['Duplicated', label], axis=1)
dict_result[labels[1]+'_only'] = df_uc_right.drop(['Duplicated', label], axis=1)
dict_result['Diff'] = df_c1.sort_values(uid).reset_index(drop=True)
return dict_result
Set df2.columns = df1.columns
Now, set every column as the index: df1 = df1.set_index(df1.columns.tolist()), and similarly for df2.
You can now do df1.index.difference(df2.index), and df2.index.difference(df1.index), and the two results are your distinct columns.
with
left_df.merge(df,left_on=left_df.columns.tolist(),right_on=df.columns.tolist(),how='outer')
you can get the outer join result.
Similarly, you can get the inner join result.Then make a diff that would be what you want.

Categories