Join columns of several dataframes into a new dataframe - python

I have a dictionary of dataframes. Each of these dataframes has a column 'defrost_temperature'. What I want to do is make one new dataframe that collects all those columns, maintaining them as seperate columns.
This is what I am doing right now:
merged_defrosts = pd.DataFrame()
for key in df_dict.keys():
merged_defrosts[key] = df_dict[key]["defrost_temperature"]
But unfortunately, only the first column is filled correctly. The other columns are filled with NaN as shown in the screenshot
enter image description here
The different defrosts are not necessarily the same length. (the fourth dataframe is 108 rows, the others are 109 rows)

You can try pd.merge on index of the larger.
df_result = pd.DataFrame()
for i, df in enumerate(df_dict.values()):
s1, s2 = f'_{i}', f'_{i+1}'
m1, m2 = df_result.shape[0], df.shape[0]
if m1 == 0:
df_result = df
elif m1 >= m2:
df_result = df_result.merge(df, how=left, left_index=True, right_index=True, suffixes=(s1, s2))
else:
df_result = df.merge(df_result, how=left, left_index=True, right_index=True, suffixes=(s2, s1))
This would create undesired column names though that you can manually rename them afterwards.

You could try to concat the dataframes horizontaly after making the common column the index:
merged_defrosts = pd.concat([df.set_index("defrost_temperature") for df in df_dict.values()]
).reset_index()

Related

How to extract the data from a column with priority order given in another file?

I have two dataframes
df1
IMPACT Rank
HIGH 1
MODERATE 2
LOW 3
MODIFIER 4
df2['Annotation']
Annotation
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
There are multiple annotation in separated by , (comma), I want to consider only one annotation from the dataframe based on Rank in the df1.
My expected outputwill be:
df['RANKED']
RANKED
A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
I tried following code to generate the output: but did not give me the expected result
d = df1.set_index('IMPACT')['Rank'].to_dict()
max1 = df1['Rank'].max()+1
def f(x):
d1 = {y: d.get(y, max1) for y in x for y in x.split(',')}
return min(d1, key=d1.get)
df2['RANKED'] = df2['Annotation'].apply(f)
Any help appreciated..
TL;DR
df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')
df2['IMPACT'] = df["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)
df_merge = df2.merge(df1, how='left', on='IMPACT')
df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])
Step-by-step
First you define your dataframes
df1 = pd.DataFrame({'IMPACT':['HIGH', 'MODERATE', 'LOW', 'MODIFIER'], 'Rank':[1,2,3,4]})
df2 = pd.DataFrame({
'Annotation':[
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||']
})
Now here is the tricky part. You should create a column with the list of the split by comma string of the original Annotation column. Then you explode this column so you can have the objective values repeated for each original string.
df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')
Next, you extract the IMPACT word from each RANKED column.
df2['IMPACT'] = df2["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)
Then, you merge df1 and df2 to get the rank of each RANKED.
df_merge = df2.merge(df1, how='left', on='IMPACT')
Finally, this is the easy part where you discard everything you do not want in the final dataframe. This can be done via groupby.
df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])
RANKED Rank
A|missense_variant|MODERATE|PERM1|ENSG00000187... 2
A|missense_variant|HIGH|PERM1|ENSG00000187642|... 1
A|missense_variant|LOW|PERM1|ENSG00000187642|T... 3
OR by dropping duplicates
df_final = df_merge.sort_values(['Annotation', 'Rank'], ascending=[False,True]).drop_duplicates(subset=['Annotation']).drop(columns=['Annotation', 'IMPACT'])

Joining/merging multiple dataframes

I have 4 dataframe objects with 1 row and X columns that I would want to join, here's a screenshot of them:
I would want them to become one big row.
Thanks for anybody who helps!
You could use concatenate in dataframe as below,
df1 = pd.DataFrame(columns=list('ABC'))
df1.loc[0] = [1,1.23,'Hello']
df2 = pd.DataFrame(columns=list('DEF'))
df2.loc[0] = [2,2.23,'Hello1']
df3 = pd.DataFrame(columns=list('GHI'))
df3.loc[0] = [3,3.23,'Hello3']
df4 = pd.DataFrame(columns=list('JKL'))
df4.loc[0] = [4,4.23,'Hello4']
pd.concat([df1,df2,df3,df4],axis=1)

Get the missing columns from one dataframe and append it to another dataframe

I have a Dataframe df1 with the columns. I need to compare the headers of columns in df1 with a list of headers from df2
df1 =['a','b','c','d','f']
df2 =['a','b','c','d','e','f']
I need to compare the df1 with df2 and if any missing columns, I need to add them to df1 with blank values.
I tried concat and also append and both didn't work. with concat, I'm not able to add the column e and with append, it is appending all the columns from df1 and df2. How would I get only missing column added to df1 in the same order?
df1_cols = df1.columns
df2_cols = df2._combine_match_columns
if (df1_cols == df2_cols).all():
df1.to_csv(path + file_name, sep='|')
else:
print("something is missing, continuing")
#pd.concat([my_df,flat_data_frame], ignore_index=False, sort=False)
all_list = my_df.append(flat_data_frame, ignore_index=False, sort=False)
I wanted to see the results as
a|b|c|d|e|f - > headers
1|2|3|4||5 -> values
pandas.DataFrame.align
df1.align(df2, axis=1)[0]
By default this does an 'outer' join
By specifying axis=1 we focus on columns
This returns a tuple of both an aligned df1 and df2 with the calling dataframe being the first element. So I grab the first element with [0]
pandas.DataFrame.reindex
df1.reindex(columns=df1.columns | df2.columns)
You can treat pandas.Index objects like sets most of the time. So df1.columns | df2.columns is the union of those two index objects. I then reindex using the result.
Lets first create the two dataframes as:
import pandas as pd, numpy as np
df1 = pd.DataFrame(np.random.random((5,5)), columns = ['a','b','c','d','f'])
df2 = pd.DataFrame(np.random.random((5,7)), columns = ['a','b','c','d','e','f','g'])
Now add those columns of df2 to df1 (with nan values), which are not in df1:
for i in list(df2):
if i not in list(df1):
df1[i] = np.nan
Now display the columns of df1 alphabetically:
df1 = df1[sorted(list(df1))]

Pandas: Add row depending on index

I just want to create a dataFrame that is updated with itself(df3), adding rows from other dataFrames (df1,df2) based on an index ("ID").
When adding a new dataFrame if an overlap index is found, update the data. If it is not found, add the data including the new index.
df1 = pd.DataFrame({"Proj. Num" :["A"],'ID':[000],'DATA':["NO_DATA"]})
df1 = df1.set_index(["ID"])
df2 = pd.DataFrame({"Proj. Num" :["B"],'ID':[100],'DATA':["OK"], })
df2 = df2.set_index(["ID"])
df3 = pd.DataFrame({"Proj. Num" :["B"],'ID':[100],'DATA':["NO_OK"], })
df3 = df3.set_index(["ID"])
#df3 = pd.concat([df1,df2, df3]) #Concat,merge,join???
df3
I have tried concatenate with _verify_integrity=False_ but it just gives an error, and I think there is a more simple/nicer way to do it.
Solution with concat + Index.duplicated for boolean mask and filter by boolean indexing:
df3 = pd.concat([df1, df2, df3])
df3 = df3[~df3.index.duplicated()]
print (df3)
DATA Proj. Num
ID
0 NO_DATA A
100 OK B
Another solution by comment, thank you:
df3 = pd.concat([df3,df1])
df3 = df3[~df3.index.duplicated(keep='last')]
print (df3)
DATA Proj. Num
ID
100 NO_OK B
0 NO_DATA A
You can concatenate all the dataframes along the index; group by index and decide which element to keep of the group sharing the same index.
From your question it looks like you want to keep the last (most updated) element with the same index. It is then important the order in which you pass the dataframes in the pd.concat function.
For a list of other methods, see here.
res = pd.concat([df1, df2, df3], axis = 0)
res.groupby(res.index).last()
Which gives:
DATA Proj. Num
ID
0 NO_DATA A
100 NO_OK B
#update existing rows
df3.update(df1)
#append new rows
df3 = pd.concat([df3,df1[~df1.index.isin(df3.index)]])
#update existing rows
df3.update(df2)
#append new rows
df3 = pd.concat([df3,df2[~df2.index.isin(df3.index)]])
Out[2438]:
DATA Proj. Num
ID
100 OK B
0 NO_DATA A

How to sum columns from three different dataframes with a common key

I am reading in an excel spreadsheet about schools with three sheets as follows.
import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
print xl.sheet_names
df1 = xl.parse(xl.sheet_names[0], skiprows=14)
df2 = xl.parse(xl.sheet_names[1], skiprows=14)
df3 = xl.parse(xl.sheet_names[2], skiprows=14)
df1.columns = [chr(65+i) for i in xrange(len(df1.columns))]
df2.columns = df1.columns
df3.columns = df1.columns
The unique id for each school is in column 'D' in each of the three dataframes. I would like to make a new dataframe which has two columns. The first is the sum of column 'G' from df1, df2, df3 and the second is the sum of column 'K' from df1, df2, df3. In other words, I think I need the following steps.
Filter rows for which unique column 'D' ids actually exist in all three dataframes. If the school doesn't appear in all three sheets then I discard it.
For each remaining row (school), add up the values in column 'G' in the three dataframes.
Do the same for column 'K'.
I am new to pandas but how should I do this? Somehow the unique ids have to be used in steps 2 and 3 to make sure the values that are added correspond to the same school.
Attempted solution
df1 = df1.set_index('D')
df2 = df2.set_index('D')
df3 = df3.set_index('D')
df1['SumK']= df1['K'] + df2['K'] + df3['K']
df1['SumG']= df1['G'] + df2['G'] + df3['G']
After concatenating the dataframes, you can use groupby and count to get a list of values for "D" that exist in all three dataframes since there is only one in each dataframe. You can then use this to filter concatenated dataframe to sum whichever columns you need, e.g.:
df = pd.concat([df1, df2, df3])
criteria = df.D.isin((df.groupby('D').count() == 3).index)
df[criteria].groupby('D')[['G', 'K']].sum()

Categories