I have 2 dataframes with missing values that I want to merge and complete data from each other,
A simple visualisation :
df1 :
A,B,C
A1,B1,C1
A2,B2,
A3,B3,C3
df2 :
A,B,C
A1,,C1
A4,B4,C4
A2,B2,C2
The result wanted:
A,B,C
A1,B1,C1
A2,B2,B2
A3,B3,C3
A4,B4,C4
Basically merge the dataframes without duplicates of the column "A" and completing if there are missing values in a row by comparing values from same column "A" between dataframes.
I tried many things saw on the Pandas documentation + solutions on stackexchange but failed everytime
These are all the different things I tried :
pd.merge_ordered(df1, df2, fill_method='ffill', left_by='A')
df1.combine_first(df2)
df1.update(df2)
pd.concat([df1, df2])
pd.merge(df1, df2, on=['A','B','C'], how='right')
pd.merge(df1, df2, on=['A','B','C'], how='outer')
pd.merge(df1, df2, on=['A','B','C'], how='left')
df1.join(df2, how='outer')
df1.join(df2, how='left')
df1.set_index('A').join(df2.set_index('A'))
(You can see I was quite desperate at the end)
Any idea how to do that ?
Did you try combine_first with A as the index?
df1.set_index('A').combine_first(df2.set_index('A')).reset_index()
A B C
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
Or you can use first
pd.concat([df1,df2]).replace('',np.nan).groupby('A',as_index=False).first()
Out[53]:
A B C
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
Setup
Since you wrote them as csvs, I'm going to assume they were csvs.
df1 = pd.read_csv('df1.csv', sep=',', index_col=0)
df2 = pd.read_csv('df2.csv', sep=',', index_col=0)
Solution
Use fillna after having used align
pd.DataFrame.fillna(*df1.align(df2))
B C
A
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
You can use reset_index if you insist but I think it's prettier to leave it as it is.
You can use the pandas categorical data type to set an ordered list of categories, sort of these ordered categories, and drop rows with Null values to get your desired results:
from pandas.api.types import CategoricalDtype
# Create first dataframe from OP values
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
'B': ['B1', 'B2', 'B3'],
'C': ['C1', '', 'C3']})
# create second dataframe from original values
df2 = pd.DataFrame({'A': ['A1', 'A4', 'A2'],
'B': ['', 'B4', 'B2'],
'C': ['C1', 'C4', 'C2']})
# concatenate the two together for a long dataframe
final = pd.concat([df1, df2])
# specify the letters in your dataset
letters = ['A', 'B', 'C']
# create a placeholder dictionary to store the categorical datatypes
cat_dict = {}
# iterate over the letters
for let in letters:
# create the ordered categories - set hte range for the max # of values
cats = ['{}{}'.format(let, num) for num in list(range(1000))]
# create ordered categorical datatype
cat_type = CategoricalDtype(cats, ordered=True)
# insert into placeholder
cat_dict[let] = cat_type
# properly format your columns as the ordered categories
final['A'] = final['A'].astype(cat_dict['A'])
final['B'] = final['B'].astype(cat_dict['B'])
final['C'] = final['C'].astype(cat_dict['C'])
# finally sort on the three columns and drop rows with NA values
final.sort_values(['A', 'B', 'C']).dropna(how='any')
# which outputs desired results
A B C
0 A1 B1 C1
2 A2 B2 C2
2 A3 B3 C3
1 A4 B4 C4
While this is a bit longer, one nice thing about doing it this way is your data can be in any order upon input. This inserts an inherit rank into the values within each column, so A1 < A2 < A3 and so on and so forth. This also enables you to sort the columns.
Related
I have a large dataframe a that I would like to split or explode to become dataframe b (the real dataframe a contains 90 columns).
I tried to look up for solutions to a problem similar to this but I did not find since it is not related to the values in cells but to the column names.
Any pointer to the solution or to using an existing function in the pandas library would be appreciated.
Thank you in advance.
from pandas import DataFrame
import numpy as np
# current df
a = DataFrame([{'ID': 'ID_1', 'A-1': 'a1', 'B-1':'b1','C-1':'c1', 'A-2': 'a2', 'B-2':'b2','C-2':'c2'}])
# desired df
b = DataFrame([{'ID': 'ID_1', 'A': 'a1', 'B':'b1', 'C':'c1'},
{'ID': 'ID_1','A': 'a2', 'B':'b2','C':'c2'}])
current df
desired df
One idea I have is to to split this dataframe into two dataframes (Dataframe 1 will contain columns from A1 to C1 and Dataframe 2 will contain columns from A2 to C2 ) rename the columns to A/B/C and than concatenate both. But I am not sure in terms of efficiency since I have 90 Columns that will grow over time.
This approach will generate some intermediate columns which will be removed later on.
First bring down those labels (A-1,...) from the header into a column
df = pd.melt(a, id_vars=['ID'], var_name='label')
Then split the label into character and number
df[['char', 'num']] = df['label'].str.split('-', expand=True)
Finally drop the label, set_index before unstack, and take care of the final table formats.
df.drop('label', axis=1)\
.set_index(['ID', 'num', 'char'])\
.unstack()\
.droplevel(0, axis=1)\
.reset_index()\
.drop('num', axis=1)
pd.wide_to_long works well here assuming a small number of known stubnames:
b = (
pd.wide_to_long(a, stubnames=['A', 'B', 'C'], sep='-', i='ID', j='to_drop')
.droplevel(level='to_drop')
.reset_index()
)
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
Alternatively set_index, split the columns on '-' with str.split and stack:
b = a.set_index('ID')
b.columns = b.columns.str.split('-', expand=True)
b = b.stack().droplevel(-1).reset_index()
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
One option is with the pivot_longer function from pyjanitor, which abstracts the reshaping process and is also efficient:
# pip install pyjanitor
import janitor
import pandas as pd
a.pivot_longer(index="ID", names_to=".value", names_pattern="(.).+")
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
The .value tells the function which part of the columns to retain. It takes its cue from the names_pattern, which should be a regular expression with groups, the grouped regex are what stay as headers. In this case, the first letter of each column is what we are interested in, which is represented by (.).
Another option, with pivot_longer, is to use the names_sep parameter:
(a.pivot_longer(index="ID", names_to=(".value", "num"), names_sep="-")
.drop(columns="num")
)
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
Again, only values in the columns associated with .value remain as headers.
import pandas as pd
import math
df=pd.DataFrame(data={k:[i*k for i in range(1,5)] for k in range (1,9)})
assert(df.shape[1]%2==0)
df_1=df.iloc[:,0:math.floor(df.shape[1]/2)]
df_2=df.iloc[:,math.floor(df.shape[1]/2):]
df_2.columns=df_1.columns
df_sum=pd.concat((df_1,df_2),axis=0)
display(df_sum)
Like this?
I have a dataframe that looks like this:
np.random.seed(21)
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B1', 'B2', 'B3'])
df['current_State'] = [df['B1'][0], df['B1'][1], df['B2'][2], df['B2'][3], df['B3'][4], df['B3'][5], df['B1'][6], df['B2'][7]]
df
I need to create a new column that contains the name of the column where the value of 'current_State' is the same, this is the desired output:
I tried many combinations of apply and lambda functions but without success. Any help is very welcome!
You can compare the current_State column with all the remaining columns to create a boolean mask, then use idxmax along axis=1 on this mask to get the name of the column where the value in the given row equal to corresponding value in current_State:
c = 'current_State'
df['new_column'] = df.drop(c, 1).eq(df[c], axis=0).idxmax(1)
In case if there is a possibility that there are no matching values we can instead use:
c = 'current_State'
m = df.drop(c, 1).eq(df[c], axis=0)
df['new_column'] = m.idxmax(1).mask(~m.any(1))
>>> df
A B1 B2 B3 current_State new_column
0 -0.051964 -0.111196 1.041797 -1.256739 -0.111196 B1
1 0.745388 -1.711054 -0.205864 -0.234571 -1.711054 B1
2 1.128144 -0.012626 -0.613200 1.373688 -0.613200 B2
3 1.610992 -0.689228 0.691924 -0.448116 0.691924 B2
4 0.162342 0.257229 -1.275456 0.064004 0.064004 B3
5 -1.061857 -0.989368 -0.457723 -1.984182 -1.984182 B3
6 -1.476442 0.231803 0.644159 0.852123 0.231803 B1
7 -0.464019 0.697177 1.567882 1.178556 1.567882 B2
I have these two dataframes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A3'],'B': ['B0', 'B1', 'B2']})
I would like to merge these two dataframes now by the entries in column 'A'. But I do not want to keep the rows which match but the ones which do not match to each other.
That means, I would like to get a new dataframe which looks like this one:
df_new = pd.DataFrame({'A':['A3'], 'B':['B2']})
How could I do this?
Thanks a lot!
Merge with outer join gives You similar result:
df1.merge(df2, how = 'outer', on = 'A', indicator = True)
A B_x B_y _merge
0 A0 B0 B0 both
1 A1 B1 B1 both
2 A2 B2 NaN left_only
3 A3 NaN B2 right_only
which can be filtered by query:
df1.merge(df2, how = 'outer', on = 'A', indicator = True).query("_merge != 'both'")
A B_x B_y _merge
2 A2 B2 NaN left_only
3 A3 NaN B2 right_only
Note indicator = True created column _merge suitable to filter rows.
Try this, using isin :
df2.loc[~df1.A.isin(df2.A)]
A B
2 A3 B2
I have two DateFrame with values and number of rows different. All Index values in second DateFrame is in the first, so, I need a DateFrame with the same number of rows that the second DateFrame and all data of both DateFrames.
Example
df1 = pd.DataFrame(columns=["A", "B", "C"], data=[['a1' , 'b1', 'c1'], ['a2', 'b2', 'c2'], ['a3', 'b3', 'c3']], index=[1,2,3])
df2 = pd.DataFrame(columns=["D", "E"], data=[['d2' , 'e2'], ['d3' , 'e3']], index=[2,3])
print (df1)
print (df2)
Out:
A B C
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
D E
2 d2 e2
3 d3 e3
I want:
A B C D E
2 a2 b2 c2 d2 e2
3 a3 b3 c3 d3 e3
I try this:
merge = df1.merge(df2, how='inner', left_on=df1.index, right_on=df2.index)
But I receive an error:
IndexError: indices are out-of-bounds
Note: Not all index in df1 are in df2 but all index in df2 are in df1.
This question already has answers here:
Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas
(8 answers)
Closed 5 years ago.
I have below data frame
A B C
1 A1 B1 [C1, C2]
2 A2 B2 [C3, C4]
I wish to transform it to
A B C
1 A1 B1 C1
2 A1 B1 C2
3 A2 B2 C3
4 A2 B2 C4
What should I do? Thanks
One really simple way of doing it is as follows:
import pandas as pd
df = pd.DataFrame([['A1', 'B1', ['C1', 'C2']],['A2', 'B2', ['C3', 'C4']]], columns = ['A', 'B', 'C'])
df1 = df.copy()
df1['C'] = df['C'].apply(lambda x: x[0])
df2 = df.copy()
df2['C'] = df['C'].apply(lambda x: x[1])
pd.concat([df1, df2]).sort_values('A')