Merge unaligned DataFrames while filling with empty string - python

I have multiple DataFrames that I want to merge where I would like the fill value an empty string rather than nan. Some of the DataFrames have already nan values in them. concat sort of does what I want but fill empty values with nan. How does one not fill them with nan, or specify the fill_value to achieve something like this:
>>> df1
Value1
0 1
1 NaN
2 3
>>> df2
Value2
1 5
2 Nan
3 7
>>> merge_multiple_without_nan([df1,df2])
Value1 Value2
0 1
1 NaN 5
2 3 NaN
3 7
This is what concat does:
>>> concat([df1,df2], axis=1)
Value1 Value2
0 1 NaN
1 NaN 5
2 3 NaN
3 NaN 7

Well, I couldn't find any function in concat or merge that would handle this by itself, but the code below works without much hassel:
df1 = pd.DataFrame({'Value2': [1,np.nan,3]}, index = [0,1, 2])
df2 = pd.DataFrame({'Value2': [5,np.nan,7]}, index = [1, 2, 3])
# Add temporary Nan values for the data frames.
df = pd.concat([df1.fillna('X'), df2.fillna('Y')], axis=1)
df=
Value2 Value2
0 1 NaN
1 X 5
2 3 Y
3 NaN 7
Step 2:
df.fillna('', inplace=True)
df=
Value2 Value2
0 1
1 X 5
2 3 Y
3 7
Step 3:
df.replace(to_replace=['X','Y'], value=np.nan, inplace=True)
df=
Value2 Value2
0 1
1 NaN 5
2 3 NaN
3 7

After using concat, you can iterate over the DataFrames you merged, find the indices that are missing, and fill them in with an empty string. This should work for concatenating an arbitrary number of DataFrames, as long as your column names are unique.
# Concatenate all of the DataFrames.
merge_dfs = [df1, df2]
full_df = pd.concat(merge_dfs, axis=1)
# Find missing indices for each merged frame, fill with an empty string.
for partial_df in merge_dfs:
missing_idx = full_df.index.difference(partial_df.index)
full_df.loc[missing_idx, partial_df.columns] = ''
The resulting output using your sample data:
Value1 Value2
0 1
1 NaN 5
2 3 NaN
3 7

Related

Pandas Dataframe - (Column re structure)

I have a dataframe that has n number of columns. These contain letters, the amount of letters a column contains varies and a letter can appear in various amounts of columns. I need the code for a pandas dataframe to convert the sheet to columns starting with the letters, the rows should contain the numbers of the columns that that letter was in.
Link to example problem
ABCDEF
ABDE. 11 1
BBCC -> 2 2
EFB. 3 3
4 4
The image describes my problem better. Thank you in advance for any help.
Use DataFrame.stack with DataFrame.reset_index for reshape, then DataFrame.sort_values and aggregate lists, last create DataFrame by constructor with transpose:
s=df.stack().reset_index(name='a').sort_values('level_1').groupby('a')['level_1'].agg(list)
df1 = pd.DataFrame(s.tolist(), index=s.index).T
print (df1)
a a b c d e f
0 1 1 1 1 3 2
1 3 3 2 4 4 None
2 None 4 None None None None
Or use GroupBy.cumcount for counter and reshape by DataFrame.pivot:
df2 = df.stack().reset_index(name='a').sort_values('level_1')
df2['g'] = df2.groupby('a').cumcount()
df2 = df2.pivot('g','a','level_1')
print (df2)
a a b c d e f
g
0 1 1 1 1 3 2
1 3 3 2 4 4 NaN
2 NaN 4 NaN NaN NaN NaN
Last if necessary remove index and columns names:
df1 = df1.rename_axis(index=None)
df2 = df2.rename_axis(index=None, columns=None)

How to duplicate and modify rows in a pandas dataframe?

I am attempting to construct dataframes using large amount of data stored in txt files. I did not construct the data, however, so I am having to work with the frustrating formatting contained within. I couldn't get my code to work within the large data (and almost crashed my computer doing so), so set up a smaller dataframe like so:
'Value' ID_1 ID_2
0 11122222 ABC42123 33333
1 21219299 YOF21233 88821
2 00022011 ERE00091 23124
3 75643311;21233332 ROB21288 99421
4 12412421 POW94277 12231;33221
5 54221721 IRS21231;YOU28137 13123
My frustration lies in the use of semicolons in the data. The data is meant to represent IDs, but multiple IDs have been assigned to multiple variables. I want to repeat these rows so that I can search through the data for individual IDs and have a datatable that looks like so:
'Value' ID_1 ID_2
0 11122222 ABC42123 33333
1 21219299 YOF21233 88821
2 00022011 ERE00091 23124
3 75643311 ROB21288 99421
4 21233332 ROB21288 99421
5 12412421 POW94277 12231
6 12412421 POW94277 33221
7 54221721 YOU28137 13123
8 54221721 IRS21231 13123
Reindexing is not a problem, so long as the different IDs stay linked to each other and to their correct values.
Unfortunately, all my attempts to split the data have, so far, ended in abject failure. I have managed to set up a function that repeats data containing a semicolon, parse that through my function for each column, but then fail to split the data afterwards.
def delete_dup(df,column):
for a in column:
location = df.loc[df.duplicated(subset= column, keep=False)]
for x in location:
semicolon = df.loc[df[column].str.contains(';', regex=True)]
duplicate = semicolon.duplicated(subset= column, keep='first')
tiny_df = semicolon.loc[duplicate]
split_up = tiny_df[column].str.split(';')
return pd.concat([df, split_up])
'Value' ID_1 ID_2 0
11122222 ABC42123 33333 NaN
21219299 YOF21233 88821 NaN
00022011 ERE00091 23124 NaN
75643311;21233332 ROB21288 99421 NaN
12412421 POW94277 12231;33221 NaN
54221721 IRS21231;YOU28137 13123 NaN
75643311;21233332 ROB21288 99421 NaN
54221721 IRS21231;YOU28137 13123 NaN
12412421 POW94277 12231;33221 NaN
NaN NaN NaN [75643311, 21233332]
I feel like this is the closest I've come and it's still nowhere near what I want. Any "If" statements I try to do on dataframes are met with the "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." error, which is so frustrating to read. Any ideas on how to make pandas do what I want?
There are two parts to the solution. The first is to identify which rows have the semicolon, and the second it to create additional rows and concatenate them. The first part is done in contains_sc, and the second part is done by iterating over the rows and running the function create_additional_rows when a row with a semicolon is detected.
Hope this helps.
In[6]: import pandas as pd
In[7]: df = pd.DataFrame(
[['1', '2;3', '4', '5'],
['A', 'B', 'C', 'D;E'],
['T', 'U', 'V;W', 'X']],
index=['Val', 'ID1', 'ID2']
).T
In[8]: df
Out[8]:
Val ID1 ID2
0 1 A T
1 2;3 B U
2 4 C V;W
3 5 D;E X
In[9]: contains_sc = df.apply(lambda x: x.str.contains(';'))
In[10]: contains_sc
Out[10]:
Val ID1 ID2
0 False False False
1 True False False
2 False False True
3 False True False
In[11]:
def create_additional_rows(data_row, csc_row, split_char=';'):
"""Given a duplicated row return additional de-duplicated rows."""
if len(csc_row[csc_row].dropna()) > 1:
raise ValueError('Expect only a single column with a semicolon')
col_with_sc = csc_row[csc_row].dropna().index[0]
retval = []
for item in data_row.loc[col_with_sc].split(split_char):
copied = data_row.copy()
copied.loc[col_with_sc] = item
retval.append(copied)
return retval
In[11]:
new_rows = []
for (idx, data_row), (_, csc_row) in zip(df.iterrows(), contains_sc.iterrows()):
if True not in csc_row.values:
new_rows.append(data_row)
continue
new_rows.extend(create_additional_rows(data_row, csc_row))
final = pd.concat(new_rows, axis='columns').T.reset_index(drop=True)
In[13]: final
Out[13]:
Val ID1 ID2
0 1 A T
1 2 B U
2 3 B U
3 4 C V
4 4 C W
5 5 D X
6 5 E X
Perhaps not the most elegant way, but this just solves the problem:
Step 1
Data we have:
df
'Value' ID_1 ID_2
0 11122222 ABC42123 33333
1 21219299 YOF21233 88821
2 00022011 ERE00091 23124
3 75643311;21233332 ROB21288 99421
4 12412421 POW94277 12231;33221
5 54221721 IRS21231;YOU28137 13123
Step 2
Let's split misbehaving columns
df["'Value'_Dupe"] = df["'Value'"].apply(lambda x: x.split(";")[1] if len(x.split(";"))>1 else np.NaN)
df["'Value'"] = df["'Value'"].apply(lambda x: x.split(";")[0])
df["ID_1_Dupe"] = df["ID_1"].apply(lambda x: x.split(";")[1] if len(x.split(";"))>1 else np.NaN)
df["ID_1"] = df["ID_1"].apply(lambda x: x.split(";")[0])
df["ID_2_Dupe"] = df["ID_2"].apply(lambda x: x.split(";")[1] if len(x.split(";"))>1 else np.NaN)
df["ID_2"] = df["ID_2"].apply(lambda x: x.split(";")[0])
df
'Value' ID_1 ID_2 'Value'_Dupe ID_1_Dupe ID_2_Dupe
0 11122222 ABC42123 33333 NaN NaN NaN
1 21219299 YOF21233 88821 NaN NaN NaN
2 00022011 ERE00091 23124 NaN NaN NaN
3 75643311 ROB21288 99421 21233332 NaN NaN
4 12412421 POW94277 12231 NaN NaN 33221
5 54221721 IRS21231 13123 NaN YOU28137 NaN
Step 3
Let's combine dupes with original data into single dataframe:
df2 = df[pd.notna(df["'Value'_Dupe"])][["'Value'_Dupe","ID_1","ID_2"]]
df2.columns = ["'Value'","ID_1","ID_2"]
df3 = df[pd.notna(df["ID_1_Dupe"])][["'Value'","ID_1_Dupe","ID_2"]]
df3.columns = ["'Value'","ID_1","ID_2"]
df4 = df[pd.notna(df["ID_2_Dupe"])][["'Value'","ID_1","ID_2_Dupe"]]
df4.columns = ["'Value'","ID_1","ID_2"]
df5 = df[["'Value'","ID_1","ID_2"]]
df_result = pd.concat([df5,df2,df3,df4])
df_result
'Value' ID_1 ID_2
0 11122222 ABC42123 33333
1 21219299 YOF21233 88821
2 00022011 ERE00091 23124
3 75643311 ROB21288 99421
4 12412421 POW94277 12231
5 54221721 IRS21231 13123
3 21233332 ROB21288 99421
5 54221721 YOU28137 13123
4 12412421 POW94277 33221
Please let me know if this solves your problem.

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Remove columns that have NA values for rows - Python

Suppose I have a dataframe as follows,
import pandas as pd
columns=['A','B','C','D', 'E', 'F']
index=['1','2','3','4','5','6']
df = pd.DataFrame(columns=columns,index=index)
df['D']['1'] = 1
df['E'] = 1
df['F']['1'] = 1
df['A']['2'] = 1
df['B']['3'] = 1
df['C']['4'] = 1
df['A']['5'] = 1
df['B']['5'] = 1
df['C']['5'] = 1
df['D']['6'] = 1
df['F']['6'] = 1
df
A B C D E F
1 NaN NaN NaN 1 1 1
2 1 NaN NaN NaN 1 NaN
3 NaN 1 NaN NaN 1 NaN
4 NaN NaN 1 NaN 1 NaN
5 1 1 1 NaN 1 NaN
6 NaN NaN NaN 1 1 1
My condition is, I want to remove the columns which have value only when A,B,C(together) don't have a value. I want to find which column is mutually exclusive to A,B,C columns together. I am interested in finding the columns that have values only when A or B or C has values. The output here would be to remove D,F columns. But my dataframe has 400 columns and I want a way to check this for A,B,C vs rest of the columns.
One way I can think is,
Remove NA rows from A,B,C
df = df[np.isfinite(df['A'])]
df = df[np.isfinite(df['B'])]
df = df[np.isfinite(df['C'])]
and get NA count of all columns and check with the total number of rows,
df.isnull().sum()
and remove the counts that match.
Is there a better and efficient way to do this?
Thanks
Rather than delete rows, just select the others that don't have A, B, C equal to NaN at the same time.
mask = df[["A", "B", "C"]].isnull().all(axis=1)
df = df[~mask]

Pandas merge two dataframes with different columns

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0 1 20 0 1
1 2 23 1 1
2 3 19 1 1
3 4 19 0 0
>df_jun
id quantity attr_1 attr_3
0 5 8 1 0
1 6 13 0 1
2 7 20 1 1
3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.
I think in this case concat is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.
The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A has 3x trial columns, which prevents concat:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})
I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

Categories