I have two pandas DataFrames with (not necessarily) identical index and column names.
>>> df_L = pd.DataFrame({'X': [1, 3],
'Y': [5, 7]})
>>> df_R = pd.DataFrame({'X': [2, 4],
'Y': [6, 8]})
I can join them together and assign suffixes.
>>> df_L.join(df_R, lsuffix='_L', rsuffix='_R')
X_L Y_L X_R Y_R
0 1 5 2 6
1 3 7 4 8
But what I want is to make 'L' and 'R' sub-columns under both 'X' and 'Y'.
The desired DataFrame looks like this:
>>> pd.DataFrame(columns=pd.MultiIndex.from_product([['X', 'Y'], ['L', 'R']]),
data=[[1, 5, 2, 6],
[3, 7, 4, 8]])
X Y
L R L R
0 1 5 2 6
1 3 7 4 8
Is there a way I can combine the two original DataFrames to get this desired DataFrame?
You can use pd.concat with the keys argument, along the first axis:
df = pd.concat([df_L, df_R], keys=['L','R'],axis=1).swaplevel(0,1,axis=1).sort_index(level=0, axis=1)
>>> df
X Y
L R L R
0 1 2 5 6
1 3 4 7 8
For those looking for an answer to the more general problem of joining two data frames with different indices or columns into a multi-index table:
# Prepend a key-level to the column index
# https://stackoverflow.com/questions/14744068
df_L = pd.concat([df_L], keys=["L"], axis=1)
df_R = pd.concat([df_R], keys=["R"], axis=1)
# Join the two dataframes
df = df_L.join(df_R)
# Reorder levels if needed:
df = df.reorder_levels([1,0], axis=1).sort_index(axis=1)
Example:
# Data:
df_L = pd.DataFrame({'X': [1, 3, 5], 'Y': [7, 9, 11]})
df_R = pd.DataFrame({'X': [2, 4], 'Y': [6, 8], 'Z': [10, 12]})
# Result:
# X Y Z
# L R L R R
# 0 1 2.0 7 6.0 10.0
# 1 3 4.0 9 8.0 12.0
# 2 5 NaN 11 NaN NaN
This also solves the special case of the OP with equal indices and columns.
df_L.columns = pd.MultiIndex.from_product([["L", ], df_L.columns])
Related
I would like to combine all row values into a list, whenever a non-null string is found in another column.
For example if I have this pandas dataframe:
df = pd.DataFrame({'X': [1,2,3,4,5,6,7,8],
'Y': [10,20,30,40,50,60,70,80],
'Z': [np.nan, np.nan, "A", np.nan, "A", "B", np.nan, np.nan]})
X Y Z
0 1 10 NaN
1 2 20 NaN
2 3 30 A
3 4 40 NaN
4 5 50 A
5 6 60 B
6 7 70 NaN
7 8 80 NaN
I would like to combine all previous row values from columns X and Y into lists, whenever column Z has a non-null string, like this:
df = pd.DataFrame({'X': [[1,2,3],[4,5],[6]],
'Y': [[10,20,30],[40,50],[60]],
'Z': ["A","A", "B"]})
X Y Z
0 [1, 2, 3] [10, 20, 30] A
1 [4, 5] [40, 50] A
2 [6] [60] B
So what I managed to do is "solve" it by using for loops. I would hope there is a better way to do it with pandas but I can't seem to find it.
My for loop solution:
Get "Z" ids without NaNs:
z_idx_withoutNaN = df[~df["Z"].isnull() == True].index.tolist()
[2, 4, 5]
Loop over ids and create lists with "X" and "Y" values:
x_list = []
y_list = []
for i, index in enumerate(z_idx_withoutNaN):
if i == 0:
x_list = [df.iloc[:index+1]["X"].values.tolist()]
y_list = [df.iloc[:index+1]["Y"].values.tolist()]
else:
x_list.append(df.iloc[previous_index:index+1]["X"].values.tolist())
y_list.append(df.iloc[previous_index:index+1]["Y"].values.tolist())
previous_index = index + 1
Finally, create df:
pd.DataFrame({"X": x_list,
"Y": y_list,
"Z": df[~df["Z"].isnull()]["Z"].values.tolist()})
X Y Z
0 [1, 2, 3] [10, 20, 30] A
1 [4, 5] [40, 50] A
2 [6] [60] B
Let us do
out = (df.groupby(df['Z'].iloc[::-1].notna().cumsum()).
agg({'X':list,'Y':list,'Z':'first'}).
dropna().
sort_index(ascending=False))
Out[23]:
X Y Z
Z
3 [1, 2, 3] [10, 20, 30] A
2 [4, 5] [40, 50] A
1 [6] [60] B
Here is one option:
(df.groupby(
df.Z.shift().notnull().cumsum()
).agg(list)
.assign(Z = lambda x: x.Z.str[-1])[
lambda x: x.Z.notnull()
])
X Y Z
Z
0 [1, 2, 3] [10, 20, 30] A
1 [4, 5] [40, 50] A
2 [6] [60] B
I have a dataframe with 5 columns. I want to look through 3 of them and store in dict or list (whichever is more efficient) the values of each of the 5 columns
Example:
A
B
C
D
E
1
10
20
9
5
4
2
4
55
14
5
2
3
3
3
9
7
7
I would like to create three lists as such
index_1 = [10,20,4]
index_2 = [4,55,2]
index_3 = [3,3,7]
I have no idea how to go forward after looping through the columns
cols = ['A', 'B', 'E']
for col in cols:
df[col]
Try:
index_1, index_2, index_3 = [list(row) for row in df[["A", "B", "E"]].values]
Use locals() to create 3 python variables:
cols = ['A', 'B', 'E']
for idx, col in enumerate(cols, 1):
locals()[f"index_{idx}"] = df[col].tolist()
>>> index_1
[10, 4, 3]
>>> index_2
[20, 55, 3
>>> index_3
[4, 2, 7]
We can try
d = df[['A','B','E']].T.to_dict('list')
Out[227]: {1: [10, 20, 4], 2: [4, 55, 2], 3: [3, 3, 7]}
d[1]
Out[231]: [10, 20, 4]
I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])
I wanted to use column values in one csv file to mask rows in another csv,
as in:
df6 = pd.read_csv(‘py_all1a.csv’) # file with multiple columns
df7 = pd.read_csv(‘artexclude1.csv’) # file with multiple columns
#
# csv df6 col 1 has the same header and data type as col 8 in df7.
# I want to mask rows in df6 that have a matching col value to any
# in df7. The data in each column is a text value (single word).
#
mask = df6.iloc[:,1].isin(df7.iloc[:,8])
df6[~mask].to_csv(‘py_all1b.csv’, index=False)
#
On that last line, I tried [mask] with the tilde, resulting in no change to the df6 file (py_all1b.csv), and without the tilde (producing the file with just the column headers).
An answer using a specific data set was provided in the below answer, but it did not work because there were inconsistencies between the text values, namely, on entry had a space while another did not.
The below answer is correct, and I have added a paragraph to show how the text issue can also be resolved.
Try converting to a set first:
mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))
This ensures your comparison is against values.
Example
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# 0 1 2
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
# 3 10 11 12
df2 = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]])
# 0 1 2
# 0 1 2 3
# 1 1 2 3
# 2 1 2 3
# 3 1 2 3
mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))
df1[mask]
# 0 1 2
# 0 1 2 3
With strings
It still works:
df1 = pd.DataFrame([['a', 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
df2 = pd.DataFrame([['a', 2, 3], ['a', 2, 3], ['a', 2, 3], ['a', 2, 3]])
mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))
df1[mask]
# 0 1 2
# 0 a 2 3
When you are dealing with string data, there may be problems with whitespace that can cause matches to be missed. As described in this answer, you may need to instead use:
df6 = pd.read_csv('py_all1a.csv', skipinitialspace=True) # file with multiple columns
df7 = pd.read_csv('artexclude1.csv', skipinitialspace=True) # file with multiple columns
mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))
df6[~mask].to_csv('py_all1b.csv', index=False)
I have a data frame say df1 with MULTILEVEL INDEX:
A B C D
0 0 0 1 2 3
4 5 6 7
1 2 8 9 10 11
3 2 3 4 5
and I have another data frame with 2 common columns in df2 also with MULTILEVEL INDEX
X B C Y
0 0 0 0 7 3
1 4 5 6 7
1 2 8 2 3 11
3 2 3 4 5
I need to remove the rows from df1 where the values of column B and C are the same as in df2, so I should be getting something like this:
A B C D
0 0 0 1 2 3
0 2 8 9 10 11
I have tried to do this by getting the index of the common elements and then remove them via a list, but they are all messed up and are in multi-level form.
You can do this in a one liner using pandas.dataframe.iloc, numpy.where and numpy.logical_or like this: (I find it to be the simplest way)
df1 = df1.iloc[np.where(np.logical_or(df1['B']!=df2['B'],df1['C']!=df2['C']))]
of course don't forget to:
import numpy as np
output:
A B C D
0 0 0 1 2 3
1 2 8 9 10 11
Hope this was helpful. If there are any questions or remarks please feel free to comment.
You could make MultiIndexes out of the B and C columns, and then call the index's isin method:
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
For example,
import pandas as pd
df1 = pd.DataFrame({'A': [0, 4, 8, 2], 'B': [1, 5, 9, 3], 'C': [2, 6, 10, 4], 'D': [3, 7, 11, 5], 'P': [0, 0, 1, 1], 'Q': [0, 0, 2, 3]})
df1 = df1.set_index(list('PQ'))
df1.index.names = [None,None]
df2 = pd.DataFrame({'B': [0, 5, 2, 3], 'C': [7, 6, 3, 4], 'P': [0, 0, 1, 1], 'Q': [0, 1, 2, 3], 'X': [0, 4, 8, 2], 'Y': [3, 7, 11, 5]})
df2 = df2.set_index(list('PQ'))
df2.index.names = [None,None]
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
print(result)
yields
A B C D
0 0 0 1 2 3
1 2 8 9 10 11