I have a Dataframe defined like :
df1 = pd.DataFrame({"col1":[1,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
"col2":[np.nan,3,np.nan,4,np.nan,np.nan,np.nan,5,6],
"col3":[np.nan,np.nan,7,np.nan,np.nan,8,9,np.nan, np.nan]})
I want to transform it into a DataFrame like:
df2 = pd.DataFrame({"col_name":['col1','col2','col3','col2','col1',
'col3','col3','col2','col2'],
"value":[1,3,7,4,2,8,9,5,6]})
If possible, can we reverse this process too? By that I mean convert df2 into df1.
I don't want to go through the DataFrame iteratively as it becomes too computationally expensive.
You can stack it:
out = (df1.stack().astype(int).droplevel(0)
.rename_axis('col_name').reset_index(name='value'))
Output:
col_name value
0 col1 1
1 col2 3
2 col3 7
3 col2 4
4 col1 2
5 col3 8
6 col3 9
7 col2 5
8 col2 6
To go from out back to df1, you could pivot:
out1 = pd.pivot(out.reset_index(), 'index', 'col_name', 'value')
Related
I have two dataframes
df1 = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]})
df2 = pd.DataFrame({'col3': [1,5,3]})
and would like to left merge df1 to df2. I don't have a fixed merge column in df1 though. I would like to merge on col1 if the cell value of col1 exists in df2.col3 and on col2 if the cell value of col2 exists in df2.col3. So in the above example merge on col1, col2 and then col1. (This is just an example, I actually have more than only two columns).
I could do this but I'm not sure if it's ok.
df1 = df1.assign(merge_col = np.where(df1.col1.isin(df2.col3), df1.col1, df1.col2))
df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
Are there any better ways to solve it?
Perform the merges in the preferred order, and use combine_first to combine the merges:
(df1.merge(df2, left_on='col1', right_on='col3', how='left')
.combine_first(df1.merge(df2, left_on='col2', right_on='col3', how='left')
)
)
For a generic method with many columns:
cols = ['col1', 'col2']
from functools import reduce
out = reduce(
lambda a,b: a.combine_first(b),
[df1.merge(df2, left_on=col, right_on='col3', how='left')
for col in cols]
)
Output:
col1 col2 col3
0 1 4 1.0
1 2 5 5.0
2 3 6 3.0
Better example:
Adding another column to df2 to illustrate the merge:
df2 = pd.DataFrame({'col3': [1,5,3], 'new': ['A', 'B', 'C']})
Output:
col1 col2 col3 new
0 1 4 1.0 A
1 2 5 5.0 B
2 3 6 3.0 C
I think your solution is possible modify with get merged Series with compare all columns from list and then merge with this Series:
Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:
cols = ['col1', 'col2']
s = df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0]
print (s)
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64
df = df1.merge(df2, left_on=s, right_on='col3', how='left')
print (df)
col1 col2 col3
0 1 4 1
1 2 5 5
2 3 6 3
Your solution with helper column:
cols = ['col1', 'col2']
df1 = (df1.assign(merge_col = = df1[cols].where(df1[cols].isin(df2.col3))
.bfill(axis=1).iloc[:, 0]))
df = df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
print (df)
col1 col2 merge_col col3
0 1 4 1.0 1
1 2 5 5.0 5
2 3 6 3.0 3
Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:
print (df1[cols].isin(df2.col3))
col1 col2
0 True False
1 False True
2 True False
print (df1[cols].where(df1[cols].isin(df2.col3)))
col1 col2
0 1.0 NaN
1 NaN 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1))
col1 col2
0 1.0 NaN
1 5.0 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0])
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64
I am looking to compare two CSVs. Both CSVs will have nearly identical data, however the second CSV will have 2 identical rows that CSV 1 does not have. I would like the program to output both of those 2 rows so I can see which row is present in CSV 2, but not CSV 1, and how many times that row is present.
Here is my current logic:
import csv
import pandas as pd
import numpy as np
data1 = {"Col1": [0,1,1,2],
"Col2": [1,2,2,3],
"Col3": [5,2,1,1],
"Col4": [1,2,2,3]}
data2 = {"Col1": [0,1,1,2,4,4],
"Col2": [1,2,2,3,4,4],
"Col3": [5,2,1,1,4,4],
"Col4": [1,2,2,3,4,4]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
ds1 = set(tuple(line) for line in df1.values)
ds2 = set(tuple(line) for line in df2.values)
df = pd.DataFrame(list(ds2.difference(ds1)), columns=df2.columns)
print(df)
Here is my current outcome:
Col1 Col2 Col3 Col4
0 4 4 4 4
Here is my desired outcome:
Col1 Col2 Col3 Col4
0 4 4 4 4
1 4 4 4 4
As of right now, it only outputs the row once even though CSV has the row twice. What can I do so that it not only shows the missing row, but also for each time it is in the second CSV? Thanks in advance!
There is almost always a built-in pandas function meant to do what you want that will be better than trying to re-invent the wheel.
df = df2[~df2.isin(df1).all(axis=1)]
# OR df = df2[df2.ne(df1).all(axis=1)]
print(df)
Output:
Col1 Col2 Col3 Col4
4 4 4 4 4
5 4 4 4 4
You can use:
df2[~df2.eq(df1).all(axis=1)]
Result:
Col1 Col2 Col3 Col4
4 4 4 4 4
5 4 4 4 4
Or (if you want the index to be 0 and 1):
df2[~df2.eq(df1).all(axis=1)].reset_index(drop=True)
Result:
Col1 Col2 Col3 Col4
0 4 4 4 4
1 4 4 4 4
N.B.
You can also use df2[df2.ne(df1).all(axis=1)] instead of df2[~df2.eq(df1).all(axis=1)].
I am trying to merge two excel files by using Vlookup function from excel in Python.
based on my code, the result would be:
col1_x | col2_x | col3_x | col4_y | col5_y | col6_y
1 2 3 4 5 6
7 8 9 10 11 12
My code :
df1 = pd.read_excel("dropped_file.xlsx")
df2 = pd.read_excel("original.xlsx")
result = pd.merge(df1, df2, on = ['col1', 'col3', 'col4'], how='left')
result.to_excel("result.xlsx", index=False)
Anyone have idea to drop out _x and _y at behind of column names ?
Reason for _x and _y is after merge are duplicated columns names. So for avoid col1, col1, col2, col2 columns is added _x, _y so ouput is col1_x, col1_y, col2_x, col2_y.
If need remove _x, _y but ouput will be duplicated columns use Series.str.replace:
df.columns = df.columns.str.replace('_x|_y','', regex=True)
print (df)
col1 col2 col3 col4 col5 col6
0 1 2 3 4 5 6
1 7 8 9 10 11 12
Given the following dataframe:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'A','A']})
df
COL1 COL2
0 A NaN
1 NaN A
2 A A
I would like to create a column ('COL3') that uses the value from COL1 per row unless that value is null (or NaN). If the value is null (or NaN), I'd like for it to use the value from COL2.
The desired result is:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Thanks in advance!
In [8]: df
Out[8]:
COL1 COL2
0 A NaN
1 NaN B
2 A B
In [9]: df["COL3"] = df["COL1"].fillna(df["COL2"])
In [10]: df
Out[10]:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
You can use np.where to conditionally set column values.
df = df.assign(COL3=np.where(df.COL1.isnull(), df.COL2, df.COL1))
>>> df
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
If you don't mind mutating the values in COL2, you can update them directly to get your desired result.
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
>>> df
COL1 COL2
0 A NaN
1 NaN B
2 A B
df.COL2.update(df.COL1)
>>> df
COL1 COL2
0 A A
1 NaN B
2 A A
Using .combine_first, which gives precedence to non-null values in the Series or DataFrame calling it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
df['COL3'] = df.COL1.combine_first(df.COL2)
Output:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
If we mod your df slightly then you will see that this works and in fact will work for any number of columns so long as there is a single valid value:
In [5]:
df = pd.DataFrame({'COL1': ['B', np.nan,'B'],
'COL2' : [np.nan,'A','A']})
df
Out[5]:
COL1 COL2
0 B NaN
1 NaN A
2 B A
In [6]:
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Out[6]:
0 B
1 A
2 B
dtype: object
first_valid_index will return the index value (in this case column) that contains the first non-NaN value:
In [7]:
df.apply(lambda x: x.first_valid_index(), axis=1)
Out[7]:
0 COL1
1 COL2
2 COL1
dtype: object
So we can use this to index into the series
You can also use mask which replaces the values where COL1 is NaN by column COL2:
In [8]: df.assign(COL3=df['COL1'].mask(df['COL1'].isna(), df['COL2']))
Out[8]:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Input: CSV with 5 columns.
Expected Output: Unique combinations of 'col1', 'col2', 'col3'.
Sample Input:
col1 col2 col3 col4 col5
0 A B C 11 30
1 A B C 52 10
2 B C A 15 14
3 B C A 1 91
Sample Expected Output:
col1 col2 col3
A B C
B C A
Just expecting this as output. I don't need col4 and col5 in output. And also don't need any sum, count, mean etc. Tried using pandas to achieve this but no luck.
My code:
input_df = pd.read_csv("input.csv");
output_df = input_df.groupby(['col1', 'col2', 'col3'])
This code is returning 'pandas.core.groupby.DataFrameGroupBy object at 0x0000000009134278'.
But I need dataframe like above. Any help much appreciated.
df[['col1', 'col2', 'col3']].drop_duplicates()
First you can use .drop() to delete col4 and col5 as you said you don't need them.
df = df.drop(['col4', 'col5'], axis=1)
Then, you can use .drop_duplicates() to delete the duplicate rows in col1, col2 and col3.
df = df.drop_duplicates(['col1', 'col2', 'col3'])
df
The output:
col1 col2 col3
0 A B C
2 B C A
You noticed that in the output the index is 0, 2 instead of 0,1. To fix that you can do this:
df.index = range(len(df))
df
The output:
col1 col2 col3
0 A B C
1 B C A