Python pandas merge with condition and no duplicates - python

I have 2 dataframes derived from 2 excel files. The first is a sort of template where there is a column with a condition and the other has the same format but includes inputs for different time periods. I would like to create an output dataframe that basically creates a copy of the template populated with the inputs when the condition is met.
When I use something like df1.merge(df2.assign(Condition='yes'), on=['Condition'], how='left') I sort of get something in line with what I'm after but it contains duplicates. What could I do instead?
thanks
Example below
Code
df1={'reference':[1,2],'condition':['yes','no'],'31/12/2021':[0,0],'31/01/2022':[0,0]}
df1 = pd.DataFrame.from_dict(df1)
df2 = {'reference':[1,2],'condition':["",""],'31/12/2021':[101,231],'31/01/2022':[3423,3242]}
df2 = pd.DataFrame.from_dict(df2)
df1.merge(df2.assign(condition='yes'), on=['condition'], how='left')
Visual example

You could use df.update for this:
# only `update` from column index `2` onwards: ['31/12/2021', '31/01/2022']
df2.update(df1.loc[df1.condition=='no', list(df1.columns)[2:]])
print(df2)
reference condition 31/12/2021 31/01/2022
0 1 101.0 3423.0
1 2 0.0 0.0
Alternative solution using df.where:
df2.iloc[:,2:] = df2.iloc[:,2:].where(df1.condition=='yes',df1.iloc[:,2:])
print(df2)
reference condition 31/12/2021 31/01/2022
0 1 101 3423
1 2 0 0

Related

Creating a new map from existing maps in python

This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
Example:
Map1:
1,100
2,453
3,200
Map2:
100,25,30,
200,300,,
250,190,20,1
My map3 should be:
1,25
2,0
3,300
As 453 is not present in map2, our map3 contains value 0 for key 2.
First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df1.columns=['a','b']
df2.columns=['c','d','e','f']
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
.fillna(0)
.convert_dtypes())
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0

Create log with change in values in pandas dataframe

I have been working with a dataframe having the following format (the actual table has much more rows(id's) and columns(value_3, value_4 etc..)):
where for each id, the status column has the value 'new' if this is the first entry for that id, and the value 'modified' if any of the value_1, value_2 columns have changed compared to their previous value. I would like to create a log of any changes made in the table, in particular I would like the resulted format for the given data above to be something like this:
Ideally, I would like to avoid using loops, so could you please suggest any more efficient pythonic way to achieve the format above?
I have seen the answers posted for the question here: Determining when a column value changes in pandas dataframe
which partly do the job I want (using shift or diff) for identifying cases where there was a change, and I was wondering if this is the best way to build on for my case, or if there is a more efficient way to do that and speed up the process. Ideally, I would like something that can work for both numeric and non-numeric values in value_1, value_2, etc columns..
Code for creating the sample data of the first pic:
import pandas as pd
data = [[1,2,5,'new'], [1,1,5,'modified'], [1,0,5,'modified'],
[2,5,2,'new'], [2,5,3,'modified'], [2,5,4,'modified'] ]
df = pd.DataFrame(data, columns = ['id', 'value_1', 'value_2',
'status'])
df
Many thanks in advance for any suggestion/help!
We do need melt first then groupby after drop_duplicates
s = df.melt(['id','status']).drop_duplicates(['id','variable','value'])
s['new'] = s.groupby(['id','variable'])['value'].shift()
s #s.sort_values('id')
id status variable value new
0 1 new value_1 2 NaN
1 1 modified value_1 1 2.0
2 1 modified value_1 0 1.0
3 2 new value_1 5 NaN
6 1 new value_2 5 NaN
9 2 new value_2 2 NaN
10 2 modified value_2 3 2.0
11 2 modified value_2 4 3.0

Pandas dataframes with multi-level columns:rename a specific level of column so that it's same as another level

Sorry for the seemingly confusing title. I was reading Excel data using Pandas. However, the original Excel data has multiple rows for header and some of the cells are merged. It sort of looks like this:
It shows in my Jupyter Notebook like this
My plan is to just the 2nd level as my column names and drop the level0. But the original data has about 15 columns that shows as "Unnamed...", I wonder if I can rename those before dropping the level0 column names.
The desirable output looks like:
I may do this repeatedly so I didn't save it as CSV first and then read it in Pandas. Now I have spent longer than I care to admit on fixing the column names. I wonder if there is a way to do this with a function instead of renaming every individual column of interest.
Thanks.
I think simpliest here is use list comprehension - get values of MultiIndex only if no Unnamed text:
df.columns = [first if 'Unnamed' in second else second for first, second in df.columns]
print (df)
Purchase/sell_time Quantity Price Side
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
But if more levels in real data is possible some columns should be duplicated, so cannot select them (if select by duplicated column get all columns, not only one, e.g. by df['dup_column_name']).
You can test it:
print (df.columns[df.columns.duplicated(keep=False)])
Then I suggest join all unnamed levels for prevent it:
df.columns = ['_'.join(y for y in x if 'Unnamed' not in y) for x in df.columns]
print (df)
Purchase/sell_time Purchase/sell_time_Quantity Purchase/sell_time_Price \
0 2020-04-09 15:22:00 20 43
1 2020-04-09 16:22:00 30 56
Side
0 B
1 S
Your columns are multiindex, and index are immutable, meaning you can't change only a part of them. This is why I suggest to retrieve both levels of the multiindex, then to create an array with your desired columns and to replace the DataFrame column with this, as follows:
# First I reproduce your dataframe
df1 = pd.DataFrame({("Purchase/sell_time","Unnamed:"): pd.date_range("2020-04-09 15:22:00",
freq="H", periods = 2),
("Purchase/sell_time", "Quantity"): [20,30],
("Purchase/sell_time", "Price"): [43, 56],
("Side", "Unnamed:") : ["B", "S"]})
df1 = df1.sort_index()
It looks like this:
Purchase/sell_time Side
Unnamed: Quantity Price Unnamed:
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
The column is a multiindex as you can see:
MultiIndex([('Purchase/sell_time', 'Unnamed:'),
('Purchase/sell_time', 'Quantity'),
('Purchase/sell_time', 'Price'),
( 'Side', 'Unnamed:')],
)
# I retrieve the first and second level of the multiindex then create an array conditionally
# on the second level not starting with "Unnamed"
first_header = df1.columns.get_level_values(0)
second_header = df1.columns.get_level_values(1)
merge_header = np.where(second_header.str.startswith("Unnamed:"),
first_header, second_header)
df1.columns = merge_header
Here is the result:
Purchase/sell_time Quantity Price Side
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
Hope it helps

How to use Python Pandas to sort data frame to match files and invoice amounts

I have a data frame with 4 columns. I need column 1 and 2 (new_df_1 and bill_df_1) to not change. I want to sort (new_File_Number_Data new_invoice_total) to match column 1 and 2 and if there is no match, match it with missing.
new_df_1 bill_df_1 new_File_Number_Data new_invoice_total
0 1-08912-000218-033 25.0 1-08915-000041-054 134.50
1 1-08915-000041-054 163.0 001-0464-01589-061 148.50
2 001-0464-01589-061 166.7 004-3001-00080-532 54.00
3 004-3001-00080-532 74.0 missing missing
easier to look at Python Data Frame pic
You can't sort only some columns of a dataframe and not others. It sounds like you need to separate the columns into two different dataframes and then merge them so that they are matched as you want. You can then fill the missing values with the string 'missing'. For example:
df1 = df[['new_df_1', 'bill_df_1']]
df2 = df[['new_File_Number_Data', 'new_invoice_total']]
new_df = pd.merge(df1, df2, how='left', left_on='new_df_1', right_on='new_File_Number_Data').fillna('missing')

Python Pandas - Appending data from multiple data frames onto same row by matching primary identifier, leave blank if no results from that data frame

Very new to python and using pandas, I only use it every once in a while when I'm trying to learn and automate otherwise a tedious Excel task. I've come upon a problem where I haven't exactly been able to find what I'm looking for through Google or here on Stack Overflow.
I currently have 6 different excel (.xlsx) files that I am able to parse and read into data frames. However, whenever I try to append them together, they're simply added on as new rows in a final output excel files, but instead I'm trying to append similar data values onto the same row, and not the same column so that I can see whether or not this unique value shows up in these data sets or not. A shortened example is as follows
[df1]
0 Col1 Col2
1 XYZ 41235
2 OAIS 15123
3 ABC 48938
[df2]
0 Col1 Col2
1 KFJ 21493
2 XYZ 43782
3 SHIZ 31299
4 ABC 33347
[Expected Output]
0 Col1 [df1] [df2]
1 XYZ 41235 43782
2 OAIS 15123
3 ABC 48938 33347
4 KFJ 21493
5 SHIZ 31299
I've tried to use a merge, however the actual data sheets are much more complicated in that I want to append 23 columns of data associated with each unique identifier in each data set. Such as, [XYZ] in [df2] has associated information across the next 23 columns that I would want to append after the 23 columns from the [XYZ] values in [df1].
How should I go about that? There are approximately 200 rows in each excel sheet and I would only need to essentially loop through until a matching unique identifier was found in [df2] with [df1], and then [df3] with [df1] and so on until [df6] and append those columns onto a new dataframe which would eventually be output as a new excel file.
df1 = pd.read_excel("set1.xlsx")
df2 = pd.read_excel("set2.xlsx")
df3 = pd.read_excel("set3.xlsx")
df4 = pd.read_excel("set4.xlsx")
df5 = pd.read_excel("set5.xlsx")
df6 = pd.read_excel("set6.xlsx")
Is currently the way I am reading into the excel files into data frames, I'm sure I could loop it, however, I am unsure of the best practices in doing so instead of hard coding each initialization of the data frame.
You need merge with the parameter how = 'outer'
new_df = df1.merge(df2, on = 'Col1',how = 'outer', suffixes=('_df1', '_df2'))
You get
Col1 Col2_df1 Col2_df2
0 XYZ 41235.0 43782.0
1 OAIS 15123.0 NaN
2 ABC 48938.0 33347.0
3 KFJ NaN 21493.0
4 SHIZ NaN 31299.0
For iterative merging, consider storing data frames in a list and then run the chain merge with reduce(). Below creates a list of dataframes from a list comprehension through the Excel files where enumerate() is used to rename the Col2 successively as df1, df2, etc.
from functools import reduce
...
dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)})
for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx",
"set4.xlsx", "set5.xlsx", "set6.xlsx"])]
df = reduce(lambda x,y: pd.merge(x, y, on=['Col1'], how='outer'), dfList)
# Col1 df1 df2
# 0 XYZ 41235.0 43782.0
# 1 OAIS 15123.0 NaN
# 2 ABC 48938.0 33347.0
# 3 KFJ NaN 21493.0
# 4 SHIZ NaN 31299.0
Alternatively, use pd.concat and outer join the dataframes horizontally where you need to set Col1 as index:
dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)}).set_index('Col1')
for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx",
"set4.xlsx", "set5.xlsx", "set6.xlsx"])]
df2 = pd.concat(dfList, axis=1, join='outer', copy=False)\
.reset_index().rename(columns={'index':'Col1'})
# Col1 df1 df2
# 0 ABC 48938.0 33347.0
# 1 KFJ NaN 21493.0
# 2 OAIS 15123.0 NaN
# 3 SHIZ NaN 31299.0
# 4 XYZ 41235.0 43782.0
You can use the merge function.
pd.merge(df1, df2, on=['Col1'])
You can use multiple keys by adding to the list on.
You can read more about the merge function in here
If you need only certain of the columns you can reach it by:
df1.merge(df2['col1','col2']], on=['Col1'])
EDIT:
In case of looping through some df's you can loop through all df's except the first and merge them all:
df_list = [df2, df3, df4]
for df in df_list:
df1 = df1.merge(df['col1','col2']], on=['Col1'])

Categories