How to concat based on a condition in python - python

I would like to concat 2 Dataframes based on the same date to identify when a product is bought in a linear fashion.
Here's my code:
s = pd.Series(['01-2020', '02-2020', '03-2020', '04-2020', '05-2020', '06-2020', '07-2020', '08-2020', '09-2020', '10-2020', '11-2020', '12-2020'], name=Date)
data = [['01-2020', 5], ['02-2020', 3], ['03-2020', 1], ['05-2020', 4], ['06-2020', 8], ['08-2020', 3], ['09-2020', 11], ['10-2020', 5], ['12-2020', 3]]
df = pd.DataFrame(data, columns = ['Date Bought', 'Amount_Bought'])
result = pd.concat([df, s], axis=1, join="outer")
When I try to concat these dataframes the result is out of order.
I wish the output too look like this
Date Date_Bought Amount_Bought
01-2020 01-2020 5
02-2020 02-2020 3
03-2020 03-2020 1
04-2020 NaN 0
05-2020 05-2020 4
06-2020 06-2020 8
07-2020 NaN 0
08-2020 08-2020 3
09-2020 09-2020 11
10-2020 10-2020 5
11-2020 NaN 0
12-2020 12-2020 3

Use merge instead of concat - the latter will combine the columns from the set formed by the series and the data frame, and this is not what you would like to have. Also, remove the NaN using fillna on the column 'Amount_Bought'.
results = pd.merge(left = s, right = df, left_on = 'Date', right_on = 'Date Bought', how = 'left')
results[['Amount_Bought']] = results[['Amount_Bought']].fillna(value=0)

Related

How to merge dataframes together with matching columns side by side?

I have two dataframes with matching keys. I would like to merge them together based on their keys and have the corresponding columns line up side by side. I am not sure how to achieve this as the pd.merge displays all columns for the first dataframe and then all columns for the second data frame:
df1 = pd.DataFrame(data={'key': ['a', 'b'], 'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame(data={'key': ['a', 'b'], 'col1': [5, 6], 'col2': [7, 8]})
print(pd.merge(df1, df2, on=['key']))
key col1_x col2_x col1_y col2_y
0 a 1 3 5 7
1 b 2 4 6 8
I am looking for a way to do the same merge and have the columns displays side by side as such:
key col1_x col1_y col2_x col2_y
0 a 1 5 3 7
1 b 2 6 4 8
Any help achieving this would be greatly appreciated!
If you're ok with a bit of a shuffle you can sort the columns.
df = pd.merge(df1, df2, on=['key'])
df = df.reindex(columns = sorted(df.columns))
or you could do this to maintain the key in the front
cols = list(df.columns)
cols.remove('key')
print(cols)
df = pd.merge(df1, df2, on=['key'])
df = df.reindex(columns = ['key']+sorted(cols))

Change column values based on other dataframe columns

I have two dataframes that look like this
df1 ==
IDLocation x-coord y-coord
1 -1.546 7.845
2 3.256 1.965
.
.
35 5.723 -2.724
df2 ==
PIDLocation DIDLocation
14 5
3 2
7 26
I want to replace the columns PIDLocation, DIDLocation with Px-coord, Py-coord, Dx-coord, Dy-coord such that the two columns PIDLocation, DIDLocation are IDLocation and each IDLocation corresponds to an x-coord and y-coord in the first dataframe.
If you set the ID column as the index of df1, you can get the coord values by indexing. I changed the values in df2 in the example below to avoid index errors that would result from not having the full dataset.
import pandas as pd
df1 = pd.DataFrame({'IDLocation': [1, 2, 35],
'x-coord': [-1.546, 3.256, 5.723],
'y-coord': [7.845, 1.965, -2.724]})
df2 = pd.DataFrame({'PIDLocation': [35, 1, 2],
'DIDLocation': [2, 1, 35]})
df1.set_index('IDLocation', inplace=True)
df2['Px-coord'] = [df1['x-coord'].loc[i] for i in df2.PIDLocation]
df2['Py-coord'] = [df1['y-coord'].loc[i] for i in df2.PIDLocation]
df2['Dx-coord'] = [df1['x-coord'].loc[i] for i in df2.DIDLocation]
df2['Dy-coord'] = [df1['y-coord'].loc[i] for i in df2.DIDLocation]
del df2['PIDLocation']
del df2['DIDLocation']
print(df2)
Px-coord Py-coord Dx-coord Dy-coord
0 5.723 -2.724 3.256 1.965
1 -1.546 7.845 -1.546 7.845
2 3.256 1.965 5.723 -2.724

Pandas merging on multi columns while ignoring NaN

A similar question was asked here Pandas merge on multiple columns ignoring NaN but without answer, so I'll ask maybe someone can help.
I need to merge values from df2 into df1, but the key used in the merge differs between rows in df2, as the rows in df2 have NaNs in different columns, and in that case I want to ignore those columns, and use for each row only the columns that have values.
df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df1
level1 level2 level3
0 0 1 3
1 0 2 4
2 1 2 5
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])
df2
level1 level2 level3 value
0 0.0 NaN NaN 10
1 0.0 1.0 NaN 12
2 NaN 2.0 5.0 13
When I do df1.merge(df2, how='left'), I get df1 with NaN in the value column, since there is no match on all the level columns as pandas is trying to match the Nan values as well.
What I do want is to get a match for any rows in df2 without trying to match the NaNs:
level1 level2 level3 value
0 0 1 3 10
1 0 1 3 12
2 0 2 4 10
3 1 2 5 13
Explanation:
Row 0 in df1 has a match on the non-NaN columns of rows 0 and 1 in df2, so it gets values 10 and 12 from there. Row 1 in df1 has a match on the non-NaN columns of row 0 in df2, so it gets value 12 from there. Row 2 in df1 has a match on the non-NaN columns of row 2 in df2, so it gets value 13 from there.
In the real data I actually have 6 level columns and the non-NaN columns for each row in df2 can be any combination or a single column from there.
What I do now is to iterrows the rows in df2, create for each one a mini-dataframe of only the non-NaN columns, and merge df1 with it. But as we know, it's not really efficient, and I wonder it there something better that can be done.
I think I figured out a vectorized solution.
Fundamentally, the idea is that you merge df1 with df2 three separate times for the number of levels and then concat the dataframes together into one.
From there, you count how many columns are null and subtract from the number of levels. This tells you how many duplicates (or matches) are required in order to prevent the data from being dropped later.
Then, you calculate how many matches or duplicates there actually are. If actual is the same as required, then that means the row is a match, and it gets kept in the dataframe.
It's not pretty, but to improve my answer you could create a merging function to cut some of the code. Most importantly, it should be highly performant compared to looping through every row. As a final note, for the duplicates_required helper column, you will need to change the 3 to a 6 since you have 6 columns in your actual dataset and you will obviously need to repeat some of my merging code:
df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])
df2 = df2.assign(duplicates_required = 3 - df2.isnull().sum(axis=1))
df = pd.concat([
df1.merge(df2[['level1','value', 'duplicates_required']], on='level1'),
df1.merge(df2[['level2','value', 'duplicates_required']], on='level2'),
df1.merge(df2[['level3','value', 'duplicates_required']], on='level3')
])
cols = ['level1', 'level2', 'level3', 'value']
df['actual_duplicates'] = df.groupby(cols)['value'].transform('size')
df = (df[df['duplicates_required'].eq(df['actual_duplicates'])]
.drop_duplicates(subset=cols)
.drop(['duplicates_required', 'actual_duplicates'], axis=1)
.reset_index(drop=True))
df
Out[1]:
level1 level2 level3 value
0 0 1 3 10
1 0 1 3 12
2 0 2 4 10
3 1 2 5 13
I think this works better than my previous answer using regex. Similar process, but a bit simpler to understand.
Do a full merge of the two dataframes
Compare across levels and count number of mismatches
Filter to rows where mismatch count == 0
import pandas as pd
df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])
levels_to_match = ['level1','level2','level3']
levels_to_match_df2 = [level + '_df2' for level in levels_to_match]
for df in [df1,df2]:
df['temp'] = 1
df1 = df1.merge(df2, on='temp', suffixes=[None,'_df2']).drop(columns=['temp'])
df1['mismatch'] = df1.apply(lambda x:
sum([(1 - (y == z or pd.isna(z))) for y, z in zip(list(x[levels_to_match]), list(x[levels_to_match_df2]))]),
axis=1)
df1 = df1.loc[df1['mismatch'] == 0, :].drop(columns=['mismatch'] + levels_to_match_df2)
print(df1)
level1 level2 level3 value
0 0 1 3 10
1 0 1 3 12
3 0 2 4 10
8 1 2 5 13
Old answer with regex
Probably not ideal, but maybe try converting your levels into strings and regex expressions, then do a full merge of all possible combinations, and finally filter using a regex search/match across two helper columns (one from df1 and the other from df2).
Assuming the data you're matching on are either int or NaN then this seems to work okay. If you have other data types in your real data then the string/regex transformations will need to be adjusted accordingly.
import pandas as pd
import re
df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])
levels_to_match = ['level1','level2','level3']
for df in [df1,df2]:
df['helper'] = df[levels_to_match].apply(list, axis=1)
df['helper'] = df['helper'].apply(lambda x: ','.join([str(int(item)) if pd.notna(item) else '.*' for item in x]))
df['temp'] = 1
df1 = df1.merge(df2.drop(columns=levels_to_match), on='temp', suffixes=[None,'_df2']).drop(columns=['temp'])
df1['match'] = df1.apply(lambda x: re.search(x['helper_df2'], x['helper']) is not None, axis=1)
df1 = df1.loc[df1['match'], :].drop(columns=['helper','helper_df2','match'])
print(df1)
level1 level2 level3 value
0 0 1 3 10
1 0 1 3 12
3 0 2 4 10
8 1 2 5 13

Efficient way to merge multiple large DataFrames

Suppose I have 4 small DataFrames
df1, df2, df3 and df4
import pandas as pd
from functools import reduce
import numpy as np
df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df1.columns = ['name', 'id', 'price']
df2.columns = ['name', 'id', 'price']
df3.columns = ['name', 'id', 'price']
df4.columns = ['name', 'id', 'price']
df1 = df1.rename(columns={'price':'pricepart1'})
df2 = df2.rename(columns={'price':'pricepart2'})
df3 = df3.rename(columns={'price':'pricepart3'})
df4 = df4.rename(columns={'price':'pricepart4'})
Create above are the 4 DataFrames, what I would like is in the code below.
# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
# Fill na values with 'missing'
df = df.fillna('missing')
So I have achieved this for 4 DataFrames that don't have many rows and columns.
Basically, I want to extend the above outer merge solution to MULTIPLE (48) DataFrames of size 62245 X 3:
So I came up with this solution by building from another StackOverflow answer that used a lambda reduce:
from functools import reduce
import pandas as pd
import numpy as np
dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))
#The solution I came up with to extend the solution to more than 3 DataFrames
df_merged = reduce(lambda left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')
This is causing a MemoryError.
I do not know what to do to stop the kernel from dying.. I've been stuck on this for two days.. Some code for the EXACT merge operation that I have performed that does not cause the MemoryError or something that gives you the same result, would be really appreciated.
Also, the 3 columns in the main DataFrame (NOT the reproducible 48 DataFrames in the example) are of type int64, int64 and float64 and I'd prefer them to stay that way because of the integer and float that it represents.
EDIT:
Instead of iteratively trying to run the merge operations or using the reduce lambda functions, I have done it in groups of 2! Also, I've changed the datatype of some columns, some did not need to be float64. So I brought it down to float16. It gets very far but still ends up throwing a MemoryError.
intermediatedfList = dfList
tempdfList = []
#Until I merge all the 48 frames two at a time, till it becomes size 2
while(len(intermediatedfList) != 2):
#If there are even number of DataFrames
if len(intermediatedfList)%2 == 0:
#Go in steps of two
for i in range(0, len(intermediatedfList), 2):
#Merge DataFrame in index i, i + 1
df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
print(df1.info(memory_usage='deep'))
#Append it to this list
tempdfList.append(df1)
#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,
#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.
intermediatedfList = tempdfList
else:
#If there are odd number of DataFrames, keep the first DataFrame out
tempdfList = [intermediatedfList[0]]
#Go in steps of two starting from 1 instead of 0
for i in range(1, len(intermediatedfList), 2):
#Merge DataFrame in index i, i + 1
df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
print(df1.info(memory_usage='deep'))
tempdfList.append(df1)
#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,
#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.
intermediatedfList = tempdfList
Is there any way I can optimize my code to avoid MemoryError, I've even used AWS 192GB RAM (I now owe them 7$ which I could've given one of yall), that gets farther than what I've gotten, and it still throws MemoryError after reducing a list of 28 DataFrames to 4..
You may get some benefit from performing index-aligned concatenation using pd.concat. This should hopefully be faster and more memory efficient than an outer merge as well.
df_list = [df1, df2, ...]
for df in df_list:
df.set_index(['name', 'id'], inplace=True)
df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)
Alternatively, you can replace the concat (second step) by an iterative join:
from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)
This may or may not be better than the merge.
Seems like part of what dask dataframes were designed to do (out of memory ops with dataframes). See
Best way to join two large datasets in Pandas for example code. Sorry not copying and pasting but don't want to seem like I am trying to take credit from answerer in linked entry.
You can try a simple for loop. The only memory optimization I have applied is downcasting to most optimal int type via pd.to_numeric.
I am also using a dictionary to store dataframes. This is good practice for holding a variable number of variables.
import pandas as pd
dfs = {}
dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df = dfs[1].copy()
for i in range(2, max(dfs)+1):
df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
print(df)
0 1 2 3 4 5
0 a 1 10 15 -1 -1
1 a 2 20 20 -1 -1
2 b 1 4 -1 -1 -1
3 c 1 2 2 -1 -1
4 e 2 10 -1 20 20
5 d 1 -1 -1 10 10
6 f 1 -1 -1 1 15
You should not, as a rule, combine strings such as "missing" with numeric types, as this will turn your entire series into object type series. Here we use -1, but you may wish to use NaN with float dtype instead.
So, you have 48 dfs with 3 columns each - name, id, and different column for every df.
You don`t must to use merge....
Instead, if you concat all the dfs
df = pd.concat([df1,df2,df3,df4])
You will recieve:
Out[3]:
id name pricepart1 pricepart2 pricepart3 pricepart4
0 1 a 10.0 NaN NaN NaN
1 2 a 20.0 NaN NaN NaN
2 1 b 4.0 NaN NaN NaN
3 1 c 2.0 NaN NaN NaN
4 2 e 10.0 NaN NaN NaN
0 1 a NaN 15.0 NaN NaN
1 2 a NaN 20.0 NaN NaN
2 1 c NaN 2.0 NaN NaN
0 1 d NaN NaN 10.0 NaN
1 2 e NaN NaN 20.0 NaN
2 1 f NaN NaN 1.0 NaN
0 1 d NaN NaN NaN 10.0
1 2 e NaN NaN NaN 20.0
2 1 f NaN NaN NaN 15.0
Now you can group by name and id and take the sum:
df.groupby(['name','id']).sum().fillna('missing').reset_index()
If you will try it with the 48 dfs you will see it solves the MemoryError:
dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))
df = pd.concat(dfList)
df.groupby(['name','id']).sum().fillna('missing').reset_index()

Pandas - Sorting By Column

I have a pandas data frame known as "df":
x y
0 1 2
1 2 4
2 3 8
I am splitting it up into two frames, and then trying to merge back together:
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
My goal is to get it back in the same order, but when I concat, I am getting the following:
frames = [df_1, df_2]
solution = pd.concat(frames)
solution.sort_values(by='x', inplace=False)
x y
1 2 4
2 3 8
0 1 2
The problem is I need the 'x' values to go back into the new dataframe in the same order that I extracted. Is there a solution?
use .loc to specify the order you want. Choose the original index.
solution.loc[df.index]
Or, if you trust the index values in each component, then
solution.sort_index()
setup
df = pd.DataFrame([[1, 2], [2, 4], [3, 8]], columns=['x', 'y'])
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_1, df_2]
solution = pd.concat(frames)
Try this:
In [14]: pd.concat([df_1, df_2.sort_values('y')])
Out[14]:
x y
0 1 2
1 2 4
2 3 8
When you are sorting the solution using
solution.sort_values(by='x', inplace=False)
you need to specify inplace = True. That would take care of it.
Based on these assumptions on df:
Columns x and y are note necessarily ordered.
The index is ordered.
Just order your result by index:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 8]})
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_2, df_1]
solution = pd.concat(frames).sort_index()
Now, solution looks like this:
x y
0 1 2
1 2 4
2 3 8

Categories