Left Outer join with condition - python

I would like to merge two data frames (how=left) but not only on an index but only on a condition.
E.g assume two data frame
C1 C2
A = I 3
K 2
L 5
C1 C2 C3
B = I 5 T
I 0 U
K 1 X
L 7 Z
Now I would like to left outer join table A with B using index C1 under the condition that A.C2 > B.C2. That is, the final result should look like
A.C1 A.C2 B.C2 B.C3
A<-B = I 1 0 U
K 2 1 X
L 5 Null Null
P.S.: If you want to test it your self:
import pandas as pd
df_A = pd.DataFrame([], columns={'C 1', 'C2'})
df_A['C 1'] = ['I', 'K', 'L']
df_A['C2'] = [3, 2, 5]
df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K', 'L']
df_B['C2'] = [5, 0, 2, 7]
df_B['C3'] = ['T', 'U', 'X', 'Z']

The quick and dirty solution would be to simply join on the C1 column and then put NULL or NaN into C3 for all the rows where C2_1 > C2_2.

Method: direct SQL query into pandas using pandasql library. reference
import pandas as pd
df_A = pd.DataFrame([], columns={'C1', 'C2'})
df_A['C1'] = ['I', 'K']
df_A['C2'] = [3, 2]
df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K']
df_B['C2'] = [5, 0, 2]
df_B['C3'] = ['T', 'U', 'X']
It appears to me that the conditions that you specified for doing the outer join on (A.C1 = B.C1) does not produce the expected result. I needed to do GROUP BY A.C1 in order to drop duplicate rows having same values in A.C1 after the join.
import pandasql as ps
q = """
SELECT A.C1 as 'A.C1',
A.C2 as 'A.C2',
B.C2 as 'B.C2',
B.C3 as 'B.C3'
FROM df_A AS A
LEFT OUTER JOIN df_B AS B
--ON A.C1 = B.C1 AND A.C2 = B.C2
WHERE A.C2 > B.C2
GROUP BY A.C1
"""
print(ps.sqldf(q, locals()))
Output
A.C1 A.C2 B.C2 B.C3
0 I 3 2 X
1 K 2 0 U
Other References
https://www.zentut.com/sql-tutorial/sql-outer-join/
Executing an SQL query over a pandas dataset
How to do a conditional join in python Pandas?
https://github.com/pandas-dev/pandas/issues/7480
https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e

I found one non-pandas-native solution:
import pandas as pd
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
df_A = pd.DataFrame([], columns={'C1', 'C2'})
df_A['C1'] = ['I', 'K', 'L']
df_A['C2'] = [3, 2, 5]
cols = df_A.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df_A.columns = cols
df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K', 'L']
df_B['C2'] = [5, 0, 2, 7]
df_B['C3'] = ['T', 'U', 'X', 'Z']
# df_merge = pd.merge(left=df_A, right=df_B, how='left', on='C1')
df_sql = pysqldf("""
select *
from df_A t_1
left join df_B t_2 on t_1.C1 = t_2.C1 and t_1.C2 >= t_2.C2
;
""")
However, for big tables, pandasql turns out to be less performant.
Output:
C2 C1 C3 C2 C1
0 3 I U 0.0 I
1 2 K X 2.0 K
2 5 L None NaN None

Related

How do I multiply values of a datraframe column by values from a column from another dataframe based on a common category?

I have two dataframes:
data1 = {'Item': ['A', 'B', 'C', 'N'], 'Price': [1, 2, 3, 10], 'Category': ['X', 'Y', 'X', 'Z'], 'County': ['K', 'L', 'L', 'K']}
df1 = pd.DataFrame(data1)
df1
data2 = {'Category': ['X', 'Y', 'Z'], 'Value retained': [0.1, 0.2, 0.8]}
df2 = pd.DataFrame(data2)
df2
How do I multiply 'Value retained' by 'Price' following their respective Category and add the result as a new column in df1?
I've searched a lot for a solution and tried several different things, among them:
df3 = df1
for cat, VR in df2['Category', 'Value retained']:
if cat in df1.columns:
df3[cat] = df1['Price'] * VR
and
df3 = df1['Price'] * df2.set_index('Category')['Value retained']
df3
In my real dataframe I have 250k+ items and 32 categories with different values of 'value retained'.
I really appreciate any help for a newbie in Python coding.
You're 2nd approach would work if both dataframes have Category as index, but since you can't set_index on Category in df1 (because you have duplicated entries) you need to do a left merge on the two df based on the column Category and then multiply.
df3 = df1.merge(df2, on='Category', how='left')
df3['result'] = df3['Price'] * df3['Value retained']
print(df3)
Item Price Category County Value retained result
0 A 1 X K 0.1 0.1
1 B 2 Y L 0.2 0.4
2 C 3 X L 0.1 0.3
3 N 10 Z K 0.8 8.0
You can use this,
import pandas as pd
data1 = {'Item': ['A', 'B', 'C', 'N'], 'Price': [1, 2, 3, 10], 'Category': ['X', 'Y', 'X', 'Z'], 'County': ['K', 'L', 'L', 'K']}
df1 = pd.DataFrame(data1)
data2 = {'Category': ['X', 'Y', 'Z'], 'Value_retained': [0.1, 0.2, 0.8]}
df2 = pd.DataFrame(data2)
df = df1.merge(df2, how='left')
df['Values'] = df.Price * df.Value_retained
print(df)
The output is,
Item Price Category County Value_retained Values
0 A 1 X K 0.1 0.1
1 B 2 Y L 0.2 0.4
2 C 3 X L 0.1 0.3
3 N 10 Z K 0.8 8.0

Apply function for multiple levels of tables/data

I get a problem in my work. I have tables:
import pandas as pd
import numpy as np
level1 = pd.DataFrame(list(zip(['a', 'b', 'c'], [3, 'x', 'x'])),
columns=['name', 'value'])
name value
0 a 3
1 b x
2 c x
I want to sum the value column, but it contains “x”s. So I will have to use the second table to calculate “x”s :
level2 = pd.DataFrame(list(zip(['b', 'b', 'c', 'c', 'c'], ['b1', 'b2', 'c1', 'c2', 'c3'], [5, 7, 2, 'x', 9])),
columns=['name', 'sub', 'value'])
name sub value
0 b b1 5
1 b b2 7
2 c c1 2
3 c c2 x
4 c c3 9
I should sum the b1, b2 to give “x” for b in level1 table (x=12). But for c, there is “x”, so a third level table:
level3 = pd.DataFrame(list(zip(['c', 'c', 'c'], ['c1', 'c2', 'c3'], [2, 4, 9])),
columns=['name', 'sub', 'value'])
name sub value
0 c c1 2
1 c c2 4
2 c c3 9
Now, we can get the sum value for value column in level1 table.
My question is: can we use a function to calculate it easily? If there are more levels, how can we loop them till no “x”?
It is OK to combine level2 and level3.
Here's a way using combine_first and replace:
from functools import reduce
l1 = level1.assign(sub=level1['name']+'1').replace('x', np.nan).set_index(['name', 'sub'])
l2 = level2.replace('x', np.nan).set_index(['name', 'sub'])
l3 = level3.replace('x', np.nan).set_index(['name', 'sub'])
reduce(lambda x, y: x.combine_first(y), [l3,l2,l1]).groupby(level=0).sum()
Output:
value
name
a 3.0
b 12.0
c 15.0
Complete example:
import pandas as pd
import numpy as np
level1 = pd.DataFrame(list(zip(['a', 'b', 'c'], [3, 'x', 'x'])),
columns=['name', 'value'])
level2 = pd.DataFrame(list(zip(['b', 'b', 'c', 'c', 'c'],
['b1', 'b2', 'c1', 'c2', 'c3'],
[5, 7, 2, 'x', 9])),
columns=['name', 'sub', 'value'])
level3 = pd.DataFrame(list(zip(['c', 'c', 'c'],
['c1', 'c2', 'c3'],
[2, 4, 9])),
columns=['name', 'sub', 'value'])
from functools import reduce
l1 = level1.assign(sub=level1['name']+'1')\
.replace('x', np.nan)\
.set_index(['name', 'sub'])
l2 = level2.replace('x', np.nan)\
.set_index(['name', 'sub'])
l3 = level3.replace('x', np.nan)\
.set_index(['name', 'sub'])
out = reduce(lambda x, y: x.combine_first(y),
[l3,l2,l1]).groupby(level=0).sum()
print(out)
One option is a combination of merge(multiple merges actually) and a groupby:
(level2
.merge(level3, on = ['name', 'sub'], how = 'left', suffixes = (None, '_y'))
.assign(value = lambda df: np.where(df.value.eq('x'), df.value_y, df.value))
.groupby('name', as_index = False)
.value
.sum()
.merge(level1, on = 'name', how = 'right', suffixes = ('_x',None))
.assign(value = lambda df: np.where(df.value.eq('x'), df.value_x, df.value))
.loc[:, ['name', 'value']]
)
name value
0 a 3
1 b 12.0
2 c 15.0

Python dataframe unexpected display error using loc

I'm creating an additional column "Total_Count" to store the cumulative count record by Site and Count_Record column information. My coding is almost done for total cumulative count. However, the Total_Count column is shift for a specific Card as below. Could someone help with code modification, thank you!
Expected Output:
Current Output:
My Code:
import pandas as pd
df1 = pd.DataFrame(columns=['site', 'card', 'date', 'count_record'],
data=[['A', 'C1', '12-Oct', 5],
['A', 'C1', '13-Oct', 10],
['A', 'C1', '14-Oct', 18],
['A', 'C1', '15-Oct', 21],
['A', 'C1', '16-Oct', 29],
['B', 'C2', '12-Oct', 11],
['A', 'C2', '13-Oct', 2],
['A', 'C2', '14-Oct', 7],
['A', 'C2', '15-Oct', 13],
['B', 'C2', '16-Oct', 4]])
df_append_temp=[]
total = 0
preCard = ''
preSite = ''
preCount = 0
for pc in df1['card'].unique():
df2 = df1[df1['card'] == pc].sort_values(['date'])
total = 0
for i in range(0, len(df2)):
site = df2.iloc[i]['site']
count = df2.iloc[i]['count_record']
if site == preSite:
total += (count - preCount)
else:
total += count
preCount = count
preSite = site
df2.loc[i, 'Total_Count'] = total #something wrong using loc here
df_append_temp.append(df2)
df3 = pd.DataFrame(pd.concat(df_append_temp), columns=df2.columns)
df3
To modify the current implementation we can use groupby to create our df2 which allows us to apply a function to each grouped DataFrame to create the new column. This should offer similar performance as the current implementation but produce correctly aligned Series:
def calc_total_count(df2: pd.DataFrame) -> pd.Series:
total = 0
pre_count = 0
pre_site = ''
lst = []
for c, s in zip(df2['count_record'], df2['site']):
if s == pre_site:
total += (c - pre_count)
else:
total += c
pre_count = c
pre_site = s
lst.append(total)
return pd.Series(lst, index=df2.index, name='Total_Count')
df3 = pd.concat([
df1,
df1.sort_values('date').groupby('card').apply(calc_total_count).droplevel(0)
], axis=1)
Alternatively we can use groupby, then within groups Series.shift to get the previous site, and count_record. Then use np.where to conditionally determine each row's value and ndarray.cumsum to calculate the cumulative total of the resulting values:
def calc_total_count(df2: pd.DataFrame) -> pd.Series:
return pd.Series(
np.where(df2['site'] == df2['site'].shift(),
df2['count_record'] - df2['count_record'].shift(fill_value=0),
df2['count_record']).cumsum(),
index=df2.index,
name='Total_Count'
)
df3 = pd.concat([
df1,
df1.sort_values('date').groupby('card').apply(calc_total_count).droplevel(0)
], axis=1)
Either approach produces df3:
site card date count_record Total_Count
0 A C1 12-Oct 5 5
1 A C1 13-Oct 10 10
2 A C1 14-Oct 18 18
3 A C1 15-Oct 21 21
4 A C1 16-Oct 29 29
5 B C2 12-Oct 11 11
6 A C2 13-Oct 2 13
7 A C2 14-Oct 7 18
8 A C2 15-Oct 13 24
9 B C2 16-Oct 4 28
Setup and imports:
import numpy as np # only needed if using np.where
import pandas as pd
df1 = pd.DataFrame(columns=['site', 'card', 'date', 'count_record'],
data=[['A', 'C1', '12-Oct', 5],
['A', 'C1', '13-Oct', 10],
['A', 'C1', '14-Oct', 18],
['A', 'C1', '15-Oct', 21],
['A', 'C1', '16-Oct', 29],
['B', 'C2', '12-Oct', 11],
['A', 'C2', '13-Oct', 2],
['A', 'C2', '14-Oct', 7],
['A', 'C2', '15-Oct', 13],
['B', 'C2', '16-Oct', 4]])

Stack different column values into one column in a pandas dataframe

I have the following dataframe -
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
This is my desired output -
desired_df = pd.DataFrame({
'ID': [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'Prior_Current': ['a', 'a1', 'b', 'c', 'c1', 'd', 'e', 'f', 'f1', 'g',
'g1'],
'Start_Date': ['', '1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019'],
'End_Date': ['1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019', '']
})
I tried the following -
keys = ['Prior', 'Current']
df2 = (
pd.melt(df, id_vars='ID', value_vars=keys, value_name='Prior_Current')
.merge(df[['ID', 'Date']], how='left', on='ID')
)
df2['Start_Date'] = np.where(df2['variable'] == 'Prior', df2['Date'], '')
df2['End_Date'] = np.where(df2['variable'] == 'Current', df2['Date'], '')
df2.sort_values(['ID'], ascending=True, inplace=True)
But this does not seem be working. Please help.
you can use stack and pivot_table:
k = df.set_index(['ID', 'Date']).stack().reset_index()
df = k.pivot_table(index = ['ID',0], columns = 'level_2', values = 'Date', aggfunc = ''.join, fill_value= '').reset_index()
df.columns = ['ID', 'prior-current', 'start-date', 'end-date']
OUTPUT:
ID prior-current start-date end-date
0 1 a 1/1/2019
1 1 a1 1/1/2019
2 2 b 5/1/2019
3 2 c 5/1/2019 10/2/2019
4 2 c1 10/2/2019
5 3 d 15/3/2019
6 3 e 15/3/2019 6/5/2019
7 3 f 6/5/2019 7/9/2019
8 3 f1 7/9/2019
9 4 g 16/11/2019
10 4 g1 16/11/2019
Explaination:
After stack / reset_index df will look like this:
ID Date level_2 0
0 1 1/1/2019 Prior a
1 1 1/1/2019 Current a1
2 2 5/1/2019 Prior b
3 2 5/1/2019 Current c
4 2 10/2/2019 Prior c
5 2 10/2/2019 Current c1
6 3 15/3/2019 Prior d
7 3 15/3/2019 Current e
8 3 6/5/2019 Prior e
9 3 6/5/2019 Current f
10 3 7/9/2019 Prior f
11 3 7/9/2019 Current f1
12 4 16/11/2019 Prior g
13 4 16/11/2019 Current g1
Now, we can use ID and column 0 as index / level_2 as column / Date column as value.
Finally, we need to rename the columns to get the desired result.
My approach is to build and attain the target df step by step. The first step is an extension of your code using melt() and merge(). The merge is done based on the columns 'Current' and 'Prior' to get the start and end date.
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID')
df2 = df2.merge(df[['Current', 'Date']], how='left', left_on='Prior_Current', right_on='Current').drop('Current',1)
df2 = df2.merge(df[['Prior', 'Date']], how='left', left_on='Prior_Current', right_on='Prior').drop('Prior',1)
df2 = df2.fillna('').reset_index(drop=True)
df2.columns = ['ID', 'Prior_Current', 'Start_Date', 'End_Date']
Alternative way is to define a custom function to get date, then use lambda function:
def get_date(x, col):
try:
return df['Date'][df[col]==x].values[0]
except:
return ''
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID').reset_index(drop=True)
df2['Start_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Current'))
df2['End_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Prior'))
Output

Count values of column A observed only with a single distinct value in columns B and/or C

I'd like to summarise a large dataframe in terms of distinct values of one column with respect to whether they are restricted to occuring with single OR multiple distinct values of other column(s). My current approach for doing this is really convoluted, and I'm looking for a pandas pattern for solving these kinds of problems.
Given the following example dataframe:
import pandas as pd
pd.DataFrame({'c': ['x', 'x', 'y', 'y', 'z', 'z'],
's': ['a1', 'a1', 'a1', 'a1', 'a1', 'a2'],
't': [1, 1, 1, 2, 1, 1]})
How may I obtain (and count) the distinct values of column c:
1) that are observed only in conjunction with a single value of columns s and t.
Desired output: set('x') and/or its length: 1
2) that are observed only in conjuction with a single value of column s but >1 values of column t.
Desired output: set('y') and/or its length: 1
3) that are observed in conjuction with >1 values of columns s and any number of distinct column t values.
Desired output: set('z') and/or its length: 1
Edit:
One more q, using the following revised df!
df = pd.DataFrame({'c': ['x', 'x', 'y', 'y', 'z', 'z', 'z1', 'z1', 'z2'],
's': ['a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a3', 'a3', 'a4'],
't': [1, 1, 1, 2, 1, 1, 3, 3, 1],
'cat': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'a']})
4) observed twice or more, and only in conjunction with a single value of columns s and t, and also restricted to cat 'a'
Desired output: set('x') and/or its length: 1
Thanks!
Idea is use DataFrame.duplicated by multiple columns with keep=False for all dupes and filtering by boolean indexing:
m1 = df.duplicated(['c','s','t'], keep=False)
m2 = df.duplicated(['c','s'], keep=False) & ~m1
m3 = df.duplicated(['c','t'], keep=False) & ~m1
a = df.loc[m1, 'c']
print (a)
0 x
1 x
Name: c, dtype: object
b = df.loc[m2, 'c']
print (b)
2 y
3 y
Name: c, dtype: object
c = df.loc[m3, 'c']
print (c)
4 z
5 z
Name: c, dtype: object
And then convert columns to sets:
out1, out2, out3 = set(a['c']), set(b['c']), set(c['c'])
print (out1)
{'x'}
print (out2)
{'y'}
print (out3)
{'z'}
And for lengths:
out11, out21, out31 = len(out1), len(out2), len(out3)
print (out11)
print (out21)
print (out31)
1
1
1
Another idea is create new column by concat and DataFrame.dot:
df1 = pd.concat([m1, m2, m3], axis=1, keys=('s&t','s','t'))
print (df1)
s&t s t
0 True False False
1 True False False
2 False True False
3 False True False
4 False False True
5 False False True
df['new'] = df1.dot(df1.columns)
And then aggregate with sets and function nunique:
df2 = (df.groupby('new')['c']
.agg([('set', lambda x: set(x)),('count','nunique')])
.reset_index())
print (df2)
new set count
0 s {y} 1
1 s&t {x} 1
2 t {z} 1

Categories