pandas: tricking outerjoin - python

Im new using python
please how should i do to get the result below. if the cod and date match of df_1 exists in df_2 then i should add the row as explained in my code below.
data1 = {'date': ['2021-06', '2021-06', '2021-07', '2021-07', '2021-07', '2021-07'], 'cod': ['12', '12', '14', '15', '15', '18'], 'Zone': ['LA', 'NY', 'LA', 'NY', 'PARIS', 'PARIS'], 'Revenue_Radio': [10, 20, 30, 50, 40, 10]}
df_1 = pd.DataFrame(data1)
data2 = {'date': ['2021-06', '2021-06', '2021-07', '2021-07', '2021-08'], 'cod': ['12', '14', '15', '15', '18'], 'Zone': ['PARIS', 'NY', 'LA', 'NY', 'NY'], 'Revenue_Str': [10, 20, 30, 50, 5]}
df_2 = pd.DataFrame(data2)
My code id
dfx = df_2[df_2['cod'].isin(df_1['cod']) &
(df_2['date'].isin(df_1['date'])) ]
df = (df_1.merge(dfx, on=['date','cod','Zone'], how='outer')
.fillna(0)
.sort_values(['date','cod'], ignore_index=True))
Expected output
data_result = {'date': ['2021-06', '2021-06', '2021-06', '2021-07', '2021-07', '2021-07', '2021-07', '2021-07','2021-07'], 'cod': ['12', '12', '12', '14', '14', '15', '15', '15', '18'], 'Zone': ['LA', 'NY', 'PARIS','LA', 'NY', 'NY', 'PARIS', 'LA', 'PARIS'], 'Revenue_Radio': [10, 20, 0, 30, 0, 50, 40, 0, 10], 'Revenue_Str': [0, 0, 10,0, 20, 50, 0, 30, 0]}
df_result = pd.DataFrame(data_result)
With my code below, im gotting something wrong which is 2021-06 14 NY that should not exist in the final df

IIUC, try:
output = df_1.merge(df_2, on=["date", "cod", "Zone"], how="outer")
output = output[output.set_index(["date", "cod"]).index.isin(df_1.set_index(["date", "cod"]).index)]
output = output.sort_values(['date','cod'], ignore_index=True)
date cod Zone Revenue_Radio Revenue_Str
0 2021-06 12 LA 10.0 0.0
1 2021-06 12 NY 20.0 0.0
2 2021-06 12 PARIS 0.0 10.0
3 2021-07 14 LA 30.0 0.0
4 2021-07 15 NY 50.0 50.0
5 2021-07 15 PARIS 40.0 0.0
6 2021-07 15 LA 0.0 30.0
7 2021-07 18 PARIS 10.0 0.0

Related

Apply a function on two pandas tables

I have the following two tables:
>>> df1 = pd.DataFrame(data={'1': ['john', '10', 'john'],
... '2': ['mike', '30', 'ana'],
... '3': ['ana', '20', 'mike'],
... '4': ['eve', 'eve', 'eve'],
... '5': ['10', np.NaN, '10'],
... '6': [np.NaN, np.NaN, '20']},
... index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
>>> df1
1 2 3 4 5 6
index
ind1 john mike ana eve 10 NaN
ind2 10 30 20 eve NaN NaN
ind3 john ana mike eve 10 20
df2 = pd.DataFrame(data={'first_n': [4, 4, 3]},
index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
>>> df2
first_n
index
ind1 4
ind2 4
ind3 3
I also have the following function that reverses a list and gets the first n non-NA elements:
def get_rev_first_n(row, top_n):
rev_row = [x for x in row[::-1] if x == x]
return rev_row[:top_n]
>>> get_rev_first_n(['john', 'mike', 'ana', 'eve', '10', np.NaN], 4)
['10', 'eve', 'ana', 'mike']
How would I apply this function to the two tables so that it takes in both df1 and df2 and outputs either a list or columns?
df=pd.concat([df1,df2],axis=1)
df.apply(get_rev_first_n,args=[4]) #send args as top_in
axis=0 is run along rows means runs on each column which is the default you don't have to specify it
args=[4] will be passed to second argument of get_rev_first_n
You can try apply with lambda on each row of the data frame, I just concatenate the two df's using concat and applied your method to each row of the resulted dataframe.
Full Code:
import pandas as pd
import numpy as np
def get_rev_first_n(row, top_n):
rev_row = [x for x in row[::-1] if x == x]
return rev_row[1:top_n]
df1 = pd.DataFrame(data={'1': ['john', '10', 'john'],
'2': ['mike', '30', 'ana'],
'3': ['ana', '20', 'mike'],
'4': ['eve', 'eve', 'eve'],
'5': ['10', np.NaN, '10'],
'6': [np.NaN, np.NaN, '20']},
index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
df2 = pd.DataFrame(data={'first_n': [4, 4, 3]},
index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
df3 = pd.concat([df1, df2.reindex(df1.index)], axis=1)
df = df3.apply(lambda row : get_rev_first_n(row, row['first_n']), axis = 1)
print(df)
Output:
index
ind1 [10, eve, ana]
ind2 [eve, 20, 30]
ind3 [20, 10]
dtype: object

pandas: tricking left join in python

please how should i do to get the result below. if the cod of df_1 exists in df_2 then i should add the row as explained in my code below.
data1 = {'date': ['2021-06', '2021-06', '2021-07', '2021-07', '2021-07', '2021-07'], 'cod': ['12', '12', '14', '15', '15', '18'], 'Zone': ['LA', 'NY', 'LA', 'NY', 'PARIS', 'PARIS'], 'Revenue_Radio': [10, 20, 30, 50, 40, 10]}
df_1 = pd.DataFrame(data1)
data2 = {'date': ['2021-06', '2021-06', '2021-07', '2021-07', '2021-08'], 'cod': ['12', '14', '15', '15', '18'], 'Zone': ['PARIS', 'NY', 'LA', 'NY', 'NY'], 'Revenue_Str': [10, 20, 30, 50, 5]}
df_2 = pd.DataFrame(data2)
the expected output is
data_result = {'date': ['2021-06', '2021-06', '2021-06', '2021-07', '2021-07', '2021-07', '2021-07', '2021-07','2021-07'], 'cod': ['12', '12', '12', '14', '14', '15', '15', '15', '18'], 'Zone': ['LA', 'NY', 'PARIS','LA', 'NY', 'NY', 'PARIS', 'LA', 'PARIS'], 'Revenue_Radio': [10, 20, 0, 30, 0, 50, 40, 0, 10], 'Revenue_Str': [0, 0, 10,0, 20, 50, 0, 30, 0]}
df_result = pd.DataFrame(data_result)
Use inner join by date and cod first, and then outer join with replace missing values:
df22 = df_2.merge(df_1[['date','cod']].drop_duplicates(), on=['date','cod'])
df = (df_1.merge(df22, on=['date','cod','Zone'], how='outer')
.fillna(0)
.sort_values(['date','cod'], ignore_index=True))
print (df)
date cod Zone Revenue_Radio Revenue_Str
0 2021-06 12 LA 10.0 0.0
1 2021-06 12 NY 20.0 0.0
2 2021-06 12 PARIS 0.0 10.0
3 2021-07 14 LA 30.0 0.0
4 2021-07 15 NY 50.0 50.0
5 2021-07 15 PARIS 40.0 0.0
6 2021-07 15 LA 0.0 30.0

PANDAS create column by iterating row by row checking values in 2nd dataframe until all values are true

df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005'],
'L' : ['5', '65.0', '445.0', '3200', '65000'],
'H' : ['2', '15.5', '150.5', '1500', '54000'],
'W' : ['5', '85.0', '640.0', '1650', '45000']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000'],
'Width' : ['10', '100', '1000', '10000'],
'Height': ['10', '100', '1000', '10000'],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
So here I have two example dataframes. The first dataframe shows unique item numbers with given dimensions. df2 shows maximum allowable dimensions for given rank and code. Meaning all elements (length, width, height) must not exceed maximum given dimensions. I would like to check the dimensions in df1 against df2 until all dimension criteria are True in order to retrieve it's 'rank' and 'code'. So, in essence, iterate down row by row of df2 until all the criteria is True.
Make a new df3 as follows:
ItemNo Rank Code
001 1 aa
002 2 bb
003 3 cc
004 4 dd
005 5 ee
Using a numpy
changed sample data so that it's not just incrementing results
get index of row in df2 that matches required logic
build df3 using index in step 2
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005'],
'L' : ['5', '65.0', '445.0', '5', '65000'],
'H' : ['2', '15.5', '150.5', '5', '54000'],
'W' : ['5', '85.0', '640.0', '5', '45000']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000',100000],
'Width' : ['10', '100', '1000', '10000',100000],
'Height': ['10', '100', '1000', '10000',100000],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
# fix up datatypes for comparisons
df1.loc[:,["L","H","W"]] = df1.loc[:,["L","H","W"]].astype(float)
df2.loc[:,["Length","Height","Width"]] = df2.loc[:,["Length","Height","Width"]].astype(float)
# row by row comparison, argmax to get first True
idx = [np.argmax((df1.loc[r,["L","H","W"]].values
< df2.loc[:,["Length","Height","Width"]].values).all(axis=1))
for r in df1.index]
# finally the result
pd.concat([df1.ItemNo, df2.loc[idx,["Rank","Code"]].reset_index(drop=True)],axis=1)
ItemNo
Rank
Code
0
001
1
aa
1
002
2
bb
2
003
3
cc
3
004
1
aa
4
005
5
ee
I think you can try:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005', '006'],
'L' : ['5', '65.0', '445.0', '3200', '65000', '10'],
'H' : ['2', '15.5', '150.5', '1500', '54000','1000'],
'W' : ['5', '85.0', '640.0', '1650', '45000', '10']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000', '100000'],
'Width' : ['10', '100', '1000', '10000', '100000'],
'Height': ['10', '100', '1000', '10000', '100000'],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
df_sort = pd.DataFrame({'W': np.searchsorted(df2['Width'].astype(float), df1['W'].astype(float)),
'H': np.searchsorted(df2['Height'].astype(float), df1['H'].astype(float)),
'L': np.searchsorted(df2['Length'].astype(float), df1['L'].astype(float))})
df1['Rank'] = df_sort.max(axis=1).map(df2['Rank'])
df1['Code'] = df1['Rank'].map(df2.set_index('Rank')['Code'])
print(df1)
Output:
ItemNo L H W Rank Code
0 001 5 2 5 1 aa
1 002 65.0 15.5 85.0 2 bb
2 003 445.0 150.5 640.0 3 cc
3 004 3200 1500 1650 4 dd
4 005 65000 54000 45000 5 ee
5 006 10 1000 10 3 cc
The core to the code is the use of the np.searchsorted function. Which is used to find the index of the value of L in Length for example per the conditions listed in the documentations. So, I use np.searchsorted for each of the three dimension then, I take the largest value using max(axis=1) and assign the rank and code based on that largest value using map.

Adding 1 values together from different rows where value in column matches

I currently have 2 datasets.
The first contains a list of football teams with values I have worked out.
I have a second dataset has a list of teams that are playing today
What I would like to do is add to dataset2 the mean number of both teams that are playing each other so the outcome would be
I have looked through Stack Overflow and not found anything that been able to help. I am fairly new to working with Pandas so I am not sure if this is possible or not.
As an example data set:
data1 = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
}
data2 = {
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
}
with the desired output being
Desired = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
}
Another option as opposed to iterating over rows, is you can merge the datasets, then iterate over the columns.
I also noticed your desired output is rounded, so I have that there as well
Sample:
data1 = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
})
data2 = pd.DataFrame({
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
})
Desired = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
})
Code:
import pandas as pd
cols = [ x for x in data2 if 'Home' in x or 'Away' in x ]
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA1'}), how='left', on=['TXTECHIPA1'])
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA2'}), how='left', on=['TXTECHIPA2'])
for col in cols:
data1[col] = data1[[col + '_x', col + '_y']].astype(int).mean(axis=1).round(0)
data1 = data1.drop([col + '_x', col + '_y'], axis=1)
Output:
print(data1)
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 86.0 82.0
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 56.0
Thanks for adding the data. Here is a simple way using loops. Loop through the df2 (upcoming matches). Find the equivalent rows of the participating teams from df1 (team statistics). Now you will have 2 rows from df1. Average the desired columns and add it to df2.
Considering similar structure as your data set, here is an example:
df1 = pd.DataFrame({'team': ['one', 'two', 'three', 'four', 'five'],
'home0-0': [86, 78, 65, 67, 100],
'home1-0': [76, 86, 67, 100, 0],
'home0-1': [91, 88, 75, 100, 67],
'home1-1': [75, 67, 67, 100, 100],
'away0-0': [57, 86, 71, 91, 50],
'away1-0': [73, 50, 71, 100, 100],
'away0-1': [78, 62, 40, 80, 0],
'away1-1': [50, 71, 33, 100, 0]})
df2 = pd.DataFrame({'date': ['2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17'],
'time': [1800, 1200, 1100, 2005, 1000, 1800, 1800],
'team1': ['one', 'two', 'three', 'four', 'five', 'one', 'three'],
'team2': ['five', 'four', 'two', 'one', 'three', 'two', 'four']})
for i, row in df2.iterrows():
team1 = df1[df1['team']==row['team1']]
team2 = df1[df1['team']==row['team2']]
for col in df1.columns[1:]:
df2.loc[i, col]=(np.mean([team1[col].values[0], team2[col].values[0]]))
print(df2)
For your sample data set:
for i, row in data1.iterrows():
team1 = data2[data2['Team']==row['TXTECHIPA1']]
team2 = data2[data2['Team']==row['TXTECHIPA2']]
for col in data2.columns[1:]:
data1.loc[i, col]=(np.mean([int(team1[col].values[0]), int(team2[col].values[0])]))
print(data1)
Result:
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 85.5 81.5
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 55.5

Plot groupby of groupby pandas

The data is a time series, with many member ids associated with many categories:
data_df = pd.DataFrame({'Date': ['2018-09-14 00:00:22',
'2018-09-14 00:01:46',
'2018-09-14 00:01:56',
'2018-09-14 00:01:57',
'2018-09-14 00:01:58',
'2018-09-14 00:02:05'],
'category': [1, 1, 1, 2, 2, 2],
'member': ['bob', 'joe', 'jim', 'sally', 'jane', 'doe'],
'data': ['23', '20', '20', '11', '16', '62']})
There are about 50 categories with 30 members, each with around 1000 datapoints.
I am trying to make one plot per category.
By subsetting each category then plotting via:
fig, ax = plt.subplots(figsize=(8,6))
for i, g in category.groupby(['memeber']):
g.plot(y='data', ax=ax, label=str(i))
plt.show()
This works fine for a single category, however, when i try to use a for loop to repeat this for each category, it does not work
tests = pd.DataFrame()
for category in categories:
tests = df.loc[df['category'] == category]
for test in tests:
fig, ax = plt.subplots(figsize=(8,6))
for i, g in category.groupby(['member']):
g.plot(y='data', ax=ax, label=str(i))
plt.show()
yields an "AttributeError: 'str' object has no attribute 'groupby'" error.
What i would like is a loop that spits out one graph per category, with all the members' data plotted on each graph
Creating your dataframe
import pandas as pd
data_df = pd.DataFrame({'Date': ['2018-09-14 00:00:22',
'2018-09-14 00:01:46',
'2018-09-14 00:01:56',
'2018-09-14 00:01:57',
'2018-09-14 00:01:58',
'2018-09-14 00:02:05'],
'category': [1, 1, 1, 2, 2, 2],
'member': ['bob', 'joe', 'jim', 'sally', 'jane', 'doe'],
'data': ['23', '20', '20', '11', '16', '62']})
then [EDIT after comments]
import matplotlib.pyplot as plt
import numpy as np
subplots_n = np.unique(data_df['category']).size
subplots_x = np.round(np.sqrt(subplots_n)).astype(int)
subplots_y = np.ceil(np.sqrt(subplots_n)).astype(int)
for i, category in enumerate(data_df.groupby('category')):
category_df = pd.DataFrame(category[1])
x = [str(x) for x in category_df['member']]
y = [float(x) for x in category_df['data']]
plt.subplot(subplots_x, subplots_y, i+1)
plt.plot(x, y)
plt.title("Category {}".format(category_df['category'].values[0]))
plt.tight_layout()
plt.show()
yields to
Please note that this nicely takes care also of bigger groups like
data_df2 = pd.DataFrame({'category': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5],
'member': ['bob', 'joe', 'jim', 'sally', 'jane', 'doe', 'ric', 'mat', 'pip', 'zoe', 'qui', 'quo', 'qua'],
'data': ['23', '20', '20', '11', '16', '62', '34', '27', '12', '7', '9', '13', '7']})
Far from an expert with pandas, but if you execute the following simple enough snippet
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'Date': ['2018-09-14 00:00:22',
'2018-09-14 00:01:46',
'2018-09-14 00:01:56',
'2018-09-14 00:01:57',
'2018-09-14 00:01:58',
'2018-09-14 00:02:05'],
'category': [1, 1, 1, 2, 2, 2],
'Id': ['bob', 'joe', 'jim', 'sally', 'jane', 'doe'],
'data': ['23', '20', '20', '11', '16', '62']})
fig, ax = plt.subplots()
for item in df.groupby('category'):
ax.plot([float(x) for x in item[1]['category']],
[float(x) for x in item[1]['data'].values],
linestyle='none', marker='D')
plt.show()
you produce this figure
But there is probably a better way.
EDIT: Based on the changes made to your question, I changed my snippet to
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame({'Date': ['2018-09-14 00:00:22',
'2018-09-14 00:01:46',
'2018-09-14 00:01:56',
'2018-09-14 00:01:57',
'2018-09-14 00:01:58',
'2018-09-14 00:02:05'],
'category': [1, 1, 1, 2, 2, 2],
'Id': ['bob', 'joe', 'jim', 'sally', 'jane', 'doe'],
'data': ['23', '20', '20', '11', '16', '62']})
fig, ax = plt.subplots(nrows=np.unique(df['category']).size)
for i, item in enumerate(df.groupby('category')):
ax[i].plot([str(x) for x in item[1]['Id']],
[float(x) for x in item[1]['data'].values],
linestyle='none', marker='D')
ax[i].set_title('Category {}'.format(item[1]['category'].values[0]))
fig.tight_layout()
plt.show()
which now displays

Categories