Adding 1 values together from different rows where value in column matches - python

I currently have 2 datasets.
The first contains a list of football teams with values I have worked out.
I have a second dataset has a list of teams that are playing today
What I would like to do is add to dataset2 the mean number of both teams that are playing each other so the outcome would be
I have looked through Stack Overflow and not found anything that been able to help. I am fairly new to working with Pandas so I am not sure if this is possible or not.
As an example data set:
data1 = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
}
data2 = {
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
}
with the desired output being
Desired = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
}

Another option as opposed to iterating over rows, is you can merge the datasets, then iterate over the columns.
I also noticed your desired output is rounded, so I have that there as well
Sample:
data1 = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
})
data2 = pd.DataFrame({
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
})
Desired = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
})
Code:
import pandas as pd
cols = [ x for x in data2 if 'Home' in x or 'Away' in x ]
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA1'}), how='left', on=['TXTECHIPA1'])
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA2'}), how='left', on=['TXTECHIPA2'])
for col in cols:
data1[col] = data1[[col + '_x', col + '_y']].astype(int).mean(axis=1).round(0)
data1 = data1.drop([col + '_x', col + '_y'], axis=1)
Output:
print(data1)
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 86.0 82.0
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 56.0

Thanks for adding the data. Here is a simple way using loops. Loop through the df2 (upcoming matches). Find the equivalent rows of the participating teams from df1 (team statistics). Now you will have 2 rows from df1. Average the desired columns and add it to df2.
Considering similar structure as your data set, here is an example:
df1 = pd.DataFrame({'team': ['one', 'two', 'three', 'four', 'five'],
'home0-0': [86, 78, 65, 67, 100],
'home1-0': [76, 86, 67, 100, 0],
'home0-1': [91, 88, 75, 100, 67],
'home1-1': [75, 67, 67, 100, 100],
'away0-0': [57, 86, 71, 91, 50],
'away1-0': [73, 50, 71, 100, 100],
'away0-1': [78, 62, 40, 80, 0],
'away1-1': [50, 71, 33, 100, 0]})
df2 = pd.DataFrame({'date': ['2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17'],
'time': [1800, 1200, 1100, 2005, 1000, 1800, 1800],
'team1': ['one', 'two', 'three', 'four', 'five', 'one', 'three'],
'team2': ['five', 'four', 'two', 'one', 'three', 'two', 'four']})
for i, row in df2.iterrows():
team1 = df1[df1['team']==row['team1']]
team2 = df1[df1['team']==row['team2']]
for col in df1.columns[1:]:
df2.loc[i, col]=(np.mean([team1[col].values[0], team2[col].values[0]]))
print(df2)
For your sample data set:
for i, row in data1.iterrows():
team1 = data2[data2['Team']==row['TXTECHIPA1']]
team2 = data2[data2['Team']==row['TXTECHIPA2']]
for col in data2.columns[1:]:
data1.loc[i, col]=(np.mean([int(team1[col].values[0]), int(team2[col].values[0])]))
print(data1)
Result:
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 85.5 81.5
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 55.5

Related

Looping through two lists of dataframe

Below is the script I am working with. For practice, I've created two sets of dataframes, one set of df1,df2,and df3, and another set of dv1,dv2, and dv3. I then created two sets of lists, test and test2, which then combined as zip_list. Now, I am trying to create a loop function that will do the following. 1. Set index and create keys = 2022 and 2021. 2. Swap level so the columns are next to each other. The loop function works but gets only applied to only the first dataframe. Without calling each dataframe one by one, how can I apply it to the whole dataframes that are found in the zipped_list?
import pandas as pd
#Creating a set of dataframes
data = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'logitech', 'samsung', 'lg', 'lenovo'],
'price': [1200, 150, 300, 450, 200]}
df1 = pd.DataFrame(data)
data2 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'mac', 'fujitsu', 'lg', 'asus'],
'price': [2200, 200, 300, 450, 200]}
df2 = pd.DataFrame(data2)
data3 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['microsoft', 'logitech', 'samsung', 'lg', 'asus'],
'price': [1500, 100, 200, 350, 400]}
df3 = pd.DataFrame(data3)
#Creating another set of dataframes
data = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'logitech', 'samsung', 'lg', 'lenovo'],
'price': [10, 20, 30, 40, 50]}
dv1 = pd.DataFrame(data)
data2 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'mac', 'fujitsu', 'lg', 'asus'],
'price': [10, 20, 30, 50, 50]}
dv2 = pd.DataFrame(data2)
data3 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['microsoft', 'logitech', 'samsung', 'lg', 'asus'],
'price': [1, 2, 3, 4, 5]}
dv3 = pd.DataFrame(data3)
#creating a list for dataframe
test=[df1,df2,df3]
test2=[dv1,dv2,dv3]
#combining two lists
zipped = zip(test, test2)
zipped_list = list(zipped)
#Looping through the zipped_list
for x,y in zipped_list:
z=pd.concat([zipped_list[0][0].set_index(['product_name','item_name']), zipped_list[0][1].set_index(['product_name','item_name'])],
axis='columns', keys=['2022', '2021'])
z=z.swaplevel(axis='columns')[zipped_list[0][0].columns[2:]]
print(z)
In addition to this dataframe, there should be two more.
The reason is that you only access 1 element of zipped_list and do not use the repeated element (x and y). You can create a new list and append the modified dataframe to that list:
new_list = []
for x in zipped_list:
z=pd.concat([x[0].set_index(['product_name','item_name']), x[1].set_index(['product_name','item_name'])],
axis='columns', keys=['2022', '2021'])
z=z.swaplevel(axis='columns')[x[0].columns[2:]]
new_list.append(z)
new_list
Output:
[ price
2022 2021
product_name item_name
laptop hp 1200 10
printer logitech 150 20
tablet samsung 300 30
desk lg 450 40
chair lenovo 200 50,
price
2022 2021
product_name item_name
laptop hp 2200 10
printer mac 200 20
tablet fujitsu 300 30
desk lg 450 50
chair asus 200 50,
price
2022 2021
product_name item_name
laptop microsoft 1500 1
printer logitech 100 2
tablet samsung 200 3
desk lg 350 4
chair asus 400 5]

Complete a pandas data frame with values from other data frames

I have 3 data frames. I need to enrich the data from df with the data columns from df2 and df3 so that df ends up with the columns 'Code', 'Quantity', 'Payment', 'Date', 'Name', 'Size', 'Product','product_id', 'Sector'.
The codes that are in df and not in df2 OR df3, need to receive "unknown" for the string columns and "0" for the numeric dtype columns
import pandas as pd
data = {'Code': [356, 177, 395, 879, 952, 999],
'Quantity': [20, 21, 19, 18, 15, 10],
'Payment': [173.78, 253.79, 158.99, 400, 500, 500],
'Date': ['2022-06-01', '2022-09-01','2022-08-01','2022-07-03', '2022-06-09', '2022-06-09']
}
df = pd.DataFrame(data)
df['Date']= pd.to_datetime(df['Date'])
data2 = {'Code': [356, 177, 395, 893, 697, 689, 687],
'Name': ['John', 'Mary', 'Ann', 'Mike', 'Bill', 'Joana', 'Linda'],
'Product': ['RRR', 'RRT', 'NGF', 'TRA', 'FRT', 'RTW', 'POU'],
'product_id': [189, 188, 16, 36, 59, 75, 55],
'Size': [1, 1, 3, 4, 5, 4, 7],
}
df2 = pd.DataFrame(data2)
data3 = {'Code': [879, 356, 389, 395, 893, 697, 689, 978],
'Name': ['Mark', 'John', 'Marry', 'Ann', 'Mike', 'Bill', 'Joana', 'James'],
'Product': ['TTT', 'RRR', 'RRT', 'NGF', 'TRA', 'FRT', 'RTW', 'DTS'],
'product_id': [988, 189, 188, 16, 36, 59, 75, 66],
'Sector': ['rt' , 'dx', 'sx', 'da', 'sa','sd','ld', 'pc'],
}
df3 = pd.DataFrame(data3)
I was using the following code to obtain the unknown codes by comparing with df2, but now i have to compare with df3 also and also add the data from the columns ['Name', 'Size', 'Product','product_id', 'Sector'].
common = df2.merge(df,on=['Code'])
new_codes = df[(~df['Code'].isin(common['Code']))]

Filter Pandas Dataframe Under Multiple Conditions

My current progress
I currently have a pandas Dataframe with 5 different instances
df =
{
'Name': ['John', 'Mark', 'Kevin', 'Ron', 'Amira'
'ID': [110,111,112,113,114]
'Job title': ['xox','xoy','xoz','yow','uyt']
'Manager': ['River' 'Trevor', 'John', 'Lydia', 'Connor']
'M2': ['Shaun', 'Mary', 'Ronald', 'Cary', 'Miranda']
'M3': ['Clavis', 'Sharon', 'Randall', 'Mark', Doug']
'M4': ['Pat', 'Karen', 'Brad', 'Chad', 'Anita']
'M5': ['Ty', 'Jared', 'Bill', 'William', 'Bob']
'Location': ['US', 'US', 'JP', 'CN', 'JA']
}
list = ['River', 'Pat', 'Brad', 'William', 'Clogah']
I need to filter and drop all rows in the pandas dataframe that contain 0 values from my list and also those that contain more than one value from my list. In the case above the instances in row 1 and row 2 would be dropped because there's two of the names in the specific row within the list.
IN ROW 1 i.e. (1: 'John', 110, 'xox, 'River', 'Shaun', 'Clavis', 'Pat', 'Ty', 'US'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE BOTH 'River' and 'Pat' are listed in the list
IN ROW 2 i.e. (2: 'Mark', 111, 'xoy, 'Trevor', 'Mary', 'Sharon', 'Karen', 'Jared', 'US'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE BOTH 'Trevor' and 'Jared' are listed in the list
IN ROW 5 i.e. (5: 'Amira', 114, 'uyt', 'Connor', 'Miranda', 'Doug', 'Anita', 'Bob', 'JA'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE the row does not contain any values from my list.
The two other instances would be kept.
Original Printed DF
0: 'Name', 'ID', 'Job title', 'Manager', 'M2', 'M3', 'M4', 'M5', 'Location'
1: 'John', 110, 'xox, 'River', 'Shaun', 'Clavis', 'Pat', 'Ty', 'US'
2: 'Mark', 111, 'xoy, 'Trevor', 'Mary', 'Sharon', 'Karen', 'Jared', 'US'
3: 'Kevin', 112, 'xoz, 'John', 'Ronald', 'Randall', 'Brad', 'Bill', 'JP
4: 'Ron', 113, 'yow', 'Lydia', 'Cary', 'Mark', 'Chad', 'William', 'CN'
5: 'Amira', 114, 'uyt', 'Connor', 'Miranda', 'Doug', 'Anita', 'Bob', 'JA'
Filtered Printed DF
3: 'Kevin', 112, 'xoz, 'John', 'Ronald', 'Randall', 'Brad', 'Bill', 'JP',
4: 'Ron', 113, 'yow', 'Lydia', 'Cary', 'Mark', 'Chad', 'William', 'CN',
The current process only filters out rows that don't contain a value equal to any value in my managers list. I want to keep rows with one manager from the list but not rows without mangers from the lis
Not the prettiest way to achieve this, but this will work:
d = {
"Name": ["John", "Mark", "Kevin", "Ron", "Amira"],
"ID": [110, 111, 112, 113, 114],
"Job title": ["xox", "xoy", "xoz", "yow", "uyt"],
"M1": ["River", "Trevor", "John", "Lydia", "Connor"],
"M2": ["Shaun", "Mary", "Ronald", "Cary", "Miranda"],
"M3": ["Clavis", "Sharon", "Randall", "Mark", "Doug"],
"M4": ["Pat", "Karen", "Brad", "Chad", "Anita"],
"M5": ["Ty", "Jared", "Bill", "William", "Bob"],
"Location": ["US", "US", "JP", "CN", "JA"],
}
df = pd.DataFrame(d)
# Isolate managers in their own DataFrame
managers = ["River", "Pat", "Trevor", "Jared", "Connor"]
df_managers = df[["M1", "M2", "M3", "M4", "M5"]]
# Assess any one employee has less than two managers and isolate those employees
less_than_two_managers = []
for i in range(df_managers.shape[0]):
if len(set(df_managers.iloc[i]).intersection(set(managers))) < 2:
less_than_two_managers.append(True)
else:
less_than_two_managers.append(False)
df["LT two managers"] = less_than_two_managers
df[df["LT two managers"] == True]
here you go:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Mark', 'Kevin', 'Ron', 'Amira'],
'ID': [110, 111, 112, 113, 114],
'Job title': ['xox', 'xoy', 'xoz', 'yow', 'uyt'],
'Manager': ['River', 'Trevor', 'John', 'Lydia', 'Connor'],
'M2': ['Shaun', 'Mary', 'Ronald', 'Cary', 'Miranda'],
'M3': ['Clavis', 'Sharon', 'Randall', 'Mark', 'Doug'],
'M4': ['Pat', 'Karen', 'Brad', 'Chad', 'Anita'],
'M5': ['Ty', 'Jared', 'Bill', 'William', 'Bob'],
'Location': ['US', 'US', 'JP', 'CN', 'JA']}
)
managers = ['River', 'Pat', 'Trevor', 'Jared', 'Connor']
mask = df.applymap(lambda x: x in managers)
filtered_df = df[mask.values.sum(axis=1) < 2]
print(filtered_df)
to filter also the 0 (so only 1 manager will stay):
filtered_df = df[mask.values.sum(axis=1) == 1]
Vectorial solution using a mask:
m = (df.filter(regex=r'^M')
.apply(lambda s: s.isin(lst))
.sum(1).eq(1)
)
out = df.loc[m]
Output:
Name ID Job title Manager M2 M3 M4 M5 Location
2 Kevin 112 xoz John Ronald Randall Brad Bill JP
3 Ron 113 yow Lydia Cary Mark Chad William CN

Reshaping pandas dataframe with unstack()

I am trying to reshape pandas DataFrame so that one of the columns would be unstacked to 'broader'. Once I proceed with unstack() new column levels occure but I seem to be unable to re-arrange the headers the way I want.
Firstly, I have following df:
from pandas import *
fList = [['Packs', 'Brablik', 'Holesovice', '2017', 100],
['Decorations', 'Drapp-design', 'Holesovice', '2017', 150],
['Decorations', 'Klapetkovi', 'Holesovice', '2017', 200],
['Decorations', 'Lezecké dárky', 'Fler', '2017', 100],
['Decorations', 'PP', 'Other', '2017', 350],
['Decorations', 'Pavlimila', 'Akce', '2017', 20],
['Decorations', 'Pavlimila', 'Holesovice', '2017', 50],
['Decorations', 'Wiccare', 'Holesovice', '2017', 70],
['Toys', 'Klára Vágnerová', 'Holesovice', '2017', 100],
['Toys', 'Lucie Polonyiová', 'Holesovice', '2017', 80],
['Dresses', 'PP', 'Other', '2018', 200]]
df = DataFrame(fList, columns = ['Section', 'Seller', 'Store', 'Selected_period', 'Total_pieces'])
This produces:
Consequently I reshape it like:
df = df.set_index(['Section', 'Seller', 'Store', 'Selected_period']).unstack(level = -1)
df = df.fillna(0)
df.columns = df.columns.droplevel(0)
That outputs:
However, I would like to have just following columns in the final dataframe: Section, Seller, Store, 2017, 2018. I still fail to re-arrange it so that I would get the output I want, despite I tried to adopt solutions posted here and here and here. Any suggestions?
If I understand correctly, you seem to just be missing a reset_index() call. Try this:
df = df.set_index(['Section', 'Seller', 'Store', 'Selected_period']).unstack(level = -1).fillna(0)
df.columns = df.columns.droplevel(0).rename('')
df = df.reset_index()

Unable to merge multiIndexed pandas dataframes

I believe I am ultimately looking for a way to change the dtype of data frame indices. Please allow me to explain:
Each df is multi-indexed on (the same) four levels. One level consists of mixed labels of integers, integer and letters (like D8), and just letters.
However, for df1, the integers within the index labels are surrounded by quotation marks, while for df2, the same integer lables are free of any quotes; i.e.,
df1.index.levels[1]
Index(['Z5', '02', '1C', '26', '2G', '2S', '30', '46', '48', '5M', 'CSA', etc...'], dtype='object', name='BMDIV')
df2.index.levels[1]
Index([ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y', '8F',
'8J', 'AN', 'AS', 'C3', 'CA', etc.
dtype='object', name='BMDIV')
When I try to merge these tables
df_merge = pd.merge(df1, df2, how='left', left_index=True, right_index=True)
I get:
TypeError: type object argument after * must be a sequence, not map
Is there a way to change, for example, the type of label in df2 so that the numbers are in quotes and therefore presumably match the corresponding labels in df1?
One way to change the level values is to build a new MultiIndex and re-assign it to df.index:
import pandas as pd
df = pd.DataFrame(
{'index':[ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y',
'8F', '8J', 'AN', 'AS', 'C3', 'CA'],
'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])
level_values = [df.index.get_level_values(i) for i in range(index.nlevels)]
level_values[0] = level_values[0].astype(str)
df.index = pd.MultiIndex.from_arrays(level_values)
which makes the level values strings:
In [53]: df.index.levels[0]
Out[56]:
Index(['1C', '26', '30', '46', '48', '5M', '72', '7D', '7Y', '8F', '8J', 'AN',
'AS', 'C3', 'CA'],
dtype='object', name='index')
Alternatively, you could avoid the somewhat low-level messiness by using reset_index and set_value:
import pandas as pd
df = pd.DataFrame(
{'index':[ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y',
'8F', '8J', 'AN', 'AS', 'C3', 'CA'],
'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])
df = df.reset_index('index')
df['index'] = df['index'].astype(str)
df = df.set_index('index', append=True)
df = df.swaplevel(0, 1, axis=0)
which again produces string-valued index level values:
In [67]: df.index.levels[0]
Out[67]:
Index(['1C', '26', '30', '46', '48', '5M', '72', '7D', '7Y', '8F', '8J', 'AN',
'AS', 'C3', 'CA'],
dtype='object', name='index')
Of these two options, using_MultiIndex is faster:
N = 1000
def make_df(N):
df = pd.DataFrame(
{'index': np.random.choice(np.array(
[26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y',
'8F', '8J', 'AN', 'AS', 'C3', 'CA'], dtype='O'), size=N),
'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])
return df
def using_MultiIndex(df):
level_values = [df.index.get_level_values(i) for i in range(index.nlevels)]
level_values[0] = level_values[0].astype(str)
df.index = pd.MultiIndex.from_arrays(level_values)
return df
def using_reset_index(df):
df = df.reset_index('index')
df['index'] = df['index'].astype(str)
df = df.set_index('index', append=True)
df = df.swaplevel(0, 1, axis=0)
return df
In [81]: %%timeit df = make_df(1000)
....: using_MultiIndex(df)
....:
1000 loops, best of 3: 693 µs per loop
In [82]: %%timeit df = make_df(1000)
....: using_reset_index(df)
....:
100 loops, best of 3: 2.09 ms per loop

Categories