Pandas: How to avoid nested for loop - python

I have some code that compares actual data to target data, where the actual data lives in one DataFrame and the target in another. I need to look up the target, bring it into the df with the actual data, and then compare the two. In the simplified example below, I have a set of products and a set of locations all with unique targets.
I'm using a nested for loop to pull this off: looping through the products and then the locations. The problem is that my real life data is larger on all dimensions, and it takes up an inordinate amount of time to loop through everything.
I've looked at various SO articles and none (that I can find!) seem to be related to pandas and/or relevant for my problem. Does anyone have a good idea on how to vectorize this code?
import pandas as pd
import numpy as np
import time
employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete',
'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian',
'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']
tgt_data = {'Location' : location_list,
'Product1' : [600, 200, 750, 225, 450, 175, 900],
'Product2' : [300, 100, 350, 125, 200, 90, 450],
'Product3' : [700, 250, 950, 275, 600, 225, 1200],
'Product4' : [200, 100, 250, 75, 150, 75, 300],
'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)
employee_data = {'Employee' : employee_list,
'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
'San Francisco', 'Phoenix', 'San Francisco',
'Eugene', 'San Francisco', 'Reno', 'Denver',
'Phoenix', 'Denver', 'Portland', 'Reno',
'Boulder', 'San Francisco', 'Phoenix',
'San Francisco', 'Phoenix'],
'Product1' : np.random.randint(1, 1000, 20),
'Product2' : np.random.randint(1, 700, 20),
'Product3' : np.random.randint(1, 1500, 20),
'Product4' : np.random.randint(1, 500, 20),
'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)
start = time.time()
for p in product_list:
for l in location_list:
emp_df.loc[emp_df['Location'] == l, p + '_tgt'] = (
tgt_df.loc[tgt_df['Location']==l, p].values)
emp_df[p + '_pct'] = emp_df[p] / emp_df[p + '_tgt']
print(emp_df)
end = time.time()
print(end - start)

If the target dataframe is guaranteed to have unique locations, you can use a join to make this process really quick.
import pandas as pd
import numpy as np
import time
employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete',
'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian',
'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']
tgt_data = {'Location' : location_list,
'Product1' : [600, 200, 750, 225, 450, 175, 900],
'Product2' : [300, 100, 350, 125, 200, 90, 450],
'Product3' : [700, 250, 950, 275, 600, 225, 1200],
'Product4' : [200, 100, 250, 75, 150, 75, 300],
'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)
employee_data = {'Employee' : employee_list,
'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
'San Francisco', 'Phoenix', 'San Francisco',
'Eugene', 'San Francisco', 'Reno', 'Denver',
'Phoenix', 'Denver', 'Portland', 'Reno',
'Boulder', 'San Francisco', 'Phoenix',
'San Francisco', 'Phoenix'],
'Product1' : np.random.randint(1, 1000, 20),
'Product2' : np.random.randint(1, 700, 20),
'Product3' : np.random.randint(1, 1500, 20),
'Product4' : np.random.randint(1, 500, 20),
'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)
With the setup done, we can now use our join.
product_tgt_cols = [product+'_tgt' for product in product_list]
print(product_tgt_cols) #['Product1_tgt', 'Product2_tgt', 'Product3_tgt', 'Product4_tgt', 'Product5_tgt']
product_pct_cols = [product+'_pct' for product in product_list]
print(product_pct_cols) #['Product1_pct', 'Product2_pct', 'Product3_pct', 'Product4_pct', 'Product5_pct']
start = time.time()
#join on location to get _tgt columns
emp_df = emp_df.join(tgt_df.set_index('Location'), on='Location', rsuffix='_tgt')
#divide the entire product arrays using numpy, store in temp
temp = emp_df[product_list].values/emp_df[product_tgt_cols].values
#create a new temp df for the _pct results, and assign back to emp_df
emp_df = emp_df.assign(**pd.DataFrame(temp, columns = product_pct_cols))
print(emp_df)
end = time.time()
print("with join: ",end - start)

You are having "wide format" dataframes. I feel "long format" easier to manipulate.
# turn emp_df into long
# indexed by "Employee", "Location", and "Product"
emp_df = (emp_df.set_index(['Employee', 'Location'])
.stack().to_frame())
emp_df.head()
0
Employee Location
Joe Boulder Product1 238
Product2 135
Product3 873
Product4 153
Product5 373
# turn tmp_df into a long series
# indexed by "Location" and "Product"
tgt_df = tgt_df.set_index('Location').stack()
tgt_df.head()
# set target for employees by locations:
emp_df['target'] = (emp_df.groupby('Employee')[0]
.apply(lambda x: tgt_df))
# percentage
emp_df['pct'] = emp_df[0]/emp_df['target']
# you can get the wide format back by
# emp_df = emp_df.unstack(level=2)
# which will give you a dataframe with
# multi-level index and multi-level column

Related

Looping through two lists of dataframe

Below is the script I am working with. For practice, I've created two sets of dataframes, one set of df1,df2,and df3, and another set of dv1,dv2, and dv3. I then created two sets of lists, test and test2, which then combined as zip_list. Now, I am trying to create a loop function that will do the following. 1. Set index and create keys = 2022 and 2021. 2. Swap level so the columns are next to each other. The loop function works but gets only applied to only the first dataframe. Without calling each dataframe one by one, how can I apply it to the whole dataframes that are found in the zipped_list?
import pandas as pd
#Creating a set of dataframes
data = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'logitech', 'samsung', 'lg', 'lenovo'],
'price': [1200, 150, 300, 450, 200]}
df1 = pd.DataFrame(data)
data2 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'mac', 'fujitsu', 'lg', 'asus'],
'price': [2200, 200, 300, 450, 200]}
df2 = pd.DataFrame(data2)
data3 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['microsoft', 'logitech', 'samsung', 'lg', 'asus'],
'price': [1500, 100, 200, 350, 400]}
df3 = pd.DataFrame(data3)
#Creating another set of dataframes
data = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'logitech', 'samsung', 'lg', 'lenovo'],
'price': [10, 20, 30, 40, 50]}
dv1 = pd.DataFrame(data)
data2 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'mac', 'fujitsu', 'lg', 'asus'],
'price': [10, 20, 30, 50, 50]}
dv2 = pd.DataFrame(data2)
data3 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['microsoft', 'logitech', 'samsung', 'lg', 'asus'],
'price': [1, 2, 3, 4, 5]}
dv3 = pd.DataFrame(data3)
#creating a list for dataframe
test=[df1,df2,df3]
test2=[dv1,dv2,dv3]
#combining two lists
zipped = zip(test, test2)
zipped_list = list(zipped)
#Looping through the zipped_list
for x,y in zipped_list:
z=pd.concat([zipped_list[0][0].set_index(['product_name','item_name']), zipped_list[0][1].set_index(['product_name','item_name'])],
axis='columns', keys=['2022', '2021'])
z=z.swaplevel(axis='columns')[zipped_list[0][0].columns[2:]]
print(z)
In addition to this dataframe, there should be two more.
The reason is that you only access 1 element of zipped_list and do not use the repeated element (x and y). You can create a new list and append the modified dataframe to that list:
new_list = []
for x in zipped_list:
z=pd.concat([x[0].set_index(['product_name','item_name']), x[1].set_index(['product_name','item_name'])],
axis='columns', keys=['2022', '2021'])
z=z.swaplevel(axis='columns')[x[0].columns[2:]]
new_list.append(z)
new_list
Output:
[ price
2022 2021
product_name item_name
laptop hp 1200 10
printer logitech 150 20
tablet samsung 300 30
desk lg 450 40
chair lenovo 200 50,
price
2022 2021
product_name item_name
laptop hp 2200 10
printer mac 200 20
tablet fujitsu 300 30
desk lg 450 50
chair asus 200 50,
price
2022 2021
product_name item_name
laptop microsoft 1500 1
printer logitech 100 2
tablet samsung 200 3
desk lg 350 4
chair asus 400 5]

Complete a pandas data frame with values from other data frames

I have 3 data frames. I need to enrich the data from df with the data columns from df2 and df3 so that df ends up with the columns 'Code', 'Quantity', 'Payment', 'Date', 'Name', 'Size', 'Product','product_id', 'Sector'.
The codes that are in df and not in df2 OR df3, need to receive "unknown" for the string columns and "0" for the numeric dtype columns
import pandas as pd
data = {'Code': [356, 177, 395, 879, 952, 999],
'Quantity': [20, 21, 19, 18, 15, 10],
'Payment': [173.78, 253.79, 158.99, 400, 500, 500],
'Date': ['2022-06-01', '2022-09-01','2022-08-01','2022-07-03', '2022-06-09', '2022-06-09']
}
df = pd.DataFrame(data)
df['Date']= pd.to_datetime(df['Date'])
data2 = {'Code': [356, 177, 395, 893, 697, 689, 687],
'Name': ['John', 'Mary', 'Ann', 'Mike', 'Bill', 'Joana', 'Linda'],
'Product': ['RRR', 'RRT', 'NGF', 'TRA', 'FRT', 'RTW', 'POU'],
'product_id': [189, 188, 16, 36, 59, 75, 55],
'Size': [1, 1, 3, 4, 5, 4, 7],
}
df2 = pd.DataFrame(data2)
data3 = {'Code': [879, 356, 389, 395, 893, 697, 689, 978],
'Name': ['Mark', 'John', 'Marry', 'Ann', 'Mike', 'Bill', 'Joana', 'James'],
'Product': ['TTT', 'RRR', 'RRT', 'NGF', 'TRA', 'FRT', 'RTW', 'DTS'],
'product_id': [988, 189, 188, 16, 36, 59, 75, 66],
'Sector': ['rt' , 'dx', 'sx', 'da', 'sa','sd','ld', 'pc'],
}
df3 = pd.DataFrame(data3)
I was using the following code to obtain the unknown codes by comparing with df2, but now i have to compare with df3 also and also add the data from the columns ['Name', 'Size', 'Product','product_id', 'Sector'].
common = df2.merge(df,on=['Code'])
new_codes = df[(~df['Code'].isin(common['Code']))]

Trying to print with specific format from DataFrame

new to Python and trying to print from a data frame
customers = {'NAME': ['Breadpot', 'Hoviz', 'Hovis', 'Grenns', 'Magnolia', 'Dozen', 'Sun'],
'CITY': ['Sydney', 'Manchester', 'London', 'London', 'Chicago', 'San Francisco', 'San Francisco'],
'COUNTRY': ['Australia', 'UK', 'UK', 'UK', 'USA', 'USA', 'USA'],
'CPERSON': ['Sam.Keng#info.com', 'harry.ham#hoviz.com', 'hamlet.host#hoviz.com', 'grenns#grenns.com', 'man#info.com', 'dozen#dozen.com', 'sunny#sun.com'],
'EMPLCNT': [250, 150, 1500, 200, 1024, 1000, 2000],
'CONTRCNT': [48, 7, 12800, 12800, 25600, 5, 2],
'CONTRCOST': [1024.00, 900.00, 10510.50, 128.30, 512000.00, 1000.20, 10000.01]
}
df = pd.DataFrame(customers, columns=['CITY', 'COUNTRY', 'CPERSON', 'EMPLCNT', 'CONTRCNT', 'EMPLCNT', 'CONTRCOST'])
new_df = df.loc[df['CONTRCNT'].idxmax()]
print('City with the largest number of signed contracts:')
print(new_df['CITY'],'(', new_df['CONTRCNT'], 'contracts)')
Trying to get code to return "City with largest number of contracts:" "city" ("number of contracts")
but instead keep getting this:
City with the largest number of signed contracts:
4 Chicago
4 Chicago
Name: CITY, dtype: object ( CONTRCNT CONTRCNT
4 25600 25600
4 25600 25600 contracts)
This should work:
customers = {'NAME': ['Breadpot', 'Hoviz', 'Hovis', 'Grenns', 'Magnolia', 'Dozen', 'Sun'],
'CITY': ['Sydney', 'Manchester', 'London', 'London', 'Chicago', 'San Francisco', 'San Francisco'],
'COUNTRY': ['Australia', 'UK', 'UK', 'UK', 'USA', 'USA', 'USA'],
'CPERSON': ['Sam.Keng#info.com', 'harry.ham#hoviz.com', 'hamlet.host#hoviz.com', 'grenns#grenns.com', 'man#info.com', 'dozen#dozen.com', 'sunny#sun.com'],
'EMPLCNT': [250, 150, 1500, 200, 1024, 1000, 2000],
'CONTRCNT': [48, 7, 12800, 12800, 25600, 5, 2],
'CONTRCOST': [1024.00, 900.00, 10510.50, 128.30, 512000.00, 1000.20, 10000.01]
}
df = pd.DataFrame(customers, columns=['CITY', 'COUNTRY', 'CPERSON', 'EMPLCNT', 'CONTRCNT', 'CONTRCOST'])
new_df = df.groupby('CITY').sum().sort_values(by='CONTRCNT', ascending = False)
print('City with the largest number of signed contracts:')
print(new_df.index.values[0],'(', new_df.iloc[0][1], 'contracts)')

Adding 1 values together from different rows where value in column matches

I currently have 2 datasets.
The first contains a list of football teams with values I have worked out.
I have a second dataset has a list of teams that are playing today
What I would like to do is add to dataset2 the mean number of both teams that are playing each other so the outcome would be
I have looked through Stack Overflow and not found anything that been able to help. I am fairly new to working with Pandas so I am not sure if this is possible or not.
As an example data set:
data1 = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
}
data2 = {
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
}
with the desired output being
Desired = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
}
Another option as opposed to iterating over rows, is you can merge the datasets, then iterate over the columns.
I also noticed your desired output is rounded, so I have that there as well
Sample:
data1 = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
})
data2 = pd.DataFrame({
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
})
Desired = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
})
Code:
import pandas as pd
cols = [ x for x in data2 if 'Home' in x or 'Away' in x ]
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA1'}), how='left', on=['TXTECHIPA1'])
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA2'}), how='left', on=['TXTECHIPA2'])
for col in cols:
data1[col] = data1[[col + '_x', col + '_y']].astype(int).mean(axis=1).round(0)
data1 = data1.drop([col + '_x', col + '_y'], axis=1)
Output:
print(data1)
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 86.0 82.0
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 56.0
Thanks for adding the data. Here is a simple way using loops. Loop through the df2 (upcoming matches). Find the equivalent rows of the participating teams from df1 (team statistics). Now you will have 2 rows from df1. Average the desired columns and add it to df2.
Considering similar structure as your data set, here is an example:
df1 = pd.DataFrame({'team': ['one', 'two', 'three', 'four', 'five'],
'home0-0': [86, 78, 65, 67, 100],
'home1-0': [76, 86, 67, 100, 0],
'home0-1': [91, 88, 75, 100, 67],
'home1-1': [75, 67, 67, 100, 100],
'away0-0': [57, 86, 71, 91, 50],
'away1-0': [73, 50, 71, 100, 100],
'away0-1': [78, 62, 40, 80, 0],
'away1-1': [50, 71, 33, 100, 0]})
df2 = pd.DataFrame({'date': ['2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17'],
'time': [1800, 1200, 1100, 2005, 1000, 1800, 1800],
'team1': ['one', 'two', 'three', 'four', 'five', 'one', 'three'],
'team2': ['five', 'four', 'two', 'one', 'three', 'two', 'four']})
for i, row in df2.iterrows():
team1 = df1[df1['team']==row['team1']]
team2 = df1[df1['team']==row['team2']]
for col in df1.columns[1:]:
df2.loc[i, col]=(np.mean([team1[col].values[0], team2[col].values[0]]))
print(df2)
For your sample data set:
for i, row in data1.iterrows():
team1 = data2[data2['Team']==row['TXTECHIPA1']]
team2 = data2[data2['Team']==row['TXTECHIPA2']]
for col in data2.columns[1:]:
data1.loc[i, col]=(np.mean([int(team1[col].values[0]), int(team2[col].values[0])]))
print(data1)
Result:
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 85.5 81.5
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 55.5

How to manage a python dictionary with three items and add the last one?

I have a dictionary in python like this:
x = {country:{city:population}:......}
and I want to create a new dictionary like y = {country:cities_population}, where cities_population adds all the population in every city in every country and I really don't know how to do it.
I tried this:
for country in x:
for city, population in x[country].iteritems():
if not country in y:
y[country] = {}
y[country] += population
I check for a dictionary with only one key and one value but I don't understand how to manage a three items dictionary... help me please!!!! :)
Well, how about:
y = { }
for country, cities in x.iteritems():
y[country] = sum(cities.values())
I'm going to assume you want to sum up more than just the city populations for each country:
>>> attributes = ['population', 'gdp', 'murders']
>>> x = {'usa': {'chicago': dict(zip(attributes, [10, 100, 1000])), 'nyc':dict(zip(attributes, [20, 200, 2000]))}, 'china': {'shanghai': dict(zip(attributes, [9, 90, 900])), 'nagasaki': dict(zip(attributes, [2, 20, 200]))}}
>>> x
{'china': {'shanghai': {'gdp': 90, 'murders': 900, 'population': 9}, 'nagasaki': {'gdp': 20, 'murders': 200, 'population': 2}}, 'usa': {'nyc': {'gdp': 200, 'murders': 2000, 'population': 20}, 'chicago': {'gdp': 100, 'murders': 1000, 'population': 10}}}
>>> for country, cities in x.iteritems():
y[country] = {attr:0 for attr in attributes}
for city, attributes in cities.iteritems():
for attribute, value in attributes.iteritems():
y[country][attribute] += value
>>> y
{'china': {'gdp': 110, 'murders': 1100, 'population': 11}, 'usa': {'gdp': 300, 'murders': 3000, 'population': 30}}
How about something like:
for country in x:
total_population = 0
for city, population in x[country].iteritems():
total_population += population
y[country] = total_population
What you want is a new dictionary whose key is the original key and whose value is the sum of the values in the current value-dictionary.
y = dict((c[0], sum(c[1].values())) for c in x.iteritems())

Categories