Below is the script I am working with. For practice, I've created two sets of dataframes, one set of df1,df2,and df3, and another set of dv1,dv2, and dv3. I then created two sets of lists, test and test2, which then combined as zip_list. Now, I am trying to create a loop function that will do the following. 1. Set index and create keys = 2022 and 2021. 2. Swap level so the columns are next to each other. The loop function works but gets only applied to only the first dataframe. Without calling each dataframe one by one, how can I apply it to the whole dataframes that are found in the zipped_list?
import pandas as pd
#Creating a set of dataframes
data = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'logitech', 'samsung', 'lg', 'lenovo'],
'price': [1200, 150, 300, 450, 200]}
df1 = pd.DataFrame(data)
data2 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'mac', 'fujitsu', 'lg', 'asus'],
'price': [2200, 200, 300, 450, 200]}
df2 = pd.DataFrame(data2)
data3 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['microsoft', 'logitech', 'samsung', 'lg', 'asus'],
'price': [1500, 100, 200, 350, 400]}
df3 = pd.DataFrame(data3)
#Creating another set of dataframes
data = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'logitech', 'samsung', 'lg', 'lenovo'],
'price': [10, 20, 30, 40, 50]}
dv1 = pd.DataFrame(data)
data2 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'mac', 'fujitsu', 'lg', 'asus'],
'price': [10, 20, 30, 50, 50]}
dv2 = pd.DataFrame(data2)
data3 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['microsoft', 'logitech', 'samsung', 'lg', 'asus'],
'price': [1, 2, 3, 4, 5]}
dv3 = pd.DataFrame(data3)
#creating a list for dataframe
test=[df1,df2,df3]
test2=[dv1,dv2,dv3]
#combining two lists
zipped = zip(test, test2)
zipped_list = list(zipped)
#Looping through the zipped_list
for x,y in zipped_list:
z=pd.concat([zipped_list[0][0].set_index(['product_name','item_name']), zipped_list[0][1].set_index(['product_name','item_name'])],
axis='columns', keys=['2022', '2021'])
z=z.swaplevel(axis='columns')[zipped_list[0][0].columns[2:]]
print(z)
In addition to this dataframe, there should be two more.
The reason is that you only access 1 element of zipped_list and do not use the repeated element (x and y). You can create a new list and append the modified dataframe to that list:
new_list = []
for x in zipped_list:
z=pd.concat([x[0].set_index(['product_name','item_name']), x[1].set_index(['product_name','item_name'])],
axis='columns', keys=['2022', '2021'])
z=z.swaplevel(axis='columns')[x[0].columns[2:]]
new_list.append(z)
new_list
Output:
[ price
2022 2021
product_name item_name
laptop hp 1200 10
printer logitech 150 20
tablet samsung 300 30
desk lg 450 40
chair lenovo 200 50,
price
2022 2021
product_name item_name
laptop hp 2200 10
printer mac 200 20
tablet fujitsu 300 30
desk lg 450 50
chair asus 200 50,
price
2022 2021
product_name item_name
laptop microsoft 1500 1
printer logitech 100 2
tablet samsung 200 3
desk lg 350 4
chair asus 400 5]
Related
I am trying to update df2 with columns and data in ref_df1 such that my output data frame has all columns ['Code', 'Place', 'Product', 'Name', 'Value'] and has pulled data from the reference data frame using Code column values as key. I am not sure how to get to the output.
import pandas as pd
data1 = {
'Code': [1, 2, 3, 4, 5, 6],
'Name': ['Company1', 'Company2', 'Company3', 'Company4', 'Company5', 'Company6'],
'Value': [200, 300, 400, 500, 600, 700],
}
ref_df1 = pd.DataFrame(data1, columns=['Code', 'Name', 'Value'])
data2 = {
'Code': [1, 2, 1, 3, 4, 1, 6],
'Place': ['A', 'B', 'E', 'G', 'I', 'K', 'L'],
'Product': ['P11', 'P22', 'P12', 'P33', 'P44', 'P13', 'P61'],
}
df2 = pd.DataFrame(data2, columns=['Code', 'Place', 'Product'])
Output:
You can merge both the data frames.
df2.merge(ref_df1)
#output:
Code Place Product Name Value
0 1 A P11 Company1 200
1 1 E P12 Company1 200
2 1 K P13 Company1 200
3 2 B P22 Company2 300
4 3 G P33 Company3 400
5 4 I P44 Company4 500
6 6 L P61 Company6 700
new to Python and trying to print from a data frame
customers = {'NAME': ['Breadpot', 'Hoviz', 'Hovis', 'Grenns', 'Magnolia', 'Dozen', 'Sun'],
'CITY': ['Sydney', 'Manchester', 'London', 'London', 'Chicago', 'San Francisco', 'San Francisco'],
'COUNTRY': ['Australia', 'UK', 'UK', 'UK', 'USA', 'USA', 'USA'],
'CPERSON': ['Sam.Keng#info.com', 'harry.ham#hoviz.com', 'hamlet.host#hoviz.com', 'grenns#grenns.com', 'man#info.com', 'dozen#dozen.com', 'sunny#sun.com'],
'EMPLCNT': [250, 150, 1500, 200, 1024, 1000, 2000],
'CONTRCNT': [48, 7, 12800, 12800, 25600, 5, 2],
'CONTRCOST': [1024.00, 900.00, 10510.50, 128.30, 512000.00, 1000.20, 10000.01]
}
df = pd.DataFrame(customers, columns=['CITY', 'COUNTRY', 'CPERSON', 'EMPLCNT', 'CONTRCNT', 'EMPLCNT', 'CONTRCOST'])
new_df = df.loc[df['CONTRCNT'].idxmax()]
print('City with the largest number of signed contracts:')
print(new_df['CITY'],'(', new_df['CONTRCNT'], 'contracts)')
Trying to get code to return "City with largest number of contracts:" "city" ("number of contracts")
but instead keep getting this:
City with the largest number of signed contracts:
4 Chicago
4 Chicago
Name: CITY, dtype: object ( CONTRCNT CONTRCNT
4 25600 25600
4 25600 25600 contracts)
This should work:
customers = {'NAME': ['Breadpot', 'Hoviz', 'Hovis', 'Grenns', 'Magnolia', 'Dozen', 'Sun'],
'CITY': ['Sydney', 'Manchester', 'London', 'London', 'Chicago', 'San Francisco', 'San Francisco'],
'COUNTRY': ['Australia', 'UK', 'UK', 'UK', 'USA', 'USA', 'USA'],
'CPERSON': ['Sam.Keng#info.com', 'harry.ham#hoviz.com', 'hamlet.host#hoviz.com', 'grenns#grenns.com', 'man#info.com', 'dozen#dozen.com', 'sunny#sun.com'],
'EMPLCNT': [250, 150, 1500, 200, 1024, 1000, 2000],
'CONTRCNT': [48, 7, 12800, 12800, 25600, 5, 2],
'CONTRCOST': [1024.00, 900.00, 10510.50, 128.30, 512000.00, 1000.20, 10000.01]
}
df = pd.DataFrame(customers, columns=['CITY', 'COUNTRY', 'CPERSON', 'EMPLCNT', 'CONTRCNT', 'CONTRCOST'])
new_df = df.groupby('CITY').sum().sort_values(by='CONTRCNT', ascending = False)
print('City with the largest number of signed contracts:')
print(new_df.index.values[0],'(', new_df.iloc[0][1], 'contracts)')
I currently have 2 datasets.
The first contains a list of football teams with values I have worked out.
I have a second dataset has a list of teams that are playing today
What I would like to do is add to dataset2 the mean number of both teams that are playing each other so the outcome would be
I have looked through Stack Overflow and not found anything that been able to help. I am fairly new to working with Pandas so I am not sure if this is possible or not.
As an example data set:
data1 = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
}
data2 = {
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
}
with the desired output being
Desired = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
}
Another option as opposed to iterating over rows, is you can merge the datasets, then iterate over the columns.
I also noticed your desired output is rounded, so I have that there as well
Sample:
data1 = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
})
data2 = pd.DataFrame({
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
})
Desired = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
})
Code:
import pandas as pd
cols = [ x for x in data2 if 'Home' in x or 'Away' in x ]
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA1'}), how='left', on=['TXTECHIPA1'])
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA2'}), how='left', on=['TXTECHIPA2'])
for col in cols:
data1[col] = data1[[col + '_x', col + '_y']].astype(int).mean(axis=1).round(0)
data1 = data1.drop([col + '_x', col + '_y'], axis=1)
Output:
print(data1)
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 86.0 82.0
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 56.0
Thanks for adding the data. Here is a simple way using loops. Loop through the df2 (upcoming matches). Find the equivalent rows of the participating teams from df1 (team statistics). Now you will have 2 rows from df1. Average the desired columns and add it to df2.
Considering similar structure as your data set, here is an example:
df1 = pd.DataFrame({'team': ['one', 'two', 'three', 'four', 'five'],
'home0-0': [86, 78, 65, 67, 100],
'home1-0': [76, 86, 67, 100, 0],
'home0-1': [91, 88, 75, 100, 67],
'home1-1': [75, 67, 67, 100, 100],
'away0-0': [57, 86, 71, 91, 50],
'away1-0': [73, 50, 71, 100, 100],
'away0-1': [78, 62, 40, 80, 0],
'away1-1': [50, 71, 33, 100, 0]})
df2 = pd.DataFrame({'date': ['2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17'],
'time': [1800, 1200, 1100, 2005, 1000, 1800, 1800],
'team1': ['one', 'two', 'three', 'four', 'five', 'one', 'three'],
'team2': ['five', 'four', 'two', 'one', 'three', 'two', 'four']})
for i, row in df2.iterrows():
team1 = df1[df1['team']==row['team1']]
team2 = df1[df1['team']==row['team2']]
for col in df1.columns[1:]:
df2.loc[i, col]=(np.mean([team1[col].values[0], team2[col].values[0]]))
print(df2)
For your sample data set:
for i, row in data1.iterrows():
team1 = data2[data2['Team']==row['TXTECHIPA1']]
team2 = data2[data2['Team']==row['TXTECHIPA2']]
for col in data2.columns[1:]:
data1.loc[i, col]=(np.mean([int(team1[col].values[0]), int(team2[col].values[0])]))
print(data1)
Result:
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 85.5 81.5
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 55.5
I have some code that compares actual data to target data, where the actual data lives in one DataFrame and the target in another. I need to look up the target, bring it into the df with the actual data, and then compare the two. In the simplified example below, I have a set of products and a set of locations all with unique targets.
I'm using a nested for loop to pull this off: looping through the products and then the locations. The problem is that my real life data is larger on all dimensions, and it takes up an inordinate amount of time to loop through everything.
I've looked at various SO articles and none (that I can find!) seem to be related to pandas and/or relevant for my problem. Does anyone have a good idea on how to vectorize this code?
import pandas as pd
import numpy as np
import time
employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete',
'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian',
'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']
tgt_data = {'Location' : location_list,
'Product1' : [600, 200, 750, 225, 450, 175, 900],
'Product2' : [300, 100, 350, 125, 200, 90, 450],
'Product3' : [700, 250, 950, 275, 600, 225, 1200],
'Product4' : [200, 100, 250, 75, 150, 75, 300],
'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)
employee_data = {'Employee' : employee_list,
'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
'San Francisco', 'Phoenix', 'San Francisco',
'Eugene', 'San Francisco', 'Reno', 'Denver',
'Phoenix', 'Denver', 'Portland', 'Reno',
'Boulder', 'San Francisco', 'Phoenix',
'San Francisco', 'Phoenix'],
'Product1' : np.random.randint(1, 1000, 20),
'Product2' : np.random.randint(1, 700, 20),
'Product3' : np.random.randint(1, 1500, 20),
'Product4' : np.random.randint(1, 500, 20),
'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)
start = time.time()
for p in product_list:
for l in location_list:
emp_df.loc[emp_df['Location'] == l, p + '_tgt'] = (
tgt_df.loc[tgt_df['Location']==l, p].values)
emp_df[p + '_pct'] = emp_df[p] / emp_df[p + '_tgt']
print(emp_df)
end = time.time()
print(end - start)
If the target dataframe is guaranteed to have unique locations, you can use a join to make this process really quick.
import pandas as pd
import numpy as np
import time
employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete',
'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian',
'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']
tgt_data = {'Location' : location_list,
'Product1' : [600, 200, 750, 225, 450, 175, 900],
'Product2' : [300, 100, 350, 125, 200, 90, 450],
'Product3' : [700, 250, 950, 275, 600, 225, 1200],
'Product4' : [200, 100, 250, 75, 150, 75, 300],
'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)
employee_data = {'Employee' : employee_list,
'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
'San Francisco', 'Phoenix', 'San Francisco',
'Eugene', 'San Francisco', 'Reno', 'Denver',
'Phoenix', 'Denver', 'Portland', 'Reno',
'Boulder', 'San Francisco', 'Phoenix',
'San Francisco', 'Phoenix'],
'Product1' : np.random.randint(1, 1000, 20),
'Product2' : np.random.randint(1, 700, 20),
'Product3' : np.random.randint(1, 1500, 20),
'Product4' : np.random.randint(1, 500, 20),
'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)
With the setup done, we can now use our join.
product_tgt_cols = [product+'_tgt' for product in product_list]
print(product_tgt_cols) #['Product1_tgt', 'Product2_tgt', 'Product3_tgt', 'Product4_tgt', 'Product5_tgt']
product_pct_cols = [product+'_pct' for product in product_list]
print(product_pct_cols) #['Product1_pct', 'Product2_pct', 'Product3_pct', 'Product4_pct', 'Product5_pct']
start = time.time()
#join on location to get _tgt columns
emp_df = emp_df.join(tgt_df.set_index('Location'), on='Location', rsuffix='_tgt')
#divide the entire product arrays using numpy, store in temp
temp = emp_df[product_list].values/emp_df[product_tgt_cols].values
#create a new temp df for the _pct results, and assign back to emp_df
emp_df = emp_df.assign(**pd.DataFrame(temp, columns = product_pct_cols))
print(emp_df)
end = time.time()
print("with join: ",end - start)
You are having "wide format" dataframes. I feel "long format" easier to manipulate.
# turn emp_df into long
# indexed by "Employee", "Location", and "Product"
emp_df = (emp_df.set_index(['Employee', 'Location'])
.stack().to_frame())
emp_df.head()
0
Employee Location
Joe Boulder Product1 238
Product2 135
Product3 873
Product4 153
Product5 373
# turn tmp_df into a long series
# indexed by "Location" and "Product"
tgt_df = tgt_df.set_index('Location').stack()
tgt_df.head()
# set target for employees by locations:
emp_df['target'] = (emp_df.groupby('Employee')[0]
.apply(lambda x: tgt_df))
# percentage
emp_df['pct'] = emp_df[0]/emp_df['target']
# you can get the wide format back by
# emp_df = emp_df.unstack(level=2)
# which will give you a dataframe with
# multi-level index and multi-level column
I would like to create an empty longitudinal country-week-dataset in which every country is represented 52 times (weeks of the year) and all other variables are first filled with 0s. It should then look like this:
countries = ['Albania', 'Belgium', ... 'Zimbabwe']
df_weekly = {['country': 'Albania', 'week': 1],
['country': 'Albania', 'week': 2],
...
['country': 'Albania', 'week': 52],
...
['country': 'Zimbabwe', 'week': 52]}
My question therefore: how do I get from a list of countries to such a longitudinal country-week-dataset.
Turned out to be quite simple:
country_list = ['Albania', 'Belgium', 'China', 'Denmark']
country = country_list * 52 # multiply by the number of weeks in the year
country.sort()
week = [1, 2, 3, 4, 5] * 4 # multiply by the number of countries
weekly = pd.DataFrame(
{'country': country,
'week': week
})