I have set of restaurant sales data the structure of which comprises two dimensions: a location dimension and a food type dimension, as well as a fact table that contains some measures. I am having trouble manipulating the table to perform analysis. In the end, it will likely be displayed in excel.
Here is a toy dataset:
tuples = [('California', 'SF'), ('California', 'SD'),
('New York', 'Roch'), ('New York', 'NY'),
('Texas', 'Houst'), ('Texas', 'SA')]
measure1 = [5, 10,
30, 60,
10, 30]
measure2 = [50, 10,
30, 6,
1, 30]
tier1 = ['Burger',
'Burger',
'Burger',
'Pizza',
'Pizza',
'Burger']
tier2 = ['Beef',
'Chicken',
'Beef',
'Pep',
'Cheese',
'Beef']
index = pd.MultiIndex.from_tuples(tuples, names=['State', 'City'])
revenue = pd.Series(measure1, index=index)
revenue = revenue.reindex(index)
rev_df = pd.DataFrame({'Category':tier1,
'Subcategory':tier2,
'Revenue': revenue,
'NumOfOrders': [3, 5,1, 3,10, 20]})
rev_df
This code produces this dataframe:
I want to do two things:
Place the Category and Subcategory as MultiIndex column headers and calculate NumOrder weighted revenue by food subcategory and category with subtotals and grand totals
Place the cities dimension on the Y-axis and move the Category and Subcategory by measure to the x-axis.
For example-
(1)
Burger Total Burger Pizza....
Beef Chicken
California SF 4 5 9
SD 5 5 10
Total Califor 9 10 19
(2)
California California Total
SF SD
Total Burger WgtRev 9 10 19
Beef WgtRev 4 5 10
Chickn WgtRev 5 5 10
Total Pizza...
To start, my first attempt was to use a pivot_table
pivoted = rev_df.pivot_table(index = ['State','City'],
columns = ['Category','Subcategory'],
aggfunc = 'sum', #How to make it a weighted average?
margins = True, #How to subtotal and grandtotal?
margins_name = 'All',
fill_value = 0)
KeyError: "['State' 'City'] not in index"
As you can see, I get an error. What is the most python way to manipulate this snowflake-esque datamodel?
Related
I have 2 DataFrames from 2 different csv file and both file have let's say 5 columns each. And I need to lookup 1 column from the second DataFrame into the first DataFrame so the first DataFrame will has 6 columns and lookup using the ID.
Example are as below:
import pandas as pd
data = [[6661, 'Lily', 21, 5000, 'USA'], [6672, 'Mark', 32, 32500, 'Canada'], [6631, 'Rose', 20, 1500, 'London'],
[6600, 'Jenifer', 42, 50000, 'London'], [6643, 'Sue', 27, 8000, 'Turkey']]
ds_main = pd.DataFrame(data, columns = ['ID', 'Name', 'Age', 'Income', 'Country'])
data2 = [[6672, 'Mark', 'Shirt', 8.5, 2], [6643, 'Sue', 'Scraft', 2.0, 5], [6661, 'Lily', 'Blouse', 11.9, 2],
[6600, 'Jenifer', 'Shirt', 9.8, 1], [6631, 'Rose', 'Pants', 4.5, 2]]
ds_rate = pd.DataFrame(data2, columns = ['ID', 'Name', 'Product', 'Rate', 'Quantity'])
I wanted to lookup the 'Rate' from ds_rate into the ds_main. However, I wanted the rate to be place in the middle of the ds_main DataFrame.
The result should be as below:
I have tried using merge and insert, still unable to get the result that I wanted. Is there any easy way to do it?
You could use set_index + loc to get "Rate" sorted according to its "ID" in ds_main; then insert:
ds_main.insert(3, 'Rate', ds_rate.set_index('ID')['Rate'].loc[ds_main['ID']].reset_index(drop=True))
Output:
ID Name Age Rate Income Country
0 6661 Lily 21 11.9 5000 USA
1 6672 Mark 32 8.5 32500 Canada
2 6631 Rose 20 4.5 1500 London
3 6600 Jenifer 42 9.8 50000 London
4 6643 Sue 27 2.0 8000 Turkey
Assuming 'ID' is unique
ds_main.iloc[:, :3].merge(ds_rate[['ID', 'Rate']]).join(ds_main.iloc[:, 3:])
ID Name Age Rate Income Country
0 6661 Lily 21 11.9 5000 USA
1 6672 Mark 32 8.5 32500 Canada
2 6631 Rose 20 4.5 1500 London
3 6600 Jenifer 42 9.8 50000 London
4 6643 Sue 27 2.0 8000 Turkey
I am working on a dataframe that displays information on property rentals in Brazil. This is a sample of the dataset:
data = {
'city': ['São Paulo', 'Rio', 'Recife'],
'area(m2)': [90, 120, 60],
'Rooms': [3, 2, 4],
'Bathrooms': [2, 3, 3],
'animal': ['accept', 'do not accept', 'accept'],
'rent($)': [2000, 3000, 800]}
df = pd.DataFrame(
data,
columns=['city', 'area(m2)', 'Rooms', 'Bathrooms', 'animal', 'rent($)'])
print(df)
This is how the sample looks:
city area(m2) Rooms Bathrooms animal rent($)
0 São Paulo 90 3 2 accept 2000
1 Rio 120 2 3 do not accept 3000
2 Recife 60 4 3 accept 800
I want to filter the dataset in order to select only the apartments that have at maximum 2 rooms and 2 bathrooms.
Do you know how I can do this?
Try with
out = df.loc[(df.Rooms>=2) & (df.Bathrooms>=2)]
You can use query() method:
out=test_gdata.query('Bathrooms<=2 and Rooms<=2')
You can filter the values on the dataframe
import pandas as pd
data = {
'city': ['São Paulo', 'Rio', 'Recife'],
'area(m2)': [90, 120, 60],
'Rooms': [3, 2, 4],
'Bathrooms': [2, 3, 3],
'animal': ['accept', 'do not accept', 'accept'],
'rent($)': [2000, 3000, 800]}
df = pd.DataFrame(
data,
columns=['city', 'area(m2)', 'Rooms', 'Bathrooms', 'animal', 'rent($)'])
df_filtered = df[(df['Rooms'] <= 2) & (df['Bathrooms'] <= 2)]
print(df_filtered)
Returns
city area(m2) Rooms Bathrooms animal rent($)
0 São Paulo 90 3 2 accept 2000
1 Rio 120 2 3 do not accept 3000
2 Recife 60 4 3 accept 800
Given three data frames containing the number of gold, silver, and bronze Olympic medals won by some countries, determine the total number of medals won by each country.
Note: All the three data frames don’t have all the same countries.Also, sort the final dataframe, according to the total medal count in descending order.
This is my code below - but I am not getting the desired output.Can someone please suggest what is wrong?
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
#gold.set_index('Country',inplace = True)
#silver.set_index('Country',inplace = True)
#bronze.set_index('Country',inplace = True)
Total = gold.add(silver,fill_value = 0).add(bronze,fill_value = 0)
Total.sort_values('Medals',ascending = True)
You can try:
pd.concat([gold, silver, bronze]).groupby('Country').sum().\
sort_values('Medals', ascending=False).reset_index()
If you do like that you have three dataframes in one. It's grouped by country and you get sum of medals for each of them. At the end we sort it in a descending order and reset the index.
Output:
Country Medals
0 USA 72
1 France 53
2 UK 27
3 Russia 25
4 Germany 20
You can do below way as well:
gold.set_index('Country', inplace=True)
silver.set_index('Country', inplace=True)
bronze.set_index('Country', inplace=True)
#print(gold)
#print(silver)
#print(bronze)
Total= gold.add(silver, fill_value=0).add(bronze,fill_value=0).sort_values('Medals', ascending=False)
Output:
Medals
Country
USA 72.0
France 53.0
UK 27.0
Russia 25.0
Germany 20.0
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
# Set the index of the dataframes to 'Country' so that you can get the countrywise
# medal count
gold.set_index('Country', inplace = True)
silver.set_index('Country', inplace = True)
bronze.set_index('Country', inplace = True)
# Add the three dataframes and set the fill_value argument to zero to avoid getting
# NaN values
total = gold.add(silver, fill_value = 0).add(bronze, fill_value = 0)
# Sort the resultant dataframe in a descending order
total = total.sort_values(by = 'Medals', ascending = False).astype("int64")
# Print the sorted dataframe
print(total)
int64 is used to convert the float value into integer and 64 indicates 64bit memory location
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
gold.set_index('Country' , inplace = True)
silver.set_index('Country' , inplace = True)
bronze.set_index('Country' , inplace = True )
total = gold.add(silver , fill_value = 0).add(bronze , fill_value = 0)
total = total.sort_values(by = 'Medals', ascending = False)
total = total.astype(int)
print(total)
I would like to filter a Dataframe. The resulting dataframe should contain all the rows where in any of a number of columns contain any of the words of a list.
I started to use for loops but there should be a better pythonic/pandonic way.
Example:
# importing pandas
import pandas as pd
# Creating the dataframe with dict of lists
df = pd.DataFrame({'Name': ['Geeks', 'Peter', 'James', 'Jack', 'Lisa'],
'Team': ['Boston', 'Boston', 'Boston', 'Chele', 'Barse'],
'Position': ['PG', 'PG', 'UG', 'PG', 'UG'],
'Number': [3, 4, 7, 11, 5],
'Age': [33, 25, 34, 35, 28],
'Height': ['6-2', '6-4', '5-9', '6-1', '5-8'],
'Weight': [89, 79, 113, 78, 84],
'College': ['MIT', 'MIT', 'MIT', 'Stanford', 'Stanford'],
'Salary': [99999, 99994, 89999, 78889, 87779]},
index =['ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
df1 = df[df['Team'].str.contains("Boston") | df['College'].str.contains('MIT')]
print(df1)
So it is clear how to filter columns individually that contain a particular word
further on it is also clear how to filter rows per column containing any of the strings of a list:
df[df.Name.str.contains('|'.join(search_values ))]
Where search_values contains a list of words or strings.
search_values = ['boston','mike','whatever']
I am looking for a short way to code
#pseudocode
give me a subframe of df where any of the columns 'Name','Position','Team' contains any of the words in search_values
I know I can do
df[df['Name'].str.contains('|'.join(search_values )) | df['Position'].str.contains('|'.join(search_values )) | df['Team'].contains('|'.join(search_values )) ]
but if I would have like 20 columns that would be a mess of a line of code
any suggestion?
EDIT Bonus:
When looking in a list of columns i.e. 'Name','Position','Team' how to include also the index? passing ['index','Name','Position','Team'] does not work.
thanks.
I had a look to this:
https://www.geeksforgeeks.org/get-all-rows-in-a-pandas-dataframe-containing-given-substring/
https://kanoki.org/2019/03/27/pandas-select-rows-by-condition-and-string-operations/
Filter out rows based on list of strings in Pandas
You can also stack with any on level=0:
cols_list = ['Name','Team'] #add column names
df[df[cols_list].stack().str.contains('|'.join(search_values),case=False,na=False)
.any(level=0)]
Name Team Position Number Age Height Weight College Salary
ind1 Geeks Boston PG 3 33 6-2 89 MIT 99999
ind2 Peter Boston PG 4 25 6-4 79 MIT 99994
ind3 James Boston UG 7 34 5-9 113 MIT 89999
Do apply with any
df[[c1,c2..]].apply(lambda x : x.str.contains('|'.join(search_values )),axis=1).any(axis=1)
You can simply apply in this case:
cols_to_filter = ['Name', 'Position', 'Team']
search_values = ['word1', 'word2']
patt = '|'.join(search_values)
mask = df[cols_to_filter].apply(lambda x: x.str.contains(patt)).any(1)
df[mask]
I'd would to generate more than 100 rows randomly and keep the link within the observations.
Below my example :
There are 4 variables which are Country, Category, Product and Price.
And Category, Product need to have a link together.
import random as rd
import pandas as pd
Country = []
Category = []
Product = []
Price = []
for i in range(1000):
Country.append(rd.choice(['England','Germany','France','USA','China','Japan']))
Category.append(rd.choice(['Electronics','home appliances','Computer','Food','Bedding']))
Product.append(rd.choice(['Iphone 6S','Samsung Fridge','PC ASUS','Cheese','Bed']))
Price.append(rd.randint(10,10000))
data = pd.DataFrame(data = {'Country':Country,'Category':Category,'Product':Product,'Price':Price})
When I executed the code above, Category observations aren't with their corresponding Product observations. For example you could have a row with Electronics (Category) and Cheese (Product) and it makes no sens obviously.
Any ideas would be appreciated
Thank you in advance
You can use Series.map for new column by dictionary with zip of lists after generating DataFrame without column Product:
Also append to lists is not necessary, faster is use numpy.random.choice and numpy.random.randint
import numpy as np
N = 10000
L0 = ['England','Germany','France','USA','China','Japan']
L1 = ['Electronics','home appliances','Computer','Food','Bedding']
L2 = ['Iphone 6S','Samsung Fridge','PC ASUS','Cheese','Bed']
d = dict(zip(L1, L2))
print (d)
{'Electronics': 'Iphone 6S', 'home appliances': 'Samsung Fridge',
'Computer': 'PC ASUS', 'Food': 'Cheese', 'Bedding': 'Bed'}
data = pd.DataFrame(data = {'Country':np.random.choice(L0, size=N),
'Category':np.random.choice(L1, size=N),
'Price':np.random.randint(10,size=N)})
data['Product'] = data['Category'].map(d)
print (data)
Country Category Price Product
0 Germany Food 1 Cheese
1 England Food 6 Cheese
2 Japan Bedding 3 Bed
3 France Electronics 1 Iphone 6S
4 Japan home appliances 8 Samsung Fridge
... ... ... ...
9995 England Electronics 3 Iphone 6S
9996 China Electronics 1 Iphone 6S
9997 Germany Bedding 0 Bed
9998 USA Electronics 3 Iphone 6S
9999 Germany home appliances 6 Samsung Fridge
[10000 rows x 4 columns]