Plot multiple attributes from rows/columns in Pandas - python

See my data:
df = pd.DataFrame({'house_number':['House 1']*6+['House 2']*6
,'room_type':['Master Bedroom', 'Bedroom 1', 'Bedroom 2', 'Kitchen',
'Bathroom 1', 'Bathroom 2']*2
,'square_feet':[250,180,150,200,25,30,300,170,175,210,30,20]})
house_number room_type square_feet
0 House 1 Master Bedroom 250
1 House 1 Bedroom 1 180
2 House 1 Bedroom 2 150
3 House 1 Kitchen 200
4 House 1 Bathroom 1 25
5 House 1 Bathroom 2 30
6 House 2 Master Bedroom 300
7 House 2 Bedroom 1 170
8 House 2 Bedroom 2 175
9 House 2 Kitchen 210
10 House 2 Bathroom 1 30
11 House 2 Bathroom 2 20
Data Table
I'm very new to programming. I'm using Jupyter Notebook and Pandas/matplotlib to plot some data. How would I be able to make a bar chart from this table where the x axis would be room_type and the y axis would be square feet. I only want to plot the data for House 1. I haven't been able to find anything online where I can select only that data from a particular column that matches with a particular value in another column. Does that make sense?
Thanks for any help you can provide!

IIUC, you can do it by filtering the dataframe first then calling plot:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'house_number':['House 1']*6+['House 2']*6
,'room_type':['Master Bedroom', 'Bedroom 1', 'Bedroom 2', 'Kitchen',
'Bathroom 1', 'Bathroom 2']*2
,'square_feet':[250,180,150,200,25,30,300,170,175,210,30,20]})
ax = df.query('house_number == "House 1"').plot.bar(x='room_type', y='square_feet')
ax.set_title('House 1')
ax.set_ylabel('square ft')
Output:
Or, you can filter the dataframe using boolean indexing:
df[df['house_number'] == 'House 1'].plot.bar(x='room_type', y='square_feet')

Related

How to create a 100% stacked bar plot from a categorical dataframe

I have a dataframe structured like this:
User
Food 1
Food 2
Food 3
Food 4
Steph
Onions
Tomatoes
Cabbages
Potatoes
Tom
Potatoes
Tomatoes
Potatoes
Potatoes
Fred
Carrots
Cabbages
Eggplant
Phil
Onions
Eggplant
Eggplant
I want to use the distinct values from across the food columns as categories. I then want to create a Seaborn plot so the % of each category for each column is plotted as a 100% horizontal stacked bar.
My attempt to do this:
data = {
'User' : ['Steph', 'Tom', 'Fred', 'Phil'],
'Food 1' : ["Onions", "Potatoes", "Carrots", "Onions"],
'Food 2' : ['Tomatoes', 'Tomatoes', 'Cabbages', 'Eggplant'],
'Food 3' : ["Cabbages", "Potatoes", "", "Eggplant"],
'Food 4' : ['Potatoes', 'Potatoes', 'Eggplant', ''],
}
df = pd.DataFrame(data)
x_ax = ["Onions", "Potatoes", "Carrots", "Onions", "", 'Eggplant', "Cabbages"]
df.plot(kind="barh", x=x_ax, y=["Food 1", "Food 2", "Food 3", "Food 4"], stacked=True, ax=axes[1])
plt.show()
Replace '' with np.nan because empty stings will be counted as values.
Use pandas.DataFrame.melt to convert the dataframe to a long form.
Use pandas.crosstab with the normalize parameter to calculate the percent for each 'Food'.
Plot the dataframe with pandas.DataFrame.plot and kind='barh'.
Putting the food names on the x-axis is not the correct way to create a 100% stacked bar plot. One axis must be numeric. The bars will be colored by food type.
Annotate the bars based on this answer.
Move the legend outside the plot based on this answer.
seaborn is a high-level API for matplotlib, and pandas uses matplotlib as the default backend, and it's easier to produce a stacked bar plot with pandas.
seaborn doesn't support stacked barplots, unless histplot is used in a hacked way, as shown in this answer, and would require an extra step of melting percent.
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1
Assignment expressions (:=) require python >= 3.8. Otherwise, use [f'{v.get_width():.2f}%' if v.get_width() > 0 else '' for v in c ].
import pandas as pd
import numpy as np
# using the dataframe in the OP
# 1.
df = df.replace('', np.nan)
# 2.
dfm = df.melt(id_vars='User', var_name='Food', value_name='Type')
# 3.
percent = pd.crosstab(dfm.Food, dfm.Type, normalize='index').mul(100).round(2)
# 4.
ax = percent.plot(kind='barh', stacked=True, figsize=(8, 6))
# 5.
for c in ax.containers:
# customize the label to account for cases when there might not be a bar section
labels = [f'{w:.2f}%' if (w := v.get_width()) > 0 else '' for v in c ]
# set the bar label
ax.bar_label(c, labels=labels, label_type='center')
# 6.
ax.legend(bbox_to_anchor=(1, 1.02), loc='upper left')
DataFrame Views
dfm
User Food Type
0 Steph Food 1 Onions
1 Tom Food 1 Potatoes
2 Fred Food 1 Carrots
3 Phil Food 1 Onions
4 Steph Food 2 Tomatoes
5 Tom Food 2 Tomatoes
6 Fred Food 2 Cabbages
7 Phil Food 2 Eggplant
8 Steph Food 3 Cabbages
9 Tom Food 3 Potatoes
10 Fred Food 3 NaN
11 Phil Food 3 Eggplant
12 Steph Food 4 Potatoes
13 Tom Food 4 Potatoes
14 Fred Food 4 Eggplant
15 Phil Food 4 NaN
ct
Type Cabbages Carrots Eggplant Onions Potatoes Tomatoes
Food
Food 1 0 1 0 2 1 0
Food 2 1 0 1 0 0 2
Food 3 1 0 1 0 1 0
Food 4 0 0 1 0 2 0
total
Food
Food 1 4
Food 2 4
Food 3 3
Food 4 3
dtype: int64
percent
Type Cabbages Carrots Eggplant Onions Potatoes Tomatoes
Food
Food 1 0.00 25.0 0.00 50.0 25.00 0.0
Food 2 25.00 0.0 25.00 0.0 0.00 50.0
Food 3 33.33 0.0 33.33 0.0 33.33 0.0
Food 4 0.00 0.0 33.33 0.0 66.67 0.0

classifing excel data row by row in n level columns

I have problem with excel file to classify data in some columns and rows, I need to arrange merge cells to next column as a 1 row and next column go to beside them like this pictures:
Input:
Output for Dairy:
Summary:
first we took Dairy row, then we go to the second column in front of Dairy and get data in front of Dairy, then we go to the second column and in front of Milk to Mr. 1 we get the Butter to Mrs. 1 and Butter to Mrs. 2 and so on ...
After that we want to export it into an excel file like in Output picture.
I have written a code which get the first column data and finds all the data in front of it but I need to change it in order to get the data row by row like in the Output picture:
import pandas
import openpyxl
import xlwt
from xlwt import Workbook
df = pandas.read_excel('excel.xlsx')
result_first_level = []
for i, item in enumerate(df[df.columns[0]].values, 2):
if pandas.isna(item):
result_first_level[-1]['index'] = i
else:
result_first_level.append(dict(name=item, index=i, levels_name=[]))
for level in df.columns[1:]:
move_index = 0
for i, obj in enumerate(result_first_level):
if i == 0:
for item in df[level].values[0:obj['index'] - 1]:
if pandas.isna(item):
move_index += 1
continue
else:
obj['levels_name'].append(item)
move_index += 1
else:
for item in df[level].values[move_index:obj['index'] - 1]:
if pandas.isna(item):
move_index += 1
continue
else:
obj['levels_name'].append(item)
move_index += 1
# Workbook is created
wb = Workbook()
# add_sheet is used to create sheet.
sheet1 = wb.add_sheet('Sheet 1')
style = xlwt.easyxf('font: bold 1')
move_index = 0
for item in result_first_level:
for member in item['levels_name']:
sheet1.write(move_index, 0, item['name'], style)
sheet1.write(move_index, 1, member)
move_index += 1
wb.save('test.xls')
download Input File excel from here
Thanks for helping!
First, fill forward your data to fill blank cells with the last valid value the create an ordered collection using pd.CategoricalDtype to sort the product column. Finally, you have just to iterate over columns pairwise and rename columns to allow concatenate. The last step is to sort your rows by product value.
import pandas as pd
# Prepare your dataframe
df = pd.read_excel('input.xlsx').dropna(how='all')
df.update(df.iloc[:, :-1].ffill())
df = df.drop_duplicates()
# Get keys to sort data in the final output
cats = pd.CategoricalDtype(df.T.melt()['value'].dropna().unique(), ordered=True)
# Group pairwise values
data = []
for cols in zip(df.columns, df.columns[1:]):
col_mapping = dict(zip(cols, ['product', 'subproduct']))
data.append(df[list(cols)].rename(columns=col_mapping))
# Merge all data
out = pd.concat(data).drop_duplicates().dropna() \
.astype(cats).sort_values('product').reset_index(drop=True)
Output:
>>> cats
CategoricalDtype(categories=['Dairy', 'Milk to Mr.1', 'Butter to Mrs.1',
'Butter to Mrs.2', 'Cheese to Miss 2 ', 'Cheese to Mr.2',
'Milk to Miss.1', 'Milk to Mr.5', 'yoghurt to Mr.3',
'Milk to Mr.6', 'Fruits', 'Apples to Mr.6',
'Limes to Miss 5', 'Oranges to Mr.7', 'Plumbs to Miss 5',
'apple for mr 2', 'Foods & Drinks', 'Chips to Mr1',
'Jam to Mr 2.', 'Coca to Mr 5', 'Cookies to Mr1.',
'Coca to Mr 7', 'Coca to Mr 6', 'Juice to Miss 1',
'Jam to Mr 3.', 'Ice cream to Miss 3.', 'Honey to Mr 5',
'Cake to Mrs. 2', 'Honey to Miss 2',
'Chewing gum to Miss 7.'], ordered=True)
>>> out
product subproduct
0 Dairy Milk to Mr.1
1 Dairy Cheese to Mr.2
2 Milk to Mr.1 Butter to Mrs.1
3 Milk to Mr.1 Butter to Mrs.2
4 Butter to Mrs.2 Cheese to Miss 2
5 Cheese to Mr.2 Milk to Miss.1
6 Cheese to Mr.2 yoghurt to Mr.3
7 Milk to Miss.1 Milk to Mr.5
8 yoghurt to Mr.3 Milk to Mr.6
9 Fruits Apples to Mr.6
10 Fruits Oranges to Mr.7
11 Apples to Mr.6 Limes to Miss 5
12 Oranges to Mr.7 Plumbs to Miss 5
13 Plumbs to Miss 5 apple for mr 2
14 Foods & Drinks Chips to Mr1
15 Foods & Drinks Juice to Miss 1
16 Foods & Drinks Cake to Mrs. 2
17 Chips to Mr1 Jam to Mr 2.
18 Chips to Mr1 Cookies to Mr1.
19 Jam to Mr 2. Coca to Mr 5
20 Cookies to Mr1. Coca to Mr 6
21 Cookies to Mr1. Coca to Mr 7
22 Juice to Miss 1 Honey to Mr 5
23 Juice to Miss 1 Jam to Mr 3.
24 Jam to Mr 3. Ice cream to Miss 3.
25 Cake to Mrs. 2 Chewing gum to Miss 7.
26 Cake to Mrs. 2 Honey to Miss 2

Disaggregate pandas data frame using ratios from another data frame

I have a pandas data frame 'High' as
segment sales
Milk 10
Chocolate 30
and another data frame 'Low' as
segment sku sales
Milk m2341 2
Milk m235 3
Chocolate c132 2
Chocolate c241 5
Chocolate c891 3
I want to use the ratios from Low to disaggregate High. So my resulting data here would be
segment sku sales
Milk m2341 4
Milk m235 6
Chocolate c132 6
Chocolate c241 15
Chocolate c891 9
First, I would find the scale we need to multiple each product sales.
df_agg = df_low[["segment", "sales"]].groupby(by=["segment"]).sum().merge(df_high, on="segment")
df_agg["scale"] = df_agg["sales_y"] / df_agg["sales_x"]
Then, apply the scale
df_disagg_high = df_low.merge(df_agg[["segment", "scale"]])
df_disagg_high["adjusted_sale"] = df_disagg_high["sales"] * df_disagg_high["scale"]
If needed, you can exclude extra columns.
Try:
df_low["sales"] = df_low.sales.mul(
df_low.merge(
df_high.set_index("segment")["sales"].div(
df_low.groupby("segment")["sales"].sum()
),
on="segment",
)["sales_y"]
).astype(int)
print(df_low)
Prints:
segment sku sales
0 Milk m2341 4
1 Milk m235 6
2 Chocolate c132 6
3 Chocolate c241 15
4 Chocolate c891 9

pandas - how to extract top three rows from the dataframe provided

My pandas Data frame df could produce result as below:
grouped = df[(df['X'] == 'venture') & (df['company_code'].isin(['TDS','XYZ','UVW']))].groupby(['company_code','sector'])['X_sector'].count()
The output of this is as follows:
company_code sector
TDS Meta 404
Electrical 333
Mechanical 533
Agri 453
XYZ Sports 331
Electrical 354
Movies 375
Manufacturing 355
UVW Sports 505
Robotics 345
Movies 56
Health 3263
Manufacturing 456
Others 524
Name: X_sector, dtype: int64
What I want to get is the top three sectors within the company codes.
What is the way to do it?
You will have to chain a groupby here. Consider this example:
import pandas as pd
import numpy as np
np.random.seed(111)
names = [
'Robert Baratheon',
'Jon Snow',
'Daenerys Targaryen',
'Theon Greyjoy',
'Tyrion Lannister'
]
df = pd.DataFrame({
'season': np.random.randint(1, 7, size=100),
'actor': np.random.choice(names, size=100),
'appearance': 1
})
s = df.groupby(['season','actor'])['appearance'].count()
print(s.sort_values(ascending=False).groupby('season').head(1)) # <-- head(3) for 3 values
Returns:
season actor
4 Daenerys Targaryen 7
6 Robert Baratheon 6
3 Robert Baratheon 6
5 Jon Snow 5
2 Theon Greyjoy 5
1 Jon Snow 4
Where s is (clipped at 4)
season actor
1 Daenerys Targaryen 2
Jon Snow 4
Robert Baratheon 2
Theon Greyjoy 3
Tyrion Lannister 4
2 Daenerys Targaryen 4
Jon Snow 3
Robert Baratheon 1
Theon Greyjoy 5
Tyrion Lannister 3
3 Daenerys Targaryen 2
Jon Snow 1
Robert Baratheon 6
Theon Greyjoy 3
Tyrion Lannister 3
4 ...
Why would you want things to be complicated, when there are simple codes possible:
Z = df.groupby('country_code')['sector'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z

Plotting graph from csv flie

I'm new to python. I've to plot graphs from a csv file that I've created.
a) Monthly sales vs Product Price
b) Geographic Region vs No of customer
The code that I've implemented was
import pandas as pd
import matplotlib.pyplot as plot
import csv
data = pd.read_csv('dataset_books.csv')
data.hist(bins=90)
plot.xlim([0,115054])
plot.title("Data")
x = plot.xlabel("Monthly Sales")
y = plot.ylabel("Product Price")
plot.show()
the output that I'm getting is not what I expected and is not approved.
I need a Horizontal Histogram with line plot.
Book ID Product Name Product Price Monthly Sales Shipping Type Geographic Region No Of Customer Who Bought the Product Customer Type
1 The Path to Power 486 2566.08 Free Gatton 4 Old
2 Touching Darkness (Midnighters, #2) 479 1264.56 Paid Hooker Creek 2 New
3 Star Wars: Lost Stars 456 1203.84 Paid Gladstone 2 New
4 Winter in Madrid 454 599.28 Paid Warruwi 1 New
5 Hairy Maclary from Donaldson's Dairy 442 2333.76 Free Mount Gambier 4 Old
6 Stand on Zanzibar 413 3816.12 Free Cessnock 7 Old
7 Marlfox 411 3797.64 Free Edinburgh 7 Old
8 The Matlock Paper 373 3446.52 Free Gladstone 7 Old
9 Tears of a Tiger 361 1906.08 Free Melbourne 4 Old
10 Star Wars: Vision of the Future 355 937.2 Paid Wagga Wagga 2 New
11 Nefes Nefese 344 454.08 Paid Gatton 1 New
this is my CSV file.
Can anyone help me?
Try this to check your columns name:
df.columns
>> Index(['Book ID', 'Product Name', 'Product Price', 'Monthly Sales',
'Shipping Type', 'Geographic Region',
'No Of Customer Who Bought the Product', 'Customer Type'],
dtype='object')
Next to plot horizontal plot you need 'barh'.
df['Product Price'].plot(kind='barh')
Another option to chose column is 'iloc'
df.iloc[:, 2].plot(kind='barh')
It will generate the same output

Categories