classifing excel data row by row in n level columns - python

I have problem with excel file to classify data in some columns and rows, I need to arrange merge cells to next column as a 1 row and next column go to beside them like this pictures:
Input:
Output for Dairy:
Summary:
first we took Dairy row, then we go to the second column in front of Dairy and get data in front of Dairy, then we go to the second column and in front of Milk to Mr. 1 we get the Butter to Mrs. 1 and Butter to Mrs. 2 and so on ...
After that we want to export it into an excel file like in Output picture.
I have written a code which get the first column data and finds all the data in front of it but I need to change it in order to get the data row by row like in the Output picture:
import pandas
import openpyxl
import xlwt
from xlwt import Workbook
df = pandas.read_excel('excel.xlsx')
result_first_level = []
for i, item in enumerate(df[df.columns[0]].values, 2):
if pandas.isna(item):
result_first_level[-1]['index'] = i
else:
result_first_level.append(dict(name=item, index=i, levels_name=[]))
for level in df.columns[1:]:
move_index = 0
for i, obj in enumerate(result_first_level):
if i == 0:
for item in df[level].values[0:obj['index'] - 1]:
if pandas.isna(item):
move_index += 1
continue
else:
obj['levels_name'].append(item)
move_index += 1
else:
for item in df[level].values[move_index:obj['index'] - 1]:
if pandas.isna(item):
move_index += 1
continue
else:
obj['levels_name'].append(item)
move_index += 1
# Workbook is created
wb = Workbook()
# add_sheet is used to create sheet.
sheet1 = wb.add_sheet('Sheet 1')
style = xlwt.easyxf('font: bold 1')
move_index = 0
for item in result_first_level:
for member in item['levels_name']:
sheet1.write(move_index, 0, item['name'], style)
sheet1.write(move_index, 1, member)
move_index += 1
wb.save('test.xls')
download Input File excel from here
Thanks for helping!

First, fill forward your data to fill blank cells with the last valid value the create an ordered collection using pd.CategoricalDtype to sort the product column. Finally, you have just to iterate over columns pairwise and rename columns to allow concatenate. The last step is to sort your rows by product value.
import pandas as pd
# Prepare your dataframe
df = pd.read_excel('input.xlsx').dropna(how='all')
df.update(df.iloc[:, :-1].ffill())
df = df.drop_duplicates()
# Get keys to sort data in the final output
cats = pd.CategoricalDtype(df.T.melt()['value'].dropna().unique(), ordered=True)
# Group pairwise values
data = []
for cols in zip(df.columns, df.columns[1:]):
col_mapping = dict(zip(cols, ['product', 'subproduct']))
data.append(df[list(cols)].rename(columns=col_mapping))
# Merge all data
out = pd.concat(data).drop_duplicates().dropna() \
.astype(cats).sort_values('product').reset_index(drop=True)
Output:
>>> cats
CategoricalDtype(categories=['Dairy', 'Milk to Mr.1', 'Butter to Mrs.1',
'Butter to Mrs.2', 'Cheese to Miss 2 ', 'Cheese to Mr.2',
'Milk to Miss.1', 'Milk to Mr.5', 'yoghurt to Mr.3',
'Milk to Mr.6', 'Fruits', 'Apples to Mr.6',
'Limes to Miss 5', 'Oranges to Mr.7', 'Plumbs to Miss 5',
'apple for mr 2', 'Foods & Drinks', 'Chips to Mr1',
'Jam to Mr 2.', 'Coca to Mr 5', 'Cookies to Mr1.',
'Coca to Mr 7', 'Coca to Mr 6', 'Juice to Miss 1',
'Jam to Mr 3.', 'Ice cream to Miss 3.', 'Honey to Mr 5',
'Cake to Mrs. 2', 'Honey to Miss 2',
'Chewing gum to Miss 7.'], ordered=True)
>>> out
product subproduct
0 Dairy Milk to Mr.1
1 Dairy Cheese to Mr.2
2 Milk to Mr.1 Butter to Mrs.1
3 Milk to Mr.1 Butter to Mrs.2
4 Butter to Mrs.2 Cheese to Miss 2
5 Cheese to Mr.2 Milk to Miss.1
6 Cheese to Mr.2 yoghurt to Mr.3
7 Milk to Miss.1 Milk to Mr.5
8 yoghurt to Mr.3 Milk to Mr.6
9 Fruits Apples to Mr.6
10 Fruits Oranges to Mr.7
11 Apples to Mr.6 Limes to Miss 5
12 Oranges to Mr.7 Plumbs to Miss 5
13 Plumbs to Miss 5 apple for mr 2
14 Foods & Drinks Chips to Mr1
15 Foods & Drinks Juice to Miss 1
16 Foods & Drinks Cake to Mrs. 2
17 Chips to Mr1 Jam to Mr 2.
18 Chips to Mr1 Cookies to Mr1.
19 Jam to Mr 2. Coca to Mr 5
20 Cookies to Mr1. Coca to Mr 6
21 Cookies to Mr1. Coca to Mr 7
22 Juice to Miss 1 Honey to Mr 5
23 Juice to Miss 1 Jam to Mr 3.
24 Jam to Mr 3. Ice cream to Miss 3.
25 Cake to Mrs. 2 Chewing gum to Miss 7.
26 Cake to Mrs. 2 Honey to Miss 2

Related

Python: Counting values for columns with multiple values per entry in dataframe

I have a dataframe of restaurants and one column has corresponding cuisines.
The problem is that there are restaurants with multiple cuisines in the same column [up to 8].
Let's say it's something like this:
RestaurantName City Restaurant ID Cuisines
Restaurant A Milan 31333 French, Spanish, Italian
Restaurant B Shanghai 63551 Pizza, Burgers
Restaurant C Dubai 7991 Burgers, Ice Cream
Here's a copy-able code as a sample:
rst= pd.DataFrame({'RestaurantName': ['Rest A', 'Rest B', 'Rest C'],
'City': ['Milan', 'Shanghai', 'Dubai'],
'RestaurantID': [31333,63551,7991],
'Cuisines':['French, Spanish, Italian','Pizza, Burgers','Burgers, Ice Cream']})
I used string split to expand them into 8 different columns and added it to the dataframe.
csnsplit=rst.Cuisines.str.split(", ",expand=True)
rst["Cuisine1"]=csnsplit.loc[:,0]
rst["Cuisine2"]=csnsplit.loc[:,1]
rst["Cuisine3"]=csnsplit.loc[:,2]
rst["Cuisine4"]=csnsplit.loc[:,3]
rst["Cuisine5"]=csnsplit.loc[:,4]
rst["Cuisine6"]=csnsplit.loc[:,5]
rst["Cuisine7"]=csnsplit.loc[:,6]
rst["Cuisine8"]=csnsplit.loc[:,7]
Which leaves me with this:
https://i.stack.imgur.com/AUSDY.png
Now I have no idea how to count individual cuisines since they're across up to 8 different columns, let's say if I want to see top cuisine by city.
I also tried getting dummy columns for all of them, Cuisine 1 to Cuisine 8. This is causing me to have duplicates like Cuisine1_Bakery, Cusine2_Bakery, and so on. I could hypothetically merge like ones and keeping only the one that has a count of "1," but no idea how to do that.
dummies=pd.get_dummies(data=rst,columns=["Cuisine1","Cuisine2","Cuisine3","Cuisine4","Cuisine5","Cuisine6","Cuisine7","Cuisine8"])
print(dummies.columns.tolist())
Which leaves me with all of these columns:
https://i.stack.imgur.com/84spI.png
A third thing I tried was to get unique values from all 8 columns, and I have a deduped list of each type of cuisine. I can probably add all these columns to the dataframe, but wouldn't know how to fill the rows with a count for each one based on the column name.
AllCsn=np.concatenate((rst.Cuisine1.unique(),
rst.Cuisine2.unique(),
rst.Cuisine3.unique(),
rst.Cuisine4.unique(),
rst.Cuisine5.unique(),
rst.Cuisine6.unique(),
rst.Cuisine7.unique(),
rst.Cuisine8.unique()
))
AllCsn=np.unique(AllCsn.astype(str))
AllCsn
Which leaves me with this:
https://i.stack.imgur.com/O9OpW.png
I do want to create a model later on where I maybe have a column for each cuisine, and use the "unique" code above to get all the columns, but then I would need to figure out how to do a count based on the column header.
I am new to this, so please bear with me and let me know if I need to provide any more info.
It sounds like you're looking for str.split without expanding, then explode:
rst['Cuisines'] = rst['Cuisines'].str.split(', ')
rst = rst.explode('Cuisines')
Creates a frame like:
RestaurantName City RestaurantID Cuisines
0 Rest A Milan 31333 French
0 Rest A Milan 31333 Spanish
0 Rest A Milan 31333 Italian
1 Rest B Shanghai 63551 Pizza
1 Rest B Shanghai 63551 Burgers
2 Rest C Dubai 7991 Burgers
2 Rest C Dubai 7991 Ice Cream
Then it sounds like either crosstab:
pd.crosstab(rst['City'], rst['Cuisines'])
Cuisines Burgers French Ice Cream Italian Pizza Spanish
City
Dubai 1 0 1 0 0 0
Milan 0 1 0 1 0 1
Shanghai 1 0 0 0 1 0
Or value_counts
rst[['City', 'Cuisines']].value_counts().reset_index(name='counts')
City Cuisines counts
0 Dubai Burgers 1
1 Dubai Ice Cream 1
2 Milan French 1
3 Milan Italian 1
4 Milan Spanish 1
5 Shanghai Burgers 1
6 Shanghai Pizza 1
Max value_count per City via groupby head:
max_counts = (
rst[['City', 'Cuisines']].value_counts()
.groupby(level=0).head(1)
.reset_index(name='counts')
)
max_counts:
City Cuisines counts
0 Dubai Burgers 1
1 Milan French 1
2 Shanghai Burgers 1

How to compare two data row before concatenating them?

I have 2 datasets (in CSV format) with different size such as follow:
df_old:
index category text
0 spam you win much money
1 spam you are the winner of the game
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
4 neutral we have a party now
5 neutral they are driving to downtown
df_new:
index category text
0 spam you win much money
14 spam London is the capital of Canada
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
4 neutral we have a party now
31 neutral construction will be done
I am using a code that concatenates the df_new to the df_old in the way that df_new goes on top of df_old's each category.
The code is:
(pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
Now, the problem is that some of the rows with similar index, category, text (all together at same row) being duplicated at the same time, and (like: [0, spam, you win much money]) I want to avoid this.
The expected output should be:
df_concat:
index category text
14 spam London is the capital of Canada
0 spam you win much money
1 spam you are the winner of the game
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
31 neutral construction will be done
4 neutral we have a party now
5 neutral they are driving to downtown
I tried this and this but these are removing either the category or text.
To remove duplicates on specific column(s), use subset in drop_duplicates:
df.drop_duplicates(subset=['index', 'category', 'text'], keep='first')
Try concat + sort_values:
res = pd.concat((new_df, old_df)).drop_duplicates()
res = res.sort_values(by=['category'], key=lambda x: x.map({'spam' : 0, 'not_spam' : 1, 'neutral': 2}))
print(res)
Output
index category text
0 0 spam you win much money
1 14 spam London is the capital of Canada
1 1 spam you are the winner of the game
2 15 not_spam no more raining in winter
3 25 not_spam the soccer game plays on HBO
2 2 not_spam the weather in Chicago is nice
3 3 not_spam pizza is an Italian food
4 31 neutral construction will be done
4 4 neutral we have a party now
5 5 neutral they are driving to downtown
Your code seems right , try to add this to the concat result it will remove your duplicates :
# this first lines will create a new column ‘index’ and will help the rest of the code be correct
df_new = df_new.reset_index()
df_ old = df_ old.reset_index()
df_concat = (pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
df_concat.drop_duplicates()
If you want to reindex it you can do ofcourse mot chnging the ‘index’column):
df_concat.drop_duplicates(ignore_index =True)
You can always do combine_first
out = df_new.combine_first(df_old)

Plot multiple attributes from rows/columns in Pandas

See my data:
df = pd.DataFrame({'house_number':['House 1']*6+['House 2']*6
,'room_type':['Master Bedroom', 'Bedroom 1', 'Bedroom 2', 'Kitchen',
'Bathroom 1', 'Bathroom 2']*2
,'square_feet':[250,180,150,200,25,30,300,170,175,210,30,20]})
house_number room_type square_feet
0 House 1 Master Bedroom 250
1 House 1 Bedroom 1 180
2 House 1 Bedroom 2 150
3 House 1 Kitchen 200
4 House 1 Bathroom 1 25
5 House 1 Bathroom 2 30
6 House 2 Master Bedroom 300
7 House 2 Bedroom 1 170
8 House 2 Bedroom 2 175
9 House 2 Kitchen 210
10 House 2 Bathroom 1 30
11 House 2 Bathroom 2 20
Data Table
I'm very new to programming. I'm using Jupyter Notebook and Pandas/matplotlib to plot some data. How would I be able to make a bar chart from this table where the x axis would be room_type and the y axis would be square feet. I only want to plot the data for House 1. I haven't been able to find anything online where I can select only that data from a particular column that matches with a particular value in another column. Does that make sense?
Thanks for any help you can provide!
IIUC, you can do it by filtering the dataframe first then calling plot:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'house_number':['House 1']*6+['House 2']*6
,'room_type':['Master Bedroom', 'Bedroom 1', 'Bedroom 2', 'Kitchen',
'Bathroom 1', 'Bathroom 2']*2
,'square_feet':[250,180,150,200,25,30,300,170,175,210,30,20]})
ax = df.query('house_number == "House 1"').plot.bar(x='room_type', y='square_feet')
ax.set_title('House 1')
ax.set_ylabel('square ft')
Output:
Or, you can filter the dataframe using boolean indexing:
df[df['house_number'] == 'House 1'].plot.bar(x='room_type', y='square_feet')

Club all the rows of a dataframe as others except one

I have a dataset
Item Type market_share
Office Supplies 10
Baby Food 20
Vegetables 10
Meat 30
Personal Care 10
Household 20
I want to club all the rows except Baby Food column so that my dataset will look like
Item Type market_share
Others 80
Baby Food 20
How can I do that, basically club all the rows,sum them and put them as others.
You can use:
df.groupby(df['Item Type'].eq('Baby Food').map({True:'Baby Food',False:'Others'})).sum()
market_share
Item Type
Baby Food 20
Others 80
Create array or Series by condition or by Series.map and convert missing values to NaN and aggregate sum:
s = np.where(df['Item Type'] == 'Baby Food', 'Baby Food', 'Others')
print (s)
['Others' 'Baby Food' 'Others' 'Others' 'Others' 'Others']
s = df['Item Type'].map({'Baby Food':'Baby Food'}).fillna('Others')
print (s)
0 Others
1 Baby Food
2 Others
3 Others
4 Others
5 Others
Name: Item Type, dtype: object
df = df.groupby(s)['market_share'].sum().rename_axis('Item Type').reset_index()
print (df)
Item Type market_share
0 Baby Food 20
1 Others 80
Use np.where -
df['market_share_2'] = np.where(df['Item Type'].values=='Baby Food', 'Baby Food', 'Others')
Output
Item Type market_share market_share_2
0 Office Supplies 10 Others
1 Baby Food 20 Baby Food
2 Vegetables 10 Others
3 Meat 30 Others
4 Personal_Care 10 Others
5 Household 20 Others
Then use value_counts() -
df['market_share_2'].value_counts()
Others 5
Baby Food 1
Name: market_share_2, dtype: int64
TLDR;
pd.Series(np.where(df['Item Type'].values=='Baby Food', 'Baby Food', 'Others')).value_counts()
You can use the except function != and the is function ==.
df[df['market_share'] != 'Baby Food'].sum()
df[df['market_share'] == 'Baby Food'].sum()

Creating a new column with last 2 values after a str.split operation

I came across this extremely well explained similar question (Get last "column" after .str.split() operation on column in pandas DataFrame), and used some of the codes found. However, it's not the output that I would like.
raw_data = {
'category': ['sweet beverage, cola,sugared', 'healthy,salty snacks', 'juice,beverage,sweet', 'fruit juice,beverage', 'appetizer,salty crackers'],
'product_name': ['coca-cola', 'salted pistachios', 'fruit juice', 'lemon tea', 'roasted peanuts']}
df = pd.DataFrame(raw_data)
Objective is to extract the various categories from each row, and only use the last 2 categories to create a new column. I have this code, which works and I have the categories of interest as a new column.
df['my_col'] = df.categories.apply(lambda s:s.split(',')[-2:])
output
my_col
[cola,sugared]
[healthy,salty snacks]
[beverage,sweet]
...
However, it appears as a list. How can I not have it appear as a list? Can this be achieved? Thanks all!!!!!
I believe you need str.split, select last to lists and last str.join:
df['my_col'] = df.category.str.split(',').str[-2:].str.join(',')
print (df)
category product_name my_col
0 sweet beverage, cola,sugared coca-cola cola,sugared
1 healthy,salty snacks salted pistachios healthy,salty snacks
2 juice,beverage,sweet fruit juice beverage,sweet
3 fruit juice,beverage lemon tea fruit juice,beverage
4 appetizer,salty crackers roasted peanuts appetizer,salty crackers
EDIT:
In my opinion pandas str text functions are more recommended as apply with puru python string functions, because also working with NaNs and None.
raw_data = {
'category': [np.nan, 'healthy,salty snacks'],
'product_name': ['coca-cola', 'salted pistachios']}
df = pd.DataFrame(raw_data)
df['my_col'] = df.category.str.split(',').str[-2:].str.join(',')
print (df)
category product_name my_col
0 NaN coca-cola NaN
1 healthy,salty snacks salted pistachios healthy,salty snacks
df['my_col'] = df.category.apply(lambda s: ','.join(s.split(',')[-2:]))
AttributeError: 'float' object has no attribute 'split'
You can also use join in the lambda to the result of split:
df['my_col'] = df.category.apply(lambda s: ','.join(s.split(',')[-2:]))
df
Result:
category product_name my_col
0 sweet beverage, cola,sugared coca-cola cola,sugared
1 healthy,salty snacks salted pistachios healthy,salty snacks
2 juice,beverage,sweet fruit juice beverage,sweet
3 fruit juice,beverage lemon tea fruit juice,beverage
4 appetizer,salty crackers roasted peanuts appetizer,salty crackers

Categories