i can only use np and bpd.
Question 1.3
1 point
The 'Segment_Category' describes the food and service of each chain. What are the most popular segment categories in chains?
Create an array called ordered_segment_categories containing all the segment categories, ordered from the most popular segment category to the least popular segment category in chains.
This is the dataframe.
Rank Restaurant Sales YOY_Sales Segment_Category
0 1 McDonald's 40517 0.30% Quick Service & Burger
1 2 Starbucks 18485 -13.50% Quick Service & Coffee Cafe
2 3 Chick-fil-A 13745 13.00% Quick Service & Chicken
3 4 Taco Bell 11294 0.00% Quick Service & Mexican
4 5 Wendy's 10231 4.80% Quick Service & Burger
... ... ... ... ... ...
245 246 American Deli 98 2.00% Quick Service & Sandwich
246 247 Bonchon 98 -19.50% Casual Dining & Asian
247 248 Chopt 98 -12.40% Fast Casual & All Other
248 249 Chicken Express 96 -6.50% Quick Service & Chicken
249 250 Sizzler 96 -63.00% Quick Service & All Other
I tried it all. I just cant seem to figure it out.
Related
I have a dataframe that looks something like this:
Group
UPC
Description
246
1234568
Chips BBQ
158
7532168
Cereal Honey
246
9876532
Chips Ketchup
665
8523687
Strawberry Jam
246
1234568
Chips BBQ
158
5553215
Cereal Chocolate
I want to replace the descriptions of the items with the most frequent description based on the group # or the first instance if there is a tie.
So in the example above: Chips Ketchup (1 instance) is replaced with Chips BBQ (2 instances) And Cereal Chocolate is replaced with Cereal Honey (First Instance).
Desired output would be:
Group
UPC
Description
246
1234568
Chips BBQ
158
7532168
Cereal Honey
246
9876532
Chips BBQ
665
8523687
Strawberry Jam
246
1234568
Chips BBQ
158
5553215
Cereal Honey
If this is too complicated I can settle for replacing with simply the first instance without taking frequency into consideration at all.
Thanks in advance
You can use
df['Description'] = df.groupby('Group')['Description'].transform(lambda s: s.value_counts().index[0])
It seems like Series.value_counts (unlike Series.mode, which I also tried) orders elements that occur the same number of times by their first occurence. This behavior is not documented so I'm not sure you can rely on it.
I'm preparing for a new job where I'll be receiving data submissions in varying quality, often times dates/chars/etc are combined together nonsensically and must be separated before analysis. Thinking ahead of how might this be solved.
Using a fictitious example below, I combined region, rep, and product together.
file['combine'] = file['Region'] + file['Sales Rep'] + file['Product']
Shift Region Sales Rep Product Cost per Units Sold combine
0 3 East Shirlene Pencil 5 71 EastShirlenePencil
1 3 South Anderson Folder 17 69 SouthAndersonFolder
2 3 West Shelli Folder 17 185 WestShelliFolder
3 3 South Damion Binder 30 159 SouthDamionBinder
4 3 West Shirlene Stapler 25 41 WestShirleneStapler
Assuming no other data, the question is, how can the 'combine' column be split up?
Many thanks in advance!
If you want space between the strings, you can do:
df["combine"] = df[["Region", "Sales Rep", "Product"]].apply(" ".join, axis=1)
print(df)
Prints:
Shift Region Sales Rep Product Cost per Units Sold combine
0 3 East Shirlene Pencil 5 71 East Shirlene Pencil
1 3 South Anderson Folder 17 69 South Anderson Folder
2 3 West Shelli Folder 17 185 West Shelli Folder
3 3 South Damion Binder 30 159 South Damion Binder
4 3 West Shirlene Stapler 25 41 West Shirlene Stapler
Or: if you want to split the already combined string:
import re
df["separated"] = df["combine"].apply(lambda x: re.findall(r"[A-Z][^A-Z]*", x))
print(df)
Prints:
Shift Region Sales Rep Product Cost per Units Sold combine separated
0 3 East Shirlene Pencil 5 71 EastShirlenePencil [East, Shirlene, Pencil]
1 3 South Anderson Folder 17 69 SouthAndersonFolder [South, Anderson, Folder]
2 3 West Shelli Folder 17 185 WestShelliFolder [West, Shelli, Folder]
3 3 South Damion Binder 30 159 SouthDamionBinder [South, Damion, Binder]
4 3 West Shirlene Stapler 25 41 WestShirleneStapler [West, Shirlene, Stapler]
I have a file that I am trying to read in a pandas dataframe. However, some of the cells, are coming up as NaN even though there are values in there. The cells that are showing up as float value. The cells that are not showing up were copied pasted in the cells. Not sure why that would make a difference. Can anyone help? I have included the file as a link at this location: https://www.dropbox.com/s/30rxw07eaza29df/manhattan_hs_gps.csv?dl=0
Tried this and it worked fine, both encoding='unicode-escape' and encoding='latin-1' work:
df = pd.read_csv('manhattan_hs_gps.csv', encoding='unicode-escape', header=None)
print(df)
0 1 2 3
0 0 A. Philip Randolph Campus High School 40.818500 -73.950000
1 1 Aaron School 40.744800 -73.983700
2 2 Abraham Joshua Heschel School 40.772300 -73.989700
3 3 Academy of Environmental Science Secondary Hig... 40.785200 -73.942200
4 4 Academy for Social Action: A College Board School 40.815400 -73.955300
.. ... ... ... ...
162 164 Xavier High School 40.737900 -73.994600
163 165 Yeshiva University High School for Boys 40.851749 -73.928695
164 166 York Preparatory School 40.774100 -73.979400
165 167 Young Women's Leadership School 40.792900 -73.947200
166 168 Washington Heights Expeditionary Learning School 40.774100 -73.979400
I have two dataframes. The first represents the nutritional information of certain ingredients with ingredients as rows and the columns as the nutritional categories.
Item Brand and style Quantity Calories Total Fat ... Carbs Sugar Protein Fiber Sodium
0 Brown rice xxx xxxxxxxx xxxxx, long grain 150g 570 4.5 ... 1170 0 12 6 0
1 Whole wheat bread xxxxxxxx, whole grains 2 slices 220 4 ... 42 6 8 6 320
2 Whole wheat cereal xxx xxxxxxxx xxxxx, wheat squares 60g 220 1 ... 47 0 7 5 5
The second represents the type and quantity of ingredients of meals with the meals as rows and the ingredients as columns.
Meal Brown rice Whole wheat bread Whole wheat cereal ... Marinara sauce American cheese Olive oil Salt
0 Standard breakfast 0 0 1 ... 0 0 0 0
1 Standard lunch 0 2 0 ... 0 0 0 0
2 Standard dinner 0 0 0 ... 0 0 1 1
I am trying to create another dataframe such that the meals are rows and the nutritional categories are at the top, representing the entire nutritional value of the meal based on the number of ingredients.
For example, if a standard lunch consists of 2 slices of bread (150 calories each slice), 1 serving of peanut butter (100 calories), and 1 serving of jelly (50 calories), then I would like the dataframe to be like:
Meal Calories Total fat ...
Standard lunch 450 xxx
Standard dinner xxx xxx
...
450 comes from (2*150 + 100 + 50).
The function template could be:
def create_meal_category_dataframe(ingredients_df, meals_df):
ingredients = meals_df.columns[1:]
meals = meals_df['Meal']
# return meal_cat_df
I extracted lists of the meal and ingredient names, but I'm not sure if they're useful here. Thanks.
I have a list of words that were inputted by my users after I did some cleaning up (to correct spelling mistakes) I have the following list, each row represents a string and the number of times this string was inputted:
Pepsi 500
Coke 358
Dr. pepper 254
Sprite 204
Coca cola 159
7 up 140
Mountain dew 137
Diet coke 58
Mtn. dew 50
Now I would like to have a script that will go over this list and group similar words.
For example, merging Coke, Coca cola and Diet coke into one group (because they are synonyms of Coca cola).
I saw that in NLTK WordNet there are some similarity functions, can I use them? or is there a "better" way of approaching this problem?