I have a large pandas dataframe in which I am attempting to randomly chunk objects into groups of a certain number. For example, I am attempting to chunk the below objects into groups of 3. However, groups must be from the same type. Here's a toy dataset:
type object index
ball soccer 1
ball soccer 2
ball basket 1
ball bouncy 1
ball tennis 1
ball tennis 2
chair office 1
chair office 2
chair office 3
chair lounge 1
chair dining 1
chair dining 2
... ... ...
Desired output:
type object index group
ball soccer 1 ball_1
ball soccer 2 ball_1
ball basket 1 ball_1
ball bouncy 1 ball_1
ball tennis 1 ball_2
ball tennis 2 ball_2
chair office 1 chair_1
chair office 2 chair_1
chair office 3 chair_1
chair lounge 1 chair_1
chair dining 1 chair_1
chair dining 2 chair_1
... ... ... ...
So here, the group ball_1 contains 3 unique objects from the same type: soccer, basket, and bouncy. The remainder object goes into group ball_2 which only has 1 object. Since the dataframe is so large, I'm hoping for a long list of groups that contain 3 objects and one group that contains the remainder objects (anything less than 3).
Again, while my example only contains a few objects, I'm hoping for the objects to be randomly sorted into groups of 3. (My real dataset will contain many more balls and chairs.)
This seemed helpful, but I haven't figured out how to apply it yet: How do you split a list into evenly sized chunks?
If need split for each N values per groups by type is possible use factorize with GroupBy.transform, integer divide and add 1, last add column type in Series.str.cat:
N = 3
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')
print (df)
type object index group
0 ball soccer 1 ball_1
1 ball soccer 2 ball_1
2 ball basket 1 ball_1
3 ball bouncy 1 ball_1
4 ball tennis 1 ball_2
5 ball tennis 2 ball_2
6 chair office 1 chair_1
7 chair office 2 chair_1
8 chair office 3 chair_1
9 chair lounge 1 chair_1
10 chair dining 1 chair_1
If need also some randomize values add DataFrame.sample:
N = 3
df = df.sample(frac=1)
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')
print (df)
type object index group
10 chair dining 1 chair_1
8 chair office 3 chair_1
2 ball basket 1 ball_1
1 ball soccer 2 ball_1
7 chair office 2 chair_1
0 ball soccer 1 ball_1
9 chair lounge 1 chair_1
4 ball tennis 1 ball_1
6 chair office 1 chair_1
3 ball bouncy 1 ball_2
5 ball tennis 2 ball_1
Related
I have a pandas data frame 'High' as
segment sales
Milk 10
Chocolate 30
and another data frame 'Low' as
segment sku sales
Milk m2341 2
Milk m235 3
Chocolate c132 2
Chocolate c241 5
Chocolate c891 3
I want to use the ratios from Low to disaggregate High. So my resulting data here would be
segment sku sales
Milk m2341 4
Milk m235 6
Chocolate c132 6
Chocolate c241 15
Chocolate c891 9
First, I would find the scale we need to multiple each product sales.
df_agg = df_low[["segment", "sales"]].groupby(by=["segment"]).sum().merge(df_high, on="segment")
df_agg["scale"] = df_agg["sales_y"] / df_agg["sales_x"]
Then, apply the scale
df_disagg_high = df_low.merge(df_agg[["segment", "scale"]])
df_disagg_high["adjusted_sale"] = df_disagg_high["sales"] * df_disagg_high["scale"]
If needed, you can exclude extra columns.
Try:
df_low["sales"] = df_low.sales.mul(
df_low.merge(
df_high.set_index("segment")["sales"].div(
df_low.groupby("segment")["sales"].sum()
),
on="segment",
)["sales_y"]
).astype(int)
print(df_low)
Prints:
segment sku sales
0 Milk m2341 4
1 Milk m235 6
2 Chocolate c132 6
3 Chocolate c241 15
4 Chocolate c891 9
I have 2 datasets (in CSV format) with different size such as follow:
df_old:
index category text
0 spam you win much money
1 spam you are the winner of the game
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
4 neutral we have a party now
5 neutral they are driving to downtown
df_new:
index category text
0 spam you win much money
14 spam London is the capital of Canada
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
4 neutral we have a party now
31 neutral construction will be done
I am using a code that concatenates the df_new to the df_old in the way that df_new goes on top of df_old's each category.
The code is:
(pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
Now, the problem is that some of the rows with similar index, category, text (all together at same row) being duplicated at the same time, and (like: [0, spam, you win much money]) I want to avoid this.
The expected output should be:
df_concat:
index category text
14 spam London is the capital of Canada
0 spam you win much money
1 spam you are the winner of the game
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
31 neutral construction will be done
4 neutral we have a party now
5 neutral they are driving to downtown
I tried this and this but these are removing either the category or text.
To remove duplicates on specific column(s), use subset in drop_duplicates:
df.drop_duplicates(subset=['index', 'category', 'text'], keep='first')
Try concat + sort_values:
res = pd.concat((new_df, old_df)).drop_duplicates()
res = res.sort_values(by=['category'], key=lambda x: x.map({'spam' : 0, 'not_spam' : 1, 'neutral': 2}))
print(res)
Output
index category text
0 0 spam you win much money
1 14 spam London is the capital of Canada
1 1 spam you are the winner of the game
2 15 not_spam no more raining in winter
3 25 not_spam the soccer game plays on HBO
2 2 not_spam the weather in Chicago is nice
3 3 not_spam pizza is an Italian food
4 31 neutral construction will be done
4 4 neutral we have a party now
5 5 neutral they are driving to downtown
Your code seems right , try to add this to the concat result it will remove your duplicates :
# this first lines will create a new column ‘index’ and will help the rest of the code be correct
df_new = df_new.reset_index()
df_ old = df_ old.reset_index()
df_concat = (pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
df_concat.drop_duplicates()
If you want to reindex it you can do ofcourse mot chnging the ‘index’column):
df_concat.drop_duplicates(ignore_index =True)
You can always do combine_first
out = df_new.combine_first(df_old)
I have two dataframes. The first represents the nutritional information of certain ingredients with ingredients as rows and the columns as the nutritional categories.
Item Brand and style Quantity Calories Total Fat ... Carbs Sugar Protein Fiber Sodium
0 Brown rice xxx xxxxxxxx xxxxx, long grain 150g 570 4.5 ... 1170 0 12 6 0
1 Whole wheat bread xxxxxxxx, whole grains 2 slices 220 4 ... 42 6 8 6 320
2 Whole wheat cereal xxx xxxxxxxx xxxxx, wheat squares 60g 220 1 ... 47 0 7 5 5
The second represents the type and quantity of ingredients of meals with the meals as rows and the ingredients as columns.
Meal Brown rice Whole wheat bread Whole wheat cereal ... Marinara sauce American cheese Olive oil Salt
0 Standard breakfast 0 0 1 ... 0 0 0 0
1 Standard lunch 0 2 0 ... 0 0 0 0
2 Standard dinner 0 0 0 ... 0 0 1 1
I am trying to create another dataframe such that the meals are rows and the nutritional categories are at the top, representing the entire nutritional value of the meal based on the number of ingredients.
For example, if a standard lunch consists of 2 slices of bread (150 calories each slice), 1 serving of peanut butter (100 calories), and 1 serving of jelly (50 calories), then I would like the dataframe to be like:
Meal Calories Total fat ...
Standard lunch 450 xxx
Standard dinner xxx xxx
...
450 comes from (2*150 + 100 + 50).
The function template could be:
def create_meal_category_dataframe(ingredients_df, meals_df):
ingredients = meals_df.columns[1:]
meals = meals_df['Meal']
# return meal_cat_df
I extracted lists of the meal and ingredient names, but I'm not sure if they're useful here. Thanks.
I have a table which has primary sector as one column with different entries. I need to add one more column as major sector. Major sector is to be picked up from a mapping table. How this task can achieved.
Sample Data
Primary Sector Major Sector
Skating
Painting
Engineer
Running
Gardening
Administrator
tennis
Reading
Cricket
Accountant
Mapping Table
Job Hobby Sports
Skating 0 0 1
Painting 0 1 0
Engineer 1 0 0
Running 0 0 1
Gardening 0 1 0
Administrator 1 0 0
tennis 0 0 1
Reading 0 1 0
Cricket 0 0 1
Accountant 1 0 0
Use map with idxmax using parameter axis=1 for column-wise as:
df1['Major Sector'] = df1['Primary Sector'].map(df2.idxmax(axis=1))
print(df1)
Primary Sector Major Sector
0 Skating Sports
1 Painting Hobby
2 Engineer Job
3 Running Sports
4 Gardening Hobby
5 Administrator Job
6 tennis Sports
7 Reading Hobby
8 Cricket Sports
9 Accountant Job
print(df2.idxmax(axis=1))
Skating Sports
Painting Hobby
Engineer Job
Running Sports
Gardening Hobby
Administrator Job
tennis Sports
Reading Hobby
Cricket Sports
Accountant Job
dtype: object
I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)