I have DataFarame
date descriptions Code
1. 1/1/2020 this is aPple 6546
2. 21/8/2019 this is fan for him 4478
3. 15/3/2020 this is ball of hockey 5577
4. 12/2/2018 this is Green apple 7899
5. 13/3/2002 this is iron fan 7788
6. 14/5/2020 this ball is soft 9991
I want to create a new column 'category' whose value would be if there is expression apple, fan, ball(capital or small letters) in the description column then value A001, F009, B099 respectively should be enter in the category column, required DataFrame would be.
date descriptions Code category
1. 1/1/2020 this is aPple 6546 A001
2. 21/8/2019 this is fan for him 4478 F009
3. 15/3/2020 this is ball of hockey 5577 B099
4. 12/2/2018 this is Green apple 7899 A001
5. 13/3/2002 this is iron fan 7788 F009
6. 14/5/2020 this ball is soft 9991 B099
Use str.extract to get the substring from the string-based column
d = {'apple': 'A001', 'ball': 'B099', 'fan': 'F009'}
df['category'] = (
df.descriptions
.str.lower()
.str.extract('(' + '|'.join(d.keys()) + ')')
.squeeze().map(d)
)
You can use numpy select, which allows for multiple conditional selection.
content = ["apple", "fan", "ball"]
condlist = [df.descriptions.str.lower().str.contains(letter) for letter in content]
choicelist = ["A001", "F009", "B099"]
df["category"] = np.select(condlist, choicelist)
df
date descriptions Code category
0 1/1/2020 this is aPple 6546 A001
1 21/8/2019 this is fan for him 4478 F009
2 15/3/2020 this is ball of hockey 5577 B099
3 12/2/2018 this is Green apple 7899 A001
4 13/3/2002 this is iron fan 7788 F009
5 14/5/2020 this ball is soft 9991 B099
Related
I have a dataframe of restaurants and one column has corresponding cuisines.
The problem is that there are restaurants with multiple cuisines in the same column [up to 8].
Let's say it's something like this:
RestaurantName City Restaurant ID Cuisines
Restaurant A Milan 31333 French, Spanish, Italian
Restaurant B Shanghai 63551 Pizza, Burgers
Restaurant C Dubai 7991 Burgers, Ice Cream
Here's a copy-able code as a sample:
rst= pd.DataFrame({'RestaurantName': ['Rest A', 'Rest B', 'Rest C'],
'City': ['Milan', 'Shanghai', 'Dubai'],
'RestaurantID': [31333,63551,7991],
'Cuisines':['French, Spanish, Italian','Pizza, Burgers','Burgers, Ice Cream']})
I used string split to expand them into 8 different columns and added it to the dataframe.
csnsplit=rst.Cuisines.str.split(", ",expand=True)
rst["Cuisine1"]=csnsplit.loc[:,0]
rst["Cuisine2"]=csnsplit.loc[:,1]
rst["Cuisine3"]=csnsplit.loc[:,2]
rst["Cuisine4"]=csnsplit.loc[:,3]
rst["Cuisine5"]=csnsplit.loc[:,4]
rst["Cuisine6"]=csnsplit.loc[:,5]
rst["Cuisine7"]=csnsplit.loc[:,6]
rst["Cuisine8"]=csnsplit.loc[:,7]
Which leaves me with this:
https://i.stack.imgur.com/AUSDY.png
Now I have no idea how to count individual cuisines since they're across up to 8 different columns, let's say if I want to see top cuisine by city.
I also tried getting dummy columns for all of them, Cuisine 1 to Cuisine 8. This is causing me to have duplicates like Cuisine1_Bakery, Cusine2_Bakery, and so on. I could hypothetically merge like ones and keeping only the one that has a count of "1," but no idea how to do that.
dummies=pd.get_dummies(data=rst,columns=["Cuisine1","Cuisine2","Cuisine3","Cuisine4","Cuisine5","Cuisine6","Cuisine7","Cuisine8"])
print(dummies.columns.tolist())
Which leaves me with all of these columns:
https://i.stack.imgur.com/84spI.png
A third thing I tried was to get unique values from all 8 columns, and I have a deduped list of each type of cuisine. I can probably add all these columns to the dataframe, but wouldn't know how to fill the rows with a count for each one based on the column name.
AllCsn=np.concatenate((rst.Cuisine1.unique(),
rst.Cuisine2.unique(),
rst.Cuisine3.unique(),
rst.Cuisine4.unique(),
rst.Cuisine5.unique(),
rst.Cuisine6.unique(),
rst.Cuisine7.unique(),
rst.Cuisine8.unique()
))
AllCsn=np.unique(AllCsn.astype(str))
AllCsn
Which leaves me with this:
https://i.stack.imgur.com/O9OpW.png
I do want to create a model later on where I maybe have a column for each cuisine, and use the "unique" code above to get all the columns, but then I would need to figure out how to do a count based on the column header.
I am new to this, so please bear with me and let me know if I need to provide any more info.
It sounds like you're looking for str.split without expanding, then explode:
rst['Cuisines'] = rst['Cuisines'].str.split(', ')
rst = rst.explode('Cuisines')
Creates a frame like:
RestaurantName City RestaurantID Cuisines
0 Rest A Milan 31333 French
0 Rest A Milan 31333 Spanish
0 Rest A Milan 31333 Italian
1 Rest B Shanghai 63551 Pizza
1 Rest B Shanghai 63551 Burgers
2 Rest C Dubai 7991 Burgers
2 Rest C Dubai 7991 Ice Cream
Then it sounds like either crosstab:
pd.crosstab(rst['City'], rst['Cuisines'])
Cuisines Burgers French Ice Cream Italian Pizza Spanish
City
Dubai 1 0 1 0 0 0
Milan 0 1 0 1 0 1
Shanghai 1 0 0 0 1 0
Or value_counts
rst[['City', 'Cuisines']].value_counts().reset_index(name='counts')
City Cuisines counts
0 Dubai Burgers 1
1 Dubai Ice Cream 1
2 Milan French 1
3 Milan Italian 1
4 Milan Spanish 1
5 Shanghai Burgers 1
6 Shanghai Pizza 1
Max value_count per City via groupby head:
max_counts = (
rst[['City', 'Cuisines']].value_counts()
.groupby(level=0).head(1)
.reset_index(name='counts')
)
max_counts:
City Cuisines counts
0 Dubai Burgers 1
1 Milan French 1
2 Shanghai Burgers 1
I have a pandas data frame 'High' as
segment sales
Milk 10
Chocolate 30
and another data frame 'Low' as
segment sku sales
Milk m2341 2
Milk m235 3
Chocolate c132 2
Chocolate c241 5
Chocolate c891 3
I want to use the ratios from Low to disaggregate High. So my resulting data here would be
segment sku sales
Milk m2341 4
Milk m235 6
Chocolate c132 6
Chocolate c241 15
Chocolate c891 9
First, I would find the scale we need to multiple each product sales.
df_agg = df_low[["segment", "sales"]].groupby(by=["segment"]).sum().merge(df_high, on="segment")
df_agg["scale"] = df_agg["sales_y"] / df_agg["sales_x"]
Then, apply the scale
df_disagg_high = df_low.merge(df_agg[["segment", "scale"]])
df_disagg_high["adjusted_sale"] = df_disagg_high["sales"] * df_disagg_high["scale"]
If needed, you can exclude extra columns.
Try:
df_low["sales"] = df_low.sales.mul(
df_low.merge(
df_high.set_index("segment")["sales"].div(
df_low.groupby("segment")["sales"].sum()
),
on="segment",
)["sales_y"]
).astype(int)
print(df_low)
Prints:
segment sku sales
0 Milk m2341 4
1 Milk m235 6
2 Chocolate c132 6
3 Chocolate c241 15
4 Chocolate c891 9
I have 2 datasets (in CSV format) with different size such as follow:
df_old:
index category text
0 spam you win much money
1 spam you are the winner of the game
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
4 neutral we have a party now
5 neutral they are driving to downtown
df_new:
index category text
0 spam you win much money
14 spam London is the capital of Canada
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
4 neutral we have a party now
31 neutral construction will be done
I am using a code that concatenates the df_new to the df_old in the way that df_new goes on top of df_old's each category.
The code is:
(pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
Now, the problem is that some of the rows with similar index, category, text (all together at same row) being duplicated at the same time, and (like: [0, spam, you win much money]) I want to avoid this.
The expected output should be:
df_concat:
index category text
14 spam London is the capital of Canada
0 spam you win much money
1 spam you are the winner of the game
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
31 neutral construction will be done
4 neutral we have a party now
5 neutral they are driving to downtown
I tried this and this but these are removing either the category or text.
To remove duplicates on specific column(s), use subset in drop_duplicates:
df.drop_duplicates(subset=['index', 'category', 'text'], keep='first')
Try concat + sort_values:
res = pd.concat((new_df, old_df)).drop_duplicates()
res = res.sort_values(by=['category'], key=lambda x: x.map({'spam' : 0, 'not_spam' : 1, 'neutral': 2}))
print(res)
Output
index category text
0 0 spam you win much money
1 14 spam London is the capital of Canada
1 1 spam you are the winner of the game
2 15 not_spam no more raining in winter
3 25 not_spam the soccer game plays on HBO
2 2 not_spam the weather in Chicago is nice
3 3 not_spam pizza is an Italian food
4 31 neutral construction will be done
4 4 neutral we have a party now
5 5 neutral they are driving to downtown
Your code seems right , try to add this to the concat result it will remove your duplicates :
# this first lines will create a new column ‘index’ and will help the rest of the code be correct
df_new = df_new.reset_index()
df_ old = df_ old.reset_index()
df_concat = (pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
df_concat.drop_duplicates()
If you want to reindex it you can do ofcourse mot chnging the ‘index’column):
df_concat.drop_duplicates(ignore_index =True)
You can always do combine_first
out = df_new.combine_first(df_old)
I have two dataframes. The first represents the nutritional information of certain ingredients with ingredients as rows and the columns as the nutritional categories.
Item Brand and style Quantity Calories Total Fat ... Carbs Sugar Protein Fiber Sodium
0 Brown rice xxx xxxxxxxx xxxxx, long grain 150g 570 4.5 ... 1170 0 12 6 0
1 Whole wheat bread xxxxxxxx, whole grains 2 slices 220 4 ... 42 6 8 6 320
2 Whole wheat cereal xxx xxxxxxxx xxxxx, wheat squares 60g 220 1 ... 47 0 7 5 5
The second represents the type and quantity of ingredients of meals with the meals as rows and the ingredients as columns.
Meal Brown rice Whole wheat bread Whole wheat cereal ... Marinara sauce American cheese Olive oil Salt
0 Standard breakfast 0 0 1 ... 0 0 0 0
1 Standard lunch 0 2 0 ... 0 0 0 0
2 Standard dinner 0 0 0 ... 0 0 1 1
I am trying to create another dataframe such that the meals are rows and the nutritional categories are at the top, representing the entire nutritional value of the meal based on the number of ingredients.
For example, if a standard lunch consists of 2 slices of bread (150 calories each slice), 1 serving of peanut butter (100 calories), and 1 serving of jelly (50 calories), then I would like the dataframe to be like:
Meal Calories Total fat ...
Standard lunch 450 xxx
Standard dinner xxx xxx
...
450 comes from (2*150 + 100 + 50).
The function template could be:
def create_meal_category_dataframe(ingredients_df, meals_df):
ingredients = meals_df.columns[1:]
meals = meals_df['Meal']
# return meal_cat_df
I extracted lists of the meal and ingredient names, but I'm not sure if they're useful here. Thanks.
I have a dataset where each observation has a Date. Then I have a list of events. I want to filter the dataset and keep observations only if the date is within +/- 30 days of an event. Also, I want to know which event it is closest to.
For example, the main dataset looks like:
Product Date
Chicken 2008-09-08
Pork 2008-08-22
Beef 2008-08-15
Rice 2008-07-22
Coke 2008-04-05
Cereal 2008-04-03
Apple 2008-04-02
Banana 2008-04-01
It is generated by
d = {'Product': ['Apple', 'Banana', 'Cereal', 'Coke', 'Rice', 'Beef', 'Pork', 'Chicken'],
'Date': ['2008-04-02', '2008-04-01', '2008-04-03', '2008-04-05',
'2008-07-22', '2008-08-15', '2008-08-22', '2008-09-08']}
df = pd.DataFrame(data = d)
df['Date'] = pd.to_datetime(df['Date'])
Then I have a column of events:
Date
2008-05-03
2008-07-20
2008-09-01
generated by
event = pd.DataFrame({'Date': pd.to_datetime(['2008-05-03', '2008-07-20', '2008-09-01'])})
GOAL (EDITED)
I want to keep the rows in df only if df['Date'] is within a month of event['Date']. For example, the first event occurred on 2008-05-03, so I want to keep observations between 2008-04-03 and 2008-06-03, and also create a new column to tell this observation is closest to the event on 2008-05-03.
Product Date Event
Chicken 2008-09-08 2008-09-01
Pork 2008-08-22 2008-09-01
Beef 2008-08-15 2008-07-20
Rice 2008-07-22 2008-07-20
Coke 2008-04-05 2008-05-03
Cereal 2008-04-03 2008-05-03
Use numpy broadcast and assumed within 30 days
df[np.any(np.abs(df.Date.values[:,None]-event.Date.values)/np.timedelta64(1,'D')<31,1)]
Out[90]:
Product Date
0 Chicken 2008-09-08
1 Pork 2008-08-22
2 Beef 2008-08-15
3 Rice 2008-07-22
4 Coke 2008-04-05
5 Cereal 2008-04-03
event['eDate'] = event.Date
df = pd.merge_asof(df.sort_values('Date'), event.sort_values('Date'), on="Date", direction='nearest')
df[(df.Date - df.eDate).abs() <= '30 days']
I would use listcomp with intervalindex
ms = pd.offsets.MonthOffset(1)
e1 = event.Date - ms
e2 = event.Date + ms
iix = pd.IntervalIndex.from_arrays(e1, e2, closed='both')
df.loc[[any(d in i for i in iix) for d in df.Date]]
Out[93]:
Product Date
2 Cereal 2008-04-03
3 Coke 2008-04-05
4 Rice 2008-07-22
5 Beef 2008-08-15
6 Pork 2008-08-22
7 Chicken 2008-09-08
If it just months irrespective of dates, this may be useful.
rng=[]
for a, b in zip (event['Date'].dt.month-1, event['Date'].dt.month+1):
rng = rng + list(range(a-1,b+1,1))
df[df['Date'].dt.month.isin(set(rng))]