I have an Excel file with data like this
Fruits Description
oranges This is an orange
apples This is an apple
oranges This is also oranges
plum this is a plum
plum this is also a plum
grape I can make some wine
grape make it red
I'm turning this into a dictionary using the below code
import pandas as pd
import xlrd
file = 'example.xlsx'
x1 = pd.ExcelFile(file)
print(x1.sheet_names)
df1 = x1.parse('Sheet1')
#print(df1)
print(df1.set_index('Fruits').T.to_dict('list'))
When i execute the above i get the error
UserWarning: DataFrame columns are not unique, some columns will be omitted.
I want to have a dictionary that looks like the below
{'oranges': ['this is an orange', 'this is also oranges'], 'apples':['this is an apple'],
'plum'['This is a plum', 'this is also a plum'], 'grape'['i can make some wine', 'make it red']}
How about this?
df.groupby(['Fruits'])['Description'].apply(list).to_dict()
{'apples': ['This is an apple'],
'grape': ['make it red', 'I can make some wine'],
'oranges': ['This is an orange', 'This is also oranges'],
'plum': ['this is a plum', 'this is also a plum']}
Related
I have a data frame with a column named text and want to assign values in a new column if the text in the first column contains one or more substrings from a dictionary. If the text column contains a substring, I want the key of the dictionary to be assigned to the new column category.
This is what my code looks like:
import pandas as pd
some_strings = ['Apples and pears and cherries and bananas',
'VW and Ford and Lamborghini and Chrysler and Hyundai',
'Berlin and Paris and Athens and London']
categories = ['fruits', 'cars', 'capitals']
test_df = pd.DataFrame(some_strings, columns = ['text'])
cat_map = {'fruits': {'apples', 'pears', 'cherries', 'bananas'},
'cars': {'VW', 'Ford', 'Lamborghini', 'Chrysler', 'Hyundai'},
'capitals': {'Berlin', 'Paris', 'Athens', 'London'}}
The dictionary cat_map contains sets of strings as values. If the text column in test_df contains any of those words, then I want the key of the dictionary to be assigned as value to the new category column. The output dataframe should look like this:
output_frame = pd.DataFrame({'text': some_strings,
'category': categories})
Any help on this would be appreciated.
You can try
d = {v:k for k, s in cat_map.items() for v in s}
test_df['category'] = (test_df['text'].str.extractall('('+'|'.join(d)+')')
[0].map(d)
.groupby(level=0).agg(set))
print(d)
{'cherries': 'fruits', 'pears': 'fruits', 'bananas': 'fruits', 'apples': 'fruits', 'Chrysler': 'cars', 'Hyundai': 'cars', 'Lamborghini': 'cars', 'Ford': 'cars', 'VW': 'cars', 'Berlin': 'capitals', 'Athens': 'capitals', 'London': 'capitals', 'Paris': 'capitals'}
print(test_df)
text category
0 Apples and pears and cherries and bananas {fruits}
1 VW and Ford and Lamborghini and Chrysler and Hyundai {cars}
2 Berlin and Paris and Athens and London {capitals}
Not exactly sure what you're trying to achieve but if I understood properly
you could check if any of the word in the string is present in your cat_map
import pandas as pd
results = {"text": [], "category": []}
for element in some_strings:
for key, value in cat_map:
# Check if any of the word of the current string is in current category
if set(element.split(' ')).intersection(value):
results["text"].append(element)
results["category"].append(key)
df = pd.DataFrame.from_dict(results)
One approach:
lookup = { word : label for label, words in cat_map.items() for word in words }
pattern = fr"\b({'|'.join(lookup)})\b"
test_df["category"] = test_df["text"].str.extract(pattern, expand=False).map(lookup)
print(test_df)
Output
text category
0 Apples and pears and cherries and bananas fruits
1 VW and Ford and Lamborghini and Chrysler and H... cars
2 Berlin and Paris and Athens and London capitals
You can try this one
results = {"text": [], "category": []}
for text in some_strings:
for key in cat_map.keys():
for word in set(text.split(" ")):
if word in cat_map[key]:
results["text"].append(text)
results["category"].append(key)
df = pd.DataFrame.from_dict(results)
df.drop_duplicates()
I'm having a table with multiple columns and repeating data on all of the columns, except one (Address).
Last Name First Name Food Address
Brown James Apple 1
Brown Duke Apple 2
William Sam Apple 3
Miller Karen Apple 4
William Barry Orange 5
William Sam Orange 6
Brown James Orange 7
Miller Karen Banana 8
Brown Terry Banana 9
I want to merge all first names sharing the same last name and food into one entry, and keep the first address found when that condition is met.
The result will look like this:
Does anyone know any functions in (pandas) python that allow me to add multiple cells into one? Also, what would be the best approach to solve this?
Thanks!
This should do the trick. May be a faster way to put it all together, but in the end I pulled out rows with repeated First Names, transformed them, and put them back into the non-repeated dataframe.
I added another repeating row to be sure it worked with more than just two repeating names.
d = {'Last Name': ['Brown', 'Brown', 'Brown', 'William','Miller', 'William', 'William','Brown', 'Miller', 'Brown'],
'First Name':['Bill', 'James', 'Duke', 'Sam','Karen', 'Barry', 'Sam','James', 'Karen', 'Terry'],
'Food': ['Apple', 'Apple', 'Apple', 'Apple','Apple', 'Orange', 'Orange','Orange', 'Banana', 'Banana'],
'Address': [0, 1,2,3,4,5,6,7,8,9]}
df=pd.DataFrame(d)
grp_df = df.groupby(['Last Name', 'Food'])
df_nonrepeats = df[grp_df['First Name'].transform('count') == 1]
df_repeats = df[grp_df['First Name'].transform('count') > 1]
def concat_repeats(x):
dff = x.copy()
temp_list = ' '.join(dff['First Name'].tolist())
dff['First Name'] = temp_list
dff = dff.head(1)
return dff
grp_df = df_repeats.groupby(['Last Name', 'Food'])
df_concats = grp_df.apply(lambda x: concat_repeats(x))
df_final = pd.concat([df_nonrepeats, df_concats[['Last Name', 'First Name', 'Food', 'Address']]]).sort_values('Address').reset_index(drop=True)
print (df_final)
I have pandas dataframe such as
basket_id = [1,2,3,4,5]
continents = ['apple', 'apple orange', 'apple orange pear', 'pear apple', 'pear']
df = pd.DataFrame({'basket_id' : basket_id , 'continents ' : continents })
baskets are equal, say 18kg, and each basket has an equal amount from each of its fruits: basket 2 has 9kg apple and 9kg orange.
I want to know how much I have from each fruit. if each basket has only one type of fruit I could simply apply value_counts and multiply by 18. But now how could I get my answer?
I expect the following:
fruits = ['apple', 'orange', 'pear']
amounts = [42, 15, 33]
df1 = pd.DataFrame({'fruits' : fruits , 'amounts(kg)' : amounts })
df1
apples are 42kg: 18kg from basket 1, 9kg of basket 2, 6kg of basket 3, and 9kg of basket 4.
You can use Series.str.split then Series.explode now count how many fruits are in a basket using GroupBy.transform then use Series.rdiv to get relative weights in each basket, then groupby each fruit and take the sum.
out = df['continents'].str.split().explode()
amt = out.groupby(level=0).transform('count').rdiv(18).groupby(out).sum()
apple 42.0
orange 15.0
pear 33.0
Name: continents , dtype: float64
To get exact output as mentioned in question, you have to use Series.reset_index then Series.rename
amt.reset_index(name='amounts(kg)').rename(columns={'index':'fruit'})
fruit amounts(kg)
0 apple 42.0
1 orange 15.0
2 pear 33.0
So for each N items in a basket you want to add 18/N kg of each item? You can use a defaultdict(int), which generates default values for unknown entries by calling int() (which is 0) and add the amounts to that.
baskets = ['apple', 'apple orange', 'apple orange pear', 'pear apple', 'pear']
from collections import defaultdict
amounts = defaultdict(int)
for basket in baskets:
items = basket.split()
for item in items:
amounts[item] += 18 // len(items)
print(amounts)
# defaultdict(<class 'int'>, {'apple': 42, 'orange': 15, 'pear': 33})
# if you need a pandas output
import pandas as pd
print(pd.Series(amounts))
# apple 42
# orange 15
# pear 33
# dtype: int64
I came across this extremely well explained similar question (Get last "column" after .str.split() operation on column in pandas DataFrame), and used some of the codes found. However, it's not the output that I would like.
raw_data = {
'category': ['sweet beverage, cola,sugared', 'healthy,salty snacks', 'juice,beverage,sweet', 'fruit juice,beverage', 'appetizer,salty crackers'],
'product_name': ['coca-cola', 'salted pistachios', 'fruit juice', 'lemon tea', 'roasted peanuts']}
df = pd.DataFrame(raw_data)
Objective is to extract the various categories from each row, and only use the last 2 categories to create a new column. I have this code, which works and I have the categories of interest as a new column.
df['my_col'] = df.categories.apply(lambda s:s.split(',')[-2:])
output
my_col
[cola,sugared]
[healthy,salty snacks]
[beverage,sweet]
...
However, it appears as a list. How can I not have it appear as a list? Can this be achieved? Thanks all!!!!!
I believe you need str.split, select last to lists and last str.join:
df['my_col'] = df.category.str.split(',').str[-2:].str.join(',')
print (df)
category product_name my_col
0 sweet beverage, cola,sugared coca-cola cola,sugared
1 healthy,salty snacks salted pistachios healthy,salty snacks
2 juice,beverage,sweet fruit juice beverage,sweet
3 fruit juice,beverage lemon tea fruit juice,beverage
4 appetizer,salty crackers roasted peanuts appetizer,salty crackers
EDIT:
In my opinion pandas str text functions are more recommended as apply with puru python string functions, because also working with NaNs and None.
raw_data = {
'category': [np.nan, 'healthy,salty snacks'],
'product_name': ['coca-cola', 'salted pistachios']}
df = pd.DataFrame(raw_data)
df['my_col'] = df.category.str.split(',').str[-2:].str.join(',')
print (df)
category product_name my_col
0 NaN coca-cola NaN
1 healthy,salty snacks salted pistachios healthy,salty snacks
df['my_col'] = df.category.apply(lambda s: ','.join(s.split(',')[-2:]))
AttributeError: 'float' object has no attribute 'split'
You can also use join in the lambda to the result of split:
df['my_col'] = df.category.apply(lambda s: ','.join(s.split(',')[-2:]))
df
Result:
category product_name my_col
0 sweet beverage, cola,sugared coca-cola cola,sugared
1 healthy,salty snacks salted pistachios healthy,salty snacks
2 juice,beverage,sweet fruit juice beverage,sweet
3 fruit juice,beverage lemon tea fruit juice,beverage
4 appetizer,salty crackers roasted peanuts appetizer,salty crackers
I have a dataset structurally similar to the one created below. Imagine each user brought a bag with the corresponding fruit. I want to count all pairwise combinations (not permutations) of fruit options, and use them to generate a probability that a user owns the bag after pulling two fruits out of it. There is an assumption that no user ever brings two of the same fruit.
import pandas as pd
df = pd.DataFrame({'user':['Matt', 'Matt', 'Matt', 'Matt', 'Tom', 'Tom', 'Tom', 'Tom', 'Nick', 'Nick', 'Nick', 'Nick', 'Nick'], 'fruit': ['Plum', 'Apple', 'Orange', 'Pear', 'Grape', 'Apple', 'Orange', 'Banana', 'Orange', 'Grape', 'Apple', 'Banana', 'Tomato']})[['user', 'fruit']]
print df
My thought was to merge the dataframe back onto itself on user, and generate counts based on unique pairs of fruit_x and fruit_y.
df_merged = df.merge(df, how='inner', on='user')
print df_merged
Unfortunately the merge yields two types of unwanted results. Instances where a fruit has been merged back onto itself are easy to fix.
df_fix1 = df_merged.query('fruit_x != fruit_y')
gb_pair_user = df_fix1.groupby(['user', 'fruit_x', 'fruit_y'])
df_fix1['pair_user_count'] = gb_pair_user['user'].transform('count')
gb_pair = df_fix1.groupby(['fruit_x', 'fruit_y'])
df_fix1['pair_count'] = gb_pair['user'].transform('count')
df_fix1['probability'] = df_fix1['pair_user_count'] / df_fix1['pair_count'] *1.0
print df_fix1[['fruit_x', 'fruit_y', 'probability', 'user']]
The second type is where I'm stuck. There is no meaningful difference between Apple+Orange and Orange+Apple, so I'd like to remove one of those rows. If there is a way to get proper combinations, I'd be very interested in that, otherwise if anyone can suggest a hack to eliminate the duplicated information that would be great too.
You can take the advantage of combinations from itertools to create unique pair of combination of fruits for each user.
from itertools import combinations
def func(group):
return pd.DataFrame(list(combinations(group.fruit, 2)), columns=['fruit_x', 'fruit_y'])
df.groupby('user').apply(func).reset_index(level=1, drop=True)
fruit_x fruit_y
user
Matt Plum Apple
Matt Plum Orange
Matt Plum Pear
Matt Apple Orange
Matt Apple Pear
Matt Orange Pear
Nick Orange Grape
Nick Orange Apple
Nick Orange Banana
Nick Orange Tomato
Nick Grape Apple
Nick Grape Banana
Nick Grape Tomato
Nick Apple Banana
Nick Apple Tomato
Nick Banana Tomato
Tom Grape Apple
Tom Grape Orange
Tom Grape Banana
Tom Apple Orange
Tom Apple Banana
Tom Orange Banana
You can then calculate the probability according to your program logic.