How create structured DataDrame from multiples arrays and lists? - python

I'm using Python3.
I have three lists, one with the names of distributors of the products, the other with the list of products and another with the classification of the products and finally, I have two arrays.
Each one of the distributors offers 15 products, all of them.
distributors = ['d1', 'd2', 'd3', 'd4', 'd5']
products = ['apple', 'carrot', 'potato', 'avocado', 'pumkie', 'banana', 'kiwi', 'lettuce', 'tomato', 'pees', 'pear', 'berries', 'strawberries', 'blueberries', 'boxes']
tips = ['fruit', 'vegetables', 'random']
actual_prix = np.random.rand(15, 5)
prix_prox_year = np.random.rand(15,5)
The structure of the arrays is the following: the rows are the products in order and the columns are the distributors in order.
And the output that I need is the following:
Products Distributor Actual Next_year Type
0 apple d1 0.16147847 0.28173206 fruit
1 ... ... ... ... fruit
2 apple d5 ... ... fruit
... ... ... ... ... ...
15 boxes d5 ... ... random
This is just an example because my arrays have this size(1010, 33).
Any idea?

You can use product from itertools to create all of your interactions, the ordering is important to get the pattern of your data. For the arrays you'll need to tile so it's repeated for each element in tips and ravel into a single long array so that the length matches.
I changed one of your arrays to be an increasing count that way it's obvious what is going on.
Sample Data
import numpy as np
distributors = ['d1', 'd2', 'd3', 'd4', 'd5']
products = ['apple', 'carrot', 'potato', 'avocado', 'pumkie', 'banana',
'kiwi', 'lettuce', 'tomato', 'pees', 'pear', 'berries', 'strawberries',
'blueberries', 'boxes']
tips = ['fruit', 'vegetables', 'random']
actual_prix = np.arange(15*5).reshape(15,5)
prix_prox_year = np.random.rand(15,5)
from itertools import product
import pandas as pd
df = (pd.DataFrame([*product(products, tips, distributors)],
columns=['Products', 'Type', 'Distributor'])
.assign(Actual = np.tile(actual_prix, len(tips)).ravel(),
Next_year = np.tile(prix_prox_year, len(tips)).ravel()))
print(df)
Products Type Distributor Actual Next_year
0 apple fruit d1 0 0.391903
1 apple fruit d2 1 0.378865
2 apple fruit d3 2 0.056134
3 apple fruit d4 3 0.623146
4 apple fruit d5 4 0.879184
5 apple vegetables d1 0 0.391903
6 apple vegetables d2 1 0.378865
...
219 boxes vegetables d5 74 0.804884
220 boxes random d1 70 0.900764
221 boxes random d2 71 0.455267
222 boxes random d3 72 0.489814
223 boxes random d4 73 0.054597
224 boxes random d5 74 0.804884

You can use itertools and set functions to get the unique combinations and then put it into a dataframe.
import itertools
store = ["a", "b", "c"]
prods = ['apple', 'banana']
all_combinations = [list(zip(each_permutation, prods)) for each_permutation in itertools.permutations(store, len(prods))]

Related

add categories present in dictionary that are not present in the data to counts output - python

I am creating some counts of the categories per columns in a df. All possible categories are not present in the column - but they are stored in a dictionary. Is there a possible way to append the categories not in the data back into the value_counts data? see below for some code and examples of the expected output. there are many of these columns so its not good to append at the end manually. Thank you so much!
dictionary with all possible responses
df_dic = {'veggie': ['cucumber', 'broccoli', 'spinach', 'kale', 'potatoe', 'pepper', 'tomatoe'],
'fruit': ['banana', 'orange', 'grapes', 'pear', 'melon', 'apple']}
data
df = pd.DataFrame([('cucumnber', 'apple'),
('broccoli', 'pear'),
('spinach', 'orange'),
('spinach', 'orange'),
('kale', 'apple'),
('kale', 'banana'),
('potatoe', 'pear')],
columns=['veggie', 'fruit'])
value_counts command
dat = []
for col in df:
out_num = pd.DataFrame(df[col].value_counts()).sort_index().add_suffix('_num')
out_per = pd.DataFrame(df[col].value_counts(normalize=True)*100).sort_index().add_suffix('_per')
out = pd.concat([out_num, out_per], axis=1)
dat.append(out)
output e.g. for dat[0]
veggie_num veggie_per
broccoli 1 14.285714
cucumnber 1 14.285714
kale 2 28.571429
potatoe 1 14.285714
spinach 2 28.571429
expected output
veggie_num veggie_per
brocoli 1 14.285714
cucumber 1 14.285714
kale 2 28.571429
potatoe 1 14.285714
spinach 2 28.571429
pepper 0 00.00
tomatoe 0 00.00
reindex from the values in df_dic before adding to dat:
dat = []
for col in df:
out_num = pd.DataFrame(df[col].value_counts()).sort_index().add_suffix('_num')
out_per = pd.DataFrame(df[col].value_counts(normalize=True)*100).sort_index().add_suffix('_per')
out = pd.concat([out_num, out_per], axis=1).reindex(df_dic[col], fill_value=0)
dat.append(out)
dat[0]:
veggie_num veggie_per
cucumber 1 14.285714
broccoli 1 14.285714
spinach 2 28.571429
kale 2 28.571429
potato 1 14.285714
pepper 0 0.000000
tomato 0 0.000000
*Note values will need to match in spelling for this to work correctly.
With some simplifications:
dat = []
for col in df:
out = df[col].value_counts().to_frame().add_suffix('_num')
out[f'{col}_per'] = (df[col].value_counts(normalize=True) * 100)
out = out.reindex(df_dic[col], fill_value=0)
dat.append(out)
dat[0]:
veggie_num veggie_per
cucumber 1 14.285714
broccoli 1 14.285714
spinach 2 28.571429
kale 2 28.571429
potato 1 14.285714
pepper 0 0.000000
tomato 0 0.000000
DataFrame and dict used:
df_dic = {
'veggie': ['cucumber', 'broccoli', 'spinach', 'kale', 'potato', 'pepper',
'tomato'],
'fruit': ['banana', 'orange', 'grapes', 'pear', 'melon', 'apple']
}
df = pd.DataFrame([('cucumber', 'apple'),
('broccoli', 'pear'),
('spinach', 'orange'),
('spinach', 'orange'),
('kale', 'apple'),
('kale', 'banana'),
('potato', 'pear')],
columns=['veggie', 'fruit'])

Use pandas python to filter and combine multiple cells into one cells Excel

I'm having a table with multiple columns and repeating data on all of the columns, except one (Address).
Last Name First Name Food Address
Brown James Apple 1
Brown Duke Apple 2
William Sam Apple 3
Miller Karen Apple 4
William Barry Orange 5
William Sam Orange 6
Brown James Orange 7
Miller Karen Banana 8
Brown Terry Banana 9
I want to merge all first names sharing the same last name and food into one entry, and keep the first address found when that condition is met.
The result will look like this:
Does anyone know any functions in (pandas) python that allow me to add multiple cells into one? Also, what would be the best approach to solve this?
Thanks!
This should do the trick. May be a faster way to put it all together, but in the end I pulled out rows with repeated First Names, transformed them, and put them back into the non-repeated dataframe.
I added another repeating row to be sure it worked with more than just two repeating names.
d = {'Last Name': ['Brown', 'Brown', 'Brown', 'William','Miller', 'William', 'William','Brown', 'Miller', 'Brown'],
'First Name':['Bill', 'James', 'Duke', 'Sam','Karen', 'Barry', 'Sam','James', 'Karen', 'Terry'],
'Food': ['Apple', 'Apple', 'Apple', 'Apple','Apple', 'Orange', 'Orange','Orange', 'Banana', 'Banana'],
'Address': [0, 1,2,3,4,5,6,7,8,9]}
df=pd.DataFrame(d)
grp_df = df.groupby(['Last Name', 'Food'])
df_nonrepeats = df[grp_df['First Name'].transform('count') == 1]
df_repeats = df[grp_df['First Name'].transform('count') > 1]
def concat_repeats(x):
dff = x.copy()
temp_list = ' '.join(dff['First Name'].tolist())
dff['First Name'] = temp_list
dff = dff.head(1)
return dff
grp_df = df_repeats.groupby(['Last Name', 'Food'])
df_concats = grp_df.apply(lambda x: concat_repeats(x))
df_final = pd.concat([df_nonrepeats, df_concats[['Last Name', 'First Name', 'Food', 'Address']]]).sort_values('Address').reset_index(drop=True)
print (df_final)

value_counts in a weighted manner with pandas

I have pandas dataframe such as
basket_id = [1,2,3,4,5]
continents = ['apple', 'apple orange', 'apple orange pear', 'pear apple', 'pear']
df = pd.DataFrame({'basket_id' : basket_id , 'continents ' : continents })
baskets are equal, say 18kg, and each basket has an equal amount from each of its fruits: basket 2 has 9kg apple and 9kg orange.
I want to know how much I have from each fruit. if each basket has only one type of fruit I could simply apply value_counts and multiply by 18. But now how could I get my answer?
I expect the following:
fruits = ['apple', 'orange', 'pear']
amounts = [42, 15, 33]
df1 = pd.DataFrame({'fruits' : fruits , 'amounts(kg)' : amounts })
df1
apples are 42kg: 18kg from basket 1, 9kg of basket 2, 6kg of basket 3, and 9kg of basket 4.
You can use Series.str.split then Series.explode now count how many fruits are in a basket using GroupBy.transform then use Series.rdiv to get relative weights in each basket, then groupby each fruit and take the sum.
out = df['continents'].str.split().explode()
amt = out.groupby(level=0).transform('count').rdiv(18).groupby(out).sum()
apple 42.0
orange 15.0
pear 33.0
Name: continents , dtype: float64
To get exact output as mentioned in question, you have to use Series.reset_index then Series.rename
amt.reset_index(name='amounts(kg)').rename(columns={'index':'fruit'})
fruit amounts(kg)
0 apple 42.0
1 orange 15.0
2 pear 33.0
So for each N items in a basket you want to add 18/N kg of each item? You can use a defaultdict(int), which generates default values for unknown entries by calling int() (which is 0) and add the amounts to that.
baskets = ['apple', 'apple orange', 'apple orange pear', 'pear apple', 'pear']
from collections import defaultdict
amounts = defaultdict(int)
for basket in baskets:
items = basket.split()
for item in items:
amounts[item] += 18 // len(items)
print(amounts)
# defaultdict(<class 'int'>, {'apple': 42, 'orange': 15, 'pear': 33})
# if you need a pandas output
import pandas as pd
print(pd.Series(amounts))
# apple 42
# orange 15
# pear 33
# dtype: int64

Python Pandas finding column value based on multiple column values in same data frame

df:
no fruit price city
1 apple 10 Pune
2 apple 20 Mumbai
3 orange 5 Nagpur
4 orange 7 Delhi
5 Mango 20 Bangalore
6 Mango 15 Chennai
Now I want to get city name where "fruit= orange and price =5"
df.loc[(df['fruit'] == 'orange') & (df['price'] == 5) , 'city'].iloc[0]
is not working and giving error as:
IndexError: single positional indexer is out-of-bounds
Versions used: Python 3.5
You could create masks step-wise and see how they look like:
import pandas as pd
df = pd.DataFrame([{'city': 'Pune', 'fruit': 'apple', 'no': 1L, 'price': 10L},
{'city': 'Mumbai', 'fruit': 'apple', 'no': 2L, 'price': 20L},
{'city': 'Nagpur', 'fruit': 'orange', 'no': 3L, 'price': 5L},
{'city': 'Delhi', 'fruit': 'orange', 'no': 4L, 'price': 7L},
{'city': 'Bangalore', 'fruit': 'Mango', 'no': 5L, 'price': 20L},
{'city': 'Chennai', 'fruit': 'Mango', 'no': 6L, 'price': 15L}])
m1 = df['fruit'] == 'orange'
m2 = df['price'] == 5
df[m1&m2]['city'].values[0] # 'Nagpur'
Scalable and programmable solution - utilizes multiIndexing
Advanced indexing with hierarchical index
Variables
search_columns=['fruit','price']
search_values=['orange','5']
target_column='city'
Make search columns indexes of the df
df_temp=df.set_index(search_columns)
Use the 'loc' method to get the value
value=df_temp.loc[tuple(search_values),target_column]
The result is either a scalar for <=2 search columns or pd.Series
for >2 search columns, respectively
Snippet:
import pandas as pd
columns = "fruit price city".split()
data = zip(
'apple apple orange orange Mango Mango'.split(),
'10 20 5 7 20 15'.split(),
'Pune Mumbai Nagpur Delhi Bangalore Chennai'.split()
)
df = pd.DataFrame(data=data, columns=columns)
search_columns = ['fruit', 'price']
search_values = ['orange', '5']
target_column = 'city'
df_temp = df.set_index(search_columns)
value = df_temp.loc[tuple(search_values), target_column]
print(value)
result: Nagpur

Using python and pandas to create combinations instead of permutations

I have a dataset structurally similar to the one created below. Imagine each user brought a bag with the corresponding fruit. I want to count all pairwise combinations (not permutations) of fruit options, and use them to generate a probability that a user owns the bag after pulling two fruits out of it. There is an assumption that no user ever brings two of the same fruit.
import pandas as pd
df = pd.DataFrame({'user':['Matt', 'Matt', 'Matt', 'Matt', 'Tom', 'Tom', 'Tom', 'Tom', 'Nick', 'Nick', 'Nick', 'Nick', 'Nick'], 'fruit': ['Plum', 'Apple', 'Orange', 'Pear', 'Grape', 'Apple', 'Orange', 'Banana', 'Orange', 'Grape', 'Apple', 'Banana', 'Tomato']})[['user', 'fruit']]
print df
My thought was to merge the dataframe back onto itself on user, and generate counts based on unique pairs of fruit_x and fruit_y.
df_merged = df.merge(df, how='inner', on='user')
print df_merged
Unfortunately the merge yields two types of unwanted results. Instances where a fruit has been merged back onto itself are easy to fix.
df_fix1 = df_merged.query('fruit_x != fruit_y')
gb_pair_user = df_fix1.groupby(['user', 'fruit_x', 'fruit_y'])
df_fix1['pair_user_count'] = gb_pair_user['user'].transform('count')
gb_pair = df_fix1.groupby(['fruit_x', 'fruit_y'])
df_fix1['pair_count'] = gb_pair['user'].transform('count')
df_fix1['probability'] = df_fix1['pair_user_count'] / df_fix1['pair_count'] *1.0
print df_fix1[['fruit_x', 'fruit_y', 'probability', 'user']]
The second type is where I'm stuck. There is no meaningful difference between Apple+Orange and Orange+Apple, so I'd like to remove one of those rows. If there is a way to get proper combinations, I'd be very interested in that, otherwise if anyone can suggest a hack to eliminate the duplicated information that would be great too.
You can take the advantage of combinations from itertools to create unique pair of combination of fruits for each user.
from itertools import combinations
def func(group):
return pd.DataFrame(list(combinations(group.fruit, 2)), columns=['fruit_x', 'fruit_y'])
df.groupby('user').apply(func).reset_index(level=1, drop=True)
fruit_x fruit_y
user
Matt Plum Apple
Matt Plum Orange
Matt Plum Pear
Matt Apple Orange
Matt Apple Pear
Matt Orange Pear
Nick Orange Grape
Nick Orange Apple
Nick Orange Banana
Nick Orange Tomato
Nick Grape Apple
Nick Grape Banana
Nick Grape Tomato
Nick Apple Banana
Nick Apple Tomato
Nick Banana Tomato
Tom Grape Apple
Tom Grape Orange
Tom Grape Banana
Tom Apple Orange
Tom Apple Banana
Tom Orange Banana
You can then calculate the probability according to your program logic.

Categories