I am trying to write a solution that would return a dataframe of all the raw_components by item when given the following 2 dataframes (items & components).
These two dataframes are linked by using item_code as the key when going from items to components. When going from components to items, sub_component in the components dataframe is used instead to relate to item in the items dataframe.
items: This dataframe holds all items that contain sub-components. An item could also be a sub-component of another item. Raw components would not exist here.
items = pd.DataFrame({'item': ['a', 'b'],
'item_code': [1, 2]
})
items.set_index('item',inplace = True)
item_code
item
a 1
b 2
components : This dataframe lists the sub-components of each item in the items dataframe. If sub_component doesn't exist in the items dataframe (as item), then it is then a raw_component in expected_result.
components = pd.DataFrame({'item_code': [1, 1, 1, 1, 2],
'sub_component': ['a1', 'a2', 'a3', 'b', 'b1'],
'quantity': [1, 2, 1, 1, 4]
})
components.set_index('item_code',inplace = True)
sub_component quantity
item_code
1 a1 1
1 a2 2
1 a3 1
1 b 1
2 b1 4
The dataframe created below is the expected output based on the current dataset.
expected_result = pd.DataFrame({'item': ['a','a','a','a','b'],
'raw_component': ['a1','a2','a3','b1','b1'],
'quantity': [1, 2, 1, 4, 4]
})
expected_result.set_index('item',inplace = True)
raw_component quantity
item
a a1 1
a a2 2
a a3 1
a b1 4
b b1 4
I tried using recursion and loops but I've been unable to figure out a solution. The challenge is that an item might have a sub-component that also has sub-components, and the number of layers is not provided.
Related
I'm trying to union several pd.DataFrames along the column axis, using the index to remove duplicates (A and B are from the same source "table" filterd by different predicates and I'm tring to recombine).
A = pd.DataFrame({"values": [1, 2]}, pd.MultiIndex.from_tuples([(1,1),(1,2)], names=('l1', 'l2')))
B = pd.DataFrame({"values": [2, 3, 2]}, pd.MultiIndex.from_tuples([(1,2),(2,1),(2,2)], names=('l1', 'l2')))
pd.concat([A,B]).drop_duplicates() fails since it ignores the index and de-dups on the values so it removed index item (2,2)
pd.concat([A.reset_index(),B.reset_index()]).drop_duplicates(subset=('l1', 'l2')).set_index(['l1', 'l2']) does what I want, but I feel like there should be a better way.
you may do a simple concat and filter out dups by using index.duplicated
df1 = pd.concat([A,B])
df1[~df1.index.duplicated()]
Out[123]:
values
l1 l2
1 1 1
2 2
2 1 3
2 2
I have a dictionary and a dataframe. The dictionary contains a mapping of one letter to one number and the dataframe has a row containing these specific letters and another row containing these specific numbers, adjacent to each other (not that it necessarily matters).
I want to update the row containing the numbers by matching each letter in the row of the dataframe with the letter in the dictionary and then replacing the corresponding number (number in the same column as the letter) with the value of that letter from the dictionary.
df = pd.DataFrame(np.array([[4, 5, 6], ['a', 'b', 'c'], [7, 8, 9]]))
dict = {'a':2, 'b':3, 'c':5}
Let's say dict is the dictionary and df is the dataframe I want the result to be df2.
df2 = pd.DataFrame(np.array([[3, 2, 5], ['b', 'a', 'c'], [7, 8, 9]]))
df
0 1 2
0 4 5 6
1 a b c
2 7 8 9
dict
{'a': 2, 'b': 3, 'c': 5}
df2
0 1 2
0 2 3 5
1 a b c
2 7 8 9
I do not know how to use merge or join to fix this, my initial thoughts are to make the dictionary a dataframe object but I am not sure where to go from there.
It's a little weird, but:
df = pd.DataFrame(np.array([[4, 5, 6], ['a', 'b', 'c'], [7, 8, 9]]))
d = {'a': 2, 'b': 3, 'c': 5}
df.iloc[0] = df.iloc[1].map(lambda x: d[x] if x in d.keys() else x)
df
# 0 1 2
# 0 2 3 5
# 1 a b c
# 2 7 8 9
I couldn't bring myself to redefine dict to be a particular dictionary. :D
After receiving a much-deserved smackdown regarding the speed of apply, I present to you the theoretically faster approach below:
df.iloc[0] = df.iloc[1].map(d).where(df.iloc[1].isin(d.keys()), df.iloc[0])
This gives you the dictionary value of d (df.iloc[1].map(d)) if the value in row 1 is in the keys of d (.where(df.iloc[1].isin(d.keys()), ...), otherwise gives you the value in row 0 (...df.iloc[0])).
Hope this helps!
I need to combine all iterations of subgroups to apply a function to and return a single value output along with concatenated string items identifying which iterations were looped.
I understand how to use pd.groupby and can set level=0 or level=1 and then call agg{'LOOPED_AVG':'mean'}. However, I need to group (or subset) rows by subgroup and then combine all rows from an iteration and then apply the function to it.
Input data table:
MAIN_GROUP SUB_GROUP CONCAT_GRP_NAME X_1
A 1 A1 9
A 1 A1 6
A 1 A1 3
A 2 A2 7
A 3 A3 9
B 1 B1 7
B 1 B1 3
B 2 B2 7
B 2 B2 8
C 1 C1 9
Desired result:
LOOP_ITEMS LOOPED_AVG
A1 B1 C1 6.166666667
A1 B2 C1 7
A2 B1 C1 6.5
A2 B2 C1 7.75
A3 B1 C1 7
A3 B2 C1 8.25
Assuming that you have three column pairs then you can apply the following, for more column pairs then adjust the script accordingly. I wanted to give you a way to solve the problem, this may not be the most efficient way but it gives a starting point.
import pandas as pd
import numpy as np
ls = [
['A', 1, 'A1', 9],
['A', 1, 'A1', 6],
['A', 1, 'A1', 3],
['A', 2, 'A2', 7],
['A', 3, 'A3', 9],
['B', 1, 'B1', 7],
['B', 1, 'B1', 3],
['B', 2, 'B2', 7],
['B', 2, 'B2', 8],
['C', 1, 'C1', 9],
]
#convert to dataframe
df = pd.DataFrame(ls, columns = ["Main_Group", "Sub_Group", "Concat_GRP_Name", "X_1"])
#get count and sum of concatenated groups
df_sum = df.groupby('Concat_GRP_Name')['X_1'].agg(['sum','count']).reset_index()
#print in permutations formula to calculate different permutation combos
import itertools as it
perms = it.permutations(df_sum.Concat_GRP_Name)
def combute_combinations(df, colname, main_group_series):
l = []
import itertools as it
perms = it.permutations(df[colname])
# Provides sorted list of unique values in the Series
unique_groups = np.unique(main_group_series)
for perm_pairs in perms:
#take in only the first three pairs of permuations and make sure
#the first column starts with A, secon with B, and third with C
if all([main_group in perm_pairs[ind] for ind, main_group in enumerate(unique_groups)]):
l.append([perm_pairs[ind] for ind in range(unique_groups.shape[0])])
return l
t = combute_combinations(df_sum, 'Concat_GRP_Name', df['Main_Group'])
#convert to dataframe and drop duplicate pairs
df2 = pd.DataFrame(t, columns = ["Item1", 'Item2', 'Item3']) .drop_duplicates()
#do a join between the dataframe that contains the sums and counts for the concat_grp_name to bring in the counts for
#each column from df2, since there are three columns: we must apply this three times
merged = df2.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item1'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item1_sum'}, axis=1)\
.rename({'count':'item1_count'}, axis=1)
merged2 = merged.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item2'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item2_sum'}, axis=1)\
.rename({'count':'item2_count'}, axis=1)
merged3 = merged2.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item3'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item3_sum'}, axis=1)\
.rename({'count':'item3_count'}, axis=1)
#get the sum of all of the item_sum cols
merged3['sums']= merged3[['item3_sum', 'item2_sum', 'item1_sum']].sum(axis = 1)
#get sum of all the item_count cols
merged3['counts']= merged3[['item3_count', 'item2_count', 'item1_count']].sum(axis = 1)
#find the average
merged3['LOOPED_AVG'] = merged3['sums'] / merged3['counts']
#remove irrelavent fields
merged3 = merged3.drop(['item3_count', 'item2_count', 'item1_count', 'item3_sum', 'item2_sum', 'item1_sum', 'counts', 'sums' ], axis = 1)
Base scenario
For a recommendation service I am training a matrix factorization model (LightFM) on a set of user-item interactions. For the matrix factorization model to yield the best results, I need to map my user and item IDs to a continuous range of integer IDs starting at 0.
I'm using a pandas DataFrame in the process, and I have found a MultiIndex to be extremely convenient to create this mapping, like so:
ratings = [{'user_id': 1, 'item_id': 1, 'rating': 1.0},
{'user_id': 1, 'item_id': 3, 'rating': 1.0},
{'user_id': 3, 'item_id': 1, 'rating': 1.0},
{'user_id': 3, 'item_id': 3, 'rating': 1.0}]
df = pd.DataFrame(ratings, columns=['user_id', 'item_id', 'rating'])
df = df.set_index(['user_id', 'item_id'])
df
Out:
rating
user_id item_id
1 1 1.0
1 3 1.0
3 1 1.0
3 1 1.0
And then allows me to get the continuous maps like so
df.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 1, 1], dtype='int8')
df.index.labels[1] # For items
Out:
FrozenNDArray([0, 1, 0, 1], dtype='int8')
Afterwards, I can map them back using df.index.levels[0].get_loc method. Great!
Extension
But, now I'm trying to streamline my model training process, ideally by training it incrementally on new data, preserving the old ID mappings. Something like:
new_ratings = [{'user_id': 2, 'item_id': 1, 'rating': 1.0},
{'user_id': 2, 'item_id': 2, 'rating': 1.0}]
df2 = pd.DataFrame(new_ratings, columns=['user_id', 'item_id', 'rating'])
df2 = df2.set_index(['user_id', 'item_id'])
df2
Out:
rating
user_id item_id
2 1 1.0
2 2 1.0
Then, simply appending the new ratings to the old DataFrame
df3 = df.append(df2)
df3
Out:
rating
user_id item_id
1 1 1.0
1 3 1.0
3 1 1.0
3 3 1.0
2 1 1.0
2 2 1.0
Looks good, but
df3.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 2, 2, 1, 1], dtype='int8')
df3.index.labels[1] # For items
Out:
FrozenNDArray([0, 2, 0, 2, 0, 1], dtype='int8')
I added user_id=2 and item_id=2 in the later DataFrame on purpose, to illustrate where it goes wrong for me. In df3, labels 3 (for both user and item), have moved from integer position 1 to 2. So the mapping is no longer the same. What I'm looking for is [0, 0, 1, 1, 2, 2] and [0, 1, 0, 1, 0, 2] for user and item mappings respectively.
This is probably because of ordering in pandas Index objects, and I'm unsure if what I want is at all possible using a MultiIndex strategy. Looking for help on how most to effectively tackle this problem :)
Some notes:
I find using DataFrames convenient for several reasons, but I use the MultiIndex purely for the ID mappings. Alternatives without MultiIndex are completely acceptable.
I cannot guarantee that new user_id and item_id entries in new ratings are larger than any values in the old dataset, hence my example of adding id 2 when [1, 3] were present.
For my incremental training approach, I will need to store my ID maps somewhere. If I only load new ratings partially, I will have to store the old DataFrame and ID maps somewhere. Would be great if it could all be in one place, like it would be with an index, but columns work too.
EDIT: An additional requirement is to allow for row re-ordering of the original DataFrame, as might happen when duplicate ratings exist, and I want to keep the most recent one.
Solution (credits to #jpp for original)
I've made a modification to #jpp's answer to satisfy the additional requirement I've added later (tagged with EDIT). This also truly satisfies the original question as posed in the title, since it preserves the old index integer positions, regardless of rows being reordered for whatever reason. I've also wrapped things into functions:
from itertools import chain
from toolz import unique
def expand_index(source, target, index_cols=['user_id', 'item_id']):
# Elevate index to series, keeping source with index
temp = source.reset_index()
target = target.reset_index()
# Convert columns to categorical, using the source index and target columns
for col in index_cols:
i = source.index.names.index(col)
col_cats = list(unique(chain(source.index.levels[i], target[col])))
temp[col] = pd.Categorical(temp[col], categories=col_cats)
target[col] = pd.Categorical(target[col], categories=col_cats)
# Convert series back to index
source = temp.set_index(index_cols)
target = target.set_index(index_cols)
return source, target
def concat_expand_index(old, new):
old, new = expand_index(old, new)
return pd.concat([old, new])
df3 = concat_expand_index(df, df2)
The result:
df3.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 1, 1, 2, 2], dtype='int8')
df3.index.labels[1] # For items
Out:
FrozenNDArray([0, 1, 0, 1, 0, 2], dtype='int8')
I think the use of MultiIndex overcomplicates this objective:
I need to map my user and item IDs to a continuous range of integer IDs starting at 0.
This solution falls in to the below category:
Alternatives without MultiIndex are completely acceptable.
def add_mapping(df, df2, df3, column_name='user_id'):
initial = df.loc[:, column_name].unique()
new = df2.loc[~df2.loc[:, column_name].isin(initial), column_name].unique()
maps = np.arange(len(initial))
mapping = dict(zip(initial, maps))
maps = np.append(maps, np.arange(np.max(maps)+1, np.max(maps)+1+len(new)))
total = np.append(initial, new)
mapping = dict(zip(total, maps))
df3[column_name+'_map'] = df3.loc[:, column_name].map(mapping)
return df3
add_mapping(df, df2, df3, column_name='item_id')
add_mapping(df, df2, df3, column_name='user_id')
user_id item_id rating item_id_map user_id_map
0 1 1 1.0 0 0
1 1 3 1.0 1 0
2 3 1 1.0 0 1
3 3 3 1.0 1 1
0 2 1 1.0 0 2
1 2 2 1.0 2 2
Explanation
This is how to maintain a mapping for the user_id values. Same holds for the item_id values as well.
These are the initial user_id values (unique):
initial_users = df['user_id'].unique()
# initial_users = array([1, 3])
user_map maintains a mapping for user_id values, as per your requirement:
user_id_maps = np.arange(len(initial_users))
# user_id_maps = array([0, 1])
user_map = dict(zip(initial_users, user_id_maps))
# user_map = {1: 0, 3: 1}
These are the new user_id values you got from df2 - ones that you didn't see in df:
new_users = df2[~df2['user_id'].isin(initial_users)]['user_id'].unique()
# new_users = array([2])
Now we update user_map for the total user base with the new users:
user_id_maps = np.append(user_id_maps, np.arange(np.max(user_id_maps)+1, np.max(user_id_maps)+1+len(new_users)))
# array([0, 1, 2])
total_users = np.append(initial_users, new_users)
# array([1, 3, 2])
user_map = dict(zip(total_users, user_id_maps))
# user_map = {1: 0, 2: 2, 3: 1}
Then, just map the values from user_map to df['user_id']:
df3['user_map'] = df3['user_id'].map(user_map)
user_id item_id rating user_map
0 1 1 1.0 0
1 1 3 1.0 0
2 3 1 1.0 1
3 3 3 1.0 1
0 2 1 1.0 2
1 2 2 1.0 2
Forcing alignment of index labels after concatenation does not appear straightforward and, if there is a solution, it is poorly documented.
One option which may appeal to you is Categorical Data. With some careful manipulation, this can achieve the same purpose: each unique index value within a level has a one-to-one mapping to an integer, and this mapping persists even after concatenation with other dataframes.
from itertools import chain
from toolz import unique
# elevate index to series
df = df.reset_index()
df2 = df2.reset_index()
# define columns for reindexing
index_cols = ['user_id', 'item_id']
# convert to categorical with merged categories
for col in index_cols:
col_cats = list(unique(chain(df[col], df2[col])))
df[col] = pd.Categorical(df[col], categories=col_cats)
df2[col] = pd.Categorical(df2[col], categories=col_cats)
# convert series back to index
df = df.set_index(index_cols)
df2 = df2.set_index(index_cols)
I use toolz.unique to return an ordered unique list, but if you don't have access to this library, you can use the identical unique_everseen recipe from the itertool docs.
Now let's have a look at the category codes underlying the 0th index level:
for data in [df, df2]:
print(data.index.get_level_values(0).codes.tolist())
[0, 0, 1, 1]
[2, 2]
Then perform our concatenation:
df3 = pd.concat([df, df2])
Finally, check that categorical codes are aligned:
print(df3.index.get_level_values(0).codes.tolist())
[0, 0, 1, 1, 2, 2]
For each index level, note we must take the union of all index values across dataframes to form col_cats, otherwise the concatenation will fail.
Basically I am trying to do the opposite of How to generate a list from a pandas DataFrame with the column name and column values?
To borrow that example, I want to go from the form:
data = [['Name','Rank','Complete'],
['one', 1, 1],
['two', 2, 1],
['three', 3, 1],
['four', 4, 1],
['five', 5, 1]]
which should output:
Name Rank Complete
One 1 1
Two 2 1
Three 3 1
Four 4 1
Five 5 1
However when I do something like:
pd.DataFrame(data)
I get a dataframe where the first list should be my colnames, and then the first element of each list should be the rowname
EDIT:
To clarify, I want the first element of each list to be the row name. I am scrapping data so it is formatted this way...
One way to do this would be to take the column names as a separate list and then only give from 1st index for pd.DataFrame -
In [8]: data = [['Name','Rank','Complete'],
...: ['one', 1, 1],
...: ['two', 2, 1],
...: ['three', 3, 1],
...: ['four', 4, 1],
...: ['five', 5, 1]]
In [10]: df = pd.DataFrame(data[1:],columns=data[0])
In [11]: df
Out[11]:
Name Rank Complete
0 one 1 1
1 two 2 1
2 three 3 1
3 four 4 1
4 five 5 1
If you want to set the first column Name column as index, use the .set_index() method and send in the column to use for index. Example -
In [16]: df = pd.DataFrame(data[1:],columns=data[0]).set_index('Name')
In [17]: df
Out[17]:
Rank Complete
Name
one 1 1
two 2 1
three 3 1
four 4 1
five 5 1