Python iterate each sub group of rows and apply function - python

I need to combine all iterations of subgroups to apply a function to and return a single value output along with concatenated string items identifying which iterations were looped.
I understand how to use pd.groupby and can set level=0 or level=1 and then call agg{'LOOPED_AVG':'mean'}. However, I need to group (or subset) rows by subgroup and then combine all rows from an iteration and then apply the function to it.
Input data table:
MAIN_GROUP SUB_GROUP CONCAT_GRP_NAME X_1
A 1 A1 9
A 1 A1 6
A 1 A1 3
A 2 A2 7
A 3 A3 9
B 1 B1 7
B 1 B1 3
B 2 B2 7
B 2 B2 8
C 1 C1 9
Desired result:
LOOP_ITEMS LOOPED_AVG
A1 B1 C1 6.166666667
A1 B2 C1 7
A2 B1 C1 6.5
A2 B2 C1 7.75
A3 B1 C1 7
A3 B2 C1 8.25

Assuming that you have three column pairs then you can apply the following, for more column pairs then adjust the script accordingly. I wanted to give you a way to solve the problem, this may not be the most efficient way but it gives a starting point.
import pandas as pd
import numpy as np
ls = [
['A', 1, 'A1', 9],
['A', 1, 'A1', 6],
['A', 1, 'A1', 3],
['A', 2, 'A2', 7],
['A', 3, 'A3', 9],
['B', 1, 'B1', 7],
['B', 1, 'B1', 3],
['B', 2, 'B2', 7],
['B', 2, 'B2', 8],
['C', 1, 'C1', 9],
]
#convert to dataframe
df = pd.DataFrame(ls, columns = ["Main_Group", "Sub_Group", "Concat_GRP_Name", "X_1"])
#get count and sum of concatenated groups
df_sum = df.groupby('Concat_GRP_Name')['X_1'].agg(['sum','count']).reset_index()
#print in permutations formula to calculate different permutation combos
import itertools as it
perms = it.permutations(df_sum.Concat_GRP_Name)
def combute_combinations(df, colname, main_group_series):
l = []
import itertools as it
perms = it.permutations(df[colname])
# Provides sorted list of unique values in the Series
unique_groups = np.unique(main_group_series)
for perm_pairs in perms:
#take in only the first three pairs of permuations and make sure
#the first column starts with A, secon with B, and third with C
if all([main_group in perm_pairs[ind] for ind, main_group in enumerate(unique_groups)]):
l.append([perm_pairs[ind] for ind in range(unique_groups.shape[0])])
return l
t = combute_combinations(df_sum, 'Concat_GRP_Name', df['Main_Group'])
#convert to dataframe and drop duplicate pairs
df2 = pd.DataFrame(t, columns = ["Item1", 'Item2', 'Item3']) .drop_duplicates()
#do a join between the dataframe that contains the sums and counts for the concat_grp_name to bring in the counts for
#each column from df2, since there are three columns: we must apply this three times
merged = df2.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item1'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item1_sum'}, axis=1)\
.rename({'count':'item1_count'}, axis=1)
merged2 = merged.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item2'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item2_sum'}, axis=1)\
.rename({'count':'item2_count'}, axis=1)
merged3 = merged2.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item3'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item3_sum'}, axis=1)\
.rename({'count':'item3_count'}, axis=1)
#get the sum of all of the item_sum cols
merged3['sums']= merged3[['item3_sum', 'item2_sum', 'item1_sum']].sum(axis = 1)
#get sum of all the item_count cols
merged3['counts']= merged3[['item3_count', 'item2_count', 'item1_count']].sum(axis = 1)
#find the average
merged3['LOOPED_AVG'] = merged3['sums'] / merged3['counts']
#remove irrelavent fields
merged3 = merged3.drop(['item3_count', 'item2_count', 'item1_count', 'item3_sum', 'item2_sum', 'item1_sum', 'counts', 'sums' ], axis = 1)

Related

How to compare and replace individual cell values in data according to a list?: Pandas

I have a dataframe containing numerical values. I want to replace all values in the dataframe by comparing individual cell values to the respective elements of the list. The length of the list and the length of the columns are the same. Here's an example:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
Output
a b c
0 101 2 3
1 4 500 6
2 712 8 9
list_numbers = [100,100,100]
I want to compare individual cell values to the respective elements of the list.
So, the column 'a' will be compared to 100. If the values are greater than hundred, I want to replace the values with another number.
Here is my code so far:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df_columns = df.columns
df_index = df.index
#Creating a new dataframe to store the values.
df1 = pd.DataFrame(index= df_index, columns = df_columns)
df1 = df1.fillna(0)
for index, value in enumerate(df.columns):
#df.where replaces values where the condition is false
df1[[value]] = df[[value]].where(df[[value]] > list_numbers [index], -1)
df1[[value]] = df[[value]].where(df[[value]] < list_numbers [index], 1)
#I am getting something like: nan for column a and error for other columns.
#The output should look something like:
Output
a b c
0 1 -1 -1
1 -1 1 -1
2 1 -1 -1
Iterating over a DataFrame iterates over its column names. So you could simply do:
df1 = pd.DataFrame()
for i, c in enumerate(df):
df1[c] = np.where(df[c] >= list_numbers[i], 1, -1)
You can avoid iterating over the columns, and use numpy broadcasting (which is more efficient):
df1 = pd.DataFrame(
np.where(df.values > np.array(list_numbers), 1, -1),
columns=df.columns)
df1
Output:
a b c
0 1 -1 -1
1 -1 1 -1
2 1 -1 -1

How to retrieve all raw components of an item using Python Dataframes

I am trying to write a solution that would return a dataframe of all the raw_components by item when given the following 2 dataframes (items & components).
These two dataframes are linked by using item_code as the key when going from items to components. When going from components to items, sub_component in the components dataframe is used instead to relate to item in the items dataframe.
items: This dataframe holds all items that contain sub-components. An item could also be a sub-component of another item. Raw components would not exist here.
items = pd.DataFrame({'item': ['a', 'b'],
'item_code': [1, 2]
})
items.set_index('item',inplace = True)
item_code
item
a 1
b 2
components : This dataframe lists the sub-components of each item in the items dataframe. If sub_component doesn't exist in the items dataframe (as item), then it is then a raw_component in expected_result.
components = pd.DataFrame({'item_code': [1, 1, 1, 1, 2],
'sub_component': ['a1', 'a2', 'a3', 'b', 'b1'],
'quantity': [1, 2, 1, 1, 4]
})
components.set_index('item_code',inplace = True)
sub_component quantity
item_code
1 a1 1
1 a2 2
1 a3 1
1 b 1
2 b1 4
The dataframe created below is the expected output based on the current dataset.
expected_result = pd.DataFrame({'item': ['a','a','a','a','b'],
'raw_component': ['a1','a2','a3','b1','b1'],
'quantity': [1, 2, 1, 4, 4]
})
expected_result.set_index('item',inplace = True)
raw_component quantity
item
a a1 1
a a2 2
a a3 1
a b1 4
b b1 4
I tried using recursion and loops but I've been unable to figure out a solution. The challenge is that an item might have a sub-component that also has sub-components, and the number of layers is not provided.

Python: Replace data from one dataframe using other dataframe

How to replace data from df1 using dataframe df2 based on column A
df1 = pd.DataFrame({'A': [0, 1, 2, 0, 4],'B': [5, 6, 7, 5, 9],'C': ['a', 'b', 'c', 'a', 'e'],'E': ['a1', '1b', '1c', '1a', '1e']})
df2 = pd.DataFrame({'A': [0, 1],'B': ['new', 'new1'],'C': ['t', 't1']})
Use DataFrame.merge with left join, replace missing values by original DataFrame by DataFrame.fillna and last filter columns by df1.columns:
df = df1.merge(df2, on='A', how='left', suffixes=('_','')).fillna(df1)[df1.columns]
print(df)
A B C E
0 0 new t a1
1 1 new1 t1 1b
2 2 7 c 1c
3 0 new t 1a
4 4 9 e 1e
Here is an option.
##set index to be the same
df1 = df1.set_index('A')
df2 = df2.set_index('A')
##update df1
df1.loc[df2.index,df2.columns] = df2
##reset the index to get it back to a column
df1 = df1.reset_index()

Selecting by multiIndex

I have two DataFrames
df_a = pd.DataFrame(data=[['A', 'B', 'C'], ['A1', 'B1', 'C1']], columns=['first', 'secound', 'third'])
df_a.set_index(['first', 'secound'], inplace=True)
df_b = pd.DataFrame(data=[['A', 'B', 12], ['A', 'B', 143], ['C1', 'C1', 11]], columns=['first', 'secound', 'data'])
df_b.set_index(['first', 'secound'], inplace=True)
third
first secound
A B C
A1 B1 C1
data
first secound
A B 12
B 143
C1 C1 11
How I can select only shared index elements in df_b:
data
first secound
A B 12
B 143
Thanks for help
You could take the intersection of the indexes, and use that as an indexer for df_b.loc:
In [28]: df_b.loc[df_b.index.intersection(df_a.index)]
Out[28]:
data
first secound
A B 12
B 143
or, alternatively, use isin to generate a boolean mask for df_b.loc:
In [32]: df_b.loc[df_b.index.isin(df_a.index)]
Out[32]:
data
first secound
A B 12
B 143
Using isin seems to be the fastest option:
This was the setup used to generate the perfplot above:
import numpy as np
import pandas as pd
import perfplot
def isin(x):
df_a, df_b = x
return df_b.loc[df_b.index.isin(df_a.index)]
def intersection(x):
df_a, df_b = x
return df_b.loc[df_b.index.intersection(df_a.index)]
def join(x):
df_a, df_b = x
return df_a.drop(df_a.columns, axis=1).join(df_b).dropna()
def make_df(n):
df = pd.DataFrame(np.random.randint(10, size=(n, 3)))
df = df.set_index([0, 1])
return df
perfplot.show(
setup=lambda n: [make_df(n) for i in range(2)],
kernels=[isin, intersection, join],
n_range=[2**k for k in range(2, 15)],
logx=True,
logy=True,
equality_check=False, # rows may appear in different order
xlabel='len(df)')
You could join the index of df_a with df_b, and then drop NaNs:
>>> df_a.drop(df_a.columns, axis=1).join(df_b).dropna()
data
first secound
A B 12.0
B 143.0

Manipulate specific columns (sample features) conditional on another column's entries (feature value) using pandas/numpy dataframe

my input dataframe (shortened) looks like this:
>>> import numpy as np
>>> import pandas as pd
>>> df_in = pd.DataFrame([[1, 2, 'a', 3, 4], [6, 7, 'b', 8, 9]],
... columns=(['c1', 'c2', 'col', 'c3', 'c4']))
>>> df_in
c1 c2 col c3 c4
0 1 2 a 3 4
1 6 7 b 8 9
It is supposed to be manipulated, i.e.
if row (sample) in column 'col' (feature) has a specific value (e.g., 'b' here),
then convert the entries in columns 'c1' and 'c2' in the same row to NumPy.NaNs.
Result wanted:
>>> df_out = pd.DataFrame([[1, 2, 'a', 3, 4], [np.nan, np.nan, np.nan, 8, 9]],
columns=(['c1', 'c2', 'col', 'c3', 'c4']))
>>> df_out
c1 c2 col c3 c4
0 1 2 a 3 4
1 NaN NaN b 8 9
So far, I managed to get obtain desired result via the code
>>> dic = {'col' : ['c1', 'c2']} # auxiliary
>>> b_w = df_in[df_in['col'] == 'b'] # Subset with 'b' in 'col'
>>> b_w = b_w.drop(dic['col'], axis=1) # ...inject np.nan in 'c1', 'c2'
>>> b_wo = df_in[df_in['col'] != 'b'] # Subset without 'b' in 'col'
>>> df_out = pd.concat([b_w, b_wo]) # Both Subsets together again
>>> df_out
c1 c2 c3 c4 col
1 NaN NaN 8 9 b
0 1.0 2.0 3 4 a
Although I get what I want (the original data consists entirely of floats, don't
bother the mutation from int to float her), it is a rather inelegant
snippet of code. And my educated guess is that this could be done faster
by using the build-in functions from pandas and numpy, but I am unable to manage this.
Any suggestions how to code this in a fast and efficient way for daily use? Any help is highly appreciated. :)
You can condition on both the row and col positions to assign values using loc which supports both logic indexing and dimension name indexing:
df_in.loc[df_in.col == 'b', ['c1', 'c2']] = np.nan
df_in
# c1 c2 col c3 c4
# 0 1.0 2.0 a 3 4
# 1 NaN NaN b 8 9
When using pandas I would go for the solution provided by #Psidom.
However, for larger datasets it is faster when doing the whole pandas -> numpy -> pandas procedure, i.e. dataframe -> numpy.array -> dataframe (minus 10% process time for my setup). Without converting back to a dataframe, numpy is almost twice as fast for my dataset.
Solution for the question asked:
cols, df_out = df_in.columns, df_in.values
for i in [0, 1]:
df_out[df_out[:, 2] == 'b', i] = np.nan
df_out = pd.DataFrame(df_out, columns=cols)

Categories