Combine rows of df based on sequence in 1 column - python

I have df like this:
d = {'col1': ['A', 'B', 'C', 'K', 'L', 'M'], 'col2': ['Open', 'Done', 'Open', 'Open', 'Done', 'Open'], 'col3': [1, 2, 3, 3, 1, 2]}
df = pd.DataFrame(data=d)
I'd like to iterate over col3 whenever the next row is increasing, until the same value reoccurs, then combine rows/columns like this:
d = {'col1': ['A', 'B', 'C', 'K', 'L', 'M'], 'col2': ['Open', 'Done', 'Open', 'Open', 'Done', 'Open'], 'col3': [1, 2, 3, 3, 1, 2], 'col4': ['B/Done;C/Open;K/Open', 'C/Open;K/Open', 'None', 'None', 'M/Open', 'None']}
df = pd.DataFrame(data=d)
I have thousands of rows, so I am trying to avoid using a for loop if possible.

I believe you can't perform this in a vectorial way.
Here is a working approach, but using a loop in a custom function:
def combine(series):
out = []
for s in series.iloc[1:]:
out.append(out[-1]+';'+s if out else s)
out = out[::-1]
out.append(None)
return pd.Series(out, index=series.index)
group = df['col3'].diff().eq(0)[::-1].cumsum()[::-1]
df['col4'] = (df.assign(col=df['col1']+'/'+df['col2'])
.groupby(group, sort=False)['col']
.apply(combine)
)
output:
col1 col2 col3 col4
0 A Open 1 B/Done;C/Open;K/Open
1 B Done 2 B/Done;C/Open
2 C Open 3 B/Done
3 K Open 3 None
4 L Done 1 M/Open
5 M Open 2 None

Related

How do I multiply values of a datraframe column by values from a column from another dataframe based on a common category?

I have two dataframes:
data1 = {'Item': ['A', 'B', 'C', 'N'], 'Price': [1, 2, 3, 10], 'Category': ['X', 'Y', 'X', 'Z'], 'County': ['K', 'L', 'L', 'K']}
df1 = pd.DataFrame(data1)
df1
data2 = {'Category': ['X', 'Y', 'Z'], 'Value retained': [0.1, 0.2, 0.8]}
df2 = pd.DataFrame(data2)
df2
How do I multiply 'Value retained' by 'Price' following their respective Category and add the result as a new column in df1?
I've searched a lot for a solution and tried several different things, among them:
df3 = df1
for cat, VR in df2['Category', 'Value retained']:
if cat in df1.columns:
df3[cat] = df1['Price'] * VR
and
df3 = df1['Price'] * df2.set_index('Category')['Value retained']
df3
In my real dataframe I have 250k+ items and 32 categories with different values of 'value retained'.
I really appreciate any help for a newbie in Python coding.
You're 2nd approach would work if both dataframes have Category as index, but since you can't set_index on Category in df1 (because you have duplicated entries) you need to do a left merge on the two df based on the column Category and then multiply.
df3 = df1.merge(df2, on='Category', how='left')
df3['result'] = df3['Price'] * df3['Value retained']
print(df3)
Item Price Category County Value retained result
0 A 1 X K 0.1 0.1
1 B 2 Y L 0.2 0.4
2 C 3 X L 0.1 0.3
3 N 10 Z K 0.8 8.0
You can use this,
import pandas as pd
data1 = {'Item': ['A', 'B', 'C', 'N'], 'Price': [1, 2, 3, 10], 'Category': ['X', 'Y', 'X', 'Z'], 'County': ['K', 'L', 'L', 'K']}
df1 = pd.DataFrame(data1)
data2 = {'Category': ['X', 'Y', 'Z'], 'Value_retained': [0.1, 0.2, 0.8]}
df2 = pd.DataFrame(data2)
df = df1.merge(df2, how='left')
df['Values'] = df.Price * df.Value_retained
print(df)
The output is,
Item Price Category County Value_retained Values
0 A 1 X K 0.1 0.1
1 B 2 Y L 0.2 0.4
2 C 3 X L 0.1 0.3
3 N 10 Z K 0.8 8.0

Apply function for multiple levels of tables/data

I get a problem in my work. I have tables:
import pandas as pd
import numpy as np
level1 = pd.DataFrame(list(zip(['a', 'b', 'c'], [3, 'x', 'x'])),
columns=['name', 'value'])
name value
0 a 3
1 b x
2 c x
I want to sum the value column, but it contains “x”s. So I will have to use the second table to calculate “x”s :
level2 = pd.DataFrame(list(zip(['b', 'b', 'c', 'c', 'c'], ['b1', 'b2', 'c1', 'c2', 'c3'], [5, 7, 2, 'x', 9])),
columns=['name', 'sub', 'value'])
name sub value
0 b b1 5
1 b b2 7
2 c c1 2
3 c c2 x
4 c c3 9
I should sum the b1, b2 to give “x” for b in level1 table (x=12). But for c, there is “x”, so a third level table:
level3 = pd.DataFrame(list(zip(['c', 'c', 'c'], ['c1', 'c2', 'c3'], [2, 4, 9])),
columns=['name', 'sub', 'value'])
name sub value
0 c c1 2
1 c c2 4
2 c c3 9
Now, we can get the sum value for value column in level1 table.
My question is: can we use a function to calculate it easily? If there are more levels, how can we loop them till no “x”?
It is OK to combine level2 and level3.
Here's a way using combine_first and replace:
from functools import reduce
l1 = level1.assign(sub=level1['name']+'1').replace('x', np.nan).set_index(['name', 'sub'])
l2 = level2.replace('x', np.nan).set_index(['name', 'sub'])
l3 = level3.replace('x', np.nan).set_index(['name', 'sub'])
reduce(lambda x, y: x.combine_first(y), [l3,l2,l1]).groupby(level=0).sum()
Output:
value
name
a 3.0
b 12.0
c 15.0
Complete example:
import pandas as pd
import numpy as np
level1 = pd.DataFrame(list(zip(['a', 'b', 'c'], [3, 'x', 'x'])),
columns=['name', 'value'])
level2 = pd.DataFrame(list(zip(['b', 'b', 'c', 'c', 'c'],
['b1', 'b2', 'c1', 'c2', 'c3'],
[5, 7, 2, 'x', 9])),
columns=['name', 'sub', 'value'])
level3 = pd.DataFrame(list(zip(['c', 'c', 'c'],
['c1', 'c2', 'c3'],
[2, 4, 9])),
columns=['name', 'sub', 'value'])
from functools import reduce
l1 = level1.assign(sub=level1['name']+'1')\
.replace('x', np.nan)\
.set_index(['name', 'sub'])
l2 = level2.replace('x', np.nan)\
.set_index(['name', 'sub'])
l3 = level3.replace('x', np.nan)\
.set_index(['name', 'sub'])
out = reduce(lambda x, y: x.combine_first(y),
[l3,l2,l1]).groupby(level=0).sum()
print(out)
One option is a combination of merge(multiple merges actually) and a groupby:
(level2
.merge(level3, on = ['name', 'sub'], how = 'left', suffixes = (None, '_y'))
.assign(value = lambda df: np.where(df.value.eq('x'), df.value_y, df.value))
.groupby('name', as_index = False)
.value
.sum()
.merge(level1, on = 'name', how = 'right', suffixes = ('_x',None))
.assign(value = lambda df: np.where(df.value.eq('x'), df.value_x, df.value))
.loc[:, ['name', 'value']]
)
name value
0 a 3
1 b 12.0
2 c 15.0

Stack different column values into one column in a pandas dataframe

I have the following dataframe -
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
This is my desired output -
desired_df = pd.DataFrame({
'ID': [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'Prior_Current': ['a', 'a1', 'b', 'c', 'c1', 'd', 'e', 'f', 'f1', 'g',
'g1'],
'Start_Date': ['', '1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019'],
'End_Date': ['1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019', '']
})
I tried the following -
keys = ['Prior', 'Current']
df2 = (
pd.melt(df, id_vars='ID', value_vars=keys, value_name='Prior_Current')
.merge(df[['ID', 'Date']], how='left', on='ID')
)
df2['Start_Date'] = np.where(df2['variable'] == 'Prior', df2['Date'], '')
df2['End_Date'] = np.where(df2['variable'] == 'Current', df2['Date'], '')
df2.sort_values(['ID'], ascending=True, inplace=True)
But this does not seem be working. Please help.
you can use stack and pivot_table:
k = df.set_index(['ID', 'Date']).stack().reset_index()
df = k.pivot_table(index = ['ID',0], columns = 'level_2', values = 'Date', aggfunc = ''.join, fill_value= '').reset_index()
df.columns = ['ID', 'prior-current', 'start-date', 'end-date']
OUTPUT:
ID prior-current start-date end-date
0 1 a 1/1/2019
1 1 a1 1/1/2019
2 2 b 5/1/2019
3 2 c 5/1/2019 10/2/2019
4 2 c1 10/2/2019
5 3 d 15/3/2019
6 3 e 15/3/2019 6/5/2019
7 3 f 6/5/2019 7/9/2019
8 3 f1 7/9/2019
9 4 g 16/11/2019
10 4 g1 16/11/2019
Explaination:
After stack / reset_index df will look like this:
ID Date level_2 0
0 1 1/1/2019 Prior a
1 1 1/1/2019 Current a1
2 2 5/1/2019 Prior b
3 2 5/1/2019 Current c
4 2 10/2/2019 Prior c
5 2 10/2/2019 Current c1
6 3 15/3/2019 Prior d
7 3 15/3/2019 Current e
8 3 6/5/2019 Prior e
9 3 6/5/2019 Current f
10 3 7/9/2019 Prior f
11 3 7/9/2019 Current f1
12 4 16/11/2019 Prior g
13 4 16/11/2019 Current g1
Now, we can use ID and column 0 as index / level_2 as column / Date column as value.
Finally, we need to rename the columns to get the desired result.
My approach is to build and attain the target df step by step. The first step is an extension of your code using melt() and merge(). The merge is done based on the columns 'Current' and 'Prior' to get the start and end date.
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID')
df2 = df2.merge(df[['Current', 'Date']], how='left', left_on='Prior_Current', right_on='Current').drop('Current',1)
df2 = df2.merge(df[['Prior', 'Date']], how='left', left_on='Prior_Current', right_on='Prior').drop('Prior',1)
df2 = df2.fillna('').reset_index(drop=True)
df2.columns = ['ID', 'Prior_Current', 'Start_Date', 'End_Date']
Alternative way is to define a custom function to get date, then use lambda function:
def get_date(x, col):
try:
return df['Date'][df[col]==x].values[0]
except:
return ''
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID').reset_index(drop=True)
df2['Start_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Current'))
df2['End_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Prior'))
Output

How to aggregate columns in pandas

I have a data frame with number of columns I want to group them under two main groups A and B while preserving the old columns names as dictionary as follow
index userid col1 col2 col3 col4 col5 col6 col7
0 1 6 3 Nora 100 11 22 44
the desired data frame is as follow
index userid A B
0 1 {"col1":6, "col2":3, "col3":"Nora","col4":100} {"col5":11, "col6":22, "col7":44}
To match your desired dataframe exactly:
>>> import pandas as pd
# recreating your data
>>> df = pd.DataFrame.from_dict({'index': [0], 'userid': [1], 'col1': [6], 'col2': [3], 'col3': ['Nora'], 'col4': [100], 'col5': [11], 'col6': [22], 'col7': [44]})
# copy of unchanged columns
>>> df_new = df[['index', 'userid']].copy()
# grouping columns together
>>> df_new['A'] = df[['col1', 'col2', 'col3', 'col4']].copy().to_dict(orient='records')
>>> df_new['B'] = df[['col5', 'col6', 'col7']].copy().to_dict(orient='records')
>>> df_new
index userid A B
0 0 1 {'col1': 6, 'col2': 3, 'col3': 'Nora', 'col4': 100} {'col5': 11, 'col6': 22, 'col7': 44}
You can try something like this:
d = {'col1': 'A',
'col2': 'A',
'col3': 'A',
'col4': 'A',
'col5': 'B',
'col6': 'B',
'col7': 'B'}
df.groupby(d, axis=1).apply(pd.DataFrame.to_dict, orient='series').to_frame().T
Output:
A B
0 {'col1': [6], 'col2': [3], 'col3': ['Nora'], '... {'col5': [11], 'col6': [22], 'col7': [44]}
Working with the original dataframe.
import pandas as pd
df1 = pd.DataFrame({'index':[0], 'userid':[1],
'col1': [6], 'col2': [3], 'col3': ['Nora'] ,'col4':[100],
'col5':[11], 'col6': [22], 'col7':[44]})
df1['A'] = df1[['col1', 'col2', 'col3', 'col4']].to_dict(orient='records')
df1['B'] = df1[['col5', 'col6', 'col7']].to_dict(orient='records')
df1.drop(df1.columns[range(2, 9)], axis=1, inplace=True)
print(df1)

Pandas dropping column issue

I'm trying to drop a column from my dataframe, but the problem is that whenever I drop the column (which works) my columns always get rearranged in different orders. Does anyone know why this might be? This is my code right now:
df=df.drop('column_name', axis=1)
One way to do it is:
df = df[[col1,col2,col4]] #if col3 is what you want to drop
This is useful when you have fewer columns.
I can not reproduce your problem. But I believe the following code can keep the order. It's the same idea as runzhi xiao's answer, but without typing all the remaining columns.
df = pd.DataFrame({
'col1': [1, 2, 3, 4],
'col2': ['a', 'e', 'i', 'o'],
'col3': ['a', 'e', 'i', 'o'],
'col4': [0.1, 0.2, 1, 2],
})
cols_to_drop = ['col2']
new_columns = df.columns.drop(cols_to_drop).to_list()
df[new_columns]
col1 col3 col4
0 1 a 0.1
1 2 e 0.2
2 3 i 1.0
3 4 o 2.0

Categories