Converting row values into column name in pandas in a peculiar condition - python

In Pandas I have a dataframe like below
data= [['A','B',3],['A','C',4],['A','D',5],['B','A',4],['B','C',4],['C','D',1]]
df = pd.DataFrame(data,columns =['Col1','Col2','Value'])
df
Col1
Col2
Value
A
B
3
A
C
4
A
D
5
B
C
4
C
D
1
B
A
4
I want to convert it as below
A:B
A:C
A:D
B:C
C:D
7
4
`5
4
1
Note: first column A:B value is 7 because there exists combination (A:B) = 4 + (B:A) = 3.
Please suggest a quick method

Use sorted with join for both columns and aggregate sum, last transpose:
df1 = (df.groupby(df[['Col1','Col2']]
.agg(lambda x: ':'.join(sorted(x)), axis=1))
.sum()
.T
.reset_index(drop=True))
print (df1)
A:B A:C A:D B:C C:D
0 7 4 5 4 1

You can do:
df.groupby([':'.join(sorted(t)) for t in zip(df['Col1'], df['Col2'
])])['Value'].sum().to_frame().T
output:
A:B A:C A:D B:C C:D
Value 7 4 5 4 1

you can use sets to make order irrelevant or sorted to make sure you have consistent ordering
result = df['Value'].\
groupby(df[['Col1','Col2']].\
apply(set,axis=1).apply(':'.join)).sum()
print(result['B:A'])
print(result)
but that gets you a series ...
B:A 7
C:A 4
C:B 4
D:A 5
D:C 1
Name: Value, dtype: int64
if you want it the other way you need to make 2 small changes
result = df[['Value']].\
groupby(df[['Col1','Col2']].\
apply(set,axis=1).apply(':'.join)).sum().T
print(result)
print(result['B:A']['Value'])
note that set does not care about order so it might be A:B or B:A
if you need it actually sorted to always be 'A:B'(you should have specified that as part of your requirements) you will need to use sort instead of set as shown below
result = df['Value'].\
groupby(df[['Col1','Col2']].\
apply(sorted,axis=1).apply(':'.join)).sum()
print(result['A:B']) # always sorted...
print(result)

Related

Keep only the final or the latest rev of a file name

I have a dataframe with columns as below:
Name Measurement
0 Blue_Water_Final_Rev_0 3
1 Blue_Water_Final_Rev_1 4
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
4 Red_Water_Initial_Rev_0 6
I want to keep only the rows with the latest rev or rows with "Final" if the other is "Initial".
In the case above, my output will be as below:
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
How can I do this in python in my pandas dataframe? Thanks.
You can extract the name before "Final" and drop_duplicates with keep='last':
keep = (df['Name']
.str.extract('^(.*)_Final', expand=False)
.drop_duplicates(keep='last')
.dropna()
)
out = df.loc[keep.index]
NB. Assuming the data is sorted by revision.
Output:
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
If you want to keep all duplicates of the last revision:
out = df[df['Name'].isin(df.loc[keep.index, 'Name'])]
If possible exist only Initial and no Final and need keep it use Series.str.extract for get 3 columns for groups, Final or Initial and number of revision, convert last column to integers and then sorting by all columns with DataFrame.sort_values and get last duplicates per groups by DataFrame.duplicated:
print (df)
Name Measurement
0 Blue_Water_Final_Rev_0 3
1 Blue_Water_Final_Rev_1 4
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
4 Red_Water_Initial_Rev_0 6
5 Green_Water_Initial_Rev_0 6
df1 = (df['Name'].str.extract(r'(?P<a>\w+)_(?P<b>Final|Initial)_Rev_(?P<c>\d+)$')
.assign(c=lambda x: x.c.astype(int)))
df = df[~df1.sort_values(['a','c','b'], ascending=[True, True, False])
.duplicated('a', keep='last')]
print (df)
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
5 Green_Water_Initial_Rev_0 6
But if need remove all Initial and processing only Final rows use first part same like above, only then filter out rows with Initial and for last revisions use DataFrame.loc with DataFrameGroupBy.idxmax:
df1 = (df['Name'].str.extract(r'(?P<a>\w+)_(?P<b>Final|Initial)_Rev_(?P<c>\d+)$')
.assign(c=lambda x: x.c.astype(int)))
df = df.loc[df1[df1.b.ne('Initial')].groupby('a')['c'].idxmax()]
print (df)
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
you can you the df.iloc[2:4,:] for this

How to add value of dataframe to another dataframe?

I want to add a row of dataframe to every row of another dataframe.
df1=pd.DataFrame({"a": [1,2],
"b": [3,4]})
df2=pd.DataFrame({"a":[4], "b":[5]})
I want to add df2 value to every row of df1.
I use df1+df2 and get following result
a b
0 5.0 8.0
1 NaN NaN
But I want to get the following result
a b
0 5 7
1 7 9
Any help would be dearly appreciated!
If really need add values per columns it means number of columns in df2 is same like number of rows in df1 use:
df = df1.add(df2.loc[0].to_numpy(), axis=0)
print (df)
a b
0 5 7
1 7 9
If need add by rows it means first value of df1 is add to first column of df2, so output is different:
df = df1.add(df2.loc[0], axis=1)
print (df)
a b
0 5 8
1 6 9

How to reduce conditionality of a categorical feature using a lookup table

I a dataframe (df1) whose one categorical column is
df1=pd.Dataframe({'COL1': ['AA','AB','BC','AC','BA','BB','BB','CA','CB','CD','CE']})
I have another dataframe (df2) which has two columns
df2=pd.Dataframe({'Category':['AA','AB','AC','BA','BB','BC','CA','CB','CC','CD','CE','CF'],'general_mapping':['A','A','A','B','B','B','C','C','C','C','C','C']})
I need to modify df1 using df2 and finally will look like:
df1->> ({'COL1': ['A','A','B','A','B','B','B','C','C','C','C']})
You can use pd.Series.map after setting Category as index using df.set_index.
df1['COL1'] = df1['COL1'].map(df2.set_index('Category')['general_mapping'])
df1
COL1
0 A
1 A
2 B
3 A
4 B
5 B
6 B
7 C
8 C
9 C
10 C

code multiple columns based on lists and dictionaries in Python

I have the following dataframe in Pandas
OfferPreference_A OfferPreference_B OfferPreference_C
A B A
B C C
C S G
I have the following dictionary of unique values under all the columns
dict1={A:1, B:2, C:3, S:4, G:5, D:6}
I also have a list of the columnames
columnlist=['OfferPreference_A', 'OfferPreference_B', 'OfferPreference_C']
I Am trying to get the following table as the output
OfferPreference_A OfferPreference_B OfferPreference_C
1 2 1
2 3 3
3 4 5
How do I do this.
Use:
#if value not match get NaN
df = df[columnlist].applymap(dict1.get)
Or:
#if value not match get original value
df = df[columnlist].replace(dict1)
Or:
#if value not match get NaN
df = df[columnlist].stack().map(dict1).unstack()
print (df)
OfferPreference_A OfferPreference_B OfferPreference_C
0 1 2 1
1 2 3 3
2 3 4 5
You can use map for this like shown below, assuming the values will match always
for col in columnlist:
df[col] = df[col].map(dict1)

Including the group name in the apply function pandas python

Is there away to specify to the groupby() call to use the group name in the apply() lambda function?
Similar to if I iterate through groups I can get the group key via the following tuple decomposition:
for group_name, subdf in temp_dataframe.groupby(level=0, axis=0):
print group_name
...is there a way to also get the group name in the apply function, such as:
temp_dataframe.groupby(level=0,axis=0).apply(lambda group_name, subdf: foo(group_name, subdf)
How can I get the group name as an argument for the apply lambda function?
I think you should be able to use the nameattribute:
temp_dataframe.groupby(level=0,axis=0).apply(lambda x: foo(x.name, x))
should work, example:
In [132]:
df = pd.DataFrame({'a':list('aabccc'), 'b':np.arange(6)})
df
Out[132]:
a b
0 a 0
1 a 1
2 b 2
3 c 3
4 c 4
5 c 5
In [134]:
df.groupby('a').apply(lambda x: print('name:', x.name, '\nsubdf:',x))
name: a
subdf: a b
0 a 0
1 a 1
name: b
subdf: a b
2 b 2
name: c
subdf: a b
3 c 3
4 c 4
5 c 5
Out[134]:
Empty DataFrame
Columns: []
Index: []
For those who came looking for an answer to the question:
Including the group name in the transform function pandas python
and ended up in this thread, please read on.
Given the following input:
df = pd.DataFrame(data={'col1': list('aabccc'),
'col2': np.arange(6),
'col3': np.arange(6)})
Data:
col1 col2 col3
0 a 0 0
1 a 1 1
2 b 2 2
3 c 3 3
4 c 4 4
5 c 5 5
We can access the group name (which is visible from the scope of the calling apply function) like this:
df.groupby('col1') \
.apply(lambda frame: frame \
.transform(lambda col: col + 3 if frame.name == 'a' and col.name == 'col2' else col))
Output:
col1 col2 col3
0 a 3 0
1 a 4 1
2 b 2 2
3 c 3 3
4 c 4 4
5 c 5 5
Note that the call to apply is needed in order to obtain a reference to the sub pandas.core.frame.DataFrame (i.e. frame) which holds the name attribute of the corresponding sub group. The name attribute of the argument of transform (i.e. col) refers to the column/series name.
Alternatively, one could also loop over the groups and then, within each group, over the columns:
for grp_name, sub_df in df.groupby('col1'):
for col in sub_df:
if grp_name == 'a' and col == 'col2':
df.loc[df.col1 == grp_name, col] = sub_df[col] + 3
My use case is quite rare and this was the only way to achieve my goal (as of pandas v0.24.2). However, I'd recommend exploring the pandas documentation thoroughly because there most likely is an easier vectorised solution to what you may need this construct for.

Categories