Pandas DataFrame: use column value to slice string in another column - python

I have a pandas DataFrame as follow:
col1 col2 col3
0 1 3 ABCDEFG
1 1 5 HIJKLMNO
2 1 2 PQRSTUV
I want to add another column which should be a substring of col3 from position as indicated in col1 to position as indicated in col2. Something like col3[(col1-1):(col2-1)], which should result in:
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJK
2 1 2 PQRSTUV PQ
I tried with the following:
my_df['new_col'] = my_df.col3.str.slice(my_df['col1']-1, my_df['col2']-1)
and
my_df['new_col'] = data['col3'].str[(my_df['col1']-1):(my_df['col2']-1)]
Both of them results in a column of NaN, while if I insert two numerical values (i.e. data['col3'].str[1:3]) it works fine. I checked and the types are correct (int64, int64 and object). Also, outside such context (e.g. using a for loop) I can get the job done, but I'd prefer a one liner that exploit the DataFrame. What am I doing wrong?

Use apply, because each row has to be process separately:
my_df['new_col'] = my_df.apply(lambda x: x['col3'][x['col1']-1:x['col2']], 1)
print (my_df)
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJKL
2 1 2 PQRSTUV PQ

Related

How to change values in a Pandas DataFrame based on values of another columns

I have the following DataFrame with some numbers in them where the sum of the values in Col1, Col2, and Col3 is equal to the value in column Main.
How can I replace the values in the Cat columns if they are equal to the corresponding value in the Main column?
For example, the following DataFrame:
Main Col1 Col2 Col3
0 100 50 50 0
1 200 0 200 0
2 30 20 5 5
3 500 0 0 500
would be changed to this:
Main Col1 Col2 Col3
0 100 50 50 0
1 200 0 EQUAL 0
2 30 20 5 5
3 500 0 0 EQUAL
You can use filter to apply only on the "Col" columns (you could also use slicing with a list, see alternative), then mask to change the matching values, finally update to update the DataFrame in place:
df.update(df.filter(like='Col').mask(df.eq(df['Main'], axis=0), 'EQUAL'))
Alternative:
cols = ['Col1', 'Col2', 'Col3']
df.update(df[cols].mask(df.eq(df['Main'], axis=0), 'EQUAL'))
Output:
Main Col1 Col2 Col3
0 100 50 50 0
1 200 0 EQUAL 0
2 30 20 5 5
3 500 0 0 EQUAL
There are several different ways of doing this, I suggest using the np.where() function.
import numpy as np
df['Col1'] = np.where(df['Col1'] == df['Main'], 'EQUAL', df['Col1']
df['Col2'] = np.where(df['Col2'] == df['Main'], 'EQUAL', df['Col2']
df['Col3'] = np.where(df['Col3'] == df['Main'], 'EQUAL', df['Col3']
Read more about np.where() here.

count based on column and add as new column in numpy

It should be straight forward but I cannot find the right command.
I want to add a new column (Col3) to my Numpy which counts the occurrence of a value of a column (Col2) for each row. take this example:
before:
Col1
Col2
1
4
2
4
3
1500
4
60
5
60
6
60
after:
Col1
Col2
Col3
1
4
2
2
4
2
3
1500
1
4
60
3
5
60
3
6
60
3
any idea?
Using numpy:
Create a frequency dictionary based on the values on Col2
from collections import Counter
freq = Counter(arr[:,1])
generate the values of the Col3 iterating the elements of Col2
new_col = np.array([freq[val] if val in freq else 0 for val in arr[:,1] ]).reshape(-1,1)
concatenate the new column to the existing array
new_arr = np.concatenate([arr, new_col],axis=1)

How can I best apply pandas str.find to another column [duplicate]

I have a pandas DataFrame as follow:
col1 col2 col3
0 1 3 ABCDEFG
1 1 5 HIJKLMNO
2 1 2 PQRSTUV
I want to add another column which should be a substring of col3 from position as indicated in col1 to position as indicated in col2. Something like col3[(col1-1):(col2-1)], which should result in:
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJK
2 1 2 PQRSTUV PQ
I tried with the following:
my_df['new_col'] = my_df.col3.str.slice(my_df['col1']-1, my_df['col2']-1)
and
my_df['new_col'] = data['col3'].str[(my_df['col1']-1):(my_df['col2']-1)]
Both of them results in a column of NaN, while if I insert two numerical values (i.e. data['col3'].str[1:3]) it works fine. I checked and the types are correct (int64, int64 and object). Also, outside such context (e.g. using a for loop) I can get the job done, but I'd prefer a one liner that exploit the DataFrame. What am I doing wrong?
Use apply, because each row has to be process separately:
my_df['new_col'] = my_df.apply(lambda x: x['col3'][x['col1']-1:x['col2']], 1)
print (my_df)
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJKL
2 1 2 PQRSTUV PQ

subtract one column from multiple columns in the same dataframe using method chaining

I have a dataframe in pandas and I would like to subtract one column (lets say col1) from col2 and col3 (or from more columns if there are) without writing the the below assign statement for each column.
df = pd.DataFrame({'col1':[1,2,3,4], 'col2':[2,5,6,8], 'col3':[5,5,5,9]})
df = (df
...
.assign(col2 = lambda x: x.col2 - x.col1)
)
How can I do this? Or would it work with apply? How would you be able to do this with method chaining?
Edit: (using **kwarg with chainning method)
As in your comment, if you want to chain method on the intermediate(on-going calculated) dataframe, you need to define a custom dictionary to calculate each column to use with assign as follows (you can't use lambda to directly construct dictionary inside assign).
In this example I do add 5 to the dataframe before chaining assign to show how it works on chain processing as you want
d = {cl: lambda x, cl=cl: x[cl] - x['col1'] for cl in ['col2','col3']}
df_final = df.add(5).assign(**d)
In [63]: df
Out[63]:
col1 col2 col3
0 1 2 5
1 2 5 5
2 3 6 5
3 4 8 9
In [64]: df_final
Out[64]:
col1 col2 col3
0 6 1 4
1 7 3 3
2 8 3 2
3 9 4 5
Note: df_final.col1 is different from df.col1 because of the add operation before assign. Don't forget cl=cl in the lambda of dictionary. It is there to avoid late-binding issue of python.
Use df.sub
df_sub = df.assign(**df[['col2','col3']].sub(df.col1, axis=0).add_prefix('sub_'))
Out[22]:
col1 col2 col3 sub_col2 sub_col3
0 1 2 5 1 4
1 2 5 5 3 3
2 3 6 5 3 2
3 4 8 9 4 5
If you want to assign values back to col2, col3, use additional update
df.update(df[['col2','col3']].sub(df.col1, axis=0))
print(df)
Output:
col1 col2 col3
0 1 1 4
1 2 3 3
2 3 3 2
3 4 4 5

How to add rows in pandas with conditions

Maybe someone knows how to add two rows of an data frame by grouping with specific condition.
dfa.groupby(['Col1','Col2'])[['Quantity']].sum()
Say we have this df:
Col1 Col2 Quantity
0 1 1 10
1 1 1 10
2 2 1 3
3 1 2 3
4 1 2 3
And Im trying to get this:
Condition to sum:
Col1 element of one Row is equal to element of other row in Col1 AND
Col2 element of that Row is equal to element of the other row in Col2
Col1 Col2 Quantity
0 1 1 20
2 2 1 3
3 1 2 6
This seems like what you are looking for:
dfa[dfa.Col1 == dfa.Col2].groupby(['Col1','Col2'])[['Quantity']].sum()
I would think that Groupby would do the trick.

Categories