I have a pandas DataFrame as follow:
col1 col2 col3
0 1 3 ABCDEFG
1 1 5 HIJKLMNO
2 1 2 PQRSTUV
I want to add another column which should be a substring of col3 from position as indicated in col1 to position as indicated in col2. Something like col3[(col1-1):(col2-1)], which should result in:
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJK
2 1 2 PQRSTUV PQ
I tried with the following:
my_df['new_col'] = my_df.col3.str.slice(my_df['col1']-1, my_df['col2']-1)
and
my_df['new_col'] = data['col3'].str[(my_df['col1']-1):(my_df['col2']-1)]
Both of them results in a column of NaN, while if I insert two numerical values (i.e. data['col3'].str[1:3]) it works fine. I checked and the types are correct (int64, int64 and object). Also, outside such context (e.g. using a for loop) I can get the job done, but I'd prefer a one liner that exploit the DataFrame. What am I doing wrong?
Use apply, because each row has to be process separately:
my_df['new_col'] = my_df.apply(lambda x: x['col3'][x['col1']-1:x['col2']], 1)
print (my_df)
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJKL
2 1 2 PQRSTUV PQ
Related
It should be straight forward but I cannot find the right command.
I want to add a new column (Col3) to my Numpy which counts the occurrence of a value of a column (Col2) for each row. take this example:
before:
Col1
Col2
1
4
2
4
3
1500
4
60
5
60
6
60
after:
Col1
Col2
Col3
1
4
2
2
4
2
3
1500
1
4
60
3
5
60
3
6
60
3
any idea?
Using numpy:
Create a frequency dictionary based on the values on Col2
from collections import Counter
freq = Counter(arr[:,1])
generate the values of the Col3 iterating the elements of Col2
new_col = np.array([freq[val] if val in freq else 0 for val in arr[:,1] ]).reshape(-1,1)
concatenate the new column to the existing array
new_arr = np.concatenate([arr, new_col],axis=1)
I have a data frame like this:
test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
test
The dataframe looks like:
col1 col2 col3
0 10 5 6
1 20 10 12
2 30 15 18
3 40 20 24
I wanna replace the values which are greater than 10 in col2 or col3 with zero. I wanna use loc function for this purpose.
My desired output is:
col1 col2 col3
0 10 5 6
1 20 10 0
2 30 0 0
3 40 0 0
I have tried the following solution:
cols_to_update = ['col2', 'col3']
test.loc[test[cols_to_update]>10]=0
test
It shows the following error:
KeyError: "None of [Index([('c', 'o', 'l', '1'), ('c', 'o', 'l', '2')], dtype='object')] are in the [index]"
When I use a single column to test the condition, it doesn't show 'KeyError', but now it also replaces values in other two columns.
test.loc[test['col2']>10]=0
test
Output is:
col1 col2 col3
0 10 5 6
1 0 0 0
2 0 0 0
3 0 0 0
Can we use loc for this purpose?
Why is loc behaving like this?
What is the efficient solution?
I would use numpy.where to conditionally replace values of multiple columns:
import numpy as np
cols_to_update = ['col2', 'col3']
test[cols_to_update] = np.where(test[cols_to_update] > 10, 0, test[cols_to_update])
The expression test[cols_to_update] > 10 gives you a boolean mask:
col2 col3
0 False False
1 False True
2 True True
3 True True
Then, np.where picks the value 0 whenever this mask is True or it picks the corresponding original data test[cols_to_update] whenever the mask is False.
Your solution test.loc[test[cols_to_update]>10]=0 doesn't work because loc in this case would require a boolean 1D series, while test[cols_to_update]>10 is still a DataFrame with two columns. This is also the reason why you cannot use loc for this problem (at least not without looping over the columns): The indices where the values of columns 2 and 3 meet the condition > 10 are different.
When would loc be appropriate in this case? For example if you wanted to set both columns 2 and 3 to zero when any of the two is greater than 10:
test.loc[(test[cols_to_update] > 10).any(axis=1), cols_to_update] = 0
test
# out:
col1 col2 col3
0 10 5 6
1 20 0 0
2 30 0 0
3 40 0 0
In this case you index with a 1D Series ((test[cols_to_update] > 10).any(axis=1)), which is an appropriate use case for loc.
You can use where:
import pandas as pd
test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
test[['col2', 'col3']] = test[['col2', 'col3']].where(test[['col2', 'col3']] <= 10, 0)
output:
col1
col2
col3
0
10
5
6
1
20
10
0
2
30
0
0
3
40
0
0
I have a dataframe in pandas and I would like to subtract one column (lets say col1) from col2 and col3 (or from more columns if there are) without writing the the below assign statement for each column.
df = pd.DataFrame({'col1':[1,2,3,4], 'col2':[2,5,6,8], 'col3':[5,5,5,9]})
df = (df
...
.assign(col2 = lambda x: x.col2 - x.col1)
)
How can I do this? Or would it work with apply? How would you be able to do this with method chaining?
Edit: (using **kwarg with chainning method)
As in your comment, if you want to chain method on the intermediate(on-going calculated) dataframe, you need to define a custom dictionary to calculate each column to use with assign as follows (you can't use lambda to directly construct dictionary inside assign).
In this example I do add 5 to the dataframe before chaining assign to show how it works on chain processing as you want
d = {cl: lambda x, cl=cl: x[cl] - x['col1'] for cl in ['col2','col3']}
df_final = df.add(5).assign(**d)
In [63]: df
Out[63]:
col1 col2 col3
0 1 2 5
1 2 5 5
2 3 6 5
3 4 8 9
In [64]: df_final
Out[64]:
col1 col2 col3
0 6 1 4
1 7 3 3
2 8 3 2
3 9 4 5
Note: df_final.col1 is different from df.col1 because of the add operation before assign. Don't forget cl=cl in the lambda of dictionary. It is there to avoid late-binding issue of python.
Use df.sub
df_sub = df.assign(**df[['col2','col3']].sub(df.col1, axis=0).add_prefix('sub_'))
Out[22]:
col1 col2 col3 sub_col2 sub_col3
0 1 2 5 1 4
1 2 5 5 3 3
2 3 6 5 3 2
3 4 8 9 4 5
If you want to assign values back to col2, col3, use additional update
df.update(df[['col2','col3']].sub(df.col1, axis=0))
print(df)
Output:
col1 col2 col3
0 1 1 4
1 2 3 3
2 3 3 2
3 4 4 5
I have a pandas DataFrame as follow:
col1 col2 col3
0 1 3 ABCDEFG
1 1 5 HIJKLMNO
2 1 2 PQRSTUV
I want to add another column which should be a substring of col3 from position as indicated in col1 to position as indicated in col2. Something like col3[(col1-1):(col2-1)], which should result in:
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJK
2 1 2 PQRSTUV PQ
I tried with the following:
my_df['new_col'] = my_df.col3.str.slice(my_df['col1']-1, my_df['col2']-1)
and
my_df['new_col'] = data['col3'].str[(my_df['col1']-1):(my_df['col2']-1)]
Both of them results in a column of NaN, while if I insert two numerical values (i.e. data['col3'].str[1:3]) it works fine. I checked and the types are correct (int64, int64 and object). Also, outside such context (e.g. using a for loop) I can get the job done, but I'd prefer a one liner that exploit the DataFrame. What am I doing wrong?
Use apply, because each row has to be process separately:
my_df['new_col'] = my_df.apply(lambda x: x['col3'][x['col1']-1:x['col2']], 1)
print (my_df)
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJKL
2 1 2 PQRSTUV PQ
Maybe someone knows how to add two rows of an data frame by grouping with specific condition.
dfa.groupby(['Col1','Col2'])[['Quantity']].sum()
Say we have this df:
Col1 Col2 Quantity
0 1 1 10
1 1 1 10
2 2 1 3
3 1 2 3
4 1 2 3
And Im trying to get this:
Condition to sum:
Col1 element of one Row is equal to element of other row in Col1 AND
Col2 element of that Row is equal to element of the other row in Col2
Col1 Col2 Quantity
0 1 1 20
2 2 1 3
3 1 2 6
This seems like what you are looking for:
dfa[dfa.Col1 == dfa.Col2].groupby(['Col1','Col2'])[['Quantity']].sum()
I would think that Groupby would do the trick.