Pandas Dataframe change column values with given interval - python

I am trying to convert my columns values according to an interval like,
if(x<5)
x=2
else if(x>=5 %% x<10)
x=3
and try to doing in python with single line code. Using mask and cut method but I could not do.
this is my trial,
dataset['CURRENT_RATIO' ] = dataset['CURRENT_RATIO'].mask((dataset['CURRENT_RATIO'] < 0.02, -7.0) | (dataset['CURRENT_RATIO'] > =0.02 & dataset['CURRENT_RATIO'] < 0.37),-5))
I need this actually if x<0.02 then -7 else if x>=0.02 and x<0.37 then -5...
inptut output
0.015 -7
0.02 -5
0.37 -3
0.75 1

TL;DR
A list comprehension will do:
dataset.CURRENT_RATIO = [-7 if i<.02 else -5 if i<.37 else -3 if i<.75 else 1 for i in dataset.CURRENT_RATIO]
Timing it with a random dataset of 1,000,000 rows gave the following result:
334 ms ± 5.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Multiple Columns
If you were to do this with multiple columns with different thresholds and update values, I would do the following:
df = pd.DataFrame({'RATIO1': np.random.rand(10000000),
'RATIO2': np.random.rand(10000000)})
update_per_col = {'RATIO1': [(.02, -7), (.37, -5), (.75, -3), 1],
'RATIO2': [(.12, 5), (.47, 6), (.85, 7), 8]}
cols_to_be_updated = ['RATIO1', 'RATIO1']
for col in cols_to_be_updated:
df[col] = [update_per_col[col][0][1] if i<update_per_col[col][0][0] else
update_per_col[col][1][1] if i<update_per_col[col][1][0] else
update_per_col[col][2][1] if i<update_per_col[col][2][0] else update_per_col[col][3]
for i in df[col]]
When we time the for loop with 10,000,000 rows and two columns we get:
9.37 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
To address Joe Ferndz's comment, lets try to speed it up. I came up with two methods: apply() and a lambda. With apply() we run the following code (only the for loop was timed):
def update_ratio(x, updates):
if x < updates[0][0]:
return updates[0][1]
elif x < updates[1][0]:
return updates[1][1]
elif x < updates[2][0]:
return updates[2][1]
else:
return updates[3]
for col in cols_to_be_updated:
df[col] = df[col].apply(update_ratio, updates=update_per_col[col])
This gives us:
11.8 s ± 285 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Finally the lambda gives:
for col in cols_to_be_updated:
df[col] = df[col].apply(lambda x: update_per_col[col][0][1] if x<update_per_col[col][0][0] else
update_per_col[col][1][1] if x<update_per_col[col][1][0] else
update_per_col[col][2][1] if x<update_per_col[col][2][0] else
update_per_col[col][3])
8.91 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This means that a lambda is the fastest method, but a list comprehension is not far off.

use
dataset['CURRENT_RATIO'][dataset['CURRENT_RATIO']<10]=3
dataset['CURRENT_RATIO'][dataset['CURRENT_RATIO']<5]=2
first line you make all value less than 10 to 3
then all values less than 5 to 2
it makes all value between 5 to 10 remain 3

It would be smart to write your own lambda-function and use pandas.apply()
def my_lambda(x):
if(x<5):
x=2
elif(x<10):
x=3
return x
dataset['CURRENT_RATIO' ] = dataset['CURRENT_RATIO'].apply(my_lambda)

Related

pandas split string and extract upto n index position

I have my input data like as below stored in a dataframe column
active_days_revenue
active_days_rate
total_revenue
gap_days_rate
I would like to do the below
a) split the string using _ delimiter
b) extract n elements from the delimiter
So, I tried the below
df['text'].split('_')[:1] # but this doesn't work
df['text'].split('_')[0] # this works but returns only the 1st element
I expect my output like below. Instead of just getting items based on 0 index position, I would like to get from 0 to 1st index position
active_days
active_days
total_revenue
gap_days
You can use str.extract with a dynamic regex (fastest):
N = 2
df['out'] = df['text'].str.extract(fr'([^_]+(?:_[^_]+){{,{N-1}}})', expand=False)
Or slicing and agg:
df['out'] = df['text'].str.split('_').str[:2].agg('_'.join)
Or str.extractall and groupby.agg:
df['out'] = df['text'].str.extractall('([^_]+)')[0].groupby(level=0).agg(lambda x: '_'.join(x.head(2)))
Output:
text out
0 active_days_revenue active_days
1 active_days_rate active_days
2 total_revenue total_revenue
3 gap_days_rate gap_days
timings
On 4k rows:
# extract
2.17 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# split/slice/agg
3.56 ms ± 811 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# extractall
361 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
using as grouper for columns
import re
N = 2
df1.groupby(lambda x: m.group() if (m:=re.match(fr'([^_]+(?:_[^_]+){{,{N-1}}})', x)) else x, axis=1, sort=False)

Modify single value in pandas DataFrame with integer row and label column

I want to modify a single value in a DataFrame. The typical suggestion for doing this is to use df.at[] and reference the position as the index label and the column label, or to use df.iat[] and reference the position as the integer row and the integer column. But I want to reference the position as the integer row and the column label.
Assume this DataFrame:
dateindex
apples
oranges
bananas
2021-01-01 14:00:01.384624
1
X
3
2021-01-05 13:43:26.203773
4
5
6
2021-01-31 08:23:29.837238
7
8
9
2021-02-08 10:23:09.095632
0
1
2
data = [{'apples':1, 'oranges':'X', 'bananas':3},
{'apples':4, 'oranges':5, 'bananas':6},
{'apples':7, 'oranges':8, 'bananas':9},
{'apples':0, 'oranges':1, 'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
pd.to_datetime('2021-01-05 13:43:26.203773'),
pd.to_datetime('2021-01-31 08:23:29.837238'),
pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)
I want to change the value "X" to "2". I don't know the exact time; I just know that it's the first row. But I do know that I want to change the "oranges" column.
I want to do something like df.at[0,'oranges'], but I can't do that; I get a KeyError.
The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?
Wrt
The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?
Yes, it is. And I agree, it is awkward. The old .ix used to support these mixed indexing cases better but its behaviour depended on the dtype of the axis, making it inconsistent. In the meanwhile...
The other options, which have been used in the other answers, can all issue the SettingWithCopy warning. It's not guaranteed to raise the issue but it might, based on what the indexing criteria are and how values are assigned.
Referencing Combining positional and label-based indexing and starting with this df, which has dateindex as the index:
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 X 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
Using both options:
with .loc or .at:
df.at[df.index[0], 'oranges'] = -50
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 -50 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
with .iloc or .iat:
df.iat[0, df.columns.get_loc('oranges')] = -20
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 -20 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
FWIW, I find approach #1 more consistent since it can handle multiple row indexes without changing the functions/methods used: df.loc[df.index[[0, 2]], 'oranges'] but approach #2 needs a different column indexer when there are multiple columns: df.iloc[[0, 2], df.columns.get_indexer(['oranges', 'bananas'])].
Solution with Series.iat
If it doesn't seem more awkward to you, you can use the iat method of pandas Series:
df["oranges"].iat[0] = 2
Time performance comparison with other methods
As this method doesn't raise any warning, it can be interesting to compare its time performance with other proposed solutions.
%%timeit
df.at[df.index[0], 'oranges'] = 2
# > 9.91 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df.iat[0, df.columns.get_loc('oranges')] = 2
# > 13.5 µs ± 74.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df["oranges"].iat[0] = 2
# > 3.49 µs ± 16.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The pandas.Series.iat method seems to be the most performant one (I took the median of three runs).
Let's try again with huge DataFrames
With a DatetimeIndex
# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
df_large.index = pd.date_range(start=0, periods=100000)
# 2070-01-01 to 2243-10-16, a bit unrealistic
%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = -2
# > 10.1 µs ± 85.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = -2
# > 13.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large["col_55555"].iat[55555] = -2
# > 3.31 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With a RangeIndex
# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = 2
# > 4.5 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = 2
# > 13.5 µs ± 50.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large["col_55555"].iat[55555] = 2
# > 3.49 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Since it is a simple indexing with O(n) complexity, the size of the array doesn't change much the results, except when it comes to the "at + index" ; strangely enough, it shows worst performance with small dataframes. Thanks to the author wfaulk for spotting that using a RangeIndex decreases the access time of the "at + index" method. The time performance remains higher and constant when dealing with DatetimeIndex with pd.Series.iat.
You were actually quite close with your initial guess.
You would do it like this:
import pandas as pd
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pd.DataFrame(mydict)
print(df)
# change th value of column a, row 2
df['a'][2] = 100
# print column a, row 2
print(df['a'][2])
There are lots of different variants such as loc and iloc, but this is one good method.
In the example we discovered that loc was optimal as df[][] throws an error:
import pandas as pd
data = [{'apples':1, 'oranges':'X', 'bananas':3},
{'apples':4, 'oranges':5, 'bananas':6},
{'apples':7, 'oranges':8, 'bananas':9},
{'apples':0, 'oranges':1, 'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
pd.to_datetime('2021-01-05 13:43:26.203773'),
pd.to_datetime('2021-01-31 08:23:29.837238'),
pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)
print(df)
df.loc['2021-01-01 14:00:01.384624','oranges'] = 10
# df['oranges'][0] = 10
print(df)
This works.
You can use the loc method. It receives the row and column you want to change.
Changing X to 2: df.loc[0, 'oranges'] = 2
See: pandas.DataFrame.loc

How to apply changes to subset dataframe to source dataframe

I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(
Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True

Pandas - check if dataframe has negative value in any column

I wonder how to check if a pandas dataframe has negative value in 1 or more columns and
return only boolean value (True or False). Can you please help?
In[1]: df = pd.DataFrame(np.random.randn(10, 3))
In[2]: df
Out[2]:
0 1 2
0 -1.783811 0.736010 0.865427
1 -1.243160 0.255592 1.670268
2 0.820835 0.246249 0.288464
3 -0.923907 -0.199402 0.090250
4 -1.575614 -1.141441 0.689282
5 -1.051722 0.513397 1.471071
6 2.549089 0.977407 0.686614
7 -1.417064 0.181957 0.351824
8 0.643760 0.867286 1.166715
9 -0.316672 -0.647559 1.331545
Expected output:-
Out[3]: True
Actually, if speed is important, I did a few tests:
df = pd.DataFrame(np.random.randn(10000, 30000))
Test 1, slowest: pure pandas
(df < 0).any().any()
# 303 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test 2, faster: switching over to numpy with .values for testing the presence of a True entry
(df < 0).values.any()
# 269 ms ± 8.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test 3, maybe even faster, though not significant: switching over to numpy for the whole thing
(df.values < 0).any()
# 267 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can chain two any
df.lt(0).any().any()
Out[96]: True
This does the trick:
(df < 0).any().any()
To break it down, (df < 0) gives a dataframe with boolean entries. Then the first .any() returns a series of booleans, testing within each column for the presence of a True value. And then, the second .any() asks whether this returned series itself contains any True value.
This returns a simple:
True

Find indices of where Pandas Series contains element containing character

Example:
import pandas as pd
arr = pd.Series(['a',['a','b'],'c'])
I would like to get the indices of where the series contains elements containing 'a'. So I would like to get back indices 0 and 1.
I've tried writing
arr.str.contains('a')
but this returns
0 True
1 NaN
2 False
dtype: object
while I'd like it to return
0 True
1 True
2 False
dtype: object
use Series.str.join() to concatenate lists/arrays in cells into a single string and then use .str.contains('a'):
In [78]: arr.str.join(sep='~').str.contains('a')
Out[78]:
0 True
1 True
2 False
dtype: bool
Use Series.apply and Python's in keyword which works on both lists and strings
arr.apply(lambda x: 'a' in x)
This will work fine if you don't have any NaN values in your Series, but if you do, you can use:
arr.apply(lambda x: 'a' in x if x is not np.nan else x)
This is much faster than using Series.str.
Benchmarks:
%%timeit
arr.str.join(sep='~').str.contains('a')
Takes: 249 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
arr.apply(lambda x: 'a' in x)
Takes: 70.1 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
arr.apply(lambda x: 'a' in x if x is not np.nan else x)
Takes: 69 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Categories