Trying to deal with an issue how to implement dictionary to existing DataFrame's column by certain indexes, i.e.:
Primary DataFrame:
country year Corruption Index
0 X 2010 6,5
1 X 2015 78,0
So I have dictionary (separately devide 78 by 10):
x = {1: 7,8}
I want to change value of index 1, column 'Corruption Index'.
I now I can use for loop, but are there any faster variants ? Or maybe I can directly divide values in Existing DataFrame which values are higher than 10 by ten? (the problem is that official statistics after 2012 turns from 1-10 to 1-100)
The fastest solution is the answer from Mykola Zotko
For a vectorized solution, use pandas.DataFrame.where, which will replace values where the condition is False.
.where will be much faster than using .apply, as df['Corruption Index'].apply(lambda x: x/10 if x > 10 else x)
numpy.where is slightly different, in that it works on the True condition, and is faster than pandas.DataFrame.where, but requires importing numpy.
Since this issue seems to be related to official statistics after 2012 turns from 1-10 to 1-100, m = df.year <= 2012 should be used as the condition.
import pandas as pd
import numpy as np
# test dataframe
df = pd.DataFrame({'country': ['X', 'X'], 'year': [2010, 2015], 'Corruption_Index': [6.5, 78.0]})
# display(df)
country year Corruption_Index
0 X 2010 6.5
1 X 2015 78.0
# create a Boolean mask for the condition
m = df.year <= 2012
# use pandas.DataFrame.where to calculate on the False condition
df['Corruption_Index'].where(m, df['Corruption_Index'] / 10, inplace=True)
# alternatively, use np.where to calculate on the True condition
df['Corruption_Index'] = np.where(df.year > 2012, df['Corruption_Index'] / 10, df['Corruption_Index'])
# display(df)
country year Corruption_Index
0 X 2010 6.5
1 X 2015 7.8
%%timeit Comparison
# test data with 1M rows
np.random.seed(365)
df = pd.DataFrame({'v': [np.random.randint(20) for _ in range(1000000)]})
%%timeit
df.loc[df['v'] > 10, 'v'] /= 10
[out]:
2.66 ms ± 61.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
np.where(df['v'] > 10, df['v'] / 10, df['v'])
[out]:
8.11 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df['v'].where(df['v'] <= 10, df['v'] / 10)
[out]:
17.9 ms ± 615 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df['v'].apply(lambda x: x/10 if x > 10 else x)
[out]:
319 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use:
df.loc[df['Corruption_Index'] > 10, 'Corruption_Index'] /= 10
Related
I have data frame of many columns consisting float values. I want to delete a row if any of the columns have value below 20.
code:
xdf = pd.DataFrame({'A':np.random.uniform(low=-50, high=53.3, size=(5)),'B':np.random.uniform(low=10, high=130, size=(5)),'C':np.random.uniform(low=-50, high=130, size=(5)),'D':np.random.uniform(low=-100, high=200, size=(5))})
xdf =
A B C D
0 -9.270533 42.098425 91.125009 148.350655
1 17.771411 55.564825 106.396381 -89.082831
2 -22.602563 99.330643 17.590466 73.985202
3 15.890920 76.011631 52.366311 194.023063
4 35.202379 41.973846 32.576890 100.523902
# my code
xdf[xdf[cols].ge(20).all(axis=1)]
Out[17]:
A B C D
4 35.202379 41.973846 32.57689 100.523902
Expected output: drop a row if any column has below 20 value
xdf =
A B C D
4 35.202379 41.973846 32.576890 100.523902
Is this the best way of doing it?
To do it in numpy:
xdf = pd.DataFrame({'A':np.random.uniform(low=-50, high=53.3, size=(5)),'B':np.random.uniform(low=10, high=130, size=(5)),'C':np.random.uniform(low=-50, high=130, size=(5)),'D':np.random.uniform(low=-100, high=200, size=(5))})
%timeit xdf[xdf[['A','B','C','D']].ge(20).all(axis=1)]
%timeit xdf[(xdf[['A','B','C','D']].values >= 20).all(axis=1)]
705 µs ± 277 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
460 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you do not want to keep result in DataFrame this can even be faster:
xdf.values[(xdf[['A','B','C','D']].values >= 20).all(axis=1)]
As numpy is lighter and therefore faster in terms of calculations with numbers, try this:
a = np.array([np.random.uniform(low=-50, high=53.3, size=(5)),
np.random.uniform(low=10, high=130, size=(5)),
np.random.uniform(low=-50, high=130, size=(5)),
np.random.uniform(low=-100, high=200, size=(5))])
print(a[np.all(a > 20, axis=1)])
If you want to stick with pandas, another idea would be:
xdfFiltered = xdf.loc[(xdf["A"] > 20) & (xdf["B"] > 20) & (xdf["C"] > 20) & (xdf["D"] > 20)]
You can use the numpy equivalent of .ge instead:
xdf.loc[np.greater(xdf,20).all(axis=1)]
I'm working in a pandas dataframe trying to clean up some data and I want to assign multiple rules to a certain column. If the column value is greater than 500 I want to drop the column. If the column value is between 101 and 500 I want to replace the value with 100. When the column is less than 101 return the column value.
I'm able to do it in 2 lines of code, but I was curious if there's a cleaner more efficient way to do this. I tried with an If/Elif/Else, but I couldn't get it to run or a lambda function, but again I couldn't get it to run.
# This drops all rows that are greater than 500
df.drop(df[df.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)
You can use .loc with boolean mask instead of .drop() with index and use fast numpy function numpy.where() to achieve more efficient / better performance, as follows:
import numpy as np
df2 = df.loc[df['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
Performance Comparison:
Part 1: Original size dataframe
Old Codes:
%%timeit
df.drop(df[df.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)
1.58 ms ± 56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
New Codes:
%%timeit
df2 = df.loc[df['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
784 µs ± 8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Benchmarking result:
The new codes take 784 µs while the old codes take 1.58 ms:
Around 2x times faster
Part 2: Large size dataframe
Let's use a dataframe 10000 times the original size:
df9 = pd.concat([df] * 10000, ignore_index=True)
Old Codes:
%%timeit
df9.drop(df9[df9.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df9['Percent'] = df9['Percent'].clip(upper = 100)
3.87 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New Codes:
%%timeit
df2 = df9.loc[df9['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
1.96 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Benchmarking result:
The new codes take 1.96 ms while the old codes take 3.87 ms :
Also around 2x times faster
I want to modify a single value in a DataFrame. The typical suggestion for doing this is to use df.at[] and reference the position as the index label and the column label, or to use df.iat[] and reference the position as the integer row and the integer column. But I want to reference the position as the integer row and the column label.
Assume this DataFrame:
dateindex
apples
oranges
bananas
2021-01-01 14:00:01.384624
1
X
3
2021-01-05 13:43:26.203773
4
5
6
2021-01-31 08:23:29.837238
7
8
9
2021-02-08 10:23:09.095632
0
1
2
data = [{'apples':1, 'oranges':'X', 'bananas':3},
{'apples':4, 'oranges':5, 'bananas':6},
{'apples':7, 'oranges':8, 'bananas':9},
{'apples':0, 'oranges':1, 'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
pd.to_datetime('2021-01-05 13:43:26.203773'),
pd.to_datetime('2021-01-31 08:23:29.837238'),
pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)
I want to change the value "X" to "2". I don't know the exact time; I just know that it's the first row. But I do know that I want to change the "oranges" column.
I want to do something like df.at[0,'oranges'], but I can't do that; I get a KeyError.
The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?
Wrt
The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?
Yes, it is. And I agree, it is awkward. The old .ix used to support these mixed indexing cases better but its behaviour depended on the dtype of the axis, making it inconsistent. In the meanwhile...
The other options, which have been used in the other answers, can all issue the SettingWithCopy warning. It's not guaranteed to raise the issue but it might, based on what the indexing criteria are and how values are assigned.
Referencing Combining positional and label-based indexing and starting with this df, which has dateindex as the index:
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 X 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
Using both options:
with .loc or .at:
df.at[df.index[0], 'oranges'] = -50
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 -50 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
with .iloc or .iat:
df.iat[0, df.columns.get_loc('oranges')] = -20
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 -20 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
FWIW, I find approach #1 more consistent since it can handle multiple row indexes without changing the functions/methods used: df.loc[df.index[[0, 2]], 'oranges'] but approach #2 needs a different column indexer when there are multiple columns: df.iloc[[0, 2], df.columns.get_indexer(['oranges', 'bananas'])].
Solution with Series.iat
If it doesn't seem more awkward to you, you can use the iat method of pandas Series:
df["oranges"].iat[0] = 2
Time performance comparison with other methods
As this method doesn't raise any warning, it can be interesting to compare its time performance with other proposed solutions.
%%timeit
df.at[df.index[0], 'oranges'] = 2
# > 9.91 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df.iat[0, df.columns.get_loc('oranges')] = 2
# > 13.5 µs ± 74.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df["oranges"].iat[0] = 2
# > 3.49 µs ± 16.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The pandas.Series.iat method seems to be the most performant one (I took the median of three runs).
Let's try again with huge DataFrames
With a DatetimeIndex
# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
df_large.index = pd.date_range(start=0, periods=100000)
# 2070-01-01 to 2243-10-16, a bit unrealistic
%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = -2
# > 10.1 µs ± 85.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = -2
# > 13.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large["col_55555"].iat[55555] = -2
# > 3.31 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With a RangeIndex
# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = 2
# > 4.5 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = 2
# > 13.5 µs ± 50.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large["col_55555"].iat[55555] = 2
# > 3.49 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Since it is a simple indexing with O(n) complexity, the size of the array doesn't change much the results, except when it comes to the "at + index" ; strangely enough, it shows worst performance with small dataframes. Thanks to the author wfaulk for spotting that using a RangeIndex decreases the access time of the "at + index" method. The time performance remains higher and constant when dealing with DatetimeIndex with pd.Series.iat.
You were actually quite close with your initial guess.
You would do it like this:
import pandas as pd
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pd.DataFrame(mydict)
print(df)
# change th value of column a, row 2
df['a'][2] = 100
# print column a, row 2
print(df['a'][2])
There are lots of different variants such as loc and iloc, but this is one good method.
In the example we discovered that loc was optimal as df[][] throws an error:
import pandas as pd
data = [{'apples':1, 'oranges':'X', 'bananas':3},
{'apples':4, 'oranges':5, 'bananas':6},
{'apples':7, 'oranges':8, 'bananas':9},
{'apples':0, 'oranges':1, 'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
pd.to_datetime('2021-01-05 13:43:26.203773'),
pd.to_datetime('2021-01-31 08:23:29.837238'),
pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)
print(df)
df.loc['2021-01-01 14:00:01.384624','oranges'] = 10
# df['oranges'][0] = 10
print(df)
This works.
You can use the loc method. It receives the row and column you want to change.
Changing X to 2: df.loc[0, 'oranges'] = 2
See: pandas.DataFrame.loc
I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(
Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True
I wonder how to check if a pandas dataframe has negative value in 1 or more columns and
return only boolean value (True or False). Can you please help?
In[1]: df = pd.DataFrame(np.random.randn(10, 3))
In[2]: df
Out[2]:
0 1 2
0 -1.783811 0.736010 0.865427
1 -1.243160 0.255592 1.670268
2 0.820835 0.246249 0.288464
3 -0.923907 -0.199402 0.090250
4 -1.575614 -1.141441 0.689282
5 -1.051722 0.513397 1.471071
6 2.549089 0.977407 0.686614
7 -1.417064 0.181957 0.351824
8 0.643760 0.867286 1.166715
9 -0.316672 -0.647559 1.331545
Expected output:-
Out[3]: True
Actually, if speed is important, I did a few tests:
df = pd.DataFrame(np.random.randn(10000, 30000))
Test 1, slowest: pure pandas
(df < 0).any().any()
# 303 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test 2, faster: switching over to numpy with .values for testing the presence of a True entry
(df < 0).values.any()
# 269 ms ± 8.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test 3, maybe even faster, though not significant: switching over to numpy for the whole thing
(df.values < 0).any()
# 267 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can chain two any
df.lt(0).any().any()
Out[96]: True
This does the trick:
(df < 0).any().any()
To break it down, (df < 0) gives a dataframe with boolean entries. Then the first .any() returns a series of booleans, testing within each column for the presence of a True value. And then, the second .any() asks whether this returned series itself contains any True value.
This returns a simple:
True