I have a pandas DataFrame like this:
a b
2011-01-01 00:00:00 1.883381 -0.416629
2011-01-01 01:00:00 0.149948 -1.782170
2011-01-01 02:00:00 -0.407604 0.314168
2011-01-01 03:00:00 1.452354 NaN
2011-01-01 04:00:00 -1.224869 -0.947457
2011-01-01 05:00:00 0.498326 0.070416
2011-01-01 06:00:00 0.401665 NaN
2011-01-01 07:00:00 -0.019766 0.533641
2011-01-01 08:00:00 -1.101303 -1.408561
2011-01-01 09:00:00 1.671795 -0.764629
Is there an efficient way to find the "integer" index of rows with NaNs? In this case the desired output should be [3, 6].
Here is a simpler solution:
inds = pd.isnull(df).any(1).nonzero()[0]
In [9]: df
Out[9]:
0 1
0 0.450319 0.062595
1 -0.673058 0.156073
2 -0.871179 -0.118575
3 0.594188 NaN
4 -1.017903 -0.484744
5 0.860375 0.239265
6 -0.640070 NaN
7 -0.535802 1.632932
8 0.876523 -0.153634
9 -0.686914 0.131185
In [10]: pd.isnull(df).any(1).nonzero()[0]
Out[10]: array([3, 6])
For DataFrame df:
import numpy as np
index = df['b'].index[df['b'].apply(np.isnan)]
will give you back the MultiIndex that you can use to index back into df, e.g.:
df['a'].ix[index[0]]
>>> 1.452354
For the integer index:
df_index = df.index.values.tolist()
[df_index.index(i) for i in index]
>>> [3, 6]
One line solution. However it works for one column only.
df.loc[pandas.isna(df["b"]), :].index
And just in case, if you want to find the coordinates of 'nan' for all the columns instead (supposing they are all numericals), here you go:
df = pd.DataFrame([[0,1,3,4,np.nan,2],[3,5,6,np.nan,3,3]])
df
0 1 2 3 4 5
0 0 1 3 4.0 NaN 2
1 3 5 6 NaN 3.0 3
np.where(np.asanyarray(np.isnan(df)))
(array([0, 1]), array([4, 3]))
Don't know if this is too late but you can use np.where to find the indices of non values as such:
indices = list(np.where(df['b'].isna()[0]))
in the case you have datetime index and you want to have the values:
df.loc[pd.isnull(df).any(1), :].index.values
Here are tests for a few methods:
%timeit np.where(np.isnan(df['b']))[0]
%timeit pd.isnull(df['b']).nonzero()[0]
%timeit np.where(df['b'].isna())[0]
%timeit df.loc[pd.isna(df['b']), :].index
And their corresponding timings:
333 µs ± 9.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
280 µs ± 220 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
313 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.84 ms ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It would appear that pd.isnull(df['DRGWeight']).nonzero()[0] wins the day in terms of timing, but that any of the top three methods have comparable performance.
Another simple solution is list(np.where(df['b'].isnull())[0])
This will give you the index values for nan in every column:
df.loc[pd.isna(df).any(1), :].index
Here is another simpler take:
df = pd.DataFrame([[0,1,3,4,np.nan,2],[3,5,6,np.nan,3,3]])
inds = np.asarray(df.isnull()).nonzero()
(array([0, 1], dtype=int64), array([4, 3], dtype=int64))
I was looking for all indexes of rows with NaN values.
My working solution:
def get_nan_indexes(data_frame):
indexes = []
print(data_frame)
for column in data_frame:
index = data_frame[column].index[data_frame[column].apply(np.isnan)]
if len(index):
indexes.append(index[0])
df_index = data_frame.index.values.tolist()
return [df_index.index(i) for i in set(indexes)]
Let the dataframe be named df and the column of interest(i.e. the column in which we are trying to find nulls) is 'b'. Then the following snippet gives the desired index of null in the dataframe:
for i in range(df.shape[0]):
if df['b'].isnull().iloc[i]:
print(i)
index_nan = []
for index, bool_v in df["b"].iteritems().isna():
if bool_v == True:
index_nan.append(index)
print(index_nan)
Related
I want to modify a single value in a DataFrame. The typical suggestion for doing this is to use df.at[] and reference the position as the index label and the column label, or to use df.iat[] and reference the position as the integer row and the integer column. But I want to reference the position as the integer row and the column label.
Assume this DataFrame:
dateindex
apples
oranges
bananas
2021-01-01 14:00:01.384624
1
X
3
2021-01-05 13:43:26.203773
4
5
6
2021-01-31 08:23:29.837238
7
8
9
2021-02-08 10:23:09.095632
0
1
2
data = [{'apples':1, 'oranges':'X', 'bananas':3},
{'apples':4, 'oranges':5, 'bananas':6},
{'apples':7, 'oranges':8, 'bananas':9},
{'apples':0, 'oranges':1, 'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
pd.to_datetime('2021-01-05 13:43:26.203773'),
pd.to_datetime('2021-01-31 08:23:29.837238'),
pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)
I want to change the value "X" to "2". I don't know the exact time; I just know that it's the first row. But I do know that I want to change the "oranges" column.
I want to do something like df.at[0,'oranges'], but I can't do that; I get a KeyError.
The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?
Wrt
The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?
Yes, it is. And I agree, it is awkward. The old .ix used to support these mixed indexing cases better but its behaviour depended on the dtype of the axis, making it inconsistent. In the meanwhile...
The other options, which have been used in the other answers, can all issue the SettingWithCopy warning. It's not guaranteed to raise the issue but it might, based on what the indexing criteria are and how values are assigned.
Referencing Combining positional and label-based indexing and starting with this df, which has dateindex as the index:
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 X 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
Using both options:
with .loc or .at:
df.at[df.index[0], 'oranges'] = -50
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 -50 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
with .iloc or .iat:
df.iat[0, df.columns.get_loc('oranges')] = -20
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 -20 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
FWIW, I find approach #1 more consistent since it can handle multiple row indexes without changing the functions/methods used: df.loc[df.index[[0, 2]], 'oranges'] but approach #2 needs a different column indexer when there are multiple columns: df.iloc[[0, 2], df.columns.get_indexer(['oranges', 'bananas'])].
Solution with Series.iat
If it doesn't seem more awkward to you, you can use the iat method of pandas Series:
df["oranges"].iat[0] = 2
Time performance comparison with other methods
As this method doesn't raise any warning, it can be interesting to compare its time performance with other proposed solutions.
%%timeit
df.at[df.index[0], 'oranges'] = 2
# > 9.91 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df.iat[0, df.columns.get_loc('oranges')] = 2
# > 13.5 µs ± 74.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df["oranges"].iat[0] = 2
# > 3.49 µs ± 16.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The pandas.Series.iat method seems to be the most performant one (I took the median of three runs).
Let's try again with huge DataFrames
With a DatetimeIndex
# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
df_large.index = pd.date_range(start=0, periods=100000)
# 2070-01-01 to 2243-10-16, a bit unrealistic
%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = -2
# > 10.1 µs ± 85.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = -2
# > 13.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large["col_55555"].iat[55555] = -2
# > 3.31 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With a RangeIndex
# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = 2
# > 4.5 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = 2
# > 13.5 µs ± 50.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large["col_55555"].iat[55555] = 2
# > 3.49 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Since it is a simple indexing with O(n) complexity, the size of the array doesn't change much the results, except when it comes to the "at + index" ; strangely enough, it shows worst performance with small dataframes. Thanks to the author wfaulk for spotting that using a RangeIndex decreases the access time of the "at + index" method. The time performance remains higher and constant when dealing with DatetimeIndex with pd.Series.iat.
You were actually quite close with your initial guess.
You would do it like this:
import pandas as pd
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pd.DataFrame(mydict)
print(df)
# change th value of column a, row 2
df['a'][2] = 100
# print column a, row 2
print(df['a'][2])
There are lots of different variants such as loc and iloc, but this is one good method.
In the example we discovered that loc was optimal as df[][] throws an error:
import pandas as pd
data = [{'apples':1, 'oranges':'X', 'bananas':3},
{'apples':4, 'oranges':5, 'bananas':6},
{'apples':7, 'oranges':8, 'bananas':9},
{'apples':0, 'oranges':1, 'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
pd.to_datetime('2021-01-05 13:43:26.203773'),
pd.to_datetime('2021-01-31 08:23:29.837238'),
pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)
print(df)
df.loc['2021-01-01 14:00:01.384624','oranges'] = 10
# df['oranges'][0] = 10
print(df)
This works.
You can use the loc method. It receives the row and column you want to change.
Changing X to 2: df.loc[0, 'oranges'] = 2
See: pandas.DataFrame.loc
I have a df which contains of categorical and numerical data
df = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Address':['Oxford', 'Cambridge', 'Xianjiang', 'Wuhan'],
'Age':[20, 21, 19, 18],
'Weight':[50, 61, 69, 78]}
df = pd.DataFrame(df)
I need to replace 50 % in each column to NaN randomly, so the result might look like this picture
how to do that with the most efficient techique because I have large number of rows and columns, and I'll do many repetitions.
Use apply with sample
df_final = df.apply(lambda x: x.sample(frac=0.5)).reindex(df.index)
Out[175]:
Name Address Age Weight
0 Tom NaN NaN 50.0
1 NaN NaN NaN 61.0
2 krish Xianjiang 19.0 NaN
3 NaN Wuhan 18.0 NaN
Improving three times the performance of previous answers, mostly inspired on #jezrael , I suggest using argpartition instead of argsort, since the sorting performed is rather useless:
df1 = df.mask(np.random.rand(*df.shape).argpartition(0, axis=0) >= df.shape[0] // 2)
print(df1)
Name Address Age Weight
0 NaN Oxford NaN 50.0
1 nick Cambridge 21.0 61.0
2 NaN NaN NaN NaN
3 jack NaN 18.0 NaN
Performance comparison
# Reusing the same comparison dataset
df = pd.concat([df] * 50000, ignore_index=True)
df = pd.concat([df] * 50, ignore_index=True, axis=1)
# #Andy's answer, using apply and sample
%timeit df.apply(lambda x: x.sample(frac=0.5)).reindex(df.index)
9.72 s ± 532 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# #jezrael's answer, based on mask, np random and argsort
%timeit df.mask(np.random.rand(*df.shape).argsort(axis=0) >= df.shape[0] // 2)
8.23 s ± 732 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This answer, based on mask, np random and argpartition
%timeit df.mask(np.random.rand(*df.shape).argpartition(0, axis=0) >= df.shape[0] // 2)
2.54 s ± 98.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It can be done by taking random numbers in the range of your tuples and run a loop over them and consider that as index to replace with NaaN
example:
if you have 10 tuples
from random number generator set range to 0 to 9 and
and take result of above operation as index to replace with NaN
I have two series;
energy_dict['QLD'] =
Timestamp
2017-04-27 00:00:00 523.720765
2017-04-27 01:00:00 512.180608
2017-04-27 02:00:00 519.076642
2017-04-27 03:00:00 516.329201
2017-04-27 04:00:00 525.150158
... ...
Freq: H, Name: QLD Total Energy (MWh), Length: 8760, dtype: float64
and
Incoming_Flow =
Timestamp
2017-04-27 00:00:00 -8.961111
2017-04-27 01:00:00 9.503472
2017-04-27 02:00:00 -10.776389
2017-04-27 03:00:00 1.451389
2017-04-27 04:00:00 -10.388195
... ...
Freq: H, Name: METEREDMWFLOW N-Q-MNSP1, Length: 8760, dtype: float64
I would like to add them together, but only when the second one is larger than zero. What is the best way to do this?
I am aware that I could do something like this;
Incoming_Flow[Incoming_Flow < 0 ] = 0
but I would like to be able to do it all in one line
Use Series.add with Series.mask:
s = energy_dict['QLD'].add(Incoming_Flow.mask(Incoming_Flow < 0, 0), fill_value=0)
print (s)
0 523.720765
1 521.684080
2 519.076642
3 517.780590
4 525.150158
dtype: float64
print (Incoming_Flow.mask(Incoming_Flow < 0, 0))
0 0.000000
1 9.503472
2 0.000000
3 1.451389
4 0.000000
Name: METEREDMWFLOW N-Q-MNSP1, dtype: float64
Or filter Series and use parameter fill_value=0:
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
s = energy_dict['QLD'].add(Incoming_Flow[Incoming_Flow > 0], fill_value=0)
print (s)
0 523.720765
1 521.684080
2 519.076642
3 517.780590
4 525.150158
dtype: float64
Detail:
print (Incoming_Flow[Incoming_Flow > 0])
1 9.503472
3 1.451389
Name: METEREDMWFLOW N-Q-MNSP1, dtype: float64
EDIT:
If performance is important, use numpy.where:
s = pd.Series(np.where(Incoming_Flow < 0, 0, Incoming_Flow ), index=Incoming_Flow.index)
#if DatetimeIndex values are same in both Series
s = np.where(Incoming_Flow < 0, 0, Incoming_Flow )
energy_dict['QLD'].add(s, fill_value=0)
You could also use Series.add and Series.where:
s = energy_dict['QLD'].add(Incoming_Flow.where(Incoming_Flow.gt(0), 0))
This is also ~18% faster than the mask solution if performance is important:
[proof]
s1 = pd.Series(np.arange(50000))
s2 = pd.Series(np.random.randint(-4, 10,50000))
%timeit s1.add(s2.mask(s2 < 0, 0), fill_value=0)
1.17 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit s1.add(s2[s2 > 0], fill_value=0)
4.68 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit s1.add(s2.where(s2.gt(0), 0))
988 µs ± 50.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Faster using numpy add and where
import numpy as np
qld = [523.720765, 512.180608, 519.076642, 516.329201, 525.150158]
flow = [ -8.961111, 9.503472, -10.776389, 1.451389, -10.388195]
df1 = pd.DataFrame(qld, columns=['QLD'])
df2 = pd.DataFrame(flow, columns=['Incoming_Flow'])
s = np.add(df1['QLD'], np.where(df2['Incoming_Flow'] > 0, df2['Incoming_Flow'], 0))
print(s)
0 523.720765
1 521.684080
2 519.076642
3 517.780590
4 525.150158
Timings:
s1 = pd.Series(np.arange(50000))
s2 = pd.Series(np.random.randint(-4, 10,50000))
%timeit s1.add(s2.where(s2.gt(0), 0))
890 µs ± 58.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.add(s1, np.where(s2 > 0, s2, 0))
367 µs ± 6.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.
Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64
Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have two columns with strings. I would like to combine them and ignore nan values. Such that:
ColA, Colb, ColA+ColB
str str strstr
str nan str
nan str str
I tried df['ColA+ColB'] = df['ColA'] + df['ColB'] but that creates a nan value if either column is nan. I've also thought about using concat.
I suppose I could just go with that, and then use some df.ColA+ColB[df[ColA] = nan] = df[ColA] but that seems like quite the workaround.
Call fillna and pass an empty str as the fill value and then sum with param axis=1:
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN]})
df
Out[3]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [7]:
df['a+b'] = df.fillna('').sum(axis=1)
df
Out[7]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa
You could fill the NaN with an empty string:
df['ColA+ColB'] = df['ColA'].fillna('') + df['ColB'].fillna('')
Using apply and str.cat you can
In [723]: df
Out[723]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [724]: df['a+b'] = df.apply(lambda x: x.str.cat(sep=''), axis=1)
In [725]: df
Out[725]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa
In my case, I wanted to join more than 2 columns together with a separator (a+b+c)
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN], 'c':['as',np.NaN ,'ds']})
In [4]: df
Out[4]:
a b c
0 asd asdas as
1 NaN asdas NaN
2 asdsa NaN ds
The following syntax worked for me:
In [5]: df['d'] = df[['a', 'b', 'c']].fillna('').agg('|'.join, axis=1)
In [6]: df
Out[6]:
a b c d
0 asd asdas as asd|asdas|as
1 NaN asdas NaN |asdas|
2 asdsa NaN ds asdsa||ds
Prefer adding the columns than use apply method. cuz it's faster than apply.
Just add the two columns (if you know they are strings)
%timeit df.bio + df.procedure_codes
21.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Use apply
%timeit df[eventcol].apply(lambda x: ''.join(x), axis=1)
13.6 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use Pandas string methods and cat:
%timeit df[eventcol[0]].str.cat(cols, sep=',')
264 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using sum (which concatenate strings)
%timeit df[eventcol].sum(axis=1)
509 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
see here for more tests