pandas combine two strings ignore nan values - python

I have two columns with strings. I would like to combine them and ignore nan values. Such that:
ColA, Colb, ColA+ColB
str str strstr
str nan str
nan str str
I tried df['ColA+ColB'] = df['ColA'] + df['ColB'] but that creates a nan value if either column is nan. I've also thought about using concat.
I suppose I could just go with that, and then use some df.ColA+ColB[df[ColA] = nan] = df[ColA] but that seems like quite the workaround.

Call fillna and pass an empty str as the fill value and then sum with param axis=1:
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN]})
df
Out[3]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [7]:
df['a+b'] = df.fillna('').sum(axis=1)
df
Out[7]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa

You could fill the NaN with an empty string:
df['ColA+ColB'] = df['ColA'].fillna('') + df['ColB'].fillna('')

Using apply and str.cat you can
In [723]: df
Out[723]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [724]: df['a+b'] = df.apply(lambda x: x.str.cat(sep=''), axis=1)
In [725]: df
Out[725]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa

In my case, I wanted to join more than 2 columns together with a separator (a+b+c)
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN], 'c':['as',np.NaN ,'ds']})
In [4]: df
Out[4]:
a b c
0 asd asdas as
1 NaN asdas NaN
2 asdsa NaN ds
The following syntax worked for me:
In [5]: df['d'] = df[['a', 'b', 'c']].fillna('').agg('|'.join, axis=1)
In [6]: df
Out[6]:
a b c d
0 asd asdas as asd|asdas|as
1 NaN asdas NaN |asdas|
2 asdsa NaN ds asdsa||ds

Prefer adding the columns than use apply method. cuz it's faster than apply.
Just add the two columns (if you know they are strings)
%timeit df.bio + df.procedure_codes
21.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Use apply
%timeit df[eventcol].apply(lambda x: ''.join(x), axis=1)
13.6 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use Pandas string methods and cat:
%timeit df[eventcol[0]].str.cat(cols, sep=',')
264 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using sum (which concatenate strings)
%timeit df[eventcol].sum(axis=1)
509 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
see here for more tests

Related

Fast way to get index of non-blank values in row/column

Let's say we have the following pandas dataframe:
df = pd.DataFrame({'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: None, 2: 8.0}, 'c': {0: 4.0, 1: 2.0, 2: 6.0}})
a b c
0 3.0 10.0 4.0
1 2.0 NaN 2.0
2 NaN 8.0 6.0
I need to get a dataframe with, for each row, the column names of all non-NaN values.
I know I can do the following, which produces the expected outupt:
df2 = df.apply(lambda x: pd.Series(x.dropna().index), axis=1)
0 1 2
0 a b c
1 a c NaN
2 b c NaN
Unfortunately, this is quite slow with large datasets. Is there a faster way?
Getting the row indices of non-Null values of each column could work too, as I would just need to transpose the input dataframe. Thanks.
Use numpy:
m = df.notna()
a = m.mul(df.columns).where(m).to_numpy()
out = pd.DataFrame(a[np.arange(len(a))[:,None], np.argsort(~m, axis=1)],
index=df.index)
Output:
0 1 2
0 a b c
1 a c NaN
2 b c NaN
timings
On 30k rows x 3 columns:
# numpy approach
6.82 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pandas apply
7.32 s ± 553 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How to put NaN in Pandas Dataframe efficiently?

I have a df which contains of categorical and numerical data
df = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Address':['Oxford', 'Cambridge', 'Xianjiang', 'Wuhan'],
'Age':[20, 21, 19, 18],
'Weight':[50, 61, 69, 78]}
df = pd.DataFrame(df)
I need to replace 50 % in each column to NaN randomly, so the result might look like this picture
how to do that with the most efficient techique because I have large number of rows and columns, and I'll do many repetitions.
Use apply with sample
df_final = df.apply(lambda x: x.sample(frac=0.5)).reindex(df.index)
Out[175]:
Name Address Age Weight
0 Tom NaN NaN 50.0
1 NaN NaN NaN 61.0
2 krish Xianjiang 19.0 NaN
3 NaN Wuhan 18.0 NaN
Improving three times the performance of previous answers, mostly inspired on #jezrael , I suggest using argpartition instead of argsort, since the sorting performed is rather useless:
df1 = df.mask(np.random.rand(*df.shape).argpartition(0, axis=0) >= df.shape[0] // 2)
print(df1)
Name Address Age Weight
0 NaN Oxford NaN 50.0
1 nick Cambridge 21.0 61.0
2 NaN NaN NaN NaN
3 jack NaN 18.0 NaN
Performance comparison
# Reusing the same comparison dataset
df = pd.concat([df] * 50000, ignore_index=True)
df = pd.concat([df] * 50, ignore_index=True, axis=1)
# #Andy's answer, using apply and sample
%timeit df.apply(lambda x: x.sample(frac=0.5)).reindex(df.index)
9.72 s ± 532 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# #jezrael's answer, based on mask, np random and argsort
%timeit df.mask(np.random.rand(*df.shape).argsort(axis=0) >= df.shape[0] // 2)
8.23 s ± 732 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This answer, based on mask, np random and argpartition
%timeit df.mask(np.random.rand(*df.shape).argpartition(0, axis=0) >= df.shape[0] // 2)
2.54 s ± 98.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It can be done by taking random numbers in the range of your tuples and run a loop over them and consider that as index to replace with NaaN
example:
if you have 10 tuples
from random number generator set range to 0 to 9 and
and take result of above operation as index to replace with NaN

Pandas Convert Dataframe Values to Labels

I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.
Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64
Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Get first letter of a string from column

I'm fighting with pandas and for now I'm loosing. I have source table similar to this:
import pandas as pd
a=pd.Series([123,22,32,453,45,453,56])
b=pd.Series([234,4353,355,453,345,453,56])
df=pd.concat([a, b], axis=1)
df.columns=['First', 'Second']
I would like to add new column to this data frame with first digit from values in column 'First':
a) change number to string from column 'First'
b) extracting first character from newly created string
c) Results from b save as new column in data frame
I don't know how to apply this to the pandas data frame object. I would be grateful for helping me with that.
Cast the dtype of the col to str and you can perform vectorised slicing calling str:
In [29]:
df['new_col'] = df['First'].astype(str).str[0]
df
Out[29]:
First Second new_col
0 123 234 1
1 22 4353 2
2 32 355 3
3 453 453 4
4 45 345 4
5 453 453 4
6 56 56 5
if you need to you can cast the dtype back again calling astype(int) on the column
.str.get
This is the simplest to specify string methods
# Setup
df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
df
A B
0 xyz 123
1 abc 456
2 foobar 789
df.dtypes
A object
B int64
dtype: object
For string (read:object) type columns, use
df['C'] = df['A'].str[0]
# Similar to,
df['C'] = df['A'].str.get(0)
.str handles NaNs by returning NaN as the output.
For non-numeric columns, an .astype conversion is required beforehand, as shown in #Ed Chum's answer.
# Note that this won't work well if the data has NaNs.
# It'll return lowercase "n"
df['D'] = df['B'].astype(str).str[0]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
List Comprehension and Indexing
There is enough evidence to suggest a simple list comprehension will work well here and probably be faster.
# For string columns
df['C'] = [x[0] for x in df['A']]
# For numeric columns
df['D'] = [str(x)[0] for x in df['B']]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
If your data has NaNs, then you will need to handle this appropriately with an if/else in the list comprehension,
df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
df2
A B
0 xyz 123.0
1 NaN 456.0
2 foobar NaN
# For string columns
df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]
# For numeric columns
df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]
A B C D
0 xyz 123.0 x 1
1 NaN 456.0 NaN 4
2 foobar NaN f NaN
Let's do some timeit tests on some larger data.
df_ = df.copy()
df = pd.concat([df_] * 5000, ignore_index=True)
%timeit df.assign(C=df['A'].str[0])
%timeit df.assign(D=df['B'].astype(str).str[0])
%timeit df.assign(C=[x[0] for x in df['A']])
%timeit df.assign(D=[str(x)[0] for x in df['B']])
12 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.84 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehensions are 4x faster.

Find integer index of rows with NaN in pandas dataframe

I have a pandas DataFrame like this:
a b
2011-01-01 00:00:00 1.883381 -0.416629
2011-01-01 01:00:00 0.149948 -1.782170
2011-01-01 02:00:00 -0.407604 0.314168
2011-01-01 03:00:00 1.452354 NaN
2011-01-01 04:00:00 -1.224869 -0.947457
2011-01-01 05:00:00 0.498326 0.070416
2011-01-01 06:00:00 0.401665 NaN
2011-01-01 07:00:00 -0.019766 0.533641
2011-01-01 08:00:00 -1.101303 -1.408561
2011-01-01 09:00:00 1.671795 -0.764629
Is there an efficient way to find the "integer" index of rows with NaNs? In this case the desired output should be [3, 6].
Here is a simpler solution:
inds = pd.isnull(df).any(1).nonzero()[0]
In [9]: df
Out[9]:
0 1
0 0.450319 0.062595
1 -0.673058 0.156073
2 -0.871179 -0.118575
3 0.594188 NaN
4 -1.017903 -0.484744
5 0.860375 0.239265
6 -0.640070 NaN
7 -0.535802 1.632932
8 0.876523 -0.153634
9 -0.686914 0.131185
In [10]: pd.isnull(df).any(1).nonzero()[0]
Out[10]: array([3, 6])
For DataFrame df:
import numpy as np
index = df['b'].index[df['b'].apply(np.isnan)]
will give you back the MultiIndex that you can use to index back into df, e.g.:
df['a'].ix[index[0]]
>>> 1.452354
For the integer index:
df_index = df.index.values.tolist()
[df_index.index(i) for i in index]
>>> [3, 6]
One line solution. However it works for one column only.
df.loc[pandas.isna(df["b"]), :].index
And just in case, if you want to find the coordinates of 'nan' for all the columns instead (supposing they are all numericals), here you go:
df = pd.DataFrame([[0,1,3,4,np.nan,2],[3,5,6,np.nan,3,3]])
df
0 1 2 3 4 5
0 0 1 3 4.0 NaN 2
1 3 5 6 NaN 3.0 3
np.where(np.asanyarray(np.isnan(df)))
(array([0, 1]), array([4, 3]))
Don't know if this is too late but you can use np.where to find the indices of non values as such:
indices = list(np.where(df['b'].isna()[0]))
in the case you have datetime index and you want to have the values:
df.loc[pd.isnull(df).any(1), :].index.values
Here are tests for a few methods:
%timeit np.where(np.isnan(df['b']))[0]
%timeit pd.isnull(df['b']).nonzero()[0]
%timeit np.where(df['b'].isna())[0]
%timeit df.loc[pd.isna(df['b']), :].index
And their corresponding timings:
333 µs ± 9.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
280 µs ± 220 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
313 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.84 ms ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It would appear that pd.isnull(df['DRGWeight']).nonzero()[0] wins the day in terms of timing, but that any of the top three methods have comparable performance.
Another simple solution is list(np.where(df['b'].isnull())[0])
This will give you the index values for nan in every column:
df.loc[pd.isna(df).any(1), :].index
Here is another simpler take:
df = pd.DataFrame([[0,1,3,4,np.nan,2],[3,5,6,np.nan,3,3]])
inds = np.asarray(df.isnull()).nonzero()
(array([0, 1], dtype=int64), array([4, 3], dtype=int64))
I was looking for all indexes of rows with NaN values.
My working solution:
def get_nan_indexes(data_frame):
indexes = []
print(data_frame)
for column in data_frame:
index = data_frame[column].index[data_frame[column].apply(np.isnan)]
if len(index):
indexes.append(index[0])
df_index = data_frame.index.values.tolist()
return [df_index.index(i) for i in set(indexes)]
Let the dataframe be named df and the column of interest(i.e. the column in which we are trying to find nulls) is 'b'. Then the following snippet gives the desired index of null in the dataframe:
for i in range(df.shape[0]):
if df['b'].isnull().iloc[i]:
print(i)
index_nan = []
for index, bool_v in df["b"].iteritems().isna():
if bool_v == True:
index_nan.append(index)
print(index_nan)

Categories