I'm fighting with pandas and for now I'm loosing. I have source table similar to this:
import pandas as pd
a=pd.Series([123,22,32,453,45,453,56])
b=pd.Series([234,4353,355,453,345,453,56])
df=pd.concat([a, b], axis=1)
df.columns=['First', 'Second']
I would like to add new column to this data frame with first digit from values in column 'First':
a) change number to string from column 'First'
b) extracting first character from newly created string
c) Results from b save as new column in data frame
I don't know how to apply this to the pandas data frame object. I would be grateful for helping me with that.
Cast the dtype of the col to str and you can perform vectorised slicing calling str:
In [29]:
df['new_col'] = df['First'].astype(str).str[0]
df
Out[29]:
First Second new_col
0 123 234 1
1 22 4353 2
2 32 355 3
3 453 453 4
4 45 345 4
5 453 453 4
6 56 56 5
if you need to you can cast the dtype back again calling astype(int) on the column
.str.get
This is the simplest to specify string methods
# Setup
df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
df
A B
0 xyz 123
1 abc 456
2 foobar 789
df.dtypes
A object
B int64
dtype: object
For string (read:object) type columns, use
df['C'] = df['A'].str[0]
# Similar to,
df['C'] = df['A'].str.get(0)
.str handles NaNs by returning NaN as the output.
For non-numeric columns, an .astype conversion is required beforehand, as shown in #Ed Chum's answer.
# Note that this won't work well if the data has NaNs.
# It'll return lowercase "n"
df['D'] = df['B'].astype(str).str[0]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
List Comprehension and Indexing
There is enough evidence to suggest a simple list comprehension will work well here and probably be faster.
# For string columns
df['C'] = [x[0] for x in df['A']]
# For numeric columns
df['D'] = [str(x)[0] for x in df['B']]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
If your data has NaNs, then you will need to handle this appropriately with an if/else in the list comprehension,
df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
df2
A B
0 xyz 123.0
1 NaN 456.0
2 foobar NaN
# For string columns
df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]
# For numeric columns
df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]
A B C D
0 xyz 123.0 x 1
1 NaN 456.0 NaN 4
2 foobar NaN f NaN
Let's do some timeit tests on some larger data.
df_ = df.copy()
df = pd.concat([df_] * 5000, ignore_index=True)
%timeit df.assign(C=df['A'].str[0])
%timeit df.assign(D=df['B'].astype(str).str[0])
%timeit df.assign(C=[x[0] for x in df['A']])
%timeit df.assign(D=[str(x)[0] for x in df['B']])
12 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.84 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehensions are 4x faster.
Related
I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.
Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64
Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I want to add two dataframes which I can achieve by add function.
Now I want to divide each value of resultant dataframe based on whether respective value was present in initial dataframes(df1,df2,df3). For eg.
df1 = pd.DataFrame([[1,2],[3,4]], index =['A','B'], columns = ['C','D'])
df2 = pd.DataFrame([[11,12], [13,14]], index = ['A','B'], columns = ['D','E'])
df3 = df1.add(df2, fill_value=0)
This would result in a df like
C D E
A 1.0 13 12.0
B 3.0 17 14.0
I require a df like:
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
because D column is found in both dataframes, I divide those values by 2.
Can anyone please provide a generic solution, assuming I need to add more than 2 dataframes (so the division factor also changes) and have more than 100 columns in each dataframe.
We can concatenate all DFs horizontally in one step:
In [13]: df = pd.concat([df1,df2], axis=1).fillna(0)
this yields:
In [15]: df
Out[15]:
C D D E
A 1 2 11 12
B 3 4 13 14
now we can group by columns, calculating average (mean):
In [14]: df.groupby(df.columns, axis=1).mean()
Out[14]:
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
or we can do it in one step (thanks #jezrael):
In [60]: pd.concat([df1,df2], axis=1).fillna(0).groupby(level=0, axis=1).mean()
Out[60]:
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
Timing:
In [38]: df1 = pd.concat([df1] * 10**5, ignore_index=True)
In [39]: df2 = pd.concat([df2] * 10**5, ignore_index=True)
In [40]: %%timeit
...: df = pd.concat([df1,df2], axis=1).fillna(0)
...: df.groupby(df.columns, axis=1).mean()
...:
63.4 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %%timeit
...: s = pd.Series(np.concatenate([df1.columns, df2.columns])).value_counts()
...: df1.add(df2, fill_value=0).div(s)
...:
28.7 ms ± 712 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [42]: %%timeit
...: pd.concat([df1,df2]).mean(level = 0)
...:
65.5 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [43]: df1.shape
Out[43]: (200000, 2)
In [44]: df2.shape
Out[44]: (200000, 2)
Current winner: #jezrael (28.7 ms ± 712 µs) - congratulations!
It looks like you are trying to compute a mean. Don't do too many operations with the dataframe methods and individual columns if you can help it, as it's slow.
df = pd.concat([df1,df2]) # concatenate all your dataframes together
df.mean(level = 0)
The second line computes the mean along the vertical axis (axis = 0 by default), and level = 0 tells pandas to get the mean of each unique index.
Faster solution is divide by size of columns:
s = pd.Series(np.concatenate([df1.columns, df2.columns])).value_counts()
print (s)
C 1
D 2
E 1
dtype: int64
df3 = df1.add(df2, fill_value=0).div(s)
print (df3)
C D E
A 1.0 6.5 12.0
B 3.0 8.5 14.0
Timings (with 100 columns like OP mentioned):
np.random.seed(123)
N = 100000
df1 = pd.DataFrame(np.random.randint(10, size=(N, 100)))
df1.columns = 'col' + df1.columns.astype(str)
df2 = df1.mul(10)
#MaxU solution
In [127]: %timeit (pd.concat([df1,df2], axis=1).fillna(0).groupby(level=0, axis=1).mean())
1 loop, best of 3: 952 ms per loop
#Ken Wei solution
In [128]: %timeit (pd.concat([df1,df2]).mean(level = 0))
1 loop, best of 3: 895 ms per loop
#jez solution
In [129]: %timeit (df1.add(df2, fill_value=0).div(pd.Series(np.concatenate([df1.columns, df2.columns])).value_counts()))
10 loops, best of 3: 161 ms per loop
More general solution:
If have list of DataFrames, is possible chaning like:
df = df1.add(df2, fill_value=0).add(df3, fill_value=0)
but better is use reduce:
from functools import reduce
dfs = [df1,df2, df3]
s = pd.Series(np.concatenate([x.columns for x in dfs])).value_counts()
df5 = reduce(lambda x, y: x.add(y, fill_value=0), dfs).div(s)
I face some problem here, in my python package I have install numpy, but I still have this error:
'DataFrame' object has no attribute 'sort'
Anyone can give me some idea..
This is my code :
final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1 # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)
sort() was deprecated for DataFrames in favor of either:
sort_values() to sort by column(s)
sort_index() to sort by the index
sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).
Pandas Sorting 101
sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.
Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.
# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})
df
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
Sort by Single Column
For example, to sort df by column "A", use sort_values with a single column name:
df.sort_values(by='A')
A B
0 a 7
3 a 5
4 b 2
1 c 9
2 c 3
If you need a fresh RangeIndex, use DataFrame.reset_index.
Sort by Multiple Columns
For example, to sort by both col "A" and "B" in df, you can pass a list to sort_values:
df.sort_values(by=['A', 'B'])
A B
3 a 5
0 a 7
4 b 2
2 c 3
1 c 9
Sort By DataFrame Index
df2 = df.sample(frac=1)
df2
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
You can do this using sort_index:
df2.sort_index()
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
df.equals(df2)
# False
df.equals(df2.sort_index())
# True
Here are some comparable methods with their performance:
%timeit df2.sort_index()
%timeit df2.iloc[df2.index.argsort()]
%timeit df2.reindex(np.sort(df2.index))
605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sort by List of Indices
For example,
idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])
This "sorting" problem is actually a simple indexing problem. Just passing integer labels to iloc will do.
df.iloc[idx]
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
I have a dataframe, which contains info about movies. It has a column called genre, which contains a list of genres it belongs to. For example:
df['genre']
## returns
0 ['comedy', 'sci-fi']
1 ['action', 'romance', 'comedy']
2 ['documentary']
3 ['crime','horror']
...
I want to know how can I query the dataframe, so it returns the movie belongs to a cerain genre?
For example, something may like df['genre'].contains('comedy') returns 0 or 1.
I know for a list, I can do things like:
'comedy' in ['comedy', 'sci-fi']
However, in pandas, I didn't find something similar, the only thing I know is df['genre'].str.contains(), but it didn't work for the list type.
You can use apply for create mask and then boolean indexing:
mask = df.genre.apply(lambda x: 'comedy' in x)
df1 = df[mask]
print (df1)
genre
0 [comedy, sci-fi]
1 [action, romance, comedy]
using sets
df.genre.map(set(['comedy']).issubset)
0 True
1 True
2 False
3 False
dtype: bool
df.genre[df.genre.map(set(['comedy']).issubset)]
0 [comedy, sci-fi]
1 [action, romance, comedy]
dtype: object
presented in a way I like better
comedy = set(['comedy'])
iscomedy = comedy.issubset
df[df.genre.map(iscomedy)]
more efficient
comedy = set(['comedy'])
iscomedy = comedy.issubset
df[[iscomedy(l) for l in df.genre.values.tolist()]]
using str in two passes
slow! and not perfectly accurate!
df[df.genre.str.join(' ').str.contains('comedy')]
According to the source code, you can use .str.contains(..., regex=False).
You need to set regex=False and .str.contains will work for list values as you would expect:
In : df['genre'].str.contains('comedy', regex=False)
Out:
0 True
1 True
2 False
3 False
Name: genre, dtype: bool
A complete example:
import pandas as pd
data = pd.DataFrame([[['foo', 'bar']],
[['bar', 'baz']]], columns=['list_column'])
print(data)
list_column
0 [foo, bar]
1 [bar, baz]
filtered_data = data.loc[
lambda df: df.list_column.apply(
lambda l: 'foo' in l
)
]
print(filtered_data)
list_column
0 [foo, bar]
This can be done in all three ways as suggested, using str.contains, set or apply and in. Although using set is the most efficient way to achieve this.
Here's a performance comparison of the three methods on an extrapolated dataframe with 10,000 rows:
set
%%timeit -n 500 -r 35
df[df.genre.map(set(['comedy']).issubset)]
2.23 ms ± 154 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
apply & in
%%timeit -n 500 -r 35
df[df.genre.apply(lambda x: 'comedy' in x)]
2.36 ms ± 359 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
str.contains
%%timeit -n 500 -r 35
df[df['genre'].str.contains('comedy', regex=False)]
2.83 ms ± 299 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
This can be done using the isin method to return a new dataframe that contains boolean values where each item is located.
df1[df1.name.isin(['Rohit','Rahul'])]
here df1 is a dataframe object and name is a string series
>>> df1[df1.name.isin(['Rohit','Rahul'])]
sample1 name Marks Class
0 1 Rohit 34 10
1 2 Rahul 56 12
>>> type (df1)
<class 'pandas.core.frame.DataFrame>
>>> df1.head()
sample1 name Marks Class
0 1 Rohit 34 10
1 2 Rahul 56 12
2 3 ankit 78 11
3 4 sajan 98 10
4 5 chintu 76 9
I have two columns with strings. I would like to combine them and ignore nan values. Such that:
ColA, Colb, ColA+ColB
str str strstr
str nan str
nan str str
I tried df['ColA+ColB'] = df['ColA'] + df['ColB'] but that creates a nan value if either column is nan. I've also thought about using concat.
I suppose I could just go with that, and then use some df.ColA+ColB[df[ColA] = nan] = df[ColA] but that seems like quite the workaround.
Call fillna and pass an empty str as the fill value and then sum with param axis=1:
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN]})
df
Out[3]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [7]:
df['a+b'] = df.fillna('').sum(axis=1)
df
Out[7]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa
You could fill the NaN with an empty string:
df['ColA+ColB'] = df['ColA'].fillna('') + df['ColB'].fillna('')
Using apply and str.cat you can
In [723]: df
Out[723]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [724]: df['a+b'] = df.apply(lambda x: x.str.cat(sep=''), axis=1)
In [725]: df
Out[725]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa
In my case, I wanted to join more than 2 columns together with a separator (a+b+c)
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN], 'c':['as',np.NaN ,'ds']})
In [4]: df
Out[4]:
a b c
0 asd asdas as
1 NaN asdas NaN
2 asdsa NaN ds
The following syntax worked for me:
In [5]: df['d'] = df[['a', 'b', 'c']].fillna('').agg('|'.join, axis=1)
In [6]: df
Out[6]:
a b c d
0 asd asdas as asd|asdas|as
1 NaN asdas NaN |asdas|
2 asdsa NaN ds asdsa||ds
Prefer adding the columns than use apply method. cuz it's faster than apply.
Just add the two columns (if you know they are strings)
%timeit df.bio + df.procedure_codes
21.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Use apply
%timeit df[eventcol].apply(lambda x: ''.join(x), axis=1)
13.6 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use Pandas string methods and cat:
%timeit df[eventcol[0]].str.cat(cols, sep=',')
264 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using sum (which concatenate strings)
%timeit df[eventcol].sum(axis=1)
509 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
see here for more tests