Concatenate multiple pandas series efficiently - python

I understand that I can use combine_first to merge two series:
series1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series2 = pd.Series([1,2,3,4,5],index=['f','g','h','i','j'])
series3 = pd.Series([1,2,3,4,5],index=['k','l','m','n','o'])
Combine1 = series1.combine_first(series2)
print(Combine1
Output:
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
f 1.0
g 2.0
h 3.0
i 4.0
j 5.0
dtype: float64
What if I need to merge 3 or more series?
I understand that using the following code: print(series1 + series2 + series3)yields:
a NaN
b NaN
c NaN
d NaN
e NaN
f NaN
...
dtype: float64
Can I merge multiple series efficiently without using combine_first multiple times?
Thanks

Combine Series with Non-Overlapping Indexes
To combine series vertically, use pd.concat.
# Setup
series_list = [
pd.Series(range(1, 6), index=list('abcde')),
pd.Series(range(1, 6), index=list('fghij')),
pd.Series(range(1, 6), index=list('klmno'))
]
pd.concat(series_list)
a 1
b 2
c 3
d 4
e 5
f 1
g 2
h 3
i 4
j 5
k 1
l 2
m 3
n 4
o 5
dtype: int64
Combine with Overlapping Indexes
series_list = [
pd.Series(range(1, 6), index=list('abcde')),
pd.Series(range(1, 6), index=list('abcde')),
pd.Series(range(1, 6), index=list('kbmdf'))
]
If the Series have overlapping indices, you can either combine (add) the keys,
pd.concat(series_list, axis=1, sort=False).sum(axis=1)
a 2.0
b 6.0
c 6.0
d 12.0
e 10.0
k 1.0
m 3.0
f 5.0
dtype: float64
Alternatively, just drop duplicates values on the index if you want to take only the first/last value (when there are duplicates).
res = pd.concat(series_list, axis=0)
# keep first value
res[~res.index.duplicated(keep='first')]
# keep last value
res[~res.index.duplicated(keep='last')]

Presuming that you were using the behavior of combine_first to prioritize the values of the series in order as combine_first is meant for, you could succinctly make multiple calls to it with a lambda expression.
from functools import reduce
l_series = [series1, series2, series3]
reduce(lambda s1, s2: s1.combine_first(s2), l_series)
Of course if the indices are unique as in your current example, you can simply use pd.concat instead.
Demo
series1 = pd.Series(list(range(5)),index=['a','b','c','d','e'])
series2 = pd.Series(list(range(5, 10)),index=['a','g','h','i','j'])
series3 = pd.Series(list(range(10, 15)),index=['k','b','m','c','o'])
from functools import reduce
l_series = [series1, series2, series3]
print(reduce(lambda s1, s2: s1.combine_first(s2), l_series))
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# e 4.0
# g 6.0
# h 7.0
# i 8.0
# j 9.0
# k 10.0
# m 12.0
# o 14.0
# dtype: float64

Agree with what #codespeed has pointed out in his answer.
I think it will depend on user needs. If series index are confirmed with no overlapping, concat will be a better option. (as original question posted, there is no index overlapping, then concat will be a better option)
If there is index overlapping, you might need to consider how to handle overlapping, which value to be overwritten. (as example provided by codespeed, if index are matching to different values, need to be careful about combine_first)
i.e. (note series3 is same as series1, series2 is same as series4)
import pandas as pd
import numpy as np
series1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series2 = pd.Series([2,3,4,4,5],index=['a','b','c','i','j'])
series3 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series4 = pd.Series([2,3,4,4,5],index=['a','b','c','i','j'])
print(series1.combine_first(series2))
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
i 4.0
j 5.0
dtype: float64
print(series4.combine_first(series3))
a 2.0
b 3.0
c 4.0
d 4.0
e 5.0
i 4.0
j 5.0
dtype: float64

You would use combine_first if you want one series's values prioritized over the other. Its usually used to fill the missing values in the first series. I am not sure whats the expected output in your example but looks like you can use concat
pd.concat([series1, series2, series3])
You get
a 1
b 2
c 3
d 4
e 5
f 1
g 2
h 3
i 4
j 5
k 1
l 2
m 3
n 4
o 5

Related

Pandas combine two if one is empty

I have a table:
A
B
C
x
1
NA
y
NA
4
z
2
NA
p
NA
5
t
6
7
I want to create a new column D which should combine columns B and C if one of the columns is empty (NA):
A
B
C
D
x
1
NA
1
y
NA
4
4
z
2
NA
2
p
NA
5
5
t
6
7
error
In case both columns contain a value, it should return the text 'error' inside the cell.
You could first calculate a mask with rows where both values are present and then fill NA values of, let's say column B, with values from column C. Using the mask calculated in the first step simply assign NA values where needed.
error_mask = df['B'].notna() & df['C'].notna()
df['D'] = df['B'].fillna(df['C'])
df.loc[error_mask, 'D'] = pd.NA
df
A B C D
0 x 1 <NA> 1
1 y <NA> 4 4
2 z 2 <NA> 2
3 p 3 5 <NA>
OR
df = df['D'].astype(str)
df.loc[error_mask, 'D'] = 'error'
I would suggest against assigning a string error where both values are present, since that would make the whole D column an object dtype
There as several ways to achieve this.
Using fillna and mask
df['D'] = df['B'].fillna(df['C']).mask(df['B'].notna()&df['C'].notna(), 'error')
Or numpy.select:
m1 = df['B'].notna()
m2 = df['C'].notna()
df['D'] = np.select([m1&m2, m1], ['error', df['B']], df['C'])
Output:
A B C D
0 x 1.0 NaN 1.0
1 y NaN 4.0 4.0
2 z 2.0 NaN 2.0
3 p NaN 5.0 5.0
4 t 6.0 7.0 error
Adding to the previous answer, you can address this with a series of .apply() methods paired with lambda functions.
Consider the dataframe that you presented, with np.nan as the NA values:
df = pd.DataFrame({
'B':[1, np.nan, 2, np.nan, 6],
'C':[np.nan, 4, np.nan, 5, 7]})
First generate a list of the elements from the series in question:
df['D'] = df.apply(lambda x: list(x), axis=1)
This will net you a pd.Series with a list of values as elements, e.g. [1.0, nan] for the first row. Next, remove all np.nan elements by using that np.nan != np.nan in numpy (see also an answer here: How can I remove Nan from list Python/NumPy)
df['E'] = df['D'].apply(lambda x: [i for i in x if i == i])
Finally, create the error by filtering based on length.
df['F'] = df['E'].apply(lambda x: x[0] if len(x) == 1 else 'error')
The resulting dataframe works like this:
B C D E F
0 1.0 NaN [1.0, nan] [1.0] 1.0
1 NaN 4.0 [nan, 4.0] [4.0] 4.0
2 2.0 NaN [2.0, nan] [2.0] 2.0
3 NaN 5.0 [nan, 5.0] [5.0] 5.0
4 6.0 7.0 [6.0, 7.0] [6.0, 7.0] error
Of course you could chain all this together in a not-so-pythonic, yet single-line answer:
a = df.apply(lambda x: list(x), axis=1).apply(lambda x: [i for i in x if i == i]).apply(lambda x: x[0] if len(x) == 1 else 'error')
Have a look at the function combine_first:
df['C'].combine_first(df['B']).mask(df['B'].notna() & df['C'].notna(), 'error')
Output:
0 1.0
1 4.0
2 2.0
3 5.0
4 error
Name: C, dtype: object

Keep only the 1st non-null value in each row (and replace others with NaN)

I have the following dataframe:
a b
0 3.0 10.0
1 2.0 9.0
2 NaN 8.0
For each row, I need to drop (and replace with NaN) all values, excluding the first non-null one.
This is the expected output:
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
I know that using the justify function I can identify the first n non-null values, but I need to keep the same structure of the original dataframe.
One way to go, would be:
import pandas as pd
data = {'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: 9.0, 2: 8.0}}
df = pd.DataFrame(data)
def keep_first_valid(x):
first_valid = x.first_valid_index()
return x.mask(x.index!=first_valid)
df = df.apply(lambda x: keep_first_valid(x), axis=1)
df
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
So, the first x passed to the function would consist of pd.Series([3.0, 10.0],index=['a','b']).
Inside the function first_valid = x.first_valid_index() will store 'a'; see df.first_valid_index.
Finally, we apply s.mask to get pd.Series([3.0, None],index=['a','b']), which we assign back to the df.
try this:
f = df.copy()
f[:] = f.columns
fv_idx = df.apply(pd.Series.first_valid_index, axis=1).values[:, None]
res = df.where(f == fv_idx)
print(res)
>>>
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0

Group rows in list and transpose pandas

grouping rows in list in pandas groupby
i have found question and need to go a step further
output required by this question was
A [1,2]
B [5,5,4]
C [6]
what I'am trying to achieve is
A B C
1 5 6
2 5
4
i have tried using
grouped=dataSet.groupby('Column1')
df = grouped.aggregate(lambda x: list(x))
output im Stucked with is
df.T
Column1 A B C
[1,2] [5,5,4] [6]
I think here there is no need to use columns of lists.
You can achieve your result using a simple dictionary comprehension over the groups generated by groupby:
out = pd.concat({key:
group['b'].reset_index(drop=True)
for key, group in df.groupby('a')}, axis=1)
which gives the desired output:
out
Out[59]:
A B C
0 1.0 5 6.0
1 2.0 5 NaN
2 NaN 4 NaN
I believe you need create DataFrame by contructor:
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
s = df.groupby('a')['b'].apply(list)
df = pd.DataFrame(s.values.tolist(), index=s.index).T
print (df)
a A B C
0 1.0 5.0 6.0
1 2.0 5.0 NaN
2 NaN 4.0 NaN

Getting Pandas.groupby.shift() results with groupbyvars as cols / index?

Given this trivial dataset
df = pd.DataFrame({'one': ['a', 'a', 'a', 'b', 'b', 'b'],
'two': ['c', 'c', 'c', 'c', 'd', 'd'],
'three': [1, 2, 3, 4, 5, 6]})
grouping on one / two and applying .max() returns me a Series indexed on the groupby vars, as expected...
df.groupby(['one', 'two'])['three'].max()
output:
one two
a c 3
b c 4
d 6
Name: three, dtype: int64
...in my case I want to shift() my records, by group. But for some reason, when I apply .shift() to the groupby object, my results don't include the groupby variables:
output:
df.groupby(['one', 'two'])['three'].shift()
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
Name: three, dtype: float64
Is there a way to preserve those groupby variables in the results, as either columns or a multi-indexed Series (as in .max())? Thanks!
It is difference between max and diff - max aggregate values (return aggregate Series) and diff not - return same size Series.
So is possible append output to new column:
df['shifted'] = df.groupby(['one', 'two'])['three'].shift()
Theoretically is possible use agg, but it return error in pandas 0.20.3:
df1 = df.groupby(['one', 'two'])['three'].agg(['max', lambda x: x.shift()])
print (df1)
ValueError: Function does not reduce
One possible solution is transform if need max with diff:
g = df.groupby(['one', 'two'])['three']
df['max'] = g.transform('max')
df['shifted'] = g.shift()
print (df)
one three two max shifted
0 a 1 c 3 NaN
1 a 2 c 3 1.0
2 a 3 c 3 2.0
3 b 4 c 4 NaN
4 b 5 d 6 NaN
5 b 6 d 6 5.0
As what Jez explained, shift return the Serise keep the same len of dataframe, if you assign it like max(), will getting the error
Function does not reduce
df.assign(shifted=df.groupby(['one', 'two'])['three'].shift()).set_index(['one','two'])
Out[57]:
three shifted
one two
a c 1 NaN
c 2 1.0
c 3 2.0
b c 4 NaN
d 5 NaN
d 6 5.0
Using max as the key , and shift value slice the value max row
df.groupby(['one', 'two'])['three'].apply(lambda x : x.shift()[x==x.max()])
Out[58]:
one two
a c 2 2.0
b c 3 NaN
d 5 5.0
Name: three, dtype: float64

Stop Pandas from rotating results from groupby-apply when there is one group

I have some code that first selects data based on a certain criteria then it does a groupby-apply on a Pandas dataframe. Occasionally, the data only has 1 group that matches the criteria. In this case, Pandas will return a row vector rather than a column vector. Example below:
In [50]: x = pd.DataFrame([(round(i/2, 0), i, i) for i in range(0, 10)], column
...: s=['a', 'b', 'c'])
In [51]: x
Out[51]:
a b c
0 0.0 0 0
1 0.0 1 1
2 1.0 2 2
3 2.0 3 3
4 2.0 4 4
5 2.0 5 5
6 3.0 6 6
7 4.0 7 7
8 4.0 8 8
9 4.0 9 9
In [52]: y = x.loc[x.a == 0.0].groupby('a').apply(lambda x: x.b / x.c)
In [53]: y
Out[53]:
0 1
a
0.0 NaN 1.0
y in the above example is a row vector with datatype pandas.DataFrame. If the .loc selection has two or more classes, it will produce a column vector.
In [54]: y = x.loc[x.a <= 1.0].groupby('a').apply(lambda x: x.b / x.c)
In [55]: y
Out[55]:
a
0.0 0 NaN
1 1.0
1.0 2 1.0
dtype: float64
Any idea how I can make the two behaviour consistent? Ultimately, the column vector is what I want.
Thanks
There's no way to do this in one step, unfortunately. You can, however, do this in two steps, by querying ngroups and reshaping your result accordingly.
g = x.loc[...].groupby('a')
y = g.apply(lambda x: x.b / x.c)
if g.ngroups == 1:
y = y.T

Categories