I have the following dataframe:
a b
0 3.0 10.0
1 2.0 9.0
2 NaN 8.0
For each row, I need to drop (and replace with NaN) all values, excluding the first non-null one.
This is the expected output:
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
I know that using the justify function I can identify the first n non-null values, but I need to keep the same structure of the original dataframe.
One way to go, would be:
import pandas as pd
data = {'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: 9.0, 2: 8.0}}
df = pd.DataFrame(data)
def keep_first_valid(x):
first_valid = x.first_valid_index()
return x.mask(x.index!=first_valid)
df = df.apply(lambda x: keep_first_valid(x), axis=1)
df
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
So, the first x passed to the function would consist of pd.Series([3.0, 10.0],index=['a','b']).
Inside the function first_valid = x.first_valid_index() will store 'a'; see df.first_valid_index.
Finally, we apply s.mask to get pd.Series([3.0, None],index=['a','b']), which we assign back to the df.
try this:
f = df.copy()
f[:] = f.columns
fv_idx = df.apply(pd.Series.first_valid_index, axis=1).values[:, None]
res = df.where(f == fv_idx)
print(res)
>>>
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
I have 2 columns, which we'll call x and y. I want to create a new column called xy:
x y xy
1 1
2 2
4 4
8 8
There shouldn't be any conflicting values, but if there are, y takes precedence. If it makes the solution easier, you can assume that x will always be NaN where y has a value.
it could be quite simple if your example is accurate
df.fillna(0) #if the blanks are nan will need this line first
df['xy']=df['x']+df['y']
Notice your column type right now is string not numeric anymore
df = df.apply(lambda x : pd.to_numeric(x, errors='coerce'))
df['xy'] = df.sum(1)
More
df['xy'] =df[['x','y']].astype(str).apply(''.join,1)
#df[['x','y']].astype(str).apply(''.join,1)
Out[655]:
0 1.0
1 2.0
2
3 4.0
4 8.0
dtype: object
You can also use NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame({'x': [1, 2, np.nan, np.nan],
'y': [np.nan, np.nan, 4, 8]})
arr = df.values
df['xy'] = arr[~np.isnan(arr)].astype(int)
print(df)
x y xy
0 1.0 NaN 1
1 2.0 NaN 2
2 NaN 4.0 4
3 NaN 8.0 8
I am trying to get a list of column names from a DataFrame corresponding to columns that aren't of type float. Right now I have
categorical = (df.dtypes.values != np.dtype('float64'))
which gives me a boolean array of whether column names are not float or not, but this is not exactly what I'm looking for. Specifically, I would like a list of column names that correspond to the 'True' values in my boolean array.
Use boolean indexing with df.columns:
categorical = df.columns[(df.dtypes.values != np.dtype('float64'))]
Or get difference of columns selected by select_dtypes:
categorical = df.columns.difference(df.select_dtypes('float64').columns)
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7.,8,9,4,2,3],
'D':[1,3,5.,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7.0 1.0 5 a
1 b 5 8.0 3.0 3 a
2 c 4 9.0 5.0 6 a
3 d 5 4.0 7.0 9 b
4 e 5 2.0 1.0 2 b
5 f 4 3.0 0.0 4 b
categorical = df.columns.difference(df.select_dtypes('float64').columns)
print (categorical)
Index(['A', 'B', 'E', 'F'], dtype='object')
grouping rows in list in pandas groupby
i have found question and need to go a step further
output required by this question was
A [1,2]
B [5,5,4]
C [6]
what I'am trying to achieve is
A B C
1 5 6
2 5
4
i have tried using
grouped=dataSet.groupby('Column1')
df = grouped.aggregate(lambda x: list(x))
output im Stucked with is
df.T
Column1 A B C
[1,2] [5,5,4] [6]
I think here there is no need to use columns of lists.
You can achieve your result using a simple dictionary comprehension over the groups generated by groupby:
out = pd.concat({key:
group['b'].reset_index(drop=True)
for key, group in df.groupby('a')}, axis=1)
which gives the desired output:
out
Out[59]:
A B C
0 1.0 5 6.0
1 2.0 5 NaN
2 NaN 4 NaN
I believe you need create DataFrame by contructor:
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
s = df.groupby('a')['b'].apply(list)
df = pd.DataFrame(s.values.tolist(), index=s.index).T
print (df)
a A B C
0 1.0 5.0 6.0
1 2.0 5.0 NaN
2 NaN 4.0 NaN
I understand that I can use combine_first to merge two series:
series1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series2 = pd.Series([1,2,3,4,5],index=['f','g','h','i','j'])
series3 = pd.Series([1,2,3,4,5],index=['k','l','m','n','o'])
Combine1 = series1.combine_first(series2)
print(Combine1
Output:
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
f 1.0
g 2.0
h 3.0
i 4.0
j 5.0
dtype: float64
What if I need to merge 3 or more series?
I understand that using the following code: print(series1 + series2 + series3)yields:
a NaN
b NaN
c NaN
d NaN
e NaN
f NaN
...
dtype: float64
Can I merge multiple series efficiently without using combine_first multiple times?
Thanks
Combine Series with Non-Overlapping Indexes
To combine series vertically, use pd.concat.
# Setup
series_list = [
pd.Series(range(1, 6), index=list('abcde')),
pd.Series(range(1, 6), index=list('fghij')),
pd.Series(range(1, 6), index=list('klmno'))
]
pd.concat(series_list)
a 1
b 2
c 3
d 4
e 5
f 1
g 2
h 3
i 4
j 5
k 1
l 2
m 3
n 4
o 5
dtype: int64
Combine with Overlapping Indexes
series_list = [
pd.Series(range(1, 6), index=list('abcde')),
pd.Series(range(1, 6), index=list('abcde')),
pd.Series(range(1, 6), index=list('kbmdf'))
]
If the Series have overlapping indices, you can either combine (add) the keys,
pd.concat(series_list, axis=1, sort=False).sum(axis=1)
a 2.0
b 6.0
c 6.0
d 12.0
e 10.0
k 1.0
m 3.0
f 5.0
dtype: float64
Alternatively, just drop duplicates values on the index if you want to take only the first/last value (when there are duplicates).
res = pd.concat(series_list, axis=0)
# keep first value
res[~res.index.duplicated(keep='first')]
# keep last value
res[~res.index.duplicated(keep='last')]
Presuming that you were using the behavior of combine_first to prioritize the values of the series in order as combine_first is meant for, you could succinctly make multiple calls to it with a lambda expression.
from functools import reduce
l_series = [series1, series2, series3]
reduce(lambda s1, s2: s1.combine_first(s2), l_series)
Of course if the indices are unique as in your current example, you can simply use pd.concat instead.
Demo
series1 = pd.Series(list(range(5)),index=['a','b','c','d','e'])
series2 = pd.Series(list(range(5, 10)),index=['a','g','h','i','j'])
series3 = pd.Series(list(range(10, 15)),index=['k','b','m','c','o'])
from functools import reduce
l_series = [series1, series2, series3]
print(reduce(lambda s1, s2: s1.combine_first(s2), l_series))
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# e 4.0
# g 6.0
# h 7.0
# i 8.0
# j 9.0
# k 10.0
# m 12.0
# o 14.0
# dtype: float64
Agree with what #codespeed has pointed out in his answer.
I think it will depend on user needs. If series index are confirmed with no overlapping, concat will be a better option. (as original question posted, there is no index overlapping, then concat will be a better option)
If there is index overlapping, you might need to consider how to handle overlapping, which value to be overwritten. (as example provided by codespeed, if index are matching to different values, need to be careful about combine_first)
i.e. (note series3 is same as series1, series2 is same as series4)
import pandas as pd
import numpy as np
series1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series2 = pd.Series([2,3,4,4,5],index=['a','b','c','i','j'])
series3 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series4 = pd.Series([2,3,4,4,5],index=['a','b','c','i','j'])
print(series1.combine_first(series2))
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
i 4.0
j 5.0
dtype: float64
print(series4.combine_first(series3))
a 2.0
b 3.0
c 4.0
d 4.0
e 5.0
i 4.0
j 5.0
dtype: float64
You would use combine_first if you want one series's values prioritized over the other. Its usually used to fill the missing values in the first series. I am not sure whats the expected output in your example but looks like you can use concat
pd.concat([series1, series2, series3])
You get
a 1
b 2
c 3
d 4
e 5
f 1
g 2
h 3
i 4
j 5
k 1
l 2
m 3
n 4
o 5