grouping rows in list in pandas groupby
i have found question and need to go a step further
output required by this question was
A [1,2]
B [5,5,4]
C [6]
what I'am trying to achieve is
A B C
1 5 6
2 5
4
i have tried using
grouped=dataSet.groupby('Column1')
df = grouped.aggregate(lambda x: list(x))
output im Stucked with is
df.T
Column1 A B C
[1,2] [5,5,4] [6]
I think here there is no need to use columns of lists.
You can achieve your result using a simple dictionary comprehension over the groups generated by groupby:
out = pd.concat({key:
group['b'].reset_index(drop=True)
for key, group in df.groupby('a')}, axis=1)
which gives the desired output:
out
Out[59]:
A B C
0 1.0 5 6.0
1 2.0 5 NaN
2 NaN 4 NaN
I believe you need create DataFrame by contructor:
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
s = df.groupby('a')['b'].apply(list)
df = pd.DataFrame(s.values.tolist(), index=s.index).T
print (df)
a A B C
0 1.0 5.0 6.0
1 2.0 5.0 NaN
2 NaN 4.0 NaN
Related
I have a table:
A
B
C
x
1
NA
y
NA
4
z
2
NA
p
NA
5
t
6
7
I want to create a new column D which should combine columns B and C if one of the columns is empty (NA):
A
B
C
D
x
1
NA
1
y
NA
4
4
z
2
NA
2
p
NA
5
5
t
6
7
error
In case both columns contain a value, it should return the text 'error' inside the cell.
You could first calculate a mask with rows where both values are present and then fill NA values of, let's say column B, with values from column C. Using the mask calculated in the first step simply assign NA values where needed.
error_mask = df['B'].notna() & df['C'].notna()
df['D'] = df['B'].fillna(df['C'])
df.loc[error_mask, 'D'] = pd.NA
df
A B C D
0 x 1 <NA> 1
1 y <NA> 4 4
2 z 2 <NA> 2
3 p 3 5 <NA>
OR
df = df['D'].astype(str)
df.loc[error_mask, 'D'] = 'error'
I would suggest against assigning a string error where both values are present, since that would make the whole D column an object dtype
There as several ways to achieve this.
Using fillna and mask
df['D'] = df['B'].fillna(df['C']).mask(df['B'].notna()&df['C'].notna(), 'error')
Or numpy.select:
m1 = df['B'].notna()
m2 = df['C'].notna()
df['D'] = np.select([m1&m2, m1], ['error', df['B']], df['C'])
Output:
A B C D
0 x 1.0 NaN 1.0
1 y NaN 4.0 4.0
2 z 2.0 NaN 2.0
3 p NaN 5.0 5.0
4 t 6.0 7.0 error
Adding to the previous answer, you can address this with a series of .apply() methods paired with lambda functions.
Consider the dataframe that you presented, with np.nan as the NA values:
df = pd.DataFrame({
'B':[1, np.nan, 2, np.nan, 6],
'C':[np.nan, 4, np.nan, 5, 7]})
First generate a list of the elements from the series in question:
df['D'] = df.apply(lambda x: list(x), axis=1)
This will net you a pd.Series with a list of values as elements, e.g. [1.0, nan] for the first row. Next, remove all np.nan elements by using that np.nan != np.nan in numpy (see also an answer here: How can I remove Nan from list Python/NumPy)
df['E'] = df['D'].apply(lambda x: [i for i in x if i == i])
Finally, create the error by filtering based on length.
df['F'] = df['E'].apply(lambda x: x[0] if len(x) == 1 else 'error')
The resulting dataframe works like this:
B C D E F
0 1.0 NaN [1.0, nan] [1.0] 1.0
1 NaN 4.0 [nan, 4.0] [4.0] 4.0
2 2.0 NaN [2.0, nan] [2.0] 2.0
3 NaN 5.0 [nan, 5.0] [5.0] 5.0
4 6.0 7.0 [6.0, 7.0] [6.0, 7.0] error
Of course you could chain all this together in a not-so-pythonic, yet single-line answer:
a = df.apply(lambda x: list(x), axis=1).apply(lambda x: [i for i in x if i == i]).apply(lambda x: x[0] if len(x) == 1 else 'error')
Have a look at the function combine_first:
df['C'].combine_first(df['B']).mask(df['B'].notna() & df['C'].notna(), 'error')
Output:
0 1.0
1 4.0
2 2.0
3 5.0
4 error
Name: C, dtype: object
I have two dataframes,
df1:
hash a b c
ABC 1 2 3
def 5 3 4
Xyz 3 2 -1
df2:
hash v
Xyz 3
def 5
I want to make
df:
hash a b c
ABC 1 2 3 (= as is, because no matching 'ABC' in df2)
def 25 15 20 (= 5*5 3*5 4*5)
Xyz 9 6 -3 (= 3*3 2*3 -1*3)
as like above,
I want to make a dataframe with values of multiplying df1 and df2 according to their index (or first column name) matched.
As df2 only has one column (v), all df1's columns except for the first one (index) should be affected.
Is there any neat Pythonic and Panda's way to achieve it?
df1.set_index(['hash']).mul(df2.set_index(['hash'])) or similar things seem not work..
One approach:
df1 = df1.set_index("hash")
df2 = df2.set_index("hash")["v"]
res = df1.mul(df2, axis=0).combine_first(df1)
print(res)
Output
a b c
hash
ABC 1.0 2.0 3.0
Xyz 9.0 6.0 -3.0
def 25.0 15.0 20.0
One Method:
# We'll make this for convenience
cols = ['a', 'b', 'c']
# Merge the DataFrames, keeping everything from df
df = df1.merge(df2, 'left').fillna(1)
# We'll make the v column integers again since it's been filled.
df.v = df.v.astype(int)
# Broadcast the multiplication across axis 0
df[cols] = df[cols].mul(df.v, axis=0)
# Drop the no-longer needed column:
df = df.drop('v', axis=1)
print(df)
Output:
hash a b c
0 ABC 1 2 3
1 def 25 15 20
2 Xyz 9 6 -3
Alternative Method:
# Set indices
df1 = df1.set_index('hash')
df2 = df2.set_index('hash')
# Apply multiplication and fill values
df = (df1.mul(df2.v, axis=0)
.fillna(df1)
.astype(int)
.reset_index())
# Output:
hash a b c
0 ABC 1 2 3
1 Xyz 9 6 -3
2 def 25 15 20
The function you are looking for is actually multiply.
Here's how I have done it:
>>> df
hash a b
0 ABC 1 2
1 DEF 5 3
2 XYZ 3 -1
>>> df2
hash v
0 XYZ 4
1 ABC 8
df = df.merge(df2, on='hash', how='left').fillna(1)
>>> df
hash a b v
0 ABC 1 2 8.0
1 DEF 5 3 1.0
2 XYZ 3 -1 4.0
df[['a','b']] = df[['a','b']].multiply(df['v'], axis='index')
>>>df
hash a b v
0 ABC 8.0 16.0 8.0
1 DEF 5.0 3.0 1.0
2 XYZ 12.0 -4.0 4.0
You can actually drop v at the end if you don't need it.
I am trying to get a list of column names from a DataFrame corresponding to columns that aren't of type float. Right now I have
categorical = (df.dtypes.values != np.dtype('float64'))
which gives me a boolean array of whether column names are not float or not, but this is not exactly what I'm looking for. Specifically, I would like a list of column names that correspond to the 'True' values in my boolean array.
Use boolean indexing with df.columns:
categorical = df.columns[(df.dtypes.values != np.dtype('float64'))]
Or get difference of columns selected by select_dtypes:
categorical = df.columns.difference(df.select_dtypes('float64').columns)
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7.,8,9,4,2,3],
'D':[1,3,5.,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7.0 1.0 5 a
1 b 5 8.0 3.0 3 a
2 c 4 9.0 5.0 6 a
3 d 5 4.0 7.0 9 b
4 e 5 2.0 1.0 2 b
5 f 4 3.0 0.0 4 b
categorical = df.columns.difference(df.select_dtypes('float64').columns)
print (categorical)
Index(['A', 'B', 'E', 'F'], dtype='object')
in a pandas dataframe
matrix
I would like to find the rows (indices) contaning NaN.
for finding NaN in columns I would do
idx_nan = matrix.columns[np.isnan(matrix).any(axis=1)]
but it doesn't work with matrix.rows
What is the equivalent for finding items in rows?
I think you need DataFrame.isnull with any and boolean indexing:
print (df[df.isnull().any(1)].index)
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[np.nan,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 NaN 1 5 7
1 2 5 8.0 3 3 4
2 3 6 9.0 5 6 3
print (df[df.isnull().any(1)].index)
Int64Index([0], dtype='int64')
Another solutions:
idx_nan = df[np.isnan(df).any(axis=1)].index
print (idx_nan)
Int64Index([0], dtype='int64')
idx_nan = df.index[np.isnan(df).any(axis=1)]
print (idx_nan)
I have the following data frame:
import pandas as pd
df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1] })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]
Which looks like this:
In [17]: df
Out[17]:
probe gene cellA.1 cellA.2 cellB.1 cellB.2
0 a foo 5 12 15 5
1 b bar 0 90 3 7
2 c qux 1 13 11 11
3 d woz 0 0 2 1
Note that the values are contained in column that shared same substring (e.g. cellA and cellB). In real case the cell ID can be more than these two and numerical index can also be more (e.g. CellFoo.5)
What I want to do is to get the average so that it looks like this
probe gene cellA cellB
a foo 9.5 10
b bar 45 5
c qux 7 11
d woz 0 1.5
How can I achieve that with Pandas?
One way would be to make a function which takes a column name and turns it into the group you want to put it in:
>>> df = df.set_index(["probe", "gene"])
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean()
cellA cellB
probe gene
a foo 8.5 10.0
b bar 45.0 5.0
c qux 7.0 11.0
d woz 0.0 1.5
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean().reset_index()
probe gene cellA cellB
0 a foo 8.5 10.0
1 b bar 45.0 5.0
2 c qux 7.0 11.0
3 d woz 0.0 1.5
Note that we set the index (and reset it afterwards) so we didn't have to special-case the groups we didn't want to touch; also note we had to specify axis=1 because we want to group columnwise, not rowwise.
You can use groupby():
import pandas as pd
df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1] })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]
mask = df.columns.str.contains(".", regex=False)
df1 = df.loc[:, ~mask]
df2 = df.loc[:, mask]
pd.concat([df1, df2.groupby(lambda name:name.split(".")[0], axis=1).mean()], axis=1)
You could use list comprehension.
In [1]: df['cellA'] = [(x+y)/2. for x,y in zip(df['cellA.1'], df['cellA.2'])]
In [2]: df['cellB'] = [(x+y)/2. for x,y in zip(df['cellB.1'], df['cellB.2'])]
In [3]: df = df[['probe', 'gene', 'cellA', 'cellB']]
In [4]: df
Out [4]:
probe gene cellA cellB
a foo 8.5 10.0
b bar 45.0 5.0
c qux 7.0 11.0
d woz 0.0 1.5