Pandas Series with column names for each value above a minimum - python

I try to get a new series from a DataFrame. This series should contain the column names of the DataFrame's values that are above some value for each row of the DataFrame. But beginning from the left of the DataFrame, like this:
df = pd.DataFrame(np.random.randint(0,10,size=(5, 6)), columns=list('ABCDEF'))
>>> df
A B C D E F
0 2 4 6 8 8 4
1 2 0 9 7 7 1
2 1 7 7 7 3 0
3 5 4 4 0 1 7
4 9 6 1 5 1 5
min = 3
Expected Output:
0 B
1 C
2 B
3 A
4 A
dtype: object
Here the output's row 0 is "B" because in the DataFrame row index 0 column "B" is the most left column that has a value that is equal or bigger than min = 3.
I know that I an use df.idxmin(axis = 1) to get the column names of the minimum for each row but I have now clue at all how to tackle this more complex problem.
Thanks for help or hints!

UPDATE - index of the first element in each row, satisfying condition:
more elegant and more efficient version from #DSM:
In [156]: (df>=3).idxmax(1)
Out[156]:
0 B
1 C
2 B
3 A
4 A
dtype: object
my version:
In [149]: df[df>=3].apply(lambda x: x.first_valid_index(), axis=1)
Out[149]:
0 B
1 C
2 B
3 A
4 A
dtype: object
Old answer - index of the minimum element for each row:
In [27]: df[df>=3].idxmin(1)
Out[27]:
0 E
1 A
2 C
3 C
4 F
dtype: object

Related

How to select values from a dataframe by a series of column names?

I have a dataframe df:
A B
0 1 4
1 2 5
2 3 6
And a series s:
0 A
1 B
2 A
Now I want to pick values from df with column names specified in s. The expected result is:
0 1 <- from column A
1 5 <- from column B
2 3 <- from column A
How can I get this done efficiently?
Use Index.get_indexer for indices by Series and select values by numpy indexing in 2d array:
a = df.to_numpy()
b = a[np.arange(len(df)), df.columns.get_indexer(s)]
print (b)
[1 5 3]
s1 = pd.Series(b, s.index)
print (s1)
0 1
1 5
2 3
dtype: int64

How to combine two Series with the same index in python?

I have two Series (df1 and df2) of equal length, which need to be combined into one DataFrame column as follows. Each index has only one value or no values but never two values, so there are no duplicates (e.g. if df1 has a value 'A' at index 0, then df2 is empty at index 0, and vice versa).
df1 = c1 df2 = c2
0 A 0
1 B 1
2 2 C
3 D 3
4 E 4
5 5 F
6 6
7 G 7
The result I want is this:
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G
I have tried .concat, .append and .union, but these do not produce the desired result. What is the correct approach then?
You can try so:
df1['new'] = df1['c1'] + df2['c2']
For an in-place solution, I recommend pd.Series.replace:
df1['c1'].replace('', df2['c2'], inplace=True)
print(df1)
c1
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G

pandas - number of unique rows occurrences in dataframe

How can I count number of occurrences of each unique row in a DataFrame?
data = {'x1': ['A','B','A','A','B','A','A','A'], 'x2': [1,3,2,2,3,1,2,3]}
df = pd.DataFrame(data)
df
x1 x2
0 A 1
1 B 3
2 A 2
3 A 2
4 B 3
5 A 1
6 A 2
7 A 3
And I would like to obtain
x1 x2 count
0 A 1 2
1 A 2 3
2 A 3 1
3 B 3 2
IIUC you can pass param as_index=False as an arg to groupby:
In [100]:
df.groupby(['x1','x2'], as_index=False).count()
Out[100]:
x1 x2 count
0 A 1 2
1 A 2 3
2 A 3 1
3 B 3 2
You could also drop duplicated rows:
In [4]: df.shape[0]
Out[4]: 8
In [5]: df.drop_duplicates().shape[0]
Out[5]: 4
There are two ways you can find unique occurence in your dataframe.
1st: Using drop_duplicates
df.drop_duplicates().sort_values('x1',ignore_index=True)
2nd: Using groupby.nunique
df.groupby(['x1','x2'], as_index=False).nunique()
For finding the number of occurrences, the answer from #EdChum will work precisely.

How to return a dataframe value from row and column reference?

I know this is probably a basic question, but somehow I can't find the answer. I was wondering how it's possible to return a value from a dataframe if I know the row and column to look for? E.g. If I have a dataframe with columns 1-4 and rows A-D, how would I return the value for B4?
You can use ix for this:
In [236]:
df = pd.DataFrame(np.random.randn(4,4), index=list('ABCD'), columns=[1,2,3,4])
df
Out[236]:
1 2 3 4
A 1.682851 0.889752 -0.406603 -0.627984
B 0.948240 -1.959154 -0.866491 -1.212045
C -0.970505 0.510938 -0.261347 -1.575971
D -0.847320 -0.050969 -0.388632 -1.033542
In [237]:
df.ix['B',4]
Out[237]:
-1.2120448782618383
Use at, if rows are A-D and columns 1-4:
print (df.at['B', 4])
If rows are 1-4 and columns A-D:
print (df.at[4, 'B'])
Fast scalar value getting and setting.
Sample:
df = pd.DataFrame(np.arange(16).reshape(4,4),index=list('ABCD'), columns=[1,2,3,4])
print (df)
1 2 3 4
A 0 1 2 3
B 4 5 6 7
C 8 9 10 11
D 12 13 14 15
print (df.at['B', 4])
7
df = pd.DataFrame(np.arange(16).reshape(4,4),index=[1,2,3,4], columns=list('ABCD'))
print (df)
A B C D
1 0 1 2 3
2 4 5 6 7
3 8 9 10 11
4 12 13 14 15
print (df.at[4, 'B'])
13

Getting all rows with NaN value

I have a table with a column that has some NaN values in it:
A B C D
2 3 2 Nan
3 4 5 5
2 3 1 Nan
I'd like to get all rows where D = NaN. How can I do this?
Creating a df for illustration (containing Nan)
In [86]: df =pd.DataFrame({'a':[1,2,3],'b':[3,4,5],'c':[np.nan, 4,5]})
In [87]: df
Out[87]:
a b c
0 1 3 NaN
1 2 4 4
2 3 5 5
Checking which indices have null for column c
In [88]: pd.isnull(df['c'])
Out[88]:
0 True
1 False
2 False
Name: c, dtype: bool
Checking which indices dont have null for column c
In [90]: pd.notnull(df['c'])
Out[90]:
0 False
1 True
2 True
Name: c, dtype: bool
Selecting rows of df where c is not null
In [91]: df[pd.notnull(df['c'])]
Out[91]:
a b c
1 2 4 4
2 3 5 5
Selecting rows of df where c is null
In [93]: df[pd.isnull(df['c'])]
Out[93]:
a b c
0 1 3 NaN
Selecting rows of column c of df where c is not null
In [94]: df['c'][pd.notnull(df['c'])]
Out[94]:
1 4
2 5
Name: c, dtype: float64
For a solution that doesn't involve pandas, you can do something like:
goodind=np.where(np.sum(np.isnan(y),axis=1)==0)[0] #indices of rows non containing nans
(or the negation if you want rows with nan) and use the indices to slice data.
I am not sure sum is the best way to combine booleans, but np.any and np.all don't seem to have a axis parameter, so this is the best way I found.

Categories