I need to select rows from a column (A) if it's not nan else from a another column (B), how to do this in Python?
The use case is that I'm inserting a new column C that will contain the result of an operation (func) on values from column A. Sometimes the values in A are nan, and in these cases I want to calculate the value in C from B, an example of the result would be this:
| A | B | C |
| bla | bla2 | func(bla) | #read from A
| nan | bla3 | func(bla3) | #read from B
What you want is combine_first:
df['C'] = df['A'].combine_first(df['B']).apply(func)
No explicit iteration required here...
This code should solve your problem.
df['C'] = [x['B'] if x['A'] == 'nan' else x['A'] for x in df.iterrows()]
basically you create a new list by iterating all rows and selecting the correct one, then you add it to de df.
By applying a function to the dataframe that checks for nan before it returns, you have the ability to switch to another value in the same row. It's important to give axis=1 if you're using apply on a dataframe (all columns) rather than a series (one column)
import pandas as pd
df = pd.DataFrame({'A':[1,2,-2],'B':[1,2,2]})
def fn(x):
c = np.sqrt(x['A'])
if np.isnan(c):
c = np.sqrt(x['B'])
return c
df['C'] = df.apply(fn,axis=1)
Related
I'm looking for a more automated approach to subset this dataframe by rank and put them in a list. Because if there happens to be 150 ranks I can't do individual subsets.
ID | GROUP | RANK
1 | A | 1
2 | B | 2
3 | C | 3
2 | A | 1
2 | E | 2
2 | G | 3
How can I subset the dataframe by Rank and then put every subset in a list? (Not using group by)
I know how to individually subset them but I'm not sure how I can do this if there's more ranks.
Output:
ranks = [df1,df2,df3....and so on]
Just use groupby directly in a list comprehension
>>> [df for rank, df in df.groupby('RANK')]
This will generate a list of dataframes, each a sub-dataframe related to the corresponding rank.
You can also do a dict comprehension:
>>> dic = {rank: df for rank, df in df.groupby('RANK')}
such that you can access your df via dic[1] for rank == 1.
In more detail, pd.DataFrame.groupby is a method that returns a DataFrameGroupBy object. A DataFrameGroupBy object is an iterable, which means you can iterate over it with a for loop. This iterable generates tuples with two vales, where the first is whatever you used to group (in this case, an integer rank), and the second, the sub dataframe.
Unlike dataframes, series do not have the set_index method - which begs the question: if we want to change the index, how do we do it?
Accessing series.index works, but I seemed to remember it wasn't recommended?
I could convert to a dataframe, use the set_index method, then convert back to series, but it seems convoluted.
EDIT: I do not want to reindex, but to set a new index. As explained in the docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
By default values in the new index that do not have corresponding
records in the dataframe are assigned NaN.
That is NOT what I need. What I need is:
+---------------+-----------+
| Current index | New Index |
+---------------+-----------+
| 1 | 4 |
| 2 | 5 |
| 3 | 6 |
+---------------+-----------+
A toy example:
import pandas as pd
s = pd.Series(data =['a','b','c'])
#this works
s.index = [4,5,6]
# this doesn't work
s = s.set_index([4,5,6])
If you want to reindex, I recommend you just do a s.index with the new index list.
import pandas as pd
s = pd.Series(data =['a','b','c'], index=range(1,4))
print (s)
s = s.append(pd.Series(list('defg')))
s.index= range(11,19)
print (s)
Here's the output:
Initial Series:
1 a
2 b
3 c
dtype: object
After the append:
1 a
2 b
3 c
0 d
1 e
2 f
3 g
4 h
dtype: object
As you know appending just starts index from 0 again if we don't give an index.
Now if I want to set the index to new values, I can just do s.index. The result of s.index for range(11,19) is:
11 a
12 b
13 c
14 d
15 e
16 f
17 g
18 h
dtype: object
Lets Try .set_axis as an alternative because .set_index is a dataframe method and not a series method
s.set_axis([4,5,6], axis=0)
Can also use np.arange if you don't want to hard code values
s.set_axis(np.arange(len(s))+4, axis=0)
How can I use loc to slice everything between first and last row of following pandas DataFrame?
Input:
id text
0 A
1 B
2 C
3 D
Output:
| id | text |
|----|------|
| 1 | B |
| 2 | C |
Selecting rows
Use
df.iloc[1:-1] # similar to df.iloc[1:3]
id text
1 1 B
2 2 C
To slice all rows by position between 0 and -1 (exclusive).
Assigning to an existing column
Since iloc expects positional values, if you need to assign back, pass the column position as the second argument in this manner:
df.iloc[1:-1, df.columns.get_loc('text')] = 'Test'
df
id text
0 0 A
1 1 Test
2 2 Test
3 3 D
Assigning to a new column
Since your index values are numeric, you can simplify the options above (as a special case):
df.loc[1:2, 'text'] = 'Test' # note the slice is `1:2`, not `1:3` or `1:-1`
In the case of loc, the ending slice is inclusive. If you don't know at runtime what your index labels will be, you can generalise using:
# Generalized solution
df.loc[df.index[1]:df.index[-2], 'text'] = 'Test'
The solution above leaves NaNs to unassigned rows. To assign values to all rows, one option would be to use fillna from the solution above. But we can optimize a bit using np.where or a more specific expression that depends on your problem.
I have a dataframe which looks like this
a b z
1 NULL NULL ... 1
2 NULL 1 ... NULL
3 1 NULL ... NULL
The first column is always populated and there are many others to the right of it. Of columns a through z one is populated the rest are not.
I would like to transform this dataframe into a two-column data frame with the headers of columns a through z in the second column. The example above would be transformed to this.
The_Column
1 z
2 b
3 a
The pandas.melt() function is close to what I need, but it doesn't handle the NULL values. I only care about the one cell in columns B through Z which is populated.
Is there an elegant way to handle this problem?
you need melt, and then df.dropna() - that's it
this should work:
df.set_index('a').melt().dropna().reset_index()
Using stack (which drops NA's by default):
x = (df.set_index('a')
.stack()
.reset_index()
.drop(columns=0)
.rename(columns={'level_1': 'The_Column'})
print(x)
Output:
a The_Column
0 1 z
1 2 b
2 3 c
I'm working on a small project using Python Pandas and I'm stuck at the following problem:
I have a table where column A contains multiple and possibly non unique values and a second column B with values which might be zero. Now I want to group all rows in the DataFrame by their value in column A and then only "keep" or "select" the groups which contain one or more zeros in the B column.
For example from a DataFrame that looks like this:
Column A Column B
-------- --------
b 12
c 56
f 0
b 456
b 334
f 10
I am only interested in all rows (the group) where column A = f :
Column A Column B
-------- --------
f 0
f 10
I know how I could achieve this using loops and iterating over groups but I'm looking for a simple and reasonably fast code as the DataFrames I work with can get very huge.
My current approach is something like this:
df.groupby("A").filter(lambda x: 0 in x["B"].values)
Obviously I'm new to Python Pandas and am hoping for your help !
Thank you in advance !
One way is to get all values of column A where column B is zero, and then group on this filtered set.
groups = df[df['Column B'] == 0]['Column A'].unique()
>>> df[df['Column A'].isin(groups)]
Column A Column B
2 f 0
5 f 10