I imported an excel and now I need multiply certain values from the list but if the value from the first column is NaN, Python should take another column for the calculation. I got the following Code:
if pd['Column1'] == 'NaN':
pd['Column2'] * pd['Column3']
else:
pd['Column1'] * pd['Column3']
Thank you for your help.
You can use isna() together with any() or all(). Here is an example:
import pandas as pd
import numpy as np
#generating test data assuming all the values in Col1 are 'NaN'
df = pd.DataFrame({'Col1':[np.nan,np.nan,np.nan,np.nan], 'Col2':[1,2,3,4], 'Col3':[2,3,4,5]})
if df['Col1'].isna().all(): # you can also use 'any()' instead of all()
df['Col4'] = df['Col2']*df['Col3']
else:
df['Col4'] = df['Col1']*df['Col3']
print(df)
Output:
Col1 Col2 Col3 Col4
0 NaN 1 2 2
1 NaN 2 3 6
2 NaN 3 4 12
3 NaN 4 5 20
Related
I have a pandas dataframe with lot of columns. The dtype of all columns is object because some columns have strings in value. Is there a way to filter out rows into a different dataframe where value in any column is a string and then convert the cleaned dataframe to integer dtype.
I figured out the second part but not able to achieve the first part - filtering out rows if value contains string character like 'a', 'b' etc. for eg. if df is:
df = pd.DataFrame({
'col1':[1,2,'a',0,3],
'col2':[1,2,3,4,5],
'col3':[1,2,3,'45a5',4]
})
This should become 2 dataframes
df = pd.DataFrame({
'col1':[1,2,3],
'col2':[1,2,5],
'col3':[1,2,4]
})
dfError = pd.DataFrame({
'col1':['a',0],
'col2':[3,4],
'col3':[3,'45a5']
})
I believe this to be an efficient way to do it.
import pandas as pd
df = pd.DataFrame({ # main dataframe
'col1':[1,2,'a',0,3],
'col2':[1,2,3,4,5],
'col3':[1,2,3,'45a5',4]
})
mask = df.apply(pd.to_numeric, errors='coerce').isna() # checks if couldn't be numeric
mask = mask.any(1) # check rows that couldn't be numeric
df1 = df[~mask] # could be numeric
df2 = df[mask] # couldn't be numeric
Breaking it down:
df.apply(pd.to_numeric) # converts the dataframe into numeric, but this would give us an error for the string elements (like 'a')
df.apply(pd.to_numeric, errors='coerce') # 'coerce' sets any non-valid element to NaN (converts the string elements to NaN).
>>>
col1 col2 col3
0 1.0 1 1.0
1 2.0 2 2.0
2 NaN 3 3.0
3 0.0 4 NaN
4 3.0 5 4.0
mask.isna() # Detect missing values.
>>>
col1 col2 col3
1 False False False
2 True False False
3 False False True
4 False False False
mask.any(1) # Returns whether any element is True along the rows
>>>
0 False
1 False
2 True
3 True
4 False
Don't know if there's a performant way to check this. But a dirty way (might be slow) could be:
str_cond = df.applymap(lambda x: isinstance(x, str)).any(1)
df[~str_cond]
col1 col2 col3
0 1 1 1
1 2 2 2
4 3 5 4
df[str_cond]
col1 col2 col3
2 a 3 3
3 0 4 45a5
I have a dataframe df1 and column 1 (col1) contains customer id. Col2 is filled with sales and some of the values are missing
My problem is that I want to drop duplicate customer ids in col1 only where the value of sales is missing.
I tried writing a function saying:
def drop(i):
if i[col2] == np.nan:
i.drop_duplicates(subset = 'col1')
else:
return i['col1']
I am getting an error saying truth value of series is ambiguous
Thank you for reading. Would appreciate a solution
Following should work, using groupby, apply, dropna, reset_index
assuming your data is something like this
input:
col1 col2
0 1001 2.0
1 1001 NaN
2 1002 4.0
3 1002 NaN
code:
import pandas as pd
import numpy as np
#Dummy data
data = {
'col1':[1001,1001,1002,1002],
'col2':[2,np.nan,4,np.nan],
}
df = pd.DataFrame(data)
#Solution
df.groupby('col1').apply(lambda group: group.dropna(subset=['col2'])).reset_index(drop=True)
output:
col1 col2
0 1001 2.0
1 1002 4.0
I have a dictionary:
d = {"A":1, "B":2, "C":3}
I also have a pandas dataframe:
col1
A
G
E
B
C
I'd like to create a new column by mapping the dictionary onto col1. Simultaneously I'd like to set the values in another column to indicate whether the value in that row has been mapped. The desired output would look like this:
col1 col2 col3
A 1 1
G NaN 0
E NaN 0
B 2 1
C 3 1
I know that col2 can be created using df.col1.map(d), but how can I simultaneously create col3?
You can create both column in one function assign - first by map and second by isin for boolean mask with casting to integers:
df = df.assign(col2=df.col1.map(d), col3=df.col1.isin(d.keys()).astype(int))
print (df)
col1 col2 col3
0 A 1.0 1
1 G NaN 0
2 E NaN 0
3 B 2.0 1
4 C 3.0 1
Another 2 step solution with different boolean mask - by checking not missing values:
df['col2'] = df.col1.map(d)
df['col3'] = df['col2'].notnull().astype(int)
In my code the df.fillna() method is not working when the df.dropna() method is working. I don't want to drop the column though. What can I do that the fillna() method works?
def preprocess_df(df):
for col in df.columns: # go through all of the columns
if col != "target": # normalize all ... except for the target itself!
df[col] = df[col].pct_change() # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
# df.dropna(inplace=True) # remove the nas created by pct_change
df.fillna(method="ffill", inplace=True)
print(df)
break
df[col] = preprocessing.scale(df[col].values) # scale between 0 and 1.
it should work unless its not within loop as mentioned..
You should consider filling it before you construct a loop or during the DataFrame construction:
Example Below cleary shows it working :
>>> df
col1
0 one
1 NaN
2 two
3 NaN
Works as expected:
>>> df['col1'].fillna( method ='ffill') # This is showing column specific to `col1`
0 one
1 one
2 two
3 two
Name: col1, dtype: object
Secondly, if you wish to change few selective columns then you use below method:
Let's suppose you have 3 columns and want to fillna with ffill for only 2 columns.
>>> df
col1 col2 col3
0 one test new
1 NaN NaN NaN
2 two rest NaN
3 NaN NaN NaN
Define the columns to be changed..
cols = ['col1', 'col2']
>>> df[cols] = df[cols].fillna(method ='ffill')
>>> df
col1 col2 col3
0 one test new
1 one test NaN
2 two rest NaN
3 two rest NaN
If you are considering it to be happen across entire DataFrame, the use it during as Follows:
>>> df
col1 col2
0 one test
1 NaN NaN
2 two rest
3 NaN NaN
>>> df.fillna(method ='ffill') # inplace=True if you considering as you wish for permanent change.
col1 col2
0 one test
1 one test
2 two rest
3 two rest
the first value was a NaN so I had to use bfill method instead. Thanks everyone
Given this dataframe, how to select only those rows that have "Col2" equal to NaN?
df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)], columns=["Col1", "Col2", "Col3"])
which looks like:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
The result should be this one:
0 1 2
1 0 NaN 0
Try the following:
df[df['Col2'].isnull()]
#qbzenker provided the most idiomatic method IMO
Here are a few alternatives:
In [28]: df.query('Col2 != Col2') # Using the fact that: np.nan != np.nan
Out[28]:
Col1 Col2 Col3
1 0 NaN 0.0
In [29]: df[np.isnan(df.Col2)]
Out[29]:
Col1 Col2 Col3
1 0 NaN 0.0
If you want to select rows with at least one NaN value, then you could use isna + any on axis=1:
df[df.isna().any(axis=1)]
If you want to select rows with a certain number of NaN values, then you could use isna + sum on axis=1 + gt. For example, the following will fetch rows with at least 2 NaN values:
df[df.isna().sum(axis=1)>1]
If you want to limit the check to specific columns, you could select them first, then check:
df[df[['Col1', 'Col2']].isna().any(axis=1)]
If you want to select rows with all NaN values, you could use isna + all on axis=1:
df[df.isna().all(axis=1)]
If you want to select rows with no NaN values, you could notna + all on axis=1:
df[df.notna().all(axis=1)]
This is equivalent to:
df[df['Col1'].notna() & df['Col2'].notna() & df['Col3'].notna()]
which could become tedious if there are many columns. Instead, you could use functools.reduce to chain & operators:
import functools, operator
df[functools.reduce(operator.and_, (df[i].notna() for i in df.columns))]
or numpy.logical_and.reduce:
import numpy as np
df[np.logical_and.reduce([df[i].notna() for i in df.columns])]
If you're looking for filter the rows where there is no NaN in some column using query, you could do so by using engine='python' parameter:
df.query('Col2.notna()', engine='python')
or use the fact that NaN!=NaN like #MaxU - stop WAR against UA
df.query('Col2==Col2')