Cant figure out why .dropnan() is not dropping cells with NaN values?
help please, I've gone through the pandas documentation, dont know what Im doing wrong????
import pandas as pd
import quandl
import pandas as pd
df = quandl.get("GOOG/NYSE_SPY")
df2 = quandl.get("YAHOO/AAPL")
date = pd.date_range('2010-01-01', periods = 365)
df3 = pd.DataFrame(index = date)
df3 = df3.join(df['Open'], how = 'inner')
df3.rename(columns = {'Open': 'SPY'}, inplace = True)
df3 = df3.join(df2['Open'], how = 'inner')
df3.rename(columns = {'Open': 'AAPL'}, inplace = True)
df3['Spread'] = df3['SPY'] / df3['AAPL']
df3 = df3 / df3.ix[0]
df3.dropna(how = 'any')
df3.plot()
print(df3)
change df3.dropna(how = 'any') to df3 = df3.dropna(how = 'any')
I tried to replicate your problem with a simple csv file:
In [6]: df
Out[6]:
a b
0 1.0 3.0
1 2.0 NaN
2 NaN 6.0
3 5.0 3.0
Both df.dropna(how='any') as well as df1 = df.dropna(how='any') work. Even just df.dropna() works. I am wondering whether your issue is because you are performing a division in the previous line:
df3 = df3 / df3.ix[0]
df3.dropna(how = 'any')
For instance, if I divide by df.ix[1], since one of the elements is a NaN, it converts all elements of a column in the result to NaN, and then if I remove NaNs using dropna, it will remove all rows:
In [17]: df.ix[1]
Out[17]:
a 2.0
b NaN
Name: 1, dtype: float64
In [18]: df2 = df / df.ix[1]
In [19]: df2
Out[19]:
a b
0 0.5 NaN
1 1.0 NaN
2 NaN NaN
3 2.5 NaN
In [20]: df2.dropna()
Out[20]:
Empty DataFrame
Columns: [a, b]
Index: []
Related
I am using openpyxl to edit three dataframes, df1, df2, df3 (If it is necessary, we can also regard as three excels independently):
import pandas as pd
data1 = [[1, 1],[1,1]]
df1 = pd.DataFrame(data1, index = ['I1a','I1b'], columns=['v1a', 'v1b'])
df1.index.name='I1'
data2 = [[2, 2,2,2],[2,2,2,2],[2,2,2,2],[2,2,2,2]]
df2 = pd.DataFrame(data2, index = ['I2a','I2b','I2c','I2d'], columns=['v2a','v2b','v2c','v2d'])
df2.index.name='I2'
data3 = [['a', 'b',3,3],['a','c',3,3],['b','c',3,3],['c','d',3,3]]
df3 = pd.DataFrame(data3, columns=['v3a','v3b','v3c','v3d'])
df3 = df3.groupby(['v3a','v3b']).first()
Here df3 is multiindex. How to concat them into one excel horizontally (each dataframe start at the same line) as following:
Here we will regard index as a column and for multiindex, we keep the first level hidden.
Update
IIUC:
>>> pd.concat([df1.reset_index(), df2.reset_index(), df3.reset_index()], axis=1)
I1 v1a v1b I2 v2a v2b v2c v2d v3a v3b v3c v3d
0 I1a 1.0 1.0 I2a 2 2 2 2 a b 3 3
1 I1b 1.0 1.0 I2b 2 2 2 2 a c 3 3
2 NaN NaN NaN I2c 2 2 2 2 b c 3 3
3 NaN NaN NaN I2d 2 2 2 2 c d 3 3
Old answer
Assuming you know the start row, you can use pandas to remove extra columns:
import pandas as pd
df = pd.read_excel('input.xlsx', header=0, skiprows=3).dropna(how='all', axis=1)
df.to_excel('output.xlsx', index=False)
Input:
Output:
My dataframe consists of 3 columns. The thirth column is based on the first two columns. The default column is column 2. But if column 2 is NaN, then I want column 3 to be filled with column 1. I added the third line to conditions, but it does not seem to work.
This is the DataFrame:
df = pd.DataFrame(np.array([[np.nan, 1717], [1749, 1750], [1704, np.nan]]),
columns=['a', 'b'])
This is my code:
import numpy as np
import pandas as pd
conditions = [
(df["b"] <= df["a"]),
df["b"] > df["a"],
df["b"] == df["b"].isna()]
choices = [df["b"], df["a"], df["a"]]
df['c'] = np.select(conditions, choices, default=df["b"])
print(df)
This is my output:
a b c
0 NaN 1749.0 1749.0
1 1717.0 1750.0 1717.0
2 1704.0 NaN NaN
But I want c to be filled if a or b is filled. So this is the output I want:
a b c
0 NaN 1749.0 1749.0
1 1717.0 1750.0 1717.0
2 1704.0 NaN 1704.0
You just need to make a small change to your third condition. df["b"].isna() already returns True or False, so df["b"] == df["b"].isna() is actually checking to see if df["b"] evaluates to the same boolean (it doesn't).
Just remove the first half of the third condition.
import numpy as np
import pandas as pd
conditions = [
(df["b"] <= df["a"]),
df["b"] > df["a"],
df["b"].isna()]
choices = [df["b"], df["a"], df["a"]]
df['c'] = np.select(conditions, choices, default=df["b"])
print(df)
This seems to work:
df = pd.DataFrame(np.array([[np.nan, 1717], [1749, 1750], [1704, np.nan]]),
columns=['a', 'b'])
df['c'] = df.a
for i in range(len(df)):
if df.a.iloc[i] == np.nan:
df.c.iloc[i] = df.b.iloc[i]
This solution gives the output that you want:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[np.nan, 1717], [1749, 1750], [1704, np.nan]]),
columns=['a', 'b'])
def fill_row(row):
if pd.isnull(row['a']):
return row['b']
else:
return row['a']
df['c'] = df.apply(lambda row : fill_row(row), axis=1)
print(df)
The output:
a b c
0 NaN 1717.0 1717.0
1 1749.0 1750.0 1749.0
2 1704.0 NaN 1704.0
I have a dataframe containing two columns; A_ID and R_ID.
I want to update R_ID to only contain values that are also in A_ID, the rest should be deleted (also NaN). The values should stay at the same position/index. I get that this is an inner join but with my proposed solution I got several problems.
Example:
import pandas as pd
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': ['1E7',[np.nan],[np.nan],"1E4",]}
df = pd.DataFrame(data)
print(df)
I tried
df_A_ID = df[["A_ID"]]
df_R_ID = df[["R_ID"]]
new_df = pd.merge(df_A_ID, df_R_ID, how='inner', left_on='A_ID', right_on ='R_ID', right_index=True)
and
new_df = pd.concat([dataset_A_ID, dataset_R_ID],join="inner")
But with the first option I get an "You are trying to merge on object and int64 columns" error, even though both columns are of d.types object and with the second one i get an empty DataFrame.
My expected output would be the same dataframe as before but with R_ID only containing values that are also in the column A_ID, at the same index/position:
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': [[np.nan],[np.nan],[np.nan],"1E4",]}
df = pd.DataFrame(data)
print(df)
Set NaN by Series.where if no match columns compared by Series.isin:
#solution working with scalar NaNs
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': ['1E7',np.nan,np.nan,"1E4",]}
df = pd.DataFrame(data)
print(df)
A_ID R_ID
0 1E2 1E7
1 1E3 NaN
2 1E4 NaN
3 1E5 1E4
df['R_ID'] = df['R_ID'].where(df["R_ID"].isin(df["A_ID"]))
print(df)
A_ID R_ID
0 1E2 NaN
1 1E3 NaN
2 1E4 NaN
3 1E5 1E4
Or:
df.loc[~df["R_ID"].isin(df["A_ID"]), 'R_ID'] = np.nan
Use isin:
df['R_ID'] = df['R_ID'].loc[df['R_ID'].isin(df['A_ID'])]
>>> df
A_ID R_ID
0 1E2 NaN
1 1E3 NaN
2 1E4 NaN
3 1E5 1E4
It should work
df_A_ID = df[["A_ID"]].astype(dtype=pd.StringDtype())
df_R_ID = df[["R_ID"]].astype(dtype=pd.StringDtype()).reset_index()
temp_df = pd.merge(df_A_ID, df_R_ID, how='inner', left_on='A_ID', right_on ='R_ID').set_index('index')
df.loc[~(df_R_ID.isin(temp_df[['R_ID']])['R_ID']).fillna(False),'R_ID'] = [np.nan]
Output
A_ID R_ID
0 1E2 NaN
1 1E3 NaN
2 1E4 1E4
3 1E5 NaN
I have a df that contains nan,
A
nan
nan
nan
nan
2017
2018
I tried to remove all the nan rows in df,
df = df.loc[df['A'].notnull()]
but df still contains those nan values for column 'A' after the above code. The dtype of 'A' is object.
I am wondering how to fix it. The thing is I need to define multiple conditions to filter the df, and df['A'].notnull() is one of them. Don't know why it doesn't work.
Please provide a reproducible example. As such this works:
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan], [np.nan], [2017], [2018]], columns=['A'])
df = df[df['A'].notnull()]
df2 = pd.DataFrame([['nan'], ['nan'], [2017], [2018]], columns=['A'])
df2 = df2.replace('nan', np.nan)
df2 = df2[df2['A'].notnull()]
# output [df or df2]
# A
# 2017.0
# 2018.0
I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2