I have a df that contains nan,
A
nan
nan
nan
nan
2017
2018
I tried to remove all the nan rows in df,
df = df.loc[df['A'].notnull()]
but df still contains those nan values for column 'A' after the above code. The dtype of 'A' is object.
I am wondering how to fix it. The thing is I need to define multiple conditions to filter the df, and df['A'].notnull() is one of them. Don't know why it doesn't work.
Please provide a reproducible example. As such this works:
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan], [np.nan], [2017], [2018]], columns=['A'])
df = df[df['A'].notnull()]
df2 = pd.DataFrame([['nan'], ['nan'], [2017], [2018]], columns=['A'])
df2 = df2.replace('nan', np.nan)
df2 = df2[df2['A'].notnull()]
# output [df or df2]
# A
# 2017.0
# 2018.0
Related
Given the following DataFrame:
A B
0 -10.0 NaN
1 NaN 20.0
2 -30.0 NaN
I want to merge columns A and B, filling the NaN cells in column A with the values from column B and then drop column B, resulting in a DataFrame like this:
A
0 -10.0
1 20.0
2 -30.0
I have managed to solve this problem by using the iterrows() function.
Complete code example:
import numpy as np
import pandas as pd
example_data = [[-10, np.NaN], [np.NaN, 20], [-30, np.NaN]]
example_df = pd.DataFrame(example_data, columns = ['A', 'B'])
for index, row in example_df.iterrows():
if pd.isnull(row['A']):
row['A'] = row['B']
example_df = example_df.drop(columns = ['B'])
example_df
This seems to work fine, but I find this information in the documentation for iterrows():
You should never modify something you are iterating over.
So it seems like I'm doing it wrong.
What would be a better/recommended approach for achieving the same result?
Use Series.fillna with Series.to_frame:
df = df['A'].fillna(df['B']).to_frame()
#alternative
#df = df['A'].combine_first(df['B']).to_frame()
print (df)
A
0 -10.0
1 20.0
2 -30.0
If more columns and need first non missing values per rows use back filling missing values with select first column by one element list for one column DataFrame:
df = df.bfill(axis=1).iloc[:, [0]]
print (df)
A
0 -10.0
1 20.0
2 -30.0
The first and the second data frames are as below:
import pandas as pd
d = {'0': [2154,799,1023,4724], '1': [27, 2981, 952,797],'2':[4905,569,4767,569]}
df1 = pd.DataFrame(data=d)
and
d={'PART_NO': ['J661-03982','661-08913', '922-8972','661-00352','661-06291',''], 'PART_NO_ENCODED': [2154,799,1023,27,569]}
df2 = pd.DataFrame(data=d)
I want to get the corresponding part_no for each row in df1 so the resulting data frame should look like this:
d={'PART_NO': ['J661-03982','661-00352',''], 'PART_NO_ENCODED': [2154,27,4905]}
df3 = pd.DataFrame(data=d)
This I can achieve like this:
df2.set_index('PART_NO_ENCODED').reindex(df1.iloc[0,:]).reset_index().rename(columns={0:'PART_NO_ENCODED'})
But instead of passing reindex(df1.iloc[0,:]) one value that's 0,1 at a Time I want to get for all the rows in df1 the corresponding part_no. Please help?
You can use the second dataframe as a dictionary of replacements:
df3 = df1.replace(df2.set_index('PART_NO_ENCODED').to_dict()['PART_NO'])
The values that are not in df2, will not be replaced. They have to be identified and discarded:
df3 = df3[df1.isin(df2['PART_NO_ENCODED'].tolist())]
# 0 1 2
#0 J661-03982 661-00352 NaN
#1 661-08913 NaN 661-06291
#2 922-8972 NaN NaN
#3 NaN NaN 661-06291
You can later replace the missing values with '' or any other value of your choice with fillna.
I am using pandas (version 0.20.3) and I want to apply the diff() method with groupby() but instead of a DataFrame, the result is an "underscore".
Here is the code:
import numpy as np
import pandas as pd
# creating the DataFrame
data = np.random.random(18).reshape(6,3)
indexes = ['B']*3 + ['A']*3
columns = ['x', 'y', 'z']
df = pd.DataFrame(data, index=indexes, columns=columns)
df.index.name = 'chain_id'
# Now I want to apply the diff method in function of the chain_id
df.groupby('chain_id').diff()
And the result is an underscore!
Note that df.loc['A'].diff() and df.loc['B'].diff() do return the expected results so I don't understand why it wouldn't work with groupby().
IIUC,Your error :cannot reindex from a duplicate axis
df.reset_index().groupby('chain_id').diff().set_index(df.index)
Out[859]:
x y z
chain_id
B NaN NaN NaN
B -0.468771 0.192558 -0.443570
B 0.323697 0.288441 0.441060
A NaN NaN NaN
A -0.198785 0.056766 0.081513
A 0.138780 0.563841 0.635097
Cant figure out why .dropnan() is not dropping cells with NaN values?
help please, I've gone through the pandas documentation, dont know what Im doing wrong????
import pandas as pd
import quandl
import pandas as pd
df = quandl.get("GOOG/NYSE_SPY")
df2 = quandl.get("YAHOO/AAPL")
date = pd.date_range('2010-01-01', periods = 365)
df3 = pd.DataFrame(index = date)
df3 = df3.join(df['Open'], how = 'inner')
df3.rename(columns = {'Open': 'SPY'}, inplace = True)
df3 = df3.join(df2['Open'], how = 'inner')
df3.rename(columns = {'Open': 'AAPL'}, inplace = True)
df3['Spread'] = df3['SPY'] / df3['AAPL']
df3 = df3 / df3.ix[0]
df3.dropna(how = 'any')
df3.plot()
print(df3)
change df3.dropna(how = 'any') to df3 = df3.dropna(how = 'any')
I tried to replicate your problem with a simple csv file:
In [6]: df
Out[6]:
a b
0 1.0 3.0
1 2.0 NaN
2 NaN 6.0
3 5.0 3.0
Both df.dropna(how='any') as well as df1 = df.dropna(how='any') work. Even just df.dropna() works. I am wondering whether your issue is because you are performing a division in the previous line:
df3 = df3 / df3.ix[0]
df3.dropna(how = 'any')
For instance, if I divide by df.ix[1], since one of the elements is a NaN, it converts all elements of a column in the result to NaN, and then if I remove NaNs using dropna, it will remove all rows:
In [17]: df.ix[1]
Out[17]:
a 2.0
b NaN
Name: 1, dtype: float64
In [18]: df2 = df / df.ix[1]
In [19]: df2
Out[19]:
a b
0 0.5 NaN
1 1.0 NaN
2 NaN NaN
3 2.5 NaN
In [20]: df2.dropna()
Out[20]:
Empty DataFrame
Columns: [a, b]
Index: []
UPDATE: This is no longer an issue since at least pandas version 0.18.1. Concatenating empty series doesn't drop them anymore so this question is out of date.
I want to create a pandas dataframe from a list of series using .concat. The problem is that when one of the series is empty it doesn't get included in the resulting dataframe but this makes the dataframe be the wrong dimensions when I then try to rename its columns with a multi-index.
UPDATE: Here's an example...
import pandas as pd
sers1 = pd.Series()
sers2 = pd.Series(['a', 'b', 'c'])
df1 = pd.concat([sers1, sers2], axis=1)
This produces the following dataframe:
>>> df1
0 a
1 b
2 c
dtype: object
But I want it to produce something like this:
>>> df2
0 1
0 NaN a
1 NaN b
2 NaN c
It does this if I put a single nan value anywhere in ser1 but it seems like this should be possible automatically even if some of my series are totally empty.
Passing an argument for levels will do the trick. Here's an example. First, the wrong way:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
df = pd.concat(list_of_series, axis=1)
Which produces this:
>>> df
0
0 1
1 2
2 3
But if we add some labels to the levels argument, it will include all the empty series too:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
labels = range(len(list_of_series))
df = pd.concat(list_of_series, levels=labels, axis=1)
Which produces the desired dataframe:
>>> df
0 1 2
0 NaN 1 NaN
1 NaN 2 NaN
2 NaN 3 NaN