What's the most pythonic place to drop the columns in a dataframe where the header row is NaN? Preferably inplace.
There may or may not be data in the column.
df = pd.DataFrame({'col1': [1,2,np.NaN], 'col2': [4,5,6], np.NaN: [7,np.NaN,9]})
df.dropna(axis='columns', inplace=True)
Doesn't do it as it looks at the data in the column.
Wanted output
df = pd.DataFrame({'col1': [1,2,np.NaN], 'col2': [4,5,6]})
Thanks in advance for the replies.
Simply try this
df.drop(np.nan, axis=1, inplace=True)
However, if 'no headers' includes None, then jpp's answer will work perfectly at one shot.
Even, in case you have more than one np.nan headers, I don't know how to make df.drop works.
You can use pd.Index.dropna:
df = df[df.columns.dropna()]
print(df)
col1 col2
0 1.0 4
1 2.0 5
2 NaN 6
Related
I have dataset, df with some empty values in second column col2.
so I create a new table with same column names and the lenght is equal to number of missings in col2 for df. I call the new dataframe df2.
df[df['col2'].isna()] = df2
But this will return nan for the entire rows where col2 was missing. which means that df[df['col1'].isna()] is now missins everywhere and not only in col2.
Why is that and how Can I fix that?
Assuming that by df2 you really meant a Series, so renaming as s:
df.loc[df['col2'].isna(), 'col2'] = s.values
Example
nan = float('nan')
df = pd.DataFrame({'col1': [1,2,3], 'col2': [nan, 0, nan]})
s = pd.Series([10, 11])
df.loc[df['col2'].isna(), 'col2'] = s.values
>>> df
col1 col2
0 1 10.0
1 2 0.0
2 3 11.0
Note
I don't like this, because it is relying on knowing that the number of NaNs in df is the same length as s. It would be better to know how you create the missing values. With that information, we could probably propose a better and more robust solution.
I read another question that probably has some similar problem (this one) but I couldn't understand the answer.
Consider 2 dataframes defined like this
df1 = pd.DataFrame({
'A' : ['a','b','c','d'],
'B': [1,np.nan,np.nan,4]
})
df2 = pd.DataFrame({
'A' : ['a','c','b','d'],
'B' : [np.nan, 8, 9, np.nan]
I want to merge them to fill blank cells. At first I used
df1.merge(df2, on='A')
but this caused my df1 to have 2 different columns named B_x and B_y. I also tried with different parameters for the .merge() method but still couldn't solve the issue.
The final dataframe should look like this one
df1 = pd.DataFrame({
'A' : ['a','b','c','d'],
'B': [1,9,8,4]
})
Do you know what's the most logic way to do that?
I think that pd.concat() should be a useful tool for this job but I have no idea on how to apply it.
EDIT:
I modified values so that 'A' columns don't have the same order in both dataframes. This should avoid any ambiguity.
Maybe you can use fillna instead:
new_df = df1.fillna(df2)
Output:
>>> new_df
A B
0 a 1.0
1 b 9.0
2 c 8.0
3 d 4.0
Here's a different solution:
df1.merge(df2, on='A', suffixes=(';',';')).pipe(lambda x: x.set_axis(x.columns.str.strip(';'), axis=1)).groupby(level=0, axis=1).first()
The semicolons (;) are arbitrary; you can use any character as long as it doesn't appear in any columns names.
You can map the values from df2:
df1['B'] = df1['B'].fillna(df1['A'].map(df2.set_index('A')['B']))
output:
A B
0 a 1.0
1 b 9.0
2 c 8.0
3 d 4.0
Alternative
if the values in A are unique, you can merge only the column A of df1 and combine_first:
df1 = df1.combine_first(df1[['A']].merge(df2))
I am trying to remove specific NA format with .dropna() method from pandas, however when apply it, the method returns None object.
import pandas as pd
# importing data #
df = pd.read_csv(path, sep=',', na_values='NA')
# this is how the df looks like
df = {'col1': [1, 2], 'col2': ['NA', 4]}
df=pd.DataFrame(df)
# trying to drop NA
d= df.dropna(how='any', inplace=True)
This code returns a None object, expected output could look like this:
# col1 col2
#0 2 4
How could I adjust this method?
Is there any simpler method to accomplish this task?
import numpy as np
import pandas as pd
Firstly replace 'NA' values in your dataframe with actual 'NaN' value by replace() method:
df=df.replace('NA',np.nan,regex=True)
Finally:
df.dropna(how='any', inplace=True)
Now if you print df you will get your desired output:
col1 col2
1 2 4.0
If you want exact same output that you mentioned in question then just use reset_index() method:
df=df.reset_index(drop=True)
Now if you print df you will get:
col1 col2
0 2 4.0
Remove records with string 'NA'
df[~df.eq('NA').any(1)]
col1 col2
1 2 4
I am trying to merge two dataframes together but when I do it, it seems to replace the existing values with blanks, which isn't really ideal.
df1
col1 col2
'house1' 300
'house2' 450
'house3' 750
df2
col1 col2
'house4' 600
the code I'm using to concat is:
df = pd.concat([df2, df1], sort=False, ignore_index=True)
which yields:
col1 col2
'house4' 600
'house1'
'house2'
'house3'
Is concat the wrong choice? I want to update all the values, but as said, it's overwriting the existing data. I'm sorry if I'm doing a terrible job at explaining this.
You should try append:
df = df1.append(df2, ignore_index=True)
Since you want to add a row, you'll probably want to use pandas append function.
Try this:
df = df1.append(df2)
import pandas as pd
df = pd.DataFrame({
'col1':[99,99,99],
'col2':[4,5,6],
'col3':[7,None,9]
})
col_list = ['col1','col2']
df[col_list].replace(99,0,inplace=True)
This generates a Warning and leaves the dataframe unchanged.
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I want to be able to apply the replace method on a subset of the columns specified by the user. I also want to use inplace = True to avoid making a copy of the dataframe, since it is huge. Any ideas on how this can be accomplished would be appreciated.
When you select the columns for replacement with df[col_list], a slice (a copy) of your dataframe is created. The copy is updated, but never written back into the original dataframe.
You should either replace one column at a time or use nested dictionary mapping:
df.replace(to_replace={'col1' : {99 : 0}, 'col2' : {99 : 0}},
inplace=True)
The nested dictionary for to_replace can be generated automatically:
d = {col : {99:0} for col in col_list}
You can use replace with loc. Here is a slightly modified version of your sample df:
d = {'col1':[99,99,9],'col2':[99,5,6],'col3':[7,None,99]}
df = pd.DataFrame(data=d)
col_list = ['col1','col2']
df.loc[:, col_list] = df.loc[:, col_list].replace(99,0)
You get
col1 col2 col3
0 0 0 7.0
1 0 5 NaN
2 9 6 99.0
Here is a nice explanation for similar issue.