When reading csv file my dataframe has these column names:
df.columns:
Index([nan,"A", nan, "B", "C", nan],dtype='object')
For unknown reasons it does not automatically name them as "Unnamed:0" and so on as it usually does.
Therefore is it possible to rename the multiple nan columns to Unnamed:0, Unnamed:1 and so on, depending on how many nan columns are there- the number of nan columns varies.
first convert your columns to a series then apply a cumulative count cumcount to a boolean condition which is True if there is a null occurrence. then use the conditional value to fill the null values.
s = pd.Series(df.columns)
print(s)
0 NaN
1 A
2 NaN
3 B
4 C
5 NaN
s = s.fillna('unnamed:' + (s.groupby(s.isnull()).cumcount() + 1).astype(str))
print(s)
0 unnamed:1
1 A
2 unnamed:2
3 B
4 C
5 unnamed:3
dtype: object
df.columns = s
Related
how can I set all my values in df1 as missing if their position equivalent is a missing value in df2?
Data df1:
Index Data
1 3
2 8
3 9
Data df2:
Index Data
1 nan
2 2
3 nan
desired output:
Index Data
1 nan
2 8
3 nan
So I would like to keep the data of df1, but only for the positions for which df2 also has data entries. For all nans in df2 I would like to replace the value of df1 with nan as well.
I tried the following, but this replaced all data points with nan.
df1 = df1.where(df2== np.nan, np.nan)
Thank you very much for your help.
Use mask, which is doing exactly the inverse of where:
df3 = df1.mask(df2.isna())
output:
Index Data
0 1 NaN
1 2 8.0
2 3 NaN
In your case, you were setting all elements matching a non-NaN as NaN, and because equality is not the correct way to check for NaN (np.nan == np.nan yields False), you were setting all to NaN.
Change df2 == np.nan by df2.notna():
df3 = df1.where(df2.notna(), np.nan)
print(df3)
# Output
Index Data
0 1 NaN
1 2 8.0
2 3 NaN
For my program I need a check comparison between an already existing DataFrame and a new DataFrame which comes as an input. The comparison should compare each cell of each DataFrame.
The case i need to find is, that in either the old and the new DataFrame there is a value but only if the value is different for this specific position it goes into a third reference DataFrame. The Reference DataFrame should e.g. like this:
A B
0 1 nan
1 2 nan
2 nan 3
3 nan 4
Input DataFrame
A B
0 nan 2
1 3 5
2 4 4
3 nan nan
Reference DataFrame
A B
0 3 4
I figured the best way is to compare each column with np.where
Since both existing and input DataFrame can be different the challenge is, that this method can only compare identically-labeled.
therefore I excluded non sharing colums and sorted them in the same order. So all Column names are the same and in the same order
Also i used this loop to align the number of records of both DataFrames:
dfshape = df.shape[0]
df1shape = df1.shape[0]
if dfshape < df1shape:
while i < (df1shape - dfshape):
df = df.append(pd.Series(0, index=df.columns), ignore_index=True)
i += 1
else:
while i < (dfshape - df1shape):
df1 = df1.append(pd.Series(0, index=df1.columns), ignore_index=True)
i += 1
Both DataFrames brought into the same shape I tried the following operation:
for column in df1:
for idx in df1.index:
if df.loc[idx, column] is not None:
dfRefercence = np.where(df1[column] != df[column], df1.loc[idx, column])
But ValueError: Can only compare identically-labeled Series objects
At this point I've run out of idieas to takle this problem and also could not identify the cause of the thrown Exception.
Maybe there is another way to achieve the desired result?
pandas.DataFrame.compare seems not to do the trick for me with this problem:
df2 = df.compare(df1)
A B
self other self other
0 1 nan nan 2
1 2 3 nan 5
2 nan 4 3 4
3 NaN NaN 4 nan
Select rows with columns ['A','B'] where rows with column 'C' contain NaN values in python (Pandas)
I have pandas data frame with three columns 'A' , 'B', 'C'.
In column 'C' there are some rows which contains NaN values.
Now I want to select column 'A' and column 'B' of data frame where column 'C' contains NaN values.
If all columns or only one column needs to be selected then I can do below,
df['A'][df['C'].isnull()]
or
df[df['C'].isnull()]
but I am not getting how to select multiple columns.
You can put multiple column names in the first form.
df[['A','B']][df['C'].isnull()]
You can use loc, and select a list of columns:
df.loc[df['C'].isnull(), ['A','B']]
For example
>>> df = pd.DataFrame({'A':[1,2,3,4], 'B':[5,6,7,8], 'C':[np.nan,1,np.nan,2]})
>>> df
A B C
0 1 5 NaN
1 2 6 1.0
2 3 7 NaN
3 4 8 2.0
>>> df.loc[df['C'].isnull(), ['A','B']]
A B
0 1 5
2 3 7
I like dropna and drop , since we will not have copy warning when we forget add the .copy()
sub=df.dropna(subset=['C']).drop('C',1)
sub
Out[26]:
A B
1 2 6
3 4 8
I am concatenating multiple months of csv's where newer, more recent versions have additional columns. As a result, putting them all together fills certain rows of certain columns with NaN.
The issue with this behavior is that it mixes these NaNs with true null entries from the data set which need to be easily distinguishable.
My only solution as of now is to replace the original NaNs with a unique string, concatenate the csv's, replace the new NaNs with a second unique string, replace the first unique string with NaN.
Given the amount of data I am processing, this is a very inefficient solution. I thought there was some way to determine how Panda's DataFrame fill these entries but couldn't find anything on it.
Updated example:
A B
1 NaN
2 3
And append
A B C
1 2 3
Gives
A B C
1 NaN NaN
2 3 NaN
1 2 3
But I want
A B C
1 NaN 'predated'
2 3 'predated'
1 2 3
In case you have a core set of columns, as here represented by df1, you could apply .fillna() to the .difference() between the core set and any new columns in more recent DataFrames.
df1 = pd.DataFrame(data={'A': [1, 2], 'B': [np.nan, 3]})
A B
0 1 NaN
1 2 3
df2 = pd.DataFrame(data={'A': 1, 'B': 2, 'C': 3}, index=[0])
A B C
0 1 2 3
df = pd.concat([df1, df2], ignore_index=True)
new_cols = df2.columns.difference(df1.columns).tolist()
df[new_cols] = df[new_cols].fillna(value='predated')
A B C
0 1 NaN predated
1 2 3 predated
2 1 2 3
When using the drop_duplicates() method I reduce duplicates but also merge all NaNs into one entry. How can I drop duplicates while preserving rows with an empty entry (like np.nan, None or '')?
import pandas as pd
df = pd.DataFrame({'col':['one','two',np.nan,np.nan,np.nan,'two','two']})
Out[]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
5 two
6 two
df.drop_duplicates(['col'])
Out[]:
col
0 one
1 two
2 NaN
Try
df[(~df.duplicated()) | (df['col'].isnull())]
The result is :
col
0 one
1 two
2 NaN
3 NaN
4 NaN
Well, one workaround that is not really beautiful is to first save the NaN and put them back in:
temp = df.iloc[pd.isnull(df).any(1).nonzero()[0]]
asd = df.drop_duplicates('col')
pd.merge(temp, asd, how='outer')
Out[81]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
use:
df.drop_duplicates('col').append(df[df['col'].isna()])