For my program I need a check comparison between an already existing DataFrame and a new DataFrame which comes as an input. The comparison should compare each cell of each DataFrame.
The case i need to find is, that in either the old and the new DataFrame there is a value but only if the value is different for this specific position it goes into a third reference DataFrame. The Reference DataFrame should e.g. like this:
A B
0 1 nan
1 2 nan
2 nan 3
3 nan 4
Input DataFrame
A B
0 nan 2
1 3 5
2 4 4
3 nan nan
Reference DataFrame
A B
0 3 4
I figured the best way is to compare each column with np.where
Since both existing and input DataFrame can be different the challenge is, that this method can only compare identically-labeled.
therefore I excluded non sharing colums and sorted them in the same order. So all Column names are the same and in the same order
Also i used this loop to align the number of records of both DataFrames:
dfshape = df.shape[0]
df1shape = df1.shape[0]
if dfshape < df1shape:
while i < (df1shape - dfshape):
df = df.append(pd.Series(0, index=df.columns), ignore_index=True)
i += 1
else:
while i < (dfshape - df1shape):
df1 = df1.append(pd.Series(0, index=df1.columns), ignore_index=True)
i += 1
Both DataFrames brought into the same shape I tried the following operation:
for column in df1:
for idx in df1.index:
if df.loc[idx, column] is not None:
dfRefercence = np.where(df1[column] != df[column], df1.loc[idx, column])
But ValueError: Can only compare identically-labeled Series objects
At this point I've run out of idieas to takle this problem and also could not identify the cause of the thrown Exception.
Maybe there is another way to achieve the desired result?
pandas.DataFrame.compare seems not to do the trick for me with this problem:
df2 = df.compare(df1)
A B
self other self other
0 1 nan nan 2
1 2 3 nan 5
2 nan 4 3 4
3 NaN NaN 4 nan
Related
One common thing people seem to want to do in pandas is to replace None-values with the next or previous None-value. This is easily done with .fillna. I however want to do something similar but different.
I have a dataframe, df, with some entries. Every row has a different number of entries and they are all "left-adjusted" (if the df is 10 columns wide and some row has n<10 entries the first n columns hold the entries and the remaining columns are Nones).
What I want to do is find the last non-None entry in every row and change it to also be a None. This could be any of the columns from the first to the last.
I could of course do this with a for-loop but my dfs can be quite large so something quicker would be preferable. Any ideas?
Thanks!
With help from numpy, this is quite easy. By counting the number of None in each row one can find for each row the column with the last non-None value. Then using Numpy change this value to None:
data = np.random.random((6,10))
df = pd.DataFrame(data)
df.iloc[0, 7:] = None
df.iloc[1, 6:] = None
df.iloc[2, 5:] = None
df.iloc[3, 8:] = None
df.iloc[4, 5:] = None
df.iloc[5, 4:] = None
Original dataframe looks like this:
0 1 2 3 4 5
0 0.992337 0.651785 0.521422 NaN NaN NaN
1 0.912962 0.292458 0.620195 0.507071 0.010205 NaN
2 0.061320 0.565979 0.344755 NaN NaN NaN
3 0.521936 0.057917 0.359699 0.484009 NaN NaN
isnull = df.isnull()
col = data.shape[1] - isnull.sum(axis = 1) - 1
df.values[range(len(df)), col] = None
Updated dataframe looks like this:
0 1 2 3 4 5
0 0.992337 0.651785 NaN NaN NaN NaN
1 0.912962 0.292458 0.620195 0.507071 NaN NaN
2 0.061320 0.565979 NaN NaN NaN NaN
3 0.521936 0.057917 0.359699 NaN NaN NaN
You can find the index of the element to replace in each row with np.argmax():
indices = np.isnan(df.to_numpy()).argmax(axis=1) - 1
df.to_numpy()[range(len(df)), indices] = None
When reading csv file my dataframe has these column names:
df.columns:
Index([nan,"A", nan, "B", "C", nan],dtype='object')
For unknown reasons it does not automatically name them as "Unnamed:0" and so on as it usually does.
Therefore is it possible to rename the multiple nan columns to Unnamed:0, Unnamed:1 and so on, depending on how many nan columns are there- the number of nan columns varies.
first convert your columns to a series then apply a cumulative count cumcount to a boolean condition which is True if there is a null occurrence. then use the conditional value to fill the null values.
s = pd.Series(df.columns)
print(s)
0 NaN
1 A
2 NaN
3 B
4 C
5 NaN
s = s.fillna('unnamed:' + (s.groupby(s.isnull()).cumcount() + 1).astype(str))
print(s)
0 unnamed:1
1 A
2 unnamed:2
3 B
4 C
5 unnamed:3
dtype: object
df.columns = s
I am testing dataframes for equality.
df_diff=(df1!=df2)
I get df_diff which is same shape as df*, and contains boolean True/False.
Now I would like to keep only the columns and rows of df1 where there was at least a different value.
If I simply do
df1=[df_diff.values]
I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.
As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.
example:
df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])
I would like to get from df1
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
to
1 2
0 2 NaN
1 NaN 6
I think you need DataFrame.any for check at least one True per rows of columns:
df = df_diff[df_diff.any(axis=1)]
It is possible to filter both of the original dataframes like so:
df11 = df1[df_diff.any(axis=1)]
df22 = df2[df_diff.any(axis=1)]
If want all columns and rows:
df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
EDIT: Filter d1 and add NaNs by where:
df_diff=(df1!=df2)
m1 = df_diff.any(axis=1)
m2 = df_diff.any()
out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
print (out)
1 2
0 2.0 NaN
1 NaN 6.0
I can't figure out why new null values are popping up after assigning a dataframe column as a series that doesn't have any nulls originally. Here's an example:
df.date_col.shape returns (100000,)
df.date_col.isnull().sum() returns 0
I then create a new series of the same size with:
new_series = pd.Series([int(d[:4]) for d in df.date_col])
new_series.shape returns (100000,)
new_series.isnull().sum() returns 0
But then if I try to assign this new series to the original column:
df.date_col = new_series
df.date_col.isnull().sum() returns 6328
Would someone please tell me what might be going on here?
IIUC, your index is not continue , when you create the pd.Series, it auto assign the index from 0 to len(s)-1, dataframe assign is base on the index , index miss match will create the NaN
df=pd.DataFrame({'col':[1,2,3]},index=[1,2,3])
s=pd.Series([d*2 for d in df.col])
df['New']=s
df
Out[170]:
col New
1 1 4.0
2 2 6.0
3 3 NaN
df['New2']=s.values
df
Out[172]:
col New New2
1 1 4.0 2
2 2 6.0 4
3 3 NaN 6
I have a series and df
s = pd.Series([1,2,3,5])
df = pd.DataFrame()
When I add columns to df like this
df.loc[:, "0-2"] = s.iloc[0:3]
df.loc[:, "1-3"] = s.iloc[1:4]
I get df
0-2 1-3
0 1 NaN
1 2 2.0
2 3 3.0
Why am I getting NaN? I tried create new series with correct idxs, but adding it to df still causes NaN.
What I want is
0-2 1-3
0 1 2
1 2 3
2 3 5
Try either of the following lines.
df.loc[:, "1-3"] = s.iloc[1:4].values
# -OR-
df.loc[:, "1-3"] = s.iloc[1:4].reset_index(drop=True)
Your original code is trying unsuccessfully to match the index of the data frame df to the index of the subset series s.iloc[1:4]. When it can't find the 0 index in the series, it places a NaN value in df at that location. You can get around this by only keeping the values so it doesn't try to match on the index or resetting the index on the subset series.
>>> s.iloc[1:4]
1 2
2 3
3 5
dtype: int64
Notice the index values since the original, unsubset series is the following.
>>> s
0 1
1 2
2 3
3 5
dtype: int64
The index of the first row in df is 0. By dropping the indices with the values call, you bypass the index matching which is producing the NaN. By resetting the index in the second option, you make the indices the same.