Appending two pandas dataframes has an unexpected behavior when one of the dataframes has a column with all null values (NaN) and the other one has boolean values at the same column.
The corresponding column in the resulting (from appending) dataframe is typed as float64 and the boolean values are turned into ones and zeros based on their original boolean values.
Example:
df1 = pd.DataFrame(data = [[1, 2 ,True], [10, 20, True]], columns=['a', 'b', 'c'])
df1
a b c
0 1 2 True
1 10 20 False
df2 = pd.DataFrame(data = [[1,2], [10,20]], columns=['a', 'b'])
df2['c'] = np.nan
df2
a b c
0 1 2 NaN
1 10 20 NaN
Appending:
df1.append(df2)
a b c
0 1 2 1.0
1 10 20 0.0
0 1 2 NaN
1 10 20 NaN
My workaround is to reset the typing of the column as bool, but this turns the NaN values to booleans:
appended_df = df1.append(df2)
appended_df
a b c
0 1 2 1.0
1 10 20 0.0
0 1 2 NaN
1 10 20 NaN
appended_df['c'] = appended_df.c.astype(bool)
appended_df
a b c
0 1 2 True
1 10 20 False
0 1 2 True
1 10 20 True
Unfortunately, the pandas append documentation doesn't refer to the problem, any idea why pandas has this behavior?
Mixed types of elements in DataFrame column is not allowed, see this discussion Mixed types of elements in DataFrame's column
The type of np.nan is float, so all the boolean values are casted to float when appending. To avoid this, you could change the type of the 'c' column to 'object' using .astype():
df1['c'] = df1['c'].astype(dtype='object')
df2['c'] = df2['c'].astype(dtype='object')
Then the append command has the desired result. However, as stated in the discussion mentioned above, having multiple types in the same column is not recommended. If instead of np.nan you use None, which is the NoneType object, you don't need to go through the type definition yourself. For the difference between NaN (Not a Number) and None types, see What is the difference between NaN and None?
You should think of what the 'c' column really represents, and choose the dtype accordingly.
You need to use convert_dtypes, if you are using Pandas 1.0.0 and above. Refer link for description and use convert_dtypes
Solution code:
df1 = df1.convert_dtypes()
df1.append(df2)
print(df1)
Related
That is the difference between groupby("x").count and groupby("x").size in pandas ?
Does size just exclude nil ?
size includes NaN values, count does not:
In [46]:
df = pd.DataFrame({'a':[0,0,1,2,2,2], 'b':[1,2,3,4,np.NaN,4], 'c':np.random.randn(6)})
df
Out[46]:
a b c
0 0 1 1.067627
1 0 2 0.554691
2 1 3 0.458084
3 2 4 0.426635
4 2 NaN -2.238091
5 2 4 1.256943
In [48]:
print(df.groupby(['a'])['b'].count())
print(df.groupby(['a'])['b'].size())
a
0 2
1 1
2 2
Name: b, dtype: int64
a
0 2
1 1
2 3
dtype: int64
What is the difference between size and count in pandas?
The other answers have pointed out the difference, however, it is not completely accurate to say "size counts NaNs while count does not". While size does indeed count NaNs, this is actually a consequence of the fact that size returns the size (or the length) of the object it is called on. Naturally, this also includes rows/values which are NaN.
So, to summarize, size returns the size of the Series/DataFrame1,
df = pd.DataFrame({'A': ['x', 'y', np.nan, 'z']})
df
A
0 x
1 y
2 NaN
3 z
<!- _>
df.A.size
# 4
...while count counts the non-NaN values:
df.A.count()
# 3
Notice that size is an attribute (gives the same result as len(df) or len(df.A)). count is a function.
1. DataFrame.size is also an attribute and returns the number of elements in the DataFrame (rows x columns).
Behaviour with GroupBy - Output Structure
Besides the basic difference, there's also the difference in the structure of the generated output when calling GroupBy.size() vs GroupBy.count().
df = pd.DataFrame({
'A': list('aaabbccc'),
'B': ['x', 'x', np.nan, np.nan,
np.nan, np.nan, 'x', 'x']
})
df
A B
0 a x
1 a x
2 a NaN
3 b NaN
4 b NaN
5 c NaN
6 c x
7 c x
Consider,
df.groupby('A').size()
A
a 3
b 2
c 3
dtype: int64
Versus,
df.groupby('A').count()
B
A
a 2
b 0
c 2
GroupBy.count returns a DataFrame when you call count on all column, while GroupBy.size returns a Series.
The reason being that size is the same for all columns, so only a single result is returned. Meanwhile, the count is called for each column, as the results would depend on on how many NaNs each column has.
Behavior with pivot_table
Another example is how pivot_table treats this data. Suppose we would like to compute the cross tabulation of
df
A B
0 0 1
1 0 1
2 1 2
3 0 2
4 0 0
pd.crosstab(df.A, df.B) # Result we expect, but with `pivot_table`.
B 0 1 2
A
0 1 2 1
1 0 0 1
With pivot_table, you can issue size:
df.pivot_table(index='A', columns='B', aggfunc='size', fill_value=0)
B 0 1 2
A
0 1 2 1
1 0 0 1
But count does not work; an empty DataFrame is returned:
df.pivot_table(index='A', columns='B', aggfunc='count')
Empty DataFrame
Columns: []
Index: [0, 1]
I believe the reason for this is that 'count' must be done on the series that is passed to the values argument, and when nothing is passed, pandas decides to make no assumptions.
Just to add a little bit to #Edchum's answer, even if the data has no NA values, the result of count() is more verbose, using the example before:
grouped = df.groupby('a')
grouped.count()
Out[197]:
b c
a
0 2 2
1 1 1
2 2 3
grouped.size()
Out[198]:
a
0 2
1 1
2 3
dtype: int64
When we are dealing with normal dataframes then only difference will be an inclusion of NAN values, means count does not include NAN values while counting rows.
But if we are using these functions with the groupby then, to get the correct results by count() we have to associate any numeric field with the groupby to get the exact number of groups where for size() there is no need for this type of association.
In addition to all above answers, I would like to point out one more difference which I find significant.
You can correlate pandas' DataFrame size and count with Java's Vectors size and length. When we create a vector, some predefined memory is allocated to it. When we reach closer to the maximum number of elements it can hold, more memory is allocated to accommodate further additions. Similarly, in DataFrame as we add elements, the memory allocated to it increases.
The size attribute gives the number of memory cell allocated to DataFrame whereas count gives the number of elements that are actually present in DataFrame. For example,
You can see that even though there are 3 rows in DataFrame, its size is 6.
This answer covers size and count difference with respect to DataFrame and not pandas Series. I have not checked what happens with Series.
I have a large dataframe and I want to search 144 of the columns to check if there are any negative values in them. If there is even one negative value in a column, I want to replace the whole column with np.nan. I then want to use the new version of the dataframe for later analysis.
I've tried a varied of methods but can't seem to find one that works. I think this is almost there but I can't seem to find a solution to what I'm trying to do.
clean_data_df.loc[clean_data_df.cols < 0, cols] = np.nan #cols is a list of the column names I want to check
null_columns=clean_data_df.columns[clean_data_df.isnull().any(axis=1)]
clean_data_df[null_columns] = np.nan
When I run the above code I get the following error: AttributeError: 'DataFrame' object has no attribute 'cols'
Thanks in advance!
You could use a loop to iterate over the columns:
for i in col:
if df[i].isna().any():
df[i] = np.nan
Minumum reproducible example:
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[1,2,3]})
for i in df:
if df[i].isna().any():
df[i] = np.nan
print(df)
Output:
a b c
0 NaN NaN 1
1 NaN NaN 2
2 NaN NaN 3
Idea is filter only filtered rows by cols by DataFrame.lt and DataFrame.any and then add all another columns filled by False in Series.reindex, last set values by DataFrame.loc, here first : means all rows:
df = pd.DataFrame({'a':list('abc'), 'b':[-2,-1,-3],'c':[1,2,3]})
cols = ['b','c']
df.loc[:, df[cols].lt(0).any().reindex(df.columns, fill_value=False)] = np.nan
print(df)
a b c
0 a NaN 1
1 b NaN 2
2 c NaN 3
Detail:
print(df[cols].lt(0).any())
b True
c False
dtype: bool
print (df[cols].lt(0).any().reindex(df.columns, fill_value=False))
a False
b True
c False
dtype: bool
I am testing dataframes for equality.
df_diff=(df1!=df2)
I get df_diff which is same shape as df*, and contains boolean True/False.
Now I would like to keep only the columns and rows of df1 where there was at least a different value.
If I simply do
df1=[df_diff.values]
I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.
As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.
example:
df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])
I would like to get from df1
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
to
1 2
0 2 NaN
1 NaN 6
I think you need DataFrame.any for check at least one True per rows of columns:
df = df_diff[df_diff.any(axis=1)]
It is possible to filter both of the original dataframes like so:
df11 = df1[df_diff.any(axis=1)]
df22 = df2[df_diff.any(axis=1)]
If want all columns and rows:
df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
EDIT: Filter d1 and add NaNs by where:
df_diff=(df1!=df2)
m1 = df_diff.any(axis=1)
m2 = df_diff.any()
out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
print (out)
1 2
0 2.0 NaN
1 NaN 6.0
I had a problem and I found a solution but I feel it's the wrong way to do it. Maybe, there is a more 'canonical' way to do it.
I already had an answer for a really similar problem, but here I have not the same amount of rows in each dataframe. Sorry for the "double-post", but the first one is still valid so I think it's better to make a new one.
Problem
I have two dataframe that I would like to merge without having extra column and without erasing existing infos. Example :
Existing dataframe (df)
A A2 B
0 1 4 0
1 2 5 1
2 2 5 1
Dataframe to merge (df2)
A A2 B
0 1 4 2
1 3 5 2
I would like to update df with df2 if columns 'A' and 'A2' corresponds.
The result would be :
A A2 B
0 1 4 2 <= Update value ONLY
1 2 5 1
2 2 5 1
Here is my solution, but I think it's not a really good one.
import pandas as pd
df = pd.DataFrame([[1,4,0],[2,5,1],[2,5,1]],columns=['A','A2','B'])
df2 = pd.DataFrame([[1,4,2],[3,5,2]],columns=['A','A2','B'])
df = df.merge(df2,on=['A', 'A2'],how='left')
df['B_y'].fillna(0, inplace=True)
df['B'] = df['B_x']+df['B_y']
df = df.drop(['B_x','B_y'], axis=1)
print(df)
I tried this solution :
rows = (df[['A','A2']] == df2[['A','A2']]).all(axis=1)
df.loc[rows,'B'] = df2.loc[rows,'B']
But I have this error because of the wrong number of rows :
ValueError: Can only compare identically-labeled DataFrame objects
Does anyone has a better way to do ?
Thanks !
I think you can use DataFrame.isin for check where are same rows in both DataFrames. Then create NaN by mask, which is filled by combine_first. Last cast to int:
mask = df[['A', 'A2']].isin(df2[['A', 'A2']]).all(1)
print (mask)
0 True
1 False
2 False
dtype: bool
df.B = df.B.mask(mask).combine_first(df2.B).astype(int)
print (df)
A A2 B
0 1 4 2
1 2 5 1
2 2 5 1
With a minor tweak in the way in which the boolean mask gets created, you can get it to work:
cols = ['A', 'A2']
# Slice it to match the shape of the other dataframe to compare elementwise
rows = (df[cols].values[:df2.shape[0]] == df2[cols].values).all(1)
df.loc[rows,'B'] = df2.loc[rows,'B']
df
I am concatenating multiple months of csv's where newer, more recent versions have additional columns. As a result, putting them all together fills certain rows of certain columns with NaN.
The issue with this behavior is that it mixes these NaNs with true null entries from the data set which need to be easily distinguishable.
My only solution as of now is to replace the original NaNs with a unique string, concatenate the csv's, replace the new NaNs with a second unique string, replace the first unique string with NaN.
Given the amount of data I am processing, this is a very inefficient solution. I thought there was some way to determine how Panda's DataFrame fill these entries but couldn't find anything on it.
Updated example:
A B
1 NaN
2 3
And append
A B C
1 2 3
Gives
A B C
1 NaN NaN
2 3 NaN
1 2 3
But I want
A B C
1 NaN 'predated'
2 3 'predated'
1 2 3
In case you have a core set of columns, as here represented by df1, you could apply .fillna() to the .difference() between the core set and any new columns in more recent DataFrames.
df1 = pd.DataFrame(data={'A': [1, 2], 'B': [np.nan, 3]})
A B
0 1 NaN
1 2 3
df2 = pd.DataFrame(data={'A': 1, 'B': 2, 'C': 3}, index=[0])
A B C
0 1 2 3
df = pd.concat([df1, df2], ignore_index=True)
new_cols = df2.columns.difference(df1.columns).tolist()
df[new_cols] = df[new_cols].fillna(value='predated')
A B C
0 1 NaN predated
1 2 3 predated
2 1 2 3