Pandas : fill values from from another column - python

I did a pandas merge and now have two columns - col_x and col_y. I'd like to fill values in col_x with col_y, but only for rows where where col_y is not NaN or has a value. I'd like to keep the original values in col_x and only replace from col_y if NaN.
import pandas as pd
df = pd.DataFrame({
'i': [0, 1, 2, 3],
'c': [np.nan, {'a':'A'}, np.nan, {'b':'B'}],
'd': [{'c':'C'}, np.nan, {'d':'D'}, np.nan]
})
Expected output:
i c d
0 {'c':'C'} {'c':'C'}
1 {'a':'A'} np.nan
2 {'d':'D'} {'d':'D'}
3 {'b':'B'} np.nan

Are you just trying to fillna?
df.c.fillna(df.d, inplace=True)

You can use np.where()
So something like
df['c'] = np.where(df['c'].isna(), df['d'], df['c'])
should do the trick! The first parameter is the condition to check, the second is what to return if the condition is true, and the third is what to return if the condition is false.

Try:
df["c"] = [y if str(x) == "nan" else x for x,y in zip(df.c,df.d)]
Probably cleaner way but this is one line

Related

Pandas: Setting a value in a cell when multiple columns are empty

I've been looking for ways to do this natively for a little while now and can't find a solution.
I have a large dataframe where I would like to set the value in other_col to 'True' for all rows where one of a list of columns is empty.
This works for a single column page_title:
df.loc[df['page_title'].isna(), ['other_col']] = ''
But not when using a list
df.loc[df[['page_title','brand','name']].isna(), ['other_col']] = ''
Any ideas of how I could do this without using Numpy or looping through all rows?
Thanks
Maybe this is what you are looking for:
df = pd.DataFrame({
'A' : ['1', '2', '3', np.nan],
'B': ['10', np.nan, np.nan, '40'],
'C' : ['test', 'test', 'test', 'test']})
df.loc[df[['A', 'B']].isna().any(1), ['C']] = 'value'
print(df)
Result:
A B C
0 1 10 test
1 2 NaN value
2 3 NaN value
3 NaN 40 value
This will allow you to set which columns you want to determine if np.nan is present and set a True/False indicator
data = {
'Column1' : [1, 2, 3, np.nan],
'Column2' : [1, 2, 3, 4],
'Column3' : [1, 2, np.nan, 4]
}
df = pd.DataFrame(data)
df['other_col'] = np.where((df['Column1'].isna()) | (df['Column2'].isna()) | (df['Column3'].isna()), True, False)
df

How to perform operations over arrays in a pandas dataframe efficiently?

I've got a pandas DataFrame that contains NumPy arrays in some columns:
import numpy as np, pandas as pd
data = {'col1': [np.array([1, 2]), np.array([3, 4])],
'col2': [np.array([5, 6]), np.array([7, 8])],
'col3': [9, 10]}
df = pd.DataFrame(data)
I need to store a large frame like this one in a CSV file, but the arrays have to be strings that look like this:
col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10
What I'm currently doing to achieve this result is to iterate over each column and each row of the DataFrame, but my solution doesn't seem efficient.
This is my current solution:
pd.options.mode.chained_assignment = None
array_columns = [column for column in df.columns if isinstance(df[column].iloc[0], np.ndarray)]
for index, row in df.iterrows():
for column in array_columns:
# Here 'tuple' is only used to replace brackets for parenthesis
df[column][index] = str(tuple(row[column]))
I tried using apply, although I've heard it's usually not an efficient alternative:
def array_to_str(array):
return str(tuple(array))
df[array_columns] = df[array_columns].apply(array_to_str)
But my arrays become NaN:
col1 col2 col3
0 NaN NaN 9
1 NaN NaN 10
I tried other similar solutions, but the error:
ValueError: Must have equal len keys and value when setting with an iterable
appeared quite often.
Is there a more efficient way of performing the same operation? My real dataframes can contain many columns and thousands of rows.
Try this:
tupcols = ['col1', 'col2']
df[tupcols] = df[tupcols].apply(lambda col: col.apply(tuple)).astype('str')
df.to_csv()
You would need to convert the arrays into tuple for the correct representation. In order to do so, you can apply tuple function on columns with object dtype.
to_save = df.apply(lambda x: x.map(lambda y: tuple(y)) if x.dtype=='object' else x)
to_save.to_csv(index=False)
Output:
col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10
Note: This would be dangerous if you have other columns, e.g. string type.
data = {'col1': [np.array([1, 2]), np.array([3, 4])],
'col2': [np.array([5, 6]), np.array([7, 8])],
'col3': [9, 10]}
df = pd.DataFrame(data)
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: tuple(x))
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: ''' "{}" '''.format(x))
col1 col2 col3
0 "(1, 2)" "(5, 6)" 9
1 "(3, 4)" "(7, 8)" 10

Merge columns into one while dropping nan values and duplicates

I am trying to merge multiple columns into a single column while dropping duplicates and dropping null values but keeping the rows.
What I have:
df= pd.DataFrame(np.array([['nan', 'nan', 'nan'], ['nan', 2, 2], ['nan', 'x', 'nan']]), columns=['a', 'b', 'c'])
What I need:
df= pd.DataFrame(np.array([[''], [ 2], [ 1]]), columns=['a'])
I have tried this but I get 1,nan for the last row:
df['a]=df[['a','b','c]].agg(', '.join, axis=1)
I have also tried the following but I cannot get this to work:
.stack().unstack()
and
.join
but I cannot get these to drop duplicates for each row
This will find the maximum value of a row and replace 'nan' with '':
new_df = pd.DataFrame(df.astype(float).max(axis=1).replace(np.nan, ''), columns=[df.columns[0]])
output:
a
0
1 2.0
2 1.0

Pandas df.equals() returning False on identical dataframes?

Let df_1 and df_2 be:
In [1]: import pandas as pd
...: df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
...: df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
In [2]: df_1
Out[2]:
a b
0 1 4
1 2 5
2 3 6
We add a row r to df_1:
In [3]: r = pd.DataFrame({'a': ['x'], 'b': ['y']})
...: df_1 = df_1.append(r, ignore_index=True)
In [4]: df_1
Out[4]:
a b
0 1 4
1 2 5
2 3 6
3 x y
We now remove the added row from df_1 and get the original df_1 back again:
In [5]: df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)
In [6]: df_1
Out[6]:
a b
0 1 4
1 2 5
2 3 6
In [7]: df_2
Out[7]:
a b
0 1 4
1 2 5
2 3 6
While df_1 and df_2 are identical, equals() returns False.
In [8]: df_1.equals(df_2)
Out[8]: False
Did reseach on SO but could not find a related question.
Am I doing somthing wrong? How to get the correct result in this case?
(df_1==df_2).all().all() returns True but not suitable for the case where df_1 and df_2 have different length.
This again is a subtle one, well done for spotting it.
import pandas as pd
df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
r = pd.DataFrame({'a': ['x'], 'b': ['y']})
df_1 = df_1.append(r, ignore_index=True)
df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)
df_1.equals(df_2)
from pandas.util.testing import assert_frame_equal
assert_frame_equal(df_1,df_2)
Now we can see the issue as the assert fails.
AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="a") are different
Attribute "dtype" are different
[left]: object
[right]: int64
as you added strings to integers the integers became objects. so this is why the equals fails as well..
Use pandas.testing.assert_frame_equal(df_1, df_2, check_dtype=True), which will also check if the dtypes are the same.
(It will pick up in this case that your dtypes changed from int to 'object' (string) when you appended, then deleted, a string row; pandas did not automatically coerce the dtype back down to less expansive dtype.)
AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="a") are different
Attribute "dtype" are different
[left]: object
[right]: int64
As per df.equals docs:
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal. The column headers do not need to have the same type, but the elements within the columns must be the same dtype.
So, df.equals will return True only when the elements have same values and the dtypes is also same.
When you add and delete the row from df_1, the dtypes changes from int to object, hence it returns False.
Explanation with your example:
In [1028]: df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
In [1029]: df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
In [1031]: df_1.dtypes
Out[1031]:
a int64
b int64
dtype: object
In [1032]: df_2.dtypes
Out[1032]:
a int64
b int64
dtype: object
So, if you see above, dtypes of both dfs are same, hence below condition returns True:
In [1030]: df_1.equals(df_2)
Out[1030]: True
Now after you add and remove the row:
In [1033]: r = pd.DataFrame({'a': ['x'], 'b': ['y']})
In [1034]: df_1 = df_1.append(r, ignore_index=True)
In [1036]: df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)
In [1038]: df_1.dtypes
Out[1038]:
a object
b object
dtype: object
dtype has changed to object, hence below condition returns False:
In [1039]: df_1.equals(df_2)
Out[1039]: False
If you still want it to return True, you need to change the dtypes back to int:
In [1042]: df_1 = df_1.astype(int)
In [1044]: df_1.equals(df_2)
Out[1044]: True
Based on the comments of the others, in this case one can do:
from pandas.util.testing import assert_frame_equal
identical_df = True
try:
assert_frame_equal(df_1, df_2, check_dtype=False)
except AssertionError:
identical_df = False

How to fill one column's missing values conditioning on another column's value in Pandas?

I have a dataframe looks like below:
import numpy as np
import pandas as pd
d = {'col1': [np.nan, 19, 32, np.nan, 54, 67], 'col2': [0, 1, 0, 1, 1, 1]}
df = pd.DataFrame(d)
I want to fill the missing values in "col1" based on the values of "col2". To be specific: I want to fill the missing values in "col1" with 0 if "col2" is 0, else leave the "col1" as it is. In this case, my output should look like:
d_updated = {'col1': [0, 19, 32, np.nan, 54, 67], 'col2': [0, 1, 0, 1, 1, 1]}
df_updated = pd.DataFrame(d_updated)
To have the above output, I try to get the index which "col2" have values equal to 0 and use fillna():
ix = list(df[df["col2"] == 0].index)
df["col2"].loc[ix].fillna(0, inplace = True)
However, my approach doesn't work and I don't know why. Thanks ahead.
Try, using loc with boolean indexing:
df.loc[(df['col1'].isna()) & (df['col2'] == 0), 'col1'] = df['col2']
Output:
col1 col2
0 0.0 0
1 19.0 1
2 32.0 0
3 NaN 1
4 54.0 1
5 67.0 1
m=(df.col2==0 )&(df.col1.isna())#boolean select using loc
Then any of the following can do
df.loc[m,'col1']=df.loc[m,'col1'].fillna(0, inplace=True)
or
df.loc[m,'col1'] = df.loc[m,'col1'].replace('nan', np.nan).fillna(0)

Categories