Merging two columns in a pandas DataFrame - python

Given the following DataFrame:
A B
0 -10.0 NaN
1 NaN 20.0
2 -30.0 NaN
I want to merge columns A and B, filling the NaN cells in column A with the values from column B and then drop column B, resulting in a DataFrame like this:
A
0 -10.0
1 20.0
2 -30.0
I have managed to solve this problem by using the iterrows() function.
Complete code example:
import numpy as np
import pandas as pd
example_data = [[-10, np.NaN], [np.NaN, 20], [-30, np.NaN]]
example_df = pd.DataFrame(example_data, columns = ['A', 'B'])
for index, row in example_df.iterrows():
if pd.isnull(row['A']):
row['A'] = row['B']
example_df = example_df.drop(columns = ['B'])
example_df
This seems to work fine, but I find this information in the documentation for iterrows():
You should never modify something you are iterating over.
So it seems like I'm doing it wrong.
What would be a better/recommended approach for achieving the same result?

Use Series.fillna with Series.to_frame:
df = df['A'].fillna(df['B']).to_frame()
#alternative
#df = df['A'].combine_first(df['B']).to_frame()
print (df)
A
0 -10.0
1 20.0
2 -30.0
If more columns and need first non missing values per rows use back filling missing values with select first column by one element list for one column DataFrame:
df = df.bfill(axis=1).iloc[:, [0]]
print (df)
A
0 -10.0
1 20.0
2 -30.0

Related

How can I fill an empty dataframe with zero in pandas? Fillna not working

A dataframe is being created by a process and sometimes the process returns an empty dataframe with no values at all. In that case I want the dataframe to be filled with zeroes for all the columns. I've tried output_df.fillna(value=0, inplace=True) but it doesn't work. The dataframe just remains empty.
To replace all values with 0, you can use:
df.loc[:] = 0
If you really have no rows and want to add one:
df.loc[0] = 0
If has empty DataFrame only with columns names possible solution for add 0 is use DataFrame.reindex by some list for new indices:
df = pd.DataFrame(columns=['a', 'b'])
df = df.reindex([0, 1, 2], fill_value=0)
print (df)
a b
0 0 0
1 0 0
2 0 0
df = pd.DataFrame()
df = df.reindex(index=[0, 1, 2], columns=list('ab'), fill_value=0)
print (df)
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0

How to deduct the values from pandas (same column)?

I am trying to manipulate excel sheet data to automate a process on excel(not a developer) in order to delete the value of the last row from the first then the value of the last -1 from the second and so on, my data is similar to the below
Code Col1 Col2
0 A 1.7653 56.2
1 B 1 Nan
2 C Nan 5
3 D 34.4 0
and i have to deduct the last last column from the first, then then last -1 from the second and so on until i meet them in the middle(assuming that we'll only be having even numbers of rows), i already solved the issue of getting rid of columns having strings so my output pandas looks like this
Col1 Col2
0 1.7653 56.2
1 1 Nan
2 Nan 5
3 34.4 0
now i need to deduct the values so the new panda frame to be created will look like this:
the values below are found after the deductions
Col1 Col2
0 -32.2347 56.2
1 1 -5
I was able to delete it per 1 value but not iteratively no matter how many rows i have and create a pandas half the rows of the first with the same columns as output
Nan will be treated as 0 and the actual dataset has hundreds of columns and rows that can change
code:
import pandas as pd
import datetime
# Create a dataframe
df = pd.read_excel(r'file.xls', sheet_name='sheet1')
for col, dt in df.dtypes.items():
if dt == object:
df = df.drop(col, 1)
i=0
for col in df.dtypes.items():
while i < len(df)/2:
df[i] = df[i] - df[len(df) - i]
i++
An approach could be the following:
import pandas as pd
import numpy as np
df = pd.DataFrame([["A", 1.7653, 56.2], ["B", 1, np.nan], ["C", np.nan, 5], ["D", 34.4, 0]], columns=["Code", "Col1", "Col2"], )
del df["Code"]
df.fillna(0, inplace=True)
s = df.shape[0] // 2
differences = pd.DataFrame([df.iloc[i] - df.iloc[df.shape[0]-i-1] for i in range(s)])
print(differences)
OUTPUT
Col1 Col2
0 -32.6347 56.2
1 1.0000 -5.0
FOLLOW UP
Reading the comments, I understand that the subtraction logic you want to apply is the following:
Normal subtraction if both numbers are not nan
If one of the numbers is nan, then swap the nan with 0
If both numbers are nan, a `nan is returned
I don't know if there is a function which works like that out of the box, hence I have created a custom_sub.
In avoidance of doubt, this is the file I am using
grid.txt
,Code,Col1,Col2
0,A,1.7653,56.2
1,B,1,
2,C,,5
3,D,34.4,0
The code:
import pandas as pd
import numpy as np
df = pd.read_csv("grid.txt", sep=",",index_col=[0])
del df["Code"]
def custom_sub(x1, x2):
if np.isnan(x1) or np.isnan(x2):
if np.isnan(x1) and np.isnan(x2):
return np.nan
else:
return -x2 if np.isnan(x1) else x1
else:
return x1 - x2
s = df.shape[0] // 2
differences = pd.DataFrame([df.iloc[i].combine(df.iloc[df.shape[0]-i-1], custom_sub) for i in range(s)])
print(differences)

pandas removes rows with nan in df

I have a df that contains nan,
A
nan
nan
nan
nan
2017
2018
I tried to remove all the nan rows in df,
df = df.loc[df['A'].notnull()]
but df still contains those nan values for column 'A' after the above code. The dtype of 'A' is object.
I am wondering how to fix it. The thing is I need to define multiple conditions to filter the df, and df['A'].notnull() is one of them. Don't know why it doesn't work.
Please provide a reproducible example. As such this works:
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan], [np.nan], [2017], [2018]], columns=['A'])
df = df[df['A'].notnull()]
df2 = pd.DataFrame([['nan'], ['nan'], [2017], [2018]], columns=['A'])
df2 = df2.replace('nan', np.nan)
df2 = df2[df2['A'].notnull()]
# output [df or df2]
# A
# 2017.0
# 2018.0

pandas dataframe drop columns by number of nan

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
I recommend the drop-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])

Include empty series when creating a pandas dataframe with .concat

UPDATE: This is no longer an issue since at least pandas version 0.18.1. Concatenating empty series doesn't drop them anymore so this question is out of date.
I want to create a pandas dataframe from a list of series using .concat. The problem is that when one of the series is empty it doesn't get included in the resulting dataframe but this makes the dataframe be the wrong dimensions when I then try to rename its columns with a multi-index.
UPDATE: Here's an example...
import pandas as pd
sers1 = pd.Series()
sers2 = pd.Series(['a', 'b', 'c'])
df1 = pd.concat([sers1, sers2], axis=1)
This produces the following dataframe:
>>> df1
0 a
1 b
2 c
dtype: object
But I want it to produce something like this:
>>> df2
0 1
0 NaN a
1 NaN b
2 NaN c
It does this if I put a single nan value anywhere in ser1 but it seems like this should be possible automatically even if some of my series are totally empty.
Passing an argument for levels will do the trick. Here's an example. First, the wrong way:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
df = pd.concat(list_of_series, axis=1)
Which produces this:
>>> df
0
0 1
1 2
2 3
But if we add some labels to the levels argument, it will include all the empty series too:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
labels = range(len(list_of_series))
df = pd.concat(list_of_series, levels=labels, axis=1)
Which produces the desired dataframe:
>>> df
0 1 2
0 NaN 1 NaN
1 NaN 2 NaN
2 NaN 3 NaN

Categories