How to deduct the values from pandas (same column)? - python

I am trying to manipulate excel sheet data to automate a process on excel(not a developer) in order to delete the value of the last row from the first then the value of the last -1 from the second and so on, my data is similar to the below
Code Col1 Col2
0 A 1.7653 56.2
1 B 1 Nan
2 C Nan 5
3 D 34.4 0
and i have to deduct the last last column from the first, then then last -1 from the second and so on until i meet them in the middle(assuming that we'll only be having even numbers of rows), i already solved the issue of getting rid of columns having strings so my output pandas looks like this
Col1 Col2
0 1.7653 56.2
1 1 Nan
2 Nan 5
3 34.4 0
now i need to deduct the values so the new panda frame to be created will look like this:
the values below are found after the deductions
Col1 Col2
0 -32.2347 56.2
1 1 -5
I was able to delete it per 1 value but not iteratively no matter how many rows i have and create a pandas half the rows of the first with the same columns as output
Nan will be treated as 0 and the actual dataset has hundreds of columns and rows that can change
code:
import pandas as pd
import datetime
# Create a dataframe
df = pd.read_excel(r'file.xls', sheet_name='sheet1')
for col, dt in df.dtypes.items():
if dt == object:
df = df.drop(col, 1)
i=0
for col in df.dtypes.items():
while i < len(df)/2:
df[i] = df[i] - df[len(df) - i]
i++

An approach could be the following:
import pandas as pd
import numpy as np
df = pd.DataFrame([["A", 1.7653, 56.2], ["B", 1, np.nan], ["C", np.nan, 5], ["D", 34.4, 0]], columns=["Code", "Col1", "Col2"], )
del df["Code"]
df.fillna(0, inplace=True)
s = df.shape[0] // 2
differences = pd.DataFrame([df.iloc[i] - df.iloc[df.shape[0]-i-1] for i in range(s)])
print(differences)
OUTPUT
Col1 Col2
0 -32.6347 56.2
1 1.0000 -5.0
FOLLOW UP
Reading the comments, I understand that the subtraction logic you want to apply is the following:
Normal subtraction if both numbers are not nan
If one of the numbers is nan, then swap the nan with 0
If both numbers are nan, a `nan is returned
I don't know if there is a function which works like that out of the box, hence I have created a custom_sub.
In avoidance of doubt, this is the file I am using
grid.txt
,Code,Col1,Col2
0,A,1.7653,56.2
1,B,1,
2,C,,5
3,D,34.4,0
The code:
import pandas as pd
import numpy as np
df = pd.read_csv("grid.txt", sep=",",index_col=[0])
del df["Code"]
def custom_sub(x1, x2):
if np.isnan(x1) or np.isnan(x2):
if np.isnan(x1) and np.isnan(x2):
return np.nan
else:
return -x2 if np.isnan(x1) else x1
else:
return x1 - x2
s = df.shape[0] // 2
differences = pd.DataFrame([df.iloc[i].combine(df.iloc[df.shape[0]-i-1], custom_sub) for i in range(s)])
print(differences)

Related

Count across dataframe columns based on str.contains (or similar)

I would like to count the number of cells within each row that contain a particular character string, cells which have the particular string more than once should be counted once only.
I can count the number of cells across a row which equal a given value, but when I expand this logic to use str.contains, I have issues, as shown below
d = {'col1': ["a#", "b","c#"], 'col2': ["a", "b","c#"]}
df = pd.DataFrame(d)
#can correctly count across rows using equality
thisworks =( df =="a#" ).sum(axis=1)
#can count across a column using str.contains
thisworks1=df['col1'].str.contains('#').sum()
#but cannot use str.contains with a dataframe so what is the alternative
thisdoesnt =( df.str.contains('#') ).sum(axis=1)
Output should be a series showing the number of cells in each row that contain the given character string.
str.contains is a series method. To apply it to whole dataframe you need either agg or apply such as:
df.agg(lambda x: x.str.contains('#')).sum(1)
Out[2358]:
0 1
1 0
2 2
dtype: int64
If you don't like agg nor apply, you may use np.char.find to work directly on underlying numpy array of df
(np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1)
Out[2360]: array([1, 0, 2])
Passing it to series or a columns of df
pd.Series((np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1), index=df.index)
Out[2361]:
0 1
1 0
2 2
dtype: int32
A solution using df.apply:
df = pd.DataFrame({'col1': ["a#", "b","c#"],
'col2': ["a", "b","c#"]})
df
col1 col2
0 a# a
1 b b
2 c# c#
df['sum'] = df.apply(lambda x: x.str.contains('#'), axis=1).sum(axis=1)
col1 col2 sum
0 a# a 1
1 b b 0
2 c# c# 2
Something like this should work:
df = pd.DataFrame({'col1': ['#', '0'], 'col2': ['#', '#']})
df['totals'] = df['col1'].str.contains('#', regex=False).astype(int) +\
df['col2'].str.contains('#', regex=False).astype(int)
df
# col1 col2 totals
# 0 # # 2
# 1 0 # 1
It should generalize to as many columns as you want.

Merging two columns in a pandas DataFrame

Given the following DataFrame:
A B
0 -10.0 NaN
1 NaN 20.0
2 -30.0 NaN
I want to merge columns A and B, filling the NaN cells in column A with the values from column B and then drop column B, resulting in a DataFrame like this:
A
0 -10.0
1 20.0
2 -30.0
I have managed to solve this problem by using the iterrows() function.
Complete code example:
import numpy as np
import pandas as pd
example_data = [[-10, np.NaN], [np.NaN, 20], [-30, np.NaN]]
example_df = pd.DataFrame(example_data, columns = ['A', 'B'])
for index, row in example_df.iterrows():
if pd.isnull(row['A']):
row['A'] = row['B']
example_df = example_df.drop(columns = ['B'])
example_df
This seems to work fine, but I find this information in the documentation for iterrows():
You should never modify something you are iterating over.
So it seems like I'm doing it wrong.
What would be a better/recommended approach for achieving the same result?
Use Series.fillna with Series.to_frame:
df = df['A'].fillna(df['B']).to_frame()
#alternative
#df = df['A'].combine_first(df['B']).to_frame()
print (df)
A
0 -10.0
1 20.0
2 -30.0
If more columns and need first non missing values per rows use back filling missing values with select first column by one element list for one column DataFrame:
df = df.bfill(axis=1).iloc[:, [0]]
print (df)
A
0 -10.0
1 20.0
2 -30.0

Remove rows where values appear in all columns in Pandas

Here is a very simple dataframe:
df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
I'm trying to remove rows where there are duplicate values (e.g., row 3)
This doesn't work,
df = df[(df.col1 != 3 & df.col2 != 3)]
and the documentation is pretty clear about why, which makes sense.
But I still don't know how to delete that row.
Does anyone have any ideas? Thanks. Monica.
If I understand your question correctly, I think you were close.
Starting from your data:
In [20]: df
Out[20]:
col1 col2
0 1 1
1 2 3
2 3 3
And doing this:
In [21]: df = df[df['col1'] != df['col2']]
Returns:
In [22]: df
Out[22]:
col1 col2
1 2 3
What about:
In [43]: df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
In [44]: df[df.max(axis=1) != df.min(axis=1)]
Out[44]:
col1 col2
1 2 3
[1 rows x 2 columns]
We want to remove rows whose values show up in all columns, or in other words the values are equal => their minimums and maximums are equal. This is method works on a DataFrame with any number of columns. If we apply the above, we remove rows 0 and 2.
Any row with all the same values with have zero as the standard deviation. One way to filter them is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1' :[1, 2, 3, np.nan],
'col2' :[1, 3, 3, np.nan]}
>>> df.loc[df.std(axis=1, skipna=False) > 0]
col1 col2
1 2

pandas dataframe drop columns by number of nan

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
I recommend the drop-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])

Include empty series when creating a pandas dataframe with .concat

UPDATE: This is no longer an issue since at least pandas version 0.18.1. Concatenating empty series doesn't drop them anymore so this question is out of date.
I want to create a pandas dataframe from a list of series using .concat. The problem is that when one of the series is empty it doesn't get included in the resulting dataframe but this makes the dataframe be the wrong dimensions when I then try to rename its columns with a multi-index.
UPDATE: Here's an example...
import pandas as pd
sers1 = pd.Series()
sers2 = pd.Series(['a', 'b', 'c'])
df1 = pd.concat([sers1, sers2], axis=1)
This produces the following dataframe:
>>> df1
0 a
1 b
2 c
dtype: object
But I want it to produce something like this:
>>> df2
0 1
0 NaN a
1 NaN b
2 NaN c
It does this if I put a single nan value anywhere in ser1 but it seems like this should be possible automatically even if some of my series are totally empty.
Passing an argument for levels will do the trick. Here's an example. First, the wrong way:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
df = pd.concat(list_of_series, axis=1)
Which produces this:
>>> df
0
0 1
1 2
2 3
But if we add some labels to the levels argument, it will include all the empty series too:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
labels = range(len(list_of_series))
df = pd.concat(list_of_series, levels=labels, axis=1)
Which produces the desired dataframe:
>>> df
0 1 2
0 NaN 1 NaN
1 NaN 2 NaN
2 NaN 3 NaN

Categories