Adding values to new columns in data frame - python

I have a large data frame (data_df) and have created four new columns. Say the original columns were 'A', 'B', 'C', 'D' and the four new ones are 'E', 'F', 'G', 'H'. For every row (3000), I need to add values to the new columns. In 'E' it needs to be A/A+B. For 'F' B/A+B. For 'G' C/C+D. And for 'H' D/C+D. I need to run this for every row.

It is pretty simple/intuitive. A similar question is here, and below is an answer to your specific question.
import pandas as pd
### Make up data for example
A = [5, 8, 2, -1]
B = [1, 0, 1, 3]
df = pd.DataFrame(list(zip(A,B)), columns =['A', 'B'])
display(df)
### Calculate Column E
df['E'] = df['A'] / (df['A'] + df['B'])
display(df)

Ans to you specific question
import pandas
df = pandas.DataFrame({'A':[1,2], 'B':[3,4],'C':[5,6], 'D':[7,8]})
df['E'] = df.apply(lambda row: row.A/(row.A + row.B), axis=1)
df['F'] = df.apply(lambda row: row.B/(row.A + row.B), axis=1)
df['G'] = df.apply(lambda row: row.C/(row.C + row.D), axis=1)
df['H'] = df.apply(lambda row: row.D/(row.C + row.D), axis=1)
print(df)
output
A B C D E F G H
1 3 5 7 0.250000 0.750000 0.416667 0.583333
2 4 6 8 0.333333 0.666667 0.428571 0.571429

Related

How to use the pandas 'isin' function to give actual values of the df row instead of a boolean expression?

I have two dataframes and I'm comparing their columns labeled 'B'. If the value of column B in df2 matches the value of column B in df1, I want to extract the value of column C from df2 and add it to a new column in df1.
Example:
df1
df2
Expected Result of df1:
I've tried the following. I know that this checks if there's a match of column B in both the dataframes - it returns a boolean value of True/False in the 'New' column. Is there a way to extract the value indicated under column 'C' when there's a match and add it to the 'New' column in df1 instead of the boolean values?
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df1['New'] = df2['B'].isin(df1['B'])
import pandas as pd
df1 = pd.DataFrame({'B': ['a', 'b', 'f', 'd', 'h'], 'C':[1, 5, 777, 10, 3]})
df2 = pd.DataFrame({'B': ['k', 'l', 'f', 'j', 'h'], 'C':[0, 9, 555, 15, 1]})
ind = df2[df2['B'].isin(df1['B'])].index
df1.loc[ind, 'new'] = df2.loc[ind, 'C']
df2
B C
0 k 0
1 l 9
2 f 555
3 j 15
4 h 1
Output df1
B C new
0 a 1 NaN
1 b 5 NaN
2 f 777 555.0
3 d 10 NaN
4 h 3 1.0
Here in ind are obtained indexes of rows df2 where there are matches. Further using loc, where on the left are the row indices, on the right are the column names.

Removing columns selectively from multilevel index dataframe

Say we have a dataframe like this and want to remove columns when certain conditions met.
df = pd.DataFrame(
np.arange(2, 14).reshape(-1, 4),
index=list('ABC'),
columns=pd.MultiIndex.from_arrays([
['data1', 'data2','data1','data2'],
['F', 'K','R','X'],
['C', 'D','E','E']
], names=['meter', 'Sleeper','sweeper'])
)
df
then lets say we want to remove cols only when meter == data1 and sweeper == E
so I tried
df = df.drop(('data1','E'),axis = 1)
KeyError: 'E'
second try
df.drop(('data1','E'), axis = 1, level = 2)
KeyError: "labels [('data1', 'E')] not found in level"
Pandas: drop a level from a multi-level column index?
Seems drop doesn't support selection over split levels ([0,2] here). We can create a mask with the conditions instead using get_level_values:
# keep where not ((level0 is 'data1') and (level2 is 'E'))
col_mask = ~((df.columns.get_level_values(0) == 'data1')
& (df.columns.get_level_values(2) == 'E'))
df = df.loc[:, col_mask]
We can also do this by integer location by excluding the locs that are in a particular index slice, however, this is overall less clear and less flexible:
idx = pd.IndexSlice['data1', :, 'E']
cols = [i for i in range(len(df.columns))
if i not in df.columns.get_locs(idx)]
df = df.iloc[:, cols]
Either approach produces df:
meter data1 data2
Sleeper F K X
sweeper C D E
A 2 3 5
B 6 7 9
C 10 11 13
You have to do them individually, since they are on different levels:
df.drop('data1', axis=1, level='meter').drop('E', axis = 1, level='sweeper')
Out[833]:
meter data2
Sleeper K
sweeper D
A 3
B 7
C 11

Is there any way add a columns with specific condition?

There is a data frame.
I would like to add column 'e' after checking below conditions.
if component of 'c' is in column 'a' AND component of 'd' is in column 'b' at same row , then component of e is OK
else ""
import pandas as pd
import numpy as np
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9']}
df = pd.DataFrame(A)
The result I want to get is
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9'], 'e':['OK','','','']}
You can merge df with itself on ['a', 'b'] on the left and ['c', 'd'] on the right. If index 'survives' the merge, then e should be OK:
df['e'] = np.where(
df.index.isin(df.merge(df, left_on=['a', 'b'], right_on=['c', 'd']).index),
'OK', '')
df
Output:
a b c d e
0 0 4 1 1 OK
1 2 5 2 4
2 1 1 3 2
3 4 7 6 9
P.S. Before the merge, we need to convert a and b columns to str type (or c and d to numeric), so that we can compare c and a, and d and b:
df[['a', 'b']] = df[['a', 'b']].astype(str)

find a value in a dataframe and add precedent column value in a new column in pandas

I have below data frame with 5 columns, I need to check specific string("-") in all columns and add precedent value in new column(F) if "-" is found. for example, "-" is located in Column B row zero and two; hence, 'a' and 'c'[precedent Column value] are added in Column(F) in related rows and so on.
Source Data Frame:
Desired Data Frame would be:
I have written below codes but get value length error when I want to create new Column(F), appreciate your support.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'},
'B': {0: '-', 1: 'a', 2: '-', 3: 'b', 4: 'd'}})
df['C'] = np.where(df['B'].isin(df['A'].values), df['B'], np.nan)
df['C'] = df['C'].map(dict(zip(df.A.values, df.B.values)))
df['D'] = np.where(df['C'].isin(df['B'].values), df['C'], np.nan)
df['D'] = df['D'].map(dict(zip(df.B.values, df['C'].values)))
df['E'] = np.where(df['D'].isin(df['C'].values), df['D'], np.nan)
df['E'] = df['E'].map(dict(zip(df['C'].values, df['D'].values)))
a=np.array(df.iloc[:,:5])
g=[]
for index,x in np.ndenumerate(a):
temp=[]
if x=="-":
temp.append(x-1)
g.append(temp)
df['F']=g
print(df)
Replace misisng values to all columns by DataFrame.where exclude previous values by - compared by DataFrame.shifted values, then back filling missing values and select first column by position:
df['F'] = df.where(df.shift(-1, axis=1).eq('-')).bfill(axis=1).iloc[:, 0]
print (df)
A B F
0 a - a
1 b a NaN
2 c - c
3 d b NaN
4 e d NaN
You can do:
df['F']=[i[0][-1] if len(i)>1 else np.nan for i in df.fillna('').sum(axis=1).str.split('-') ]
output:
df['F']
Out[41]:
0 a
1 a
2 c
3 a
4 a
Name: F, dtype: object
List Comprehension Explanation:
fill the NAs in df with '' and sum it across rows
split the sum with -
select the first element after spliting with - if length is > 1, else - wont be present hence fill with np.nan
select the last element of the splitted data by using [-1]

How to create new dataframe with pandas based on columns of other dataframes

I have a dataframe with columns 'A', 'B', 'C' and I want to create a new dataframe that has columns named 'X', 'Y', 'Z' which will respectively be column 'A' divided by column 'B', 'B' by 'C' and 'A' by 'C'. Also, I want to keep the same index. I have tried the following
new_df = old_df['A'] / old_df['B']
However, I don't know how to add the other columns I want.
Just do this:
new_df = pd.DataFrame(columns=['X', 'Y', 'Z'])
new_df['X'] = df['A'] / df['B']
new_df['Y'] = df['B'] / df['C']
new_df['Z'] = df['A'] / df['C']
print(new_df)
X Y Z
0 0.227273 0.647059 0.147059
1 1.600000 0.833333 1.333333
2 0.800000 1.666667 1.333333
You can also use pandas.DataFrame.div which allows you use fill_values. For instance:
new_df = pd.DataFrame(columns=['X','Y','Z'])
new_df['X'] = old_df['A'].div(old_df['B'], fill_value=-9999)

Categories