Need to calculate columns from CSV using pandas - python

incidentcountlevel1 and examcount were two column names on CSV file. I want to calculate two columns based on these. I have written the script below but it's failing:
import pandas as pd
import numpy as np
import time, os, fnmatch, shutil
df = pd.read_csv(r"/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',skiprows=[1])
df1 = pd.read_csv(r"/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',skiprows=[1])
df3 = pd.read_csv("/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',converters={"incidentcountlevel1":int})
inc_count_lvl_1 = df3.loc[:, ['incidentcountlevel1']]
exam_count=df3.loc[:, ['examcount']]
for exam_count in exam_count: #need to iterate this col to calculate for each row
if exam_count < 1:
print "IPTE Cannot be calculated"
else:
if inc_count_lvl_1 > 5:
ipte1= (inc_count_lvl_1/exam_count)*1000
else:
dof = 2*(inc_count_lvl_1+ 1)
chi_square=chi2.ppf(0.5,dof)
ipte1=(chi_square/(2*exam_count))×1000

You can apply lamda function on pandas column.
Just created an example using numpy. You can change according to your case
>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 50]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 1500
or you can create your own function:
>>> def fx(x, y):
... return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 1500
I your case, the solution might look like this.
df['new_column'] = np.vectorize(fx)(df['examcount'], df['incidentcountlevel1'])
def fx(exam_count,inc_count_lvl_1):
if exam_count < 1:
return -1 ##whatever you want
else:
if inc_count_lvl_1 > 5:
ipte1= (inc_count_lvl_1/exam_count)*1000
else:
dof = 2*(inc_count_lvl_1+ 1)
chi_square=chi2.ppf(0.5,dof)
ipte1=(chi_square/(2*exam_count))×1000
return ipte1
If you dont want to use lamda fucntions then you can use iterrows.
iterrows is a generator which yield both index and row.
for index, row in df.iterrows():
print row['examcount'], row['incidentcountlevel1']
#do your stuff.
I hope it helps.

Related

Pandas dataframe, in a row, to find the max in selected column, and find value of another column based on that

I have a dataframe like this:
>>>import pandas as pd
>>>df = pd.DataFrame({'x1':[20,25],'y1':[5,8],'x2':[22,27],'y2':[10,2]})
>>>df
x1 y1 x2 y2
0 20 5 22 10
1 25 8 27 2
>>>
X and Y pair together. I need to compare y1 and y2 and get the max in every row. And find the corresponding x.
Hence the max of row [0] is y2 (=10), and the corresponding x is x2 (=22). The second row will be y1
(=8) and x1(=25).
Expected result, new columns x and y:
x1 y1 x2 y2 x y
0 20 5 22 10 22 10
1 25 8 27 2 25 8
This is a simple dataframe I made to elaborate on the question. X and Y pairs, in my case, can be 30 pairs.
# get a hold on "y*" columns
y_cols = df.filter(like="y")
# get the maximal y-values' suffixes, and then add from front "x" to them
max_x_vals = y_cols.idxmax(axis=1).str.extract(r"(\d+)$", expand=False).radd("x")
# get the locations of those x* values
max_x_ids = df.columns.get_indexer(max_x_vals)
# now we have the indexes of x*'s in the columns; NumPy's indexing
# helps to get a cross section
df["max_xs"] = df.to_numpy()[np.arange(len(df)), max_x_ids]
# for y*'s, it's directly the maximum per row
df["max_ys"] = y_cols.max(axis=1)
to get
>>> df
x1 y1 x2 y2 max_xs max_ys
0 20 5 22 10 22 10
1 25 8 27 2 25 8
You can do it with the help of .apply function.
import pandas as pd
import numpy as np
df = pd.DataFrame({'x1':[20,25],'y1':[5,8],'x2':[22,27],'y2':[10,2]})
y_cols = [col for col in df.columns if col[0] == 'y']
x_cols = [col for col in df.columns if col[0] == 'x']
def find_corresponding_x(row):
max_y_index = np.argmax(row[y_cols])
return row[f'{x_cols[max_y_index]}']
df['corresponding_x'] = df.apply(find_corresponding_x, axis = 1)
you can use the function below. remember to import pandas and numpy like I did in this code. import your data set and use Max_number function.
import pandas as pd
import numpy as np
df = pd.DataFrame({'x1':[20,25],'y1':[5,8],'x2':[22,27],'y2':[10,2]})
def Max_number (df):
columns = list(df.columns)
rows = df.shape[0]
max_value = []
column_name = []
for i in range(rows):
row_array = list(np.array(df[i:i+1])[0])
maximum = max(row_array)
max_value.append(maximum)
index=row_array.index(maximum)
column_name.append(columns[index])
return pd.DataFrame({"column":column_name,"max_value":max_value})
returns this:
row index
column
max_value
0
x2
22
1
x2
27
if x1 column comes first and then y1, then x2, y2 and so on, you can just try:
a = df.columns.get_indexer(y_cols.idxmax(axis=1))
df[['y', 'x']] = df.to_numpy()[np.arange(len(df)), [a, a - 1]].T
this is one solution:
a = df[df['y1'] < df['y2']].drop(columns=['y1','x1']).rename(columns={'y2':'y', 'x2':'x'})
b = df[df['y1'] >= df['y2']].drop(columns=['y2','x2']).rename(columns={'y1':'y', 'x1':'x'})
result = pd.concat([a,b])
if you need to keep order then maybe add another column with original index and sort by it after concatenation
I hope it works for your solution,
import pandas as pd
df = pd.DataFrame({'x1':[20,25],'y1':[5,8],'x2':[22,27],'y2':[10,2]})
df['x_max'] = df[['x1', 'x2']].max(axis=1)
df['y_max'] = df[['y1', 'y2']].max(axis=1)
df

Use previous row value for calculating log

I have a Dataframe as presented in the Spreadsheet, It has a column A.
https://docs.google.com/spreadsheets/d/1h3ED1FbkxQxyci0ETQio8V4cqaAOC7bIJ5NvVx41jA/edit?usp=sharing
I have been trying to create a new column like A_output which uses the previous row value and current row value for finding the Natual Log.
df.apply(custom_function, axix=1) #on a function
But I am not sure, How to access the previous value of the row?
The only thing I have tried is converting the values into the list and perform my operation and appending it back to the dataframe something like this.
output = []
previous_value = 100
for value in df['A'].values:
output.append(np.log(value/previous_value))
previous_value = value
df['A_output'] = output
This is going to be extremely expensive operation, What's the best way to approach this problem?
Another way with rolling():
import pandas as pd
import numpy as np
data = np.random.normal(loc=5., size=(6, 1))
df = pd.DataFrame(columns=['A'], data=data)
df['output'] = df['A'].rolling(2).apply(lambda x: np.log(x[1]/x[0]))
init_val = 3.
df['output'][0] = np.log(df['A'][0] / init_val) # <-- manually assign value for the first item
print(df)
# A output
# 0 7.257160 0.883376
# 1 4.579390 -0.460423
# 2 4.630148 0.011023
# 3 5.153198 0.107029
# 4 6.004917 0.152961
# 5 6.633857 0.099608
If you want to apply the same operation on multiple columns:
import pandas as pd
import numpy as np
data = np.random.normal(loc=5., size=(6, 2))
df = pd.DataFrame(columns=['A', 'B'], data=data)
df[['output_A', 'output_B']] = df.rolling(2).apply(lambda x: np.log(x[1]/x[0]))
init_val = 3.
df['output_A'][0] = np.log(df['A'][0] / init_val)
df['output_B'][0] = np.log(df['B'][0] / init_val)
print(df)
# A B output_A output_B
# 0 7.289657 4.986245 0.887844 0.508071
# 1 5.690721 5.010605 -0.247620 0.004874
# 2 5.773812 5.129814 0.014495 0.023513
# 3 4.417981 6.395500 -0.267650 0.220525
# 4 4.923170 5.363723 0.108270 -0.175936
# 5 5.279008 5.327365 0.069786 -0.006802
We can use Series.shift and after use .loc to assign the first value with the base value
Let's assume we have the following dataframe:
df = pd.DataFrame({'A':np.random.randint(1, 10, 5)})
print(df)
A
0 8
1 3
2 3
3 1
4 5
df['A_output'] = np.log(df['A'] / df['A'].shift())
df.loc[0, 'A_output'] = np.log(df.loc[0, 'A'] / 100)
print(df)
A A_output
0 8 -2.525729
1 3 -0.980829
2 3 0.000000
3 1 -1.098612
4 5 1.609438

How can i change the column values of csv file using with pandas

Here in my program I have 4 columns of csv file, in that x,y values having 0,0 values I want to change those 0,0 values to my desired values without changing other x,y values. Can you please help me how to change these values?
I tried this given code but other values of x,y values are also changing because here I am adding 3 value for whole x, but I don't want change remaining values I want to change the 0,0 x,y values to my desired values only, so can you please guide me. Thank you in advance
import pandas as pd
df = pd.read_csv("Tunnel.csv",delimiter= ',')
df['X'] = df['X'] + 3
df['Y'] = df['Y'] + 4
print(df)
This is my csv_file
You can select subframes of zero entries:
df[df['X'] == 0] += 3
df[df['Y'] == 0] += 4
To write your dataframe to a csv file named file_name use to_csv
file_name = 'file.csv'
df.to_csv(file_name)
You could use the df.loc method as follows:
df.loc[(df['X'] == 0) & (df['Y'] ==0) , ['X', 'Y'] ] = 3,4
another way around using with df.iteritems :
>>> df = pd.DataFrame({'a': [0, 0, 2], 'b': [ 0, 2, 1]})
>>> df
a b
0 0 0
1 0 2
2 2 1
>>> for key, val in df.iteritems():
... val[val == 0] = 3
...
>>> df
a b
0 3 3
1 3 2
2 2 1
You can use apply function to set these values. Below is the code:
import pandas as pd
df=pd.read_csv('t.csv',header=None, names=['x','y','depth','color'])
dfc=df.copy()
dfc['x']=dfc['x'].apply(lambda t: 1 if t==0 else t)
dfc['y']=dfc['y'].apply(lambda t: 1 if t==0 else t)
Input:
Output:
Hope this helps.
Thanks,
Rohan Hodarkar

Unpack a function into a data frame

I have a function which returns two list, so a can save those in two variables like:
list_a,list_b = my_function(input)
I want to save this directly into a dataframe, something like this:
df[['list_a','list_b']] = my_function(input)
I got the following error:
array is not broadcastable to correct shape
Use
df['B'], df['C'] = my_function()
to unpack the tuple of lists returned by my_function and assign the lists to df['B'] and df['C']:
import pandas as pd
N = 5
def my_function():
return [10]*N, [20]*N
df = pd.DataFrame({'A':[1]*N})
df['B'], df['C'] = my_function()
yields
A B C
0 1 10 20
1 1 10 20
2 1 10 20
3 1 10 20
4 1 10 20
Note that the lengths of the lists returned by my_function must match the length of df.
import pandas as pd
list_a, list_b = my_function(input)
df = pd.DataFrame([list_a, list_b], columns=['a','b'])
or combined in to one line:
df = pd.DataFrame(list(my_function(input)), columns=['a','b'])

Applying function with multiple arguments to create a new pandas column

I want to create a new column in a pandas data frame by applying a function to two existing columns. Following this answer I've been able to create a new column when I only need one column as an argument:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
def fx(x):
return x * x
print(df)
df['newcolumn'] = df.A.apply(fx)
print(df)
However, I cannot figure out how to do the same thing when the function requires multiple arguments. For example, how do I create a new column by passing column A and column B to the function below?
def fxy(x, y):
return x * y
You can go with #greenAfrican example, if it's possible for you to rewrite your function. But if you don't want to rewrite your function, you can wrap it into anonymous function inside apply, like this:
>>> def fxy(x, y):
... return x * y
>>> df['newcolumn'] = df.apply(lambda x: fxy(x['A'], x['B']), axis=1)
>>> df
A B newcolumn
0 10 20 200
1 20 30 600
2 30 10 300
Alternatively, you can use numpy underlying function:
>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 300
or vectorize arbitrary function in general case:
>>> def fx(x, y):
... return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 300
This solves the problem:
df['newcolumn'] = df.A * df.B
You could also do:
def fab(row):
return row['A'] * row['B']
df['newcolumn'] = df.apply(fab, axis=1)
If you need to create multiple columns at once:
Create the dataframe:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
Create the function:
def fab(row):
return row['A'] * row['B'], row['A'] + row['B']
Assign the new columns:
df['newcolumn'], df['newcolumn2'] = zip(*df.apply(fab, axis=1))
One more dict style clean syntax:
df["new_column"] = df.apply(lambda x: x["A"] * x["B"], axis = 1)
or,
df["new_column"] = df["A"] * df["B"]
This will dynamically give you desired result. It works even if you have more than two arguments
df['anothercolumn'] = df[['A', 'B']].apply(lambda x: fxy(*x), axis=1)
print(df)
A B newcolumn anothercolumn
0 10 20 100 200
1 20 30 400 600
2 30 10 900 300

Categories