Python function using Pandas datafram values as arguments - python

Being not very experimented with python programming, I have to trigger a python function in my ETL
Goal: Run a RATE function on every row of a table containing financial data.
Here is the Function I have to use
import numpy as np
Solution = np.rate(nper, pmt, pv, fv,when=1,guess=0.01,tol=1e-06, maxiter=100)
and as argument, I'd like to use data coming from a table stored in a Pandas DataFrame
In this case I presenting only one row for exemple purposes, but m table will most likely contains many rows
Table:
nper
pmt
pv
fv
56
281
-22057
9365
Does anyone know how to structure the function?
Thanks for reading

np.rate uses array-like inputs, so you can simply pass the columns as arguments and will receive an array of rates:
df = pd.DataFrame.from_dict({'nper': {0: 56}, 'pmt': {0: 281}, 'pv': {0: -22057}, 'fv': {0: 9365}})
np.rate(df.nper, df.pmt, df.pv, df.fv, when=1, guess=0.01, tol=1e-06, maxiter=100)
This uses the values nper, pmt, pv, fv of each row in the dataframe df in a vectorized manner. If df hasn rows, the function will return an array of length n corresponding to the rate of each row.
To store the resulting rates in the dataframe df, you can assign in to a new column:
df['rate'] = np.rate(df.nper, df.pmt, df.pv, df.fv, when=1, guess=0.01, tol=1e-06, maxiter=100)
Note that np.rate is deprecated, so you should consider using the corresponding function in the numpy-financial library, https://pypi.org/project/numpy-financial.

Something you can do is creating a separate function for the rate rate function.
You can then use df.apply with axis 1, which will pass every row into this function. You can get the values out from each row and put it into your numpy.rate function.
Right now in the end you get a new column called Solution where you have what the rate function computed:
import pandas as pd
import numpy as np
data = {
"nper": [1,2,3],
"pmt": [4,5,6],
"pv": [7,8,9],
"fv": [0,1,2]
}
df = pd.DataFrame(data)
def RunRateFunction(row):
x = np.rate(row.nper, row.pmt, row.pv, row.fv,when=1,guess=0.01,tol=1e-06, maxiter=100)
return x
df["Solution"] = df.apply(RunRateFunction, axis = 1)
Output:
nper pmt pv fv Solution
0 1 4 7 0 -1.000000
1 2 5 8 1 NaN
2 3 6 9 2 -1.348887

Related

Splitting / Divide dataframe every 10 rows

I'm trying to split my data frame into 10 rows and find the aggregate function (mean, SD, etc) for each 10 rows then merge it into 1 data frame again. Previously I had grouped the data using .groupby function, but having trouble to split the data into 10 rows.
This is what I have done :
def sorting (df):
grouped = df.groupby(['Activity']).
l_grouped=list(grouped)
I turned the grouped result into list (l_grouped), but I don't know if I could separate the rows from each tuple/list?
The result was indentical with the original data frame, but there were separated by 'Activity'. For example, row that has 'Standing' as the targeted value ('Activity') would be accesible through calling l_grouped[1][0] (type list/tuple). l_grouped [1][1] would return word 'Standing' only.
I could access the grouped result using :
for i in range(len(df_sort)):
print(df_sort[i][1])
df_sort referring to the result of calling the sorting(df)
Is there any way i could split/divide the tuple/list per each rows? Then create the aggregate function out of that?
I would suggest using a window function + a small trick for the stride:
step = 10
window = 10
df = pd.DataFrame({'a': range(100)})
df['a'].rolling(window).sum()[::step]
if this is not the exact result you are looking for check the documentation for more details regarding the bounds of the window and etc. ...
You can do:
import numpy as np
step = 10
df["id"] = np.arange(len(df))//step
gr = df.groupby("id")
...

how to calculate dominant frequency use numpy.fft in python

I have this data frame about vehicle vibration and I want to calculate the dominant frequency of the vibration. I know that we can calculate it using numpy.fft, but I have no idea how can I apply numpy.fft to my data frame.
Please enlighten me how to do it in python.
Thankyou.
A dataframe column is a NumPy array effectively
df = pd.DataFrame({"Vibration":[random.uniform(2,10) for i in range(10)]})
df["fft"] = np.fft.fft(df["Vibration"].values)
print(df.to_string())
output
Vibration fft
0 8.212039 63.320213+0.000000j
1 5.590523 2.640720-2.231825j
2 8.945281 -2.977825-5.716229j
3 6.833036 4.657765+5.649944j
4 5.150939 -0.216720-0.445046j
5 3.174186 10.592292+0.000000j
6 9.054791 -0.216720+0.445046j
7 5.830278 4.657765-5.649944j
8 5.593203 -2.977825+5.716229j
9 4.935937 2.640720+2.231825j
batching to every 15 rows
df = pd.DataFrame({"Vibration":[random.uniform(2,10) for i in range(800)]})
df.assign(
fft=df.groupby(df.index // 15)["Vibration"].transform(lambda s: np.fft.fft(list(s)).astype("object")),
grpfirst=df.groupby(df.index // 15)["Vibration"].transform(lambda s: list(s)[0])
)
Without knowing how the DataFrame looks like, and which fields you need to use for your calculations, you can apply any function to dataframe using .apply()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

How to set row order of a dataframe that has a long format? [duplicate]

I have following dataframe:
(Index) sample reads yeasts
9 CO ref 10
10 CO raai 20
11 CO tus 30
I want to change the order of the columns based on sample, expected output:
(Index) sample reads yeasts
9 CO ref 10
11 CO tus 30
10 CO raai 10
I'm not interested in the Index of the rows.
I've tried following code based on other stackoverflow/google posts:
df=df.reindex(["CO ref","CO tus","CO raai"])
This correctly changes the index, but all the other columns get value nan
I've also tried:
df.index=["CO ref","CO tus","CO raai"]
This changes the index correctly but the other columns do not switch so it messes up the dataframe.
Also:
df["sample"].index=["CO ref","CO tus","CO raai"]
But this does nothing.
How can I get this to work?
For reindex is necessary create index from sample column:
df=df.set_index(['sample']).reindex(["CO ref","CO tus","CO raai"]).reset_index()
Or use ordered categorical:
cats = ["CO ref","CO tus","CO raai"]
df['sample'] = pd.CategoricalIndex(df['sample'], ordered=True, categories=cats)
df = df.sort_values('sample')
The solution from jezrael is of course correct, and most likely the fastest. But since this is really just a question of restructuring your dataframe I'd like to show you how you can easily do that and at the same time let your procedure select which subset of your sorting column to use.
The following very simple function will let you specify both the subset and order of your dataframe:
# function to subset and order a pandas
# dataframe of a long format
def order_df(df_input, order_by, order):
df_output=pd.DataFrame()
for var in order:
df_append=df_input[df_input[order_by]==var].copy()
df_output = pd.concat([df_output, df_append])
return(df_output)
Here's an example using the iris dataset from plotly express. df['species'].unique() will show you the order of that column:
Output:
array(['setosa', 'versicolor', 'virginica'], dtype=object)
Now, running the following complete snippet with the function above will give you a new specified order. No need for categorical variables or tampering of the index.
Complete code with datasample:
# imports
import pandas as pd
import plotly.express as px
# data
df = px.data.iris()
# function to subset and order a pandas
# dataframe fo a long format
def order_df(df_input, order_by, order):
df_output=pd.DataFrame()
for var in order:
df_append=df_input[df_input[order_by]==var].copy()
df_output = pd.concat([df_output, df_append])
return(df_output)
# data subsets
df_new = order_df(df_input = df, order_by='species', order=['virginica', 'setosa', 'versicolor'])
df_new['species'].unique()
Output:
array(['virginica', 'setosa', 'versicolor'], dtype=object)

Multiply two columns in a groupby statement in pandas

I have a simplified data frame called df
import pandas as pd
df = pd.DataFrame({'num': [1,1,2,2],
'price': [12,11,15,13],
'y': [7,7,9,9]})
I want to group by num an then multiply price and y and take the sum divided by sum of y
I've been trying to get started with this and have been having trouble
df.groupby('letter').agg(['price']*['quantity'])
Prior to the groupby operation, you can add a temporary column to the dataframe that calcs your intermediate result (price * y) and then use this column in your groupby operation (summing the values, and then using eval to calculate the sum of temp divided by the sum of y). Cast the result back to a dataframe and call the new column whatever you'd like.
>>> (df
.assign(temp=df.eval('price * y'))
.groupby('num')
.sum()
.eval('temp / y')
.to_frame('result')
)
result
num
1 11.5
2 14.0
Basically you want to compute a weighted mean
A way to do this is:
import numpy as np
# define custom function with 'y'column as weights
weights = lambda x: np.average(x,weights=df.loc[x.index,'y'])
# aggregate using this new function
df.groupby('num').agg({'price': weights})

how do you pass multiple variables to pandas dataframe to use them with .map to create a new column

To pass multiple variables to a normal python function you can just write something like:
def a_function(date,string,float):
do something....
convert string to int,
date = date + (float * int) days
return date
When using Pandas DataFrames I know you can create a new column based on the contents of one like so:
df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year
What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?
For example combining three separate parts of a date Y - M - D into one field.
df['whole_date']) = df['Year','Month','Day'].map(a_function)
I get a key error with the following test.
def combine(one,two,three):
return one + two + three
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})
df['d'] = df['a','b','b'].map(combine)
Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?
-> Example input: 1, 2, 3
-> Example output: 1*2*3
Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?
Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3
To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:
def combine(row):
return row['a'] + row['b'] + row['c']
>>> df.apply(combine, axis=1)
0 7
1 10
2 13
Or you can pass a lambda which unpacks the Series into separate arguments:
def combine(one,two,three):
return one + two + three
>>> df.apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:
>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)
However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:
>>> combine(df.a, df.b, df.c)
0 7
1 10
2 13
This is typically much more efficient when the "combining" operation is vectorizable.
Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?
As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:
>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
date
0 2015-05-01
1 2015-05-02
2 2015-05-03
You can define a function that returns a Series for each value, and then apply it to the column:
def dateComponents(date):
return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])
>>> df.date.apply(dateComponents)
11: Year Month Day
0 2015 5 1
1 2015 5 2
2 2015 5 3
In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:
>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
a
0 Hello
1 There
2 Pal
>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
FirstChar Length
0 H 5
1 T 5
2 P 3
Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.
I normally use apply for this kind of thing; it's basically the DataFrame version of map (the axis parameter lets you decide whether to apply your function to rows or columns):
df.apply(lambda row: row.a*row.b*row.c, axis =1)
or
df.apply(np.prod, axis=1)
0 8
1 30
2 72

Categories