I am trying to write code that loops over the following code for columns in a dataframe: four times for four different arrays:
median_alcohol = df.alcohol.median()
for i, alcohol in enumerate(df.alcohol):
if alcohol >= median_alcohol:
df.loc[i, 'alcohol'] = 'high'
else:
df.loc[i, 'alcohol'] = 'low'
df.groupby('alcohol').quality.mean()
The columns in the dataframe are:
alcohol
pH
residual_sugar
citric_acid
I am trying to come up with a method to capture the four different arrays. Any ideas how I should go about this?
I'm not sure what actually you're trying to do, but, from what I understood, you could try something like this:
import pandas as pd
from statistics import mean
df = pd.DataFrame({'alcohol':[45, 88, 56, 15, 71], 'pH':[12, 83, 56, 25,71],'residual_sugar':[14, 25, 55, 8, 21]})
print(df)
#Output
>>> alcohol pH residual_sugar
0 45 12 14
1 88 83 25
2 56 56 55
3 15 25 8
4 71 71 21
def func(colum):
dftemp=df.copy()
median_colum = eval('df.'+colum).median()
for i, item in enumerate(eval('df.'+colum)):
dftemp.loc[i, colum] = 'high' if item >= median_colum else 'low'
return dftemp.groupby(colum).agg(list).applymap(mean)
diferrentarrays = [func(i) for i in df.columns]
for array in diferrentarrays:
print(array)
Output:
pH residual_sugar
alcohol
high 70.0 33.666667
low 18.5 11.000000
alcohol residual_sugar
pH
high 71.666667 33.666667
low 30.000000 11.000000
alcohol pH
residual_sugar
high 71.666667 70.0
low 30.000000 18.5
def numeric_to_buckets(df, column_name):
median = df[column_name].median()
for i, val in enumerate(df[column_name]):
if val >= median:
df.loc[i, column_name] = 'high'
else:
df.loc[i, column_name] = 'low'
for feature in df.columns[:-1]:
numeric_to_buckets(df, feature)
print(df.groupby(feature).quality.mean(), '\n')
Related
I have an input dataframe for daily fruit spend which looks like this:
spend_df
Date Apples Pears Grapes
01/01/22 10 47 0
02/01/22 0 22 3
03/01/22 11 0 3
...
For each fruit, I need to apply a function using their respective parameters and inputs spends. The function includes the previous day and the current day spends, which is as follows:
y = beta(1 - exp(-(theta*previous + current)/alpha))
parameters_df
Parameter Apples Pears Grapes
alpha 132 323 56
beta 424 31 33
theta 13 244 323
My output data frame should look like this (may contain errors):
profit_df
Date Apples Pears Grapes
01/01/22 30.93 4.19 0
02/01/22 265.63 31.00 1.72
03/01/22 33.90 30.99 32.99
...
This is what I attempted:
# First map parameters_df to spend_df
merged_df = input_df.merge(parameters_df, on=['Apples','Pears','Grapes'])
# Apply function to each row
profit_df = merged_df.apply(lambda x: beta(1 - exp(-(theta*x[-1] + x)/alpha))
It might be easier to read if you extract the necessary variables from parameters_df and spend_df first. Then a simple application of the formula will produce the expected output.
# extract alpha, beta, theta from parameters df
alpha, beta, theta = parameters_df.iloc[:, 1:].values
# select fruit columns
current = spend_df[['Apples', 'Pears', 'Grapes']]
# find previous values of fruit columns
previous = current.shift(fill_value=0)
# calculate profit using formula
y = beta*(1 - np.exp(-(theta*previous + current) / alpha))
profit_df = spend_df[['Date']].join(y)
Another approach using Pandas rolling function (this is a generalized version to as many fruits as necessary) :
import pandas as pd
import numpy as np
sdf = pd.DataFrame({
"Date": ['01/01/22', '02/01/22', '03/01/22'],
"Apples": [10, 0, 11],
"Pears": [47, 22, 0],
"Grapes": [0, 3, 3],
}).set_index("Date")
pdf = pd.DataFrame({
"Parameter": ['alpha', 'beta', 'theta'],
"Apples": [132, 424, 13],
"Pears": [323, 31, 244],
"Grapes": [56, 33, 323],
}).set_index("Parameter")
def func(r):
t = (pdf.loc['alpha', r.name], pdf.loc['beta', r.name], pdf.loc['theta', r.name])
return r.rolling(2).apply(lambda x: t[1]*(1 - np.exp(-(t[2]*x[0] + x[1])/t[0])))
r1 = sdf.iloc[0:2,:].shift(fill_value=0).apply(lambda r: func(r), axis=0)
r = sdf.apply(lambda r: func(r), axis=0)
r.iloc[0]=r1.shift(-1).iloc[0]
print(r)
Result
Apples Pears Grapes
Date
01/01/22 30.934651 4.198004 0.000000
02/01/22 265.637775 31.000000 1.721338
03/01/22 33.901168 30.999998 32.999999
Here is my code. I would like to sum the FPKM rows containing all specific target and print all the corresponding targets and sum values in a new pd.
# coding=utf-8
import pandas as pd
import numpy as np
classes = [('Carbon;Pyruvate;vitamins', 16.7, 1),
('Pyruvate;Carbohydrate;Pentose and glucuronate', 30, 7),
('Lipid;Carbon;Galactose', 40.5, 9),
('Galactose;Pyruvate;Fatty acid', 57, 10),
('Fatty acid;Lipid', 22, 4)]
labels = ['Ko_class','FPKM', 'count']
alls = pd.DataFrame.from_records(classes, columns=labels)
target = [['Carbon'],['Pyruvate'],['Galactose']]
targetsum = pd.DataFrame.from_records(target,columns=['target'])
#######
targets = '|'.join(sum(target, []))
targetsum['total_FPKM']=(alls['FPKM']
.groupby(alls['Ko_class']
.str.contains(targets))
.sum())
targetsum['count']=(alls['count']
.groupby(alls['Ko_class']
.str.contains(targets))
.sum())
targetsum
Its results:
target total_FPKM count
0 Carbon NaN NaN
1 Pyruvate NaN NaN
2 Galactose NaN NaN
What I want is :
target total_FPKM count
0 Carbon 57.2 10
1 Pyruvate 103.7 18
2 Galactose 97.5 19
Hope I have described my question clearly:(
try this :
def aggregation(dataframe,target):
targetsum = pd.DataFrame(columns=['target','sum','count'])
for val in target:
df_tempo=dataframe.loc[dataframe['Ko_class'].str.contains(val),:].copy()
new_row = {'target':val, 'sum':df_tempo['FPKM'].sum(), 'count':df_tempo['count'].sum()}
targetsum = targetsum.append(new_row, ignore_index=True)
return targetsum
df_result=aggregation(alls,['Carbon','Pyruvate','Galactose'])
Result :
target sum count
0 Carbon 57.2 10
1 Pyruvate 103.7 18
2 Galactose 97.5 19
You can use str.findall to find the substances that appear in your 'Ko_class' column, and assign that back to a new column. Exploding this new list-valued column into a separate rows using explode will allow you to groupby on them and perform your aggregation:
target_list = ['Carbon','Pyruvate','Galactose']
target_substances = '|'.join(target_list)
alls.assign(
Ko_class_contains_target = alls['Ko_class'].str.findall(target_substances)
).explode('Ko_class_contains_target').groupby('Ko_class_contains_target').agg('sum')
prints back:
FPKM count
Ko_class_contains_target
Carbon 57.2 10
Galactose 97.5 19
Pyruvate 103.7 18
Background
I want to determine the global cumulative value of a variable for different decades starting from 1990 to 2014 i.e. 1990, 2000, 2010 (3 decades separately). I have annual data for different countries. However, data availability is not uniform.
Existing questions
Uses R: 1
Following questions look at date formatting issues: 2, 3
Answers to these questions do not address the current question.
Current question
How to obtain a global sum for the period of different decades using features/tools of Pandas?
Expected outcome
1990-2000 x1
2000-2010 x2
2010-2015 x3
Method used so far
data_binned = data_pivoted.copy()
decade = []
# obtaining decade values for each country
for i in range(1960, 2017):
if i in list(data_binned):
# adding the columns into the decade list
decade.append(i)
if i % 10 == 0:
# adding large header so that newly created columns are set at the end of the dataframe
data_binned[i *10] = data_binned.apply(lambda x: sum(x[j] for j in decade), axis=1)
decade = []
for x in list(data_binned):
if x < 3000:
# removing non-decade columns
del data_binned[x]
# renaming the decade columns
new_names = [int(x/10) for x in list(data_binned)]
data_binned.columns = new_names
# computing global values
global_values = data_binned.sum(axis=0)
This is a non-optimal method because of less experience in using Pandas. Kindly suggest a better method which uses features of Pandas. Thank you.
If I had pandas.DataFrame called df looking like this:
>>> df = pd.DataFrame(
... {
... 1990: [1, 12, 45, 67, 78],
... 1999: [1, 12, 45, 67, 78],
... 2000: [34, 6, 67, 21, 65],
... 2009: [34, 6, 67, 21, 65],
... 2010: [3, 6, 6, 2, 6555],
... 2015: [3, 6, 6, 2, 6555],
... }, index=['country_1', 'country_2', 'country_3', 'country_4', 'country_5']
... )
>>> print(df)
1990 1999 2000 2009 2010 2015
country_1 1 1 34 34 3 3
country_2 12 12 6 6 6 6
country_3 45 45 67 67 6 6
country_4 67 67 21 21 2 2
country_5 78 78 65 65 6555 6555
I could make another pandas.DataFrame called df_decades with decades statistics like this:
>>> df_decades = pd.DataFrame()
>>>
>>> for decade in set([(col // 10) * 10 for col in df.columns]):
... cols_in_decade = [col for col in df.columns if (col // 10) * 10 == decade]
... df_decades[f'{decade}-{decade + 9}'] = df[cols_in_decade].sum(axis=1)
>>>
>>> df_decades = df_decades[sorted(df_decades.columns)]
>>> print(df_decades)
1990-1999 2000-2009 2010-2019
country_1 2 68 6
country_2 24 12 12
country_3 90 134 12
country_4 134 42 4
country_5 156 130 13110
The idea behind this is to iterate over all possible decades provided by column names in df, filtering those columns, which are part of the decade and aggregating them.
Finally, I could merge these data frames together, so my data frame df could be enriched by decades statistics from the second data frame df_decades.
>>> df = pd.merge(left=df, right=df_decades, left_index=True, right_index=True, how='left')
>>> print(df)
1990 1999 2000 2009 2010 2015 1990-1999 2000-2009 2010-2019
country_1 1 1 34 34 3 3 2 68 6
country_2 12 12 6 6 6 6 24 12 12
country_3 45 45 67 67 6 6 90 134 12
country_4 67 67 21 21 2 2 134 42 4
country_5 78 78 65 65 6555 6555 156 130 13110
I have a dataframe as follows,
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
Expected output,
value score
54 scaled value
74 scaled value
71 scaled value
78 50.000
12 600.00
I want to assign a score between 50 and 600 to all, but lowest value must have a highest score. Do you have an idea?
Not sure what you want to achieve, maybe you could provide the exact expected output for this input.
But if I understand well, maybe you could try
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
min = pd.DataFrame.min(df).value
max = pd.DataFrame.max(df).value
step = 550 / (max - min)
df['score'] = 600 - (df['value']-min) * step
print(df)
This will output
value score
0 54 250.000000
1 74 83.333333
2 71 108.333333
3 78 50.000000
4 12 600.000000
This is my idea. But I think you have a scale on your scores that is missing in your questions.
dfmin = df.min()[0]
dfmax = df.max()[0]
dfrange = dfmax - dfmin
score_value = (600-50)/dfrange
df.loc[:,'score'] = np.where(df['value'] == dfmin, 600,
np.where(df.value == dfmax,
50,
600 - ((df.value - dfmin)* (1/score_value))))
df
that produces:
value score
0 54 594.96
1 74 592.56
2 71 592.92
3 78 50.00
4 12 600.00
Not matching your output, because of the missing scale.
I'm trying to find an efficient way to generate rolling counts or sums in pandas given a grouping and a date range. Eventually, I want to be able to add conditions, ie. evaluating a 'type' field, but I'm not there just yet. I've written something to get the job done, but feel that there could be a more direct way of getting to the desired result.
My pandas data frame currently looks like this, with the desired output being put in the last column 'rolling_sales_180'.
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0
My current solution and environment can be sourced below. I've been modeling my solution from this R Q&A in stackoverflow. Efficient way to perform running total in the last 365 day window
import pandas as pd
import numpy as np
def trans_date_to_dist_matrix(date_col): # used to create a distance matrix
x = date_col.tolist()
y = date_col.tolist()
data = []
for i in x:
tmp = []
for j in y:
tmp.append(abs((i - j).days))
data.append(tmp)
del tmp
return pd.DataFrame(data=data, index=date_col.values, columns=date_col.values)
def lower_tri(x_col, date_col, win): # x_col = column user wants a rolling sum of ,date_col = dates, win = time window
dm = trans_date_to_dist_matrix(date_col=date_col) # dm = distance matrix
dm = dm.where(dm <= win) # find all elements of the distance matrix that are less than window(time)
lt = dm.where(np.tril(np.ones(dm.shape)).astype(np.bool)) # lt = lower tri of distance matrix so we get only future dates
lt[lt >= 0.0] = 1.0 # cleans up our lower tri so that we can sum events that happen on the day we are evaluating
lt = lt.fillna(0) # replaces NaN with 0's for multiplication
return pd.DataFrame(x_col.values * lt.values).sum(axis=1).tolist()
def flatten(x):
try:
n = [v for sl in x for v in sl]
return [v for sl in n for v in sl]
except:
return [v for sl in x for v in sl]
data = [
['David', '1/1/2015', 100], ['David', '1/5/2015', 500], ['David', '5/30/2015', 50], ['David', '7/25/2015', 50],
['Ryan', '1/4/2014', 100], ['Ryan', '1/19/2015', 500], ['Ryan', '3/31/2016', 50],
['Joe', '7/1/2015', 100], ['Joe', '9/9/2015', 500], ['Joe', '10/15/2015', 50]
]
list_of_vals = []
dates_df = pd.DataFrame(data=data, columns=['name', 'date', 'amount'], index=None)
dates_df['date'] = pd.to_datetime(dates_df['date'])
list_of_vals.append(dates_df.groupby('name', as_index=False).apply(
lambda x: lower_tri(x_col=x.amount, date_col=x.date, win=180)))
new_data = flatten(list_of_vals)
dates_df['rolling_sales_180'] = new_data
print dates_df
Your time and feedback are appreciated.
Pandas has support for time-aware rolling via the rolling method, so you can use that instead of writing your own solution from scratch:
def get_rolling_amount(grp, freq):
return grp.rolling(freq, on='date')['amount'].sum()
df['rolling_sales_180'] = df.groupby('name', as_index=False, group_keys=False) \
.apply(get_rolling_amount, '180D')
The resulting output:
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0