Frequency mean calculation for an arbitrary distibution in pandas - python

I have a large dataset with values ranging from 1 to 25 with a resolution of o.1 . The distribution is arbitrary in nature with mode value of 1. The sample dataset can be like :
1,
1,
23.05,
19.57,
1,
1.56,
1,
23.53,
19.74,
7.07,
1,
22.85,
1,
1,
7.78,
16.89,
12.75,
15.32,
7.7,
14.26,
15.41,
1,
16.34,
8.57,
15,
14.97,
1.18,
14.15,
1.94,
14.61,
1,
15.49,
1,
9.18,
1.71,
1,
10.4,
How to evaluate the counts in different ranges (0-0.5,0.5-1, etc) and find out their frequency mean in pandas, Python.
expected output can be
values ranges(f) occurance(n) f*n
1
2.2 1-2 2 3
2.8 2-3 3 7.5
3.7 3-4 2 7
5.5 4-5 1 4.5
5.8 5-6 3 16.5
4.3
2.7 sum- 11 38.5
3.5
1.8 frequency mean 3.5
5.9

You need cut for binning, then convert CategoricalIndex to IntervalIndex for mid value, multiple column by mul, sum and last divide scalars:
df = pd.DataFrame({'col':[1,2.2,2.8,3.7,5.5,5.8,4.3,2.7,3.5,1.8,5.9]})
print (df)
col
0 1.0
1 2.2
2 2.8
3 3.7
4 5.5
5 5.8
6 4.3
7 2.7
8 3.5
9 1.8
10 5.9
binned = pd.cut(df['col'], np.arange(1, 7), include_lowest=True)
df1 = df.groupby(binned).size().reset_index(name='val')
df1['mid'] = pd.IntervalIndex(df1['col']).mid
df1['mul'] = df1['val'].mul(df1['mid'])
print (df1)
col val mid mul
0 (0.999, 2.0] 2 1.4995 2.999
1 (2.0, 3.0] 3 2.5000 7.500
2 (3.0, 4.0] 2 3.5000 7.000
3 (4.0, 5.0] 1 4.5000 4.500
4 (5.0, 6.0] 3 5.5000 16.500
a = df1.sum()
print (a)
val 11.0000
mid 17.4995
mul 38.4990
dtype: float64
b = a['mul'] / a['val']
print (b)
3.49990909091

Related

Is there a way to recalculate existing values in df based on conditions? - Python / Pandas

I have a DataFrame with Employees and their hours for different categories.
I need to recalculate only specific categories (OT, MILE and REST Categories SHOULD NOT Be Updated, ALL Other Should be updated) ONLY if OT category is present under Empl_Id.
data = {'Empl_Id': [1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3],
'Category': ["MILE", "REST", "OT", "TRVL", "REG", "ADMIN", "REST", "REG", "MILE", "OT", "TRVL", "REST", "MAT", "REG"],
'Value': [43, 0.7, 6.33, 2.67, 52, 22, 1.17, 16.5, 73.6, 4.75, 1.33, 2.5, 5.5, 52.25]}
df = pd.DataFrame(data=data)
df
Empl_Id
Category
Value
1
MILE
43
1
REST
0.7
1
OT
6.33
1
TRVL
2.67
1
REG
52
2
ADMIN
22
2
REST
1.17
2
REG
16.5
3
MILE
73.6
3
OT
4.75
3
TRVL
1.33
3
REST
2.5
3
MAT
5.5
3
REG
52.25
The Logic is to:
1) Find % of OT Hours from Total Hours (OT, REST and MILE don't count):
1st Empl_Id: 6.33 (OT) / 2.67 (TRVL) + 52 (REG) = 6.33 / 54.67 = 11.58 %
2nd Empl_Id: OT Hours Not present, nothing should be updated
3rd Empl_Id: 4.75 (OT) / 1.33 (TRVL) + 5.5 (MAT) + 52.25 (REG) = 4.75 / 59.08 = 8.04 %
2) Substract % of OT from each category (OT, REST and MILE don't count):
Empl_Id
Category
Value
1
MILE
43
1
REST
0.7
1
OT
6.33
1
TRVL
2.67 - 11.58 % (0.31) = 2.36
1
REG
52 - 11.58 % (6.02) = 45.98
2
ADMIN
22
2
REST
1.17
2
REG
16.5
3
MILE
73.6
3
OT
4.75
3
TRVL
1.33 - 8.04 % (0.11) = 1.22
3
REST
2.5
3
MAT
5.5 - 8.04 % (0.44) = 5.06
3
REG
52.25 - 8.04 % (4.2) = 48.05
You can use:
keep = ['OT', 'MILE', 'REST']
# get factor
factor = (df.groupby(df['Empl_Id'])
.apply(lambda g: g.loc[g['Category'].eq('OT'),'Value'].sum()
/g.loc[~g['Category'].isin(keep),'Value'].sum()
)
.rsub(1)
)
# update
df.loc[~df['Category'].isin(keep), 'Value'] *= df['Empl_Id'].map(factor)
output:
Empl_Id Category Value
0 1 MILE 43.000000
1 1 REST 0.700000
2 1 OT 6.330000
3 1 TRVL 2.360852
4 1 REG 45.979148
5 2 ADMIN 22.000000
6 2 REST 1.170000
7 2 REG 16.500000
8 3 MILE 73.600000
9 3 OT 1.750000
10 3 TRVL 1.290604
11 3 REST 2.500000
12 3 MAT 5.337085
13 3 REG 50.702310

Minimize total error squared column of table by changing a variable (Python)

Consider a table that is created using the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Reference Value' : [4.8, 2.4, 3.6, 0.6, 4.8, 5.4], 'True Result' : [8, 4, 6, 1, 8, 9]})
x = 1.5
df["Predicted Result"] = df['Reference Value'] * x
df["Error Squared"] = np.square(df["True Result"] - df["Predicted Result"])
Which if printed, looks as follows:
Reference Value True Result Predicted Result Error Squared
0 4.8 8 7.2 0.64
1 2.4 4 3.6 0.16
2 3.6 6 5.4 0.36
3 0.6 1 0.9 0.01
4 4.8 8 7.2 0.64
5 5.4 9 8.1 0.81
The total squared error is:
print("Total Error Squared: " + str(np.sum(df["Error Squared"])))
>> Total Error Squared: 2.6199999999999997
I am trying to change x such that the total error squared in the table is minimized. Ideally, after minimization, the table should look something like this:
Reference Value True Result Predicted Result Error Squared
0 4.8 8 8.0 0.0
1 2.4 4 4.0 0.0
2 3.6 6 6.0 0.0
3 0.6 1 1.0 0.0
4 4.8 8 8.0 0.0
5 5.4 9 9.0 0.0
with x being set to 1.6666
How can I achieve this through scipy or similar? Thanks
You can use scipy.optimize.minimize:
from scipy.optimize import minimize
ref_vals = df["Reference Value"].values
true_vals = df["True Result"].values
def obj(x):
return np.sum((true_vals - ref_vals * x)**2)
res = minimize(obj, x0=[1.0])
where res.x contains the solution 1.66666666.

fillna with max value of each group in python

Dataframe
df=pd.DataFrame({"sym":["a","a","aa","aa","aa","a","ab","ab","ab"],
"id_h":[2.1, 2.2 , 2.5 , 3.1 , 2.5, 3.8 , 2.5, 5,6],
"pm_h":[np.nan, 2.3, np.nan , 2.8, 2.7, 3.7, 2.4, 4.9,np.nan]})
want to fill pm_h nan values with max id_h value of each "sys" group i.e. (a, aa, ab)
Required output:
df1=pd.DataFrame({"sym":["a","a","aa","aa","aa","a","ab","ab","ab"],
"id_h":[2.1, 2.2 , 2.5 , 3.1 , 2.5, 3.8 , 2.5, 5,6],
"pm_h":[3.8, 2.3, 3.1 , 2.8, 2.7, 3.7, 2.4, 4.9, 6})
Use Series.fillna with GroupBy.transform by maximal values for new Series with same index like original:
df['pm_h'] = df['pm_h'].fillna(df.groupby('sym')['id_h'].transform('max'))
print (df)
sym id_h pm_h
0 a 2.1 3.8
1 a 2.2 2.3
2 aa 2.5 3.1
3 aa 3.1 2.8
4 aa 2.5 2.7
5 a 3.8 3.7
6 ab 2.5 2.4
7 ab 5.0 4.9
8 ab 6.0 6.0

Repeat row of dataframe if condition met, and change value of one value

I have a dataframe:
import pandas as pd
df = pd.DataFrame(
{
"Qty": [1,2,2,4,5,4,3],
"Date": ['2020-12-16', '2020-12-17', '2020-12-18', '2020-12-19', '2020-12-20', '2020-12-21', '2020-12-22'],
"Item": ['22-A', 'R-22-A', '33-CDE', 'R-33-CDE', '55-A', '22-AB', '55-AB'],
"Price": [1.1, 2.2, 2.2, 4.4, 5.5, 4.4, 3.3]
})
I'm trying to duplicate each row where the Item suffix has 2 or more characters, and then change the value of the Item. For example, the row containing '22-AB' will become two rows. In the first row the Item will be '22-A', and in the 2nd it will be '22-B'.
All this should be done only if the item number (without suffix) is in a 'clean' list.
Here is the pseudocode for what I'm trying to achieve:
Clean list of items = ['11', '22', '33']
For each row, check if substring of df["Item"] is in clean list.
   if no:
     skip row and leave it as it is
   if yes:
     check if len(suffix) >= 2
       if no:
        skip row and leave it as it is
       if yes:
         separate the item (11, 22, or 33) and the suffix
         for char in suffix:
            newitem = concat item + char
            duplicate the row, replacing the old item with newitem
            
            if number started with R-, prepend the R- again
The desired output:
df2 = pd.DataFrame(
{
"Qty": [1,2,2,2,2,4,4,4,5,4,4,3,3],
"Date": ['2020-12-16', '2020-12-17', '2020-12-18', '2020-12-18', '2020-12-18', '2020-12-19', '2020-12-19', '2020-12-19', '2020-12-20', '2020-12-21', '2020-12-21', '2020-12-22', '2020-12-22'],
"Item": ['22-A', 'R-22-A', '33-C', '33-D', '33-E', 'R-33-C', 'R-33-D', 'R-33-E', '55-A', '22-A', '22-B', '55-A', '55-B'],
"Price": [1.1, 2.2, 2.2, 2.2, 2.2, 4.4, 4.4, 4.4, 5.5, 4.4, 4.4, 3.3, 3.3]
})
What I have come up with so far:
mains = ['11', '22', '33']
for i in df["Item"]:
iptrn = re.compile(r'\d{2}')
optrn = re.compile('(?<=[0-9]-).*')
item = bptrn.search(i).group(0)
option = optrn.search(i).group(0)
if item in mains:
for o in option:
combo = item + "-" + o
print(combo)
I can't figure out the last step of actually duplicating the row. I've tried this: df = df.loc[df.index.repeat(1)].assign(Item=combo, num=len(option)-1).reset_index(drop=True), but it doesn't replace the Item correctly
You can use pandas operations to do the work here
It seems like the first step is to separate the two parts of the item code with pandas string methods (here, use extract with expand=True)
>>> item_code = df['Item'].str.extract('(?P<ic1>R?-?\d+)-+(?P<ic2>\w+)', expand=True)
>>> item_code
ic1 ic2
0 22 A
1 R-22 A
2 33 CDE
3 R-33 CDE
4 55 A
5 22 AB
6 55 AB
You can add these columns directly to df - I just included that snippet above to show you the output from the extract operation.
>>> df = df.join(df['Item'].str.extract('(?P<ic1>R?-?\d+)-+(?P<ic2>\w+)', expand=True))
>>> df
Qty Date Item Price ic1 ic2
0 1 2020-12-16 22-A 1.1 22 A
1 2 2020-12-17 R-22-A 2.2 R-22 A
2 2 2020-12-18 33-CDE 2.2 33 CDE
3 4 2020-12-19 R-33-CDE 4.4 R-33 CDE
4 5 2020-12-20 55-A 5.5 55 A
5 4 2020-12-21 22-AB 4.4 22 AB
6 3 2020-12-22 55-AB 3.3 55 AB
Next, I would build up a python data structure and convert it to a dataframe at the end rather than trying to insert rows or change existing rows.
data = []
for row in df.itertuples(index=False):
for character in row.ic2:
data.append({
'Date': row.Date,
'Qty': row.Qty,
'Price': row.Price,
'Item': f'{row.ic1}-{character}'
})
newdf = pd.DataFrame(data)
The new dataframe looks like this
>>> newdf
Date Qty Price Item
0 2020-12-16 1 1.1 22-A
1 2020-12-17 2 2.2 R-22-A
2 2020-12-18 2 2.2 33-C
3 2020-12-18 2 2.2 33-D
4 2020-12-18 2 2.2 33-E
5 2020-12-19 4 4.4 R-33-C
6 2020-12-19 4 4.4 R-33-D
7 2020-12-19 4 4.4 R-33-E
8 2020-12-20 5 5.5 55-A
9 2020-12-21 4 4.4 22-A
10 2020-12-21 4 4.4 22-B
11 2020-12-22 3 3.3 55-A
12 2020-12-22 3 3.3 55-B

Standard error of values in array corresponding to values in another array

I have an array that contains numbers that are distances, and another that represents certain values at that distance. How do I calculate the standard error of all the data at a fixed value of the distance?
The standard error is the standard deviation/ the square-root of the number of observations.
e.g distances(d):
[1 1 14 6 1 12 14 6 6 7 4 3 7 9 1 3 3 6 5 8]
e.g data corresponding to the entry of the distances:
therefore value=3.3 at d=1; value=2,1 at d=1; value=3.5 at d=14; etc..
[3.3 2.1 3.5 2.5 4.6 7.4 2.6 7.8 9.2 10.11 14.3 2.5 6.7 3.4 7.5 8.5 9.7 4.3 2.8 4.1]
For example, at distance d=6 I should calculate the standard error of 2.5, 7.8, 9.2 and 4.3 which would be the standard deviation of these values divided by the square root of the total number of values (4 in this case).
I've used the following code that works, but I don't know how to divide the result be the square-root of the total number of values at each distance:
import numpy as np
result = []
for d in set(key):
result.append(np.std[dist[i] for i in range(len(key)) if key[i] == d])
Any help would be greatly appreciated. Thanks!
Does this help?
for d in set(key):
result.append(np.std[dist[i] for i in range(len(key)) if key[i] == d] / np.sqrt(dist.count(d)))
I'm having a bit of a hard time telling exactly how you want things structured, but I would recommend a dictionary, so that you can know which result is associated with which key value. If your data is like this:
>>> key
array([ 1, 1, 14, 6, 1, 12, 14, 6, 6, 7, 4, 3, 7, 9, 1, 3, 3,
6, 5, 8])
>>> values
array([ 3.3 , 2.1 , 3.5 , 2.5 , 4.6 , 7.4 , 2.6 , 7.8 , 9.2 ,
10.11, 14.3 , 2.5 , 6.7 , 3.4 , 7.5 , 8.5 , 9.7 , 4.3 ,
2.8 , 4.1 ])
You can set up a dictionary along these lines with a dict comprehension:
result = {f'distance_{i}':np.std(values[key==i]) / np.sqrt(sum(key==i)) for i in set(key)}
>>> result
{'distance_1': 1.0045988005169029, 'distance_3': 1.818424226264781, 'distance_4': 0.0, 'distance_5': 0.0, 'distance_6': 1.3372079120316331, 'distance_7': 1.2056170619230633, 'distance_8': 0.0, 'distance_9': 0.0, 'distance_12': 0.0, 'distance_14': 0.3181980515339463}

Categories