Standardize Unit of measurements in dataframe - python

Consider I have the following dataframe
d = {'quantity': [100, 910, 500, 50, 0.5, 22.5, 1300, 600, 20], 'uom': ['KG', 'GM', 'KG', 'KG', 'GM', 'MT', 'GM', 'GM', 'MT']}
df = pd.DataFrame(data=d)
df
My dataframe is like this:
quantity uom
0 100.0 KG
1 910.0 GM
2 500.0 KG
3 50.0 KG
4 0.5 GM
5 22.5 MT
6 1300.0 GM
7 600.0 GM
8 20.0 MT
Now I want to use a single UOM for all the data. For that I have the following code:
listy = []
listy.append(list(df['quantity']))
listy.append(list(df['uom']))
for index, x in enumerate(listy[0]):
if listy[1][index] == 'MT':
listy[0][index] = '{:1.4f}'.format(x * 1000)
listy[1][index] = 'KG'
elif listy[1][index] == 'LBS':
listy[0][index] = '{:1.4f}'.format(x * 0.453592)
listy[1][index] = 'KG'
elif listy[1][index] == 'GM':
listy[0][index] = '{:1.4f}'.format(x * 0.001)
listy[1][index] = 'KG'
elif listy[1][index] == 'MG':
listy[0][index] = '{:1.4f}'.format(x * 0.000001)
listy[1][index] = 'KG'
elif listy[1][index] == 'KG':
listy[0][index] = '{:1.4f}'.format(x * 1)
listy[1][index] = 'KG'
df['quantity'] = listy[0]
df['uom'] = listy[1]
df
quantity uom
0 100.0000 KG
1 0.9100 KG
2 500.0000 KG
3 50.0000 KG
4 0.0005 KG
5 22500.0000 KG
6 1.3000 KG
7 0.6000 KG
8 20000.0000 KG
But If we have a really large dataframe I dont think looping through it would be a good way of doing this.
Can I do the similar in some better way?
I was also trying List Comprehension but couldn't do it using that.

Map using a dict and multiply the values i.e
vals = {'MT':1000, 'LBS':0.453592, 'GM': 0.001, 'MG':0.000001, 'KG':1}
df['new'] = df['quantity']*df['uom'].map(vals)
quantity uom new
0 100.0 KG 100.0000
1 910.0 GM 0.9100
2 500.0 KG 500.0000
3 50.0 KG 50.0000
4 0.5 GM 0.0005
5 22.5 MT 22500.0000
6 1300.0 GM 1.3000
7 600.0 GM 0.6000
8 20.0 MT 20000.0000
If you want to add 'KG' as a column values then use df['new_unit'] = 'KG'

You can use apply on rows by specifying the axis parameter. Like this:
uom_map = {
'KG': 1,
'GM': .001,
'MT': 1000,
'LBS': 0.453592,
'MG': .000001,
}
def to_kg(row):
quantity, uom = row.quantity, row.uom
multiplier = uom_map[uom]
return quantity*multiplier
df['quantity_kg'] = df.apply(to_kg, axis=1)

Related

Is there a way to recalculate existing values in df based on conditions? - Python / Pandas

I have a DataFrame with Employees and their hours for different categories.
I need to recalculate only specific categories (OT, MILE and REST Categories SHOULD NOT Be Updated, ALL Other Should be updated) ONLY if OT category is present under Empl_Id.
data = {'Empl_Id': [1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3],
'Category': ["MILE", "REST", "OT", "TRVL", "REG", "ADMIN", "REST", "REG", "MILE", "OT", "TRVL", "REST", "MAT", "REG"],
'Value': [43, 0.7, 6.33, 2.67, 52, 22, 1.17, 16.5, 73.6, 4.75, 1.33, 2.5, 5.5, 52.25]}
df = pd.DataFrame(data=data)
df
Empl_Id
Category
Value
1
MILE
43
1
REST
0.7
1
OT
6.33
1
TRVL
2.67
1
REG
52
2
ADMIN
22
2
REST
1.17
2
REG
16.5
3
MILE
73.6
3
OT
4.75
3
TRVL
1.33
3
REST
2.5
3
MAT
5.5
3
REG
52.25
The Logic is to:
1) Find % of OT Hours from Total Hours (OT, REST and MILE don't count):
1st Empl_Id: 6.33 (OT) / 2.67 (TRVL) + 52 (REG) = 6.33 / 54.67 = 11.58 %
2nd Empl_Id: OT Hours Not present, nothing should be updated
3rd Empl_Id: 4.75 (OT) / 1.33 (TRVL) + 5.5 (MAT) + 52.25 (REG) = 4.75 / 59.08 = 8.04 %
2) Substract % of OT from each category (OT, REST and MILE don't count):
Empl_Id
Category
Value
1
MILE
43
1
REST
0.7
1
OT
6.33
1
TRVL
2.67 - 11.58 % (0.31) = 2.36
1
REG
52 - 11.58 % (6.02) = 45.98
2
ADMIN
22
2
REST
1.17
2
REG
16.5
3
MILE
73.6
3
OT
4.75
3
TRVL
1.33 - 8.04 % (0.11) = 1.22
3
REST
2.5
3
MAT
5.5 - 8.04 % (0.44) = 5.06
3
REG
52.25 - 8.04 % (4.2) = 48.05
You can use:
keep = ['OT', 'MILE', 'REST']
# get factor
factor = (df.groupby(df['Empl_Id'])
.apply(lambda g: g.loc[g['Category'].eq('OT'),'Value'].sum()
/g.loc[~g['Category'].isin(keep),'Value'].sum()
)
.rsub(1)
)
# update
df.loc[~df['Category'].isin(keep), 'Value'] *= df['Empl_Id'].map(factor)
output:
Empl_Id Category Value
0 1 MILE 43.000000
1 1 REST 0.700000
2 1 OT 6.330000
3 1 TRVL 2.360852
4 1 REG 45.979148
5 2 ADMIN 22.000000
6 2 REST 1.170000
7 2 REG 16.500000
8 3 MILE 73.600000
9 3 OT 1.750000
10 3 TRVL 1.290604
11 3 REST 2.500000
12 3 MAT 5.337085
13 3 REG 50.702310

pandas scoring system; show ties as "3-4"

I have a basic sports score at the end and the players ranking what I want to do is if their score is equal instead of one player being 3 and the other 4 I need them both to be 3-4, anyone have any good hints where I could find a solution?
Name 100 m Long jump Shot put High jump 400 m 110 m hurdles Discus throw Pole vault Javelin throw 1500 m Total Score Ranking
1 Edan Daniele 12.61 5.00 9.22 1.50 60.39 16.43 21.60 2.6 35.81 00:05:25.720000 3847.0 1
2 Coos Kwesi 13.75 4.84 10.12 1.50 68.44 19.18 30.85 2.8 33.88 00:06:22.750000 3127.0 2
3 Severi Eileifr 13.43 4.35 8.64 1.50 66.06 19.05 24.89 2.2 33.48 00:06:51.010000 2953.0 3
4 Lehi Poghos 13.04 4.53 7.79 1.55 64.72 18.74 24.20 2.4 28.20 00:06:50.760000 2940.0 4
This is the outcome, and this is the code
import numpy as np
from os import sep
import pandas as pd
df = pd.read_csv("Decathlon.csv",sep=";",header=None)
df.reset_index(drop=False)
df.index = np.arange(1, len(df) + 1)
df.columns = ["Name","100 m","Long jump","Shot put","High jump","400 m","110 m hurdles","Discus throw","Pole vault","Javelin throw","1500 m"]
df['100m score'] = round(25.4347*((18-df["100 m"])**1.81))
df["Long jump score"] = round(0.14354*(((df["Long jump"]-220)*-1)**1.4))
df["shot put score"] = round( 51.39*((df["Shot put"]-1.5)**1.05))
df["high jump score"] = round( 0.8465*(((df["High jump"]-75)*-1)**1.42))
df["400m score"] = round( 1.53775*((82-df["400 m"])**1.81))
df['110m hurdles score'] = round( 5.74352*((28.5-df['110 m hurdles'])**1.92))
df['Discus throw score'] = round( 12.91*((df['Discus throw']-4)**1.1))
df['Pole vault score'] = round( 0.2797*(((df['Pole vault']-100)*-1)*1.35))
df['Javelin throw score'] = round( 10.14*(((df['Javelin throw']-7)**1.08)))
df['1500 m'] = pd.to_datetime(df['1500 m'].str.strip(), format='%M.%S.%f')
df['Minute'] = pd.to_datetime(df['1500 m']).dt.minute
df['sekunde'] = pd.to_datetime(df['1500 m']).dt.second
df['milisekunde'] = pd.to_datetime(df['1500 m']).dt.microsecond
df.loc[df['milisekunde']>500000,['sekunde']] = df['sekunde']+1
df['Total seconds'] = (df["Minute"]*60) + df["sekunde"]
df['1500 m score'] = round(0.03768*((480-df["Total seconds"])**1.85))
df["Total Score"] = df['100m score']+df["Long jump score"]+df["shot put score"]+df["high jump score"]+df["400m score"]+df['110m hurdles score']+df['Discus throw score']+df['Pole vault score']+df['Javelin throw score']+df['1500 m score']
df["1500 m"] = pd.DatetimeIndex(df['1500 m']).time
#clean up
del df['100m score']
del df["Long jump score"]
del df["shot put score"]
del df["high jump score"]
del df["400m score"]
del df['110m hurdles score']
del df['Discus throw score']
del df['Pole vault score']
del df['Javelin throw score']
del df['Minute']
del df['sekunde']
del df['milisekunde']
del df["Total seconds"]
del df ["1500 m score"]
df = df.sort_values(['Total Score'], ascending = False)
df= df.reset_index(drop = True)
df.index = np.arange(1, len(df) + 1)
df["Ranking"] = df.index
print(df)
df.to_json('Json file')
Assuming "Decathlon.csv" file looks something like this:
Edan Daniele;12.61;5.00;9.22;1.50;60.39;16.43;21.60;2.6;35.81;00:05:25.720000
Coos Kwesi;13.75;4.84;10.12;1.50;68.44;19.18;30.85;2.8;33.88;00:06:22.750000
Severi Eileifr;13.43;4.35;8.64;1.50;66.06;19.05;24.89;2.2;33.48;00:06:51.010000
Severi Eileifr;13.43;4.35;8.64;1.50;66.06;19.05;24.89;2.2;33.48;00:06:51.010000
Lehi Poghos;13.04;4.53;7.79;1.55;64.72;18.74;24.20;2.4;28.20;00:06:50.760000
Here's how you can generate the rankings:
df["Ranking"] = df["Total Score"].apply(lambda score: df.index[df["Total Score"] == score].astype(str)).str.join("-")
output:
Name 100 m ... Total Score Ranking
1 Edan Daniele 12.61 ... 6529.0 1
2 Coos Kwesi 13.75 ... 6088.0 2
3 Severi Eileifr 13.43 ... 5652.0 3-4
4 Severi Eileifr 13.43 ... 5652.0 3-4
5 Lehi Poghos 13.04 ... 5639.0 5
or just use .tolist() to get rankings as a list:
df["Ranking"] = df["Total Score"].apply(lambda score: df.index[df["Total Score"] == score].tolist())
Name 100 m ... Total Score Ranking
1 Edan Daniele 12.61 ... 6529.0 [1]
2 Coos Kwesi 13.75 ... 6088.0 [2]
3 Severi Eileifr 13.43 ... 5652.0 [3, 4]
4 Severi Eileifr 13.43 ... 5652.0 [3, 4]
5 Lehi Poghos 13.04 ... 5639.0 [5]
Might not be the best approach though
Note: I've made rows 3 and 4 identical in the initial csv to match the example you've provided

Pandas - Groupby and aggregate over multiple columns

I am trying to aggregate values in a groupby over multiple columns. I come from the R/dplyr world and what I want is usually achievable in a single line using group_by/summarize. I am trying to find an equivalently elegant way of achieving this using pandas.
Consider the below Input Dataset. I would like to aggregate by state and calculate the column v1 as v1 = sum(n1)/sum(d1) by state.
The r-code for this using dplyr is as follows:
input %>% group_by(state) %>%
summarise(v1=sum(n1)/sum(d1),
v2=sum(n2)/sum(d2))
Is there an elegant way of doing this in Python? I found a slightly verbose way of getting what I want in on a stack overflow answer here.
Copying over modified python-code from the link
In [14]: s = mn.groupby('state', as_index=False).sum()
In [15]: s['v1'] = s['n1'] / s['d1']
In [16]: s['v2'] = s['n2'] / s['d2']
In [17]: s[['state', 'v1', 'v2']]
INPUT DATASET
state n1 n2 d1 d2
CA 100 1000 1 2
FL 200 2000 2 4
CA 300 3000 3 6
AL 400 4000 4 8
FL 500 5000 5 2
NY 600 6000 6 4
CA 700 7000 7 6
OUTPUT
state v1 v2
AL 100 500.000000
CA 100 500.000000
NY 100 1500.000000
CA 100 1166.666667
FL 100 1166.666667
One possible solution with DataFrame.assign and DataFrame.reindex:
df = (mn.groupby('state', as_index=False)
.sum()
.assign(v1 = lambda x: x['n1'] / x['d1'], v2 = lambda x: x['n2'] / x['d2'])
.reindex(['state', 'v1', 'v2'], axis=1))
print (df)
state v1 v2
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
And another with GroupBy.apply and custom lambda function:
df = (mn.groupby('state')
.apply(lambda x: x[['n1','n2']].sum() / x[['d1','d2']].sum().values)
.reset_index()
.rename(columns={'n1':'v1', 'n2':'v2'})
)
print (df)
state v1 v2
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
Another solution:
def func(x):
u = x.sum()
return pd.Series({'v1':u['n1']/u['d1'],
'v2':u['n2']/u['d2']})
df.groupby('state').apply(func)
Output:
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Here is the equivalent way as you did in R:
>>> from datar.all import f, tribble, group_by, summarise, sum
>>>
>>> input = tribble(
... f.state, f.n1, f.n2, f.d1, f.d2,
... "CA", 100, 1000, 1, 2,
... "FL", 200, 2000, 2, 4,
... "CA", 300, 3000, 3, 6,
... "AL", 400, 4000, 4, 8,
... "FL", 500, 5000, 5, 2,
... "NY", 600, 6000, 6, 4,
... "CA", 700, 7000, 7, 6,
... )
>>>
>>> input >> group_by(f.state) >> \
... summarise(v1=sum(f.n1)/sum(f.d1),
... v2=sum(f.n2)/sum(f.d2))
state v1 v2
<object> <float64> <float64>
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
I am the author of the datar package.
Another option is with the pipe function, where the groupby object is resuable:
(df.groupby('state')
.pipe(lambda df: pd.DataFrame({'v1' : df.n1.sum() / df.d1.sum(),
'v2' : df.n2.sum() / df.d2.sum()})
)
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Another option would be to convert the columns into a MultiIndex before grouping:
temp = temp = df.set_index('state')
temp.columns = temp.columns.str.split('(\d)', expand=True).droplevel(-1)
(temp.groupby('state')
.sum()
.pipe(lambda df: df.n /df.d)
.add_prefix('v')
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Yet another way, still with the MultiIndex option, while avoiding a groupby:
# keep the index, necessary for unstacking later
temp = df.set_index('state', append=True)
# convert the columns to a MultiIndex
temp.columns = temp.columns.map(tuple)
# this works because the index is unique
(temp.unstack('state')
.sum()
.unstack([0,1])
.pipe(lambda df: df.n / df.d)
.add_prefix('v')
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000

How to groupby, cut, transpose then merge result of one pandas Dataframe using vectorisation

Here is a example of data we want to process:
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' :np.random.random_integers(0,10,df_size)})
X boat_id target_Y
0 482 275 6
1 705 245 4
2 328 102 6
3 631 227 6
4 234 236 8
...
I want to obtain an output like this :
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 target_Y boat_id
40055 684.0 692.0 950.0 572.0 442.0 850.0 75.0 140.0 382.0 576.0 0.0 1
40056 178.0 949.0 490.0 777.0 335.0 559.0 397.0 729.0 701.0 44.0 4.0 1
40057 21.0 818.0 341.0 577.0 612.0 57.0 303.0 183.0 519.0 357.0 0.0 1
40058 501.0 1000.0 999.0 532.0 765.0 913.0 964.0 922.0 772.0 534.0 1.0 2
40059 305.0 906.0 724.0 996.0 237.0 197.0 414.0 171.0 369.0 299.0 8.0 2
40060 408.0 796.0 815.0 638.0 691.0 598.0 913.0 579.0 650.0 955.0 2.0 3
40061 298.0 512.0 247.0 824.0 764.0 414.0 71.0 440.0 135.0 707.0 9.0 4
40062 535.0 687.0 945.0 859.0 718.0 580.0 427.0 284.0 122.0 777.0 2.0 4
40063 352.0 115.0 228.0 69.0 497.0 387.0 552.0 473.0 574.0 759.0 3.0 4
40064 179.0 870.0 862.0 186.0 25.0 125.0 925.0 310.0 335.0 739.0 7.0 4
...
I did the folowing code, but it is way to slow.
It groupby, cut with enumerate, transpose then merge result into one pandas Dataframe
start_time = time.time()
N = 10
col_names = map(lambda x: 'X'+str(x), range(N))
compil = pd.DataFrame(columns = col_names)
i = 0
# I group by boat ID
for boat_id, df_boat in df_random.groupby('boat_id'):
# then I cut every 50 line
for (line_number, (index, row)) in enumerate(df_boat.iterrows()):
if line_number%5 == 0:
compil_new_line_X = list(df_boat.iloc[line_number-N:line_number,:]["X"])
# filter to avoid issues at the start and end of the columns
if len (compil_new_line_X ) == N:
compil.loc[i,col_names] = compil_new_line_X
compil.loc[i, 'target_Y'] = row['target_Y']
compil.loc[i,'boat_id'] = row['boat_id']
i += 1
print("Total %s seconds" % (time.time() - start_time))
Total 232.947000027 seconds
My question is:
How to do somethings every "x number of line"? Then merge result?
Do it exist a way to vectorize that kind of operation?
Here is a solution that improve calculation time by 35%.
It use a 'groupby' for 'boat_ID' then 'groupby.apply' to divide groups in smalls chunks.
Then a final apply to create the new line. We probably still can improve it.
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' : np.random.random_integers(0,10,df_size)})
start_time = time.time()
len_of_chunks = 10
col_names = map(lambda x: 'X'+str(x), range(N))+['boat_id', 'target_Y']
def prepare_data(group):
# this function create the new line we will put in 'compil'
info_we_want_to_keep =['boat_id', 'target_Y']
info_and_target = group.tail(1)[info_we_want_to_keep].values
k = group["X"]
return np.hstack([k.values, info_and_target[0]]) # this create the new line we will put in 'compil'
# we group by ID (boat)
# we divide in chunk of len "len_of_chunks"
# we apply prepare data from each chunk
groups = df_random.groupby('boat_id').apply(lambda x: x.groupby(np.arange(len(x)) // len_of_chunks).apply(prepare_data))
# we reset index
# we take the '0' columns containing valuable info
# we put info in a new 'compil' dataframe
# we drop uncomplet line ( generated by chunk < len_of_chunks )
compil = pd.DataFrame(groups.reset_index()[0].values.tolist(), columns= col_names).dropna()
print("Total %s seconds" % (time.time() - start_time))
Total 153.781999826 seconds

Efficient pandas rolling aggregation over date range by group - Python 2.7 Windows - Pandas 0.19.2

I'm trying to find an efficient way to generate rolling counts or sums in pandas given a grouping and a date range. Eventually, I want to be able to add conditions, ie. evaluating a 'type' field, but I'm not there just yet. I've written something to get the job done, but feel that there could be a more direct way of getting to the desired result.
My pandas data frame currently looks like this, with the desired output being put in the last column 'rolling_sales_180'.
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0
My current solution and environment can be sourced below. I've been modeling my solution from this R Q&A in stackoverflow. Efficient way to perform running total in the last 365 day window
import pandas as pd
import numpy as np
def trans_date_to_dist_matrix(date_col): # used to create a distance matrix
x = date_col.tolist()
y = date_col.tolist()
data = []
for i in x:
tmp = []
for j in y:
tmp.append(abs((i - j).days))
data.append(tmp)
del tmp
return pd.DataFrame(data=data, index=date_col.values, columns=date_col.values)
def lower_tri(x_col, date_col, win): # x_col = column user wants a rolling sum of ,date_col = dates, win = time window
dm = trans_date_to_dist_matrix(date_col=date_col) # dm = distance matrix
dm = dm.where(dm <= win) # find all elements of the distance matrix that are less than window(time)
lt = dm.where(np.tril(np.ones(dm.shape)).astype(np.bool)) # lt = lower tri of distance matrix so we get only future dates
lt[lt >= 0.0] = 1.0 # cleans up our lower tri so that we can sum events that happen on the day we are evaluating
lt = lt.fillna(0) # replaces NaN with 0's for multiplication
return pd.DataFrame(x_col.values * lt.values).sum(axis=1).tolist()
def flatten(x):
try:
n = [v for sl in x for v in sl]
return [v for sl in n for v in sl]
except:
return [v for sl in x for v in sl]
data = [
['David', '1/1/2015', 100], ['David', '1/5/2015', 500], ['David', '5/30/2015', 50], ['David', '7/25/2015', 50],
['Ryan', '1/4/2014', 100], ['Ryan', '1/19/2015', 500], ['Ryan', '3/31/2016', 50],
['Joe', '7/1/2015', 100], ['Joe', '9/9/2015', 500], ['Joe', '10/15/2015', 50]
]
list_of_vals = []
dates_df = pd.DataFrame(data=data, columns=['name', 'date', 'amount'], index=None)
dates_df['date'] = pd.to_datetime(dates_df['date'])
list_of_vals.append(dates_df.groupby('name', as_index=False).apply(
lambda x: lower_tri(x_col=x.amount, date_col=x.date, win=180)))
new_data = flatten(list_of_vals)
dates_df['rolling_sales_180'] = new_data
print dates_df
Your time and feedback are appreciated.
Pandas has support for time-aware rolling via the rolling method, so you can use that instead of writing your own solution from scratch:
def get_rolling_amount(grp, freq):
return grp.rolling(freq, on='date')['amount'].sum()
df['rolling_sales_180'] = df.groupby('name', as_index=False, group_keys=False) \
.apply(get_rolling_amount, '180D')
The resulting output:
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0

Categories