Group By with sumproduct

Group By with sumproduct - python

I am working with a df with the following structure:
df = DataFrame({'Date' : ['1', '1', '1', '1'],
'Ref' : ['one', 'one', 'two', 'two'],
'Price' : ['50', '65', '30', '35'],
'MktPrice' : ['63', '63', '32', '32'],
'Quantity' : ['10', '15', '20', '10'],
'MarketQuantity': ['50', '50', '100', '100'],
'Weightings' : ['2', '2', '4', '4'],
'QxWeightings' : ['20', '30', '80', '40'],
'MktQxWeightings': ['100', '100', '400', '400'],
})
I have managed to get the weighted percentage that represents my Quantity out of MarketQuantity, when Price is above Mkt Price (and showing it by Date and Ref)
def percentage(x):
return (x.loc[x['Price'] >= x['MktPrice'], ['QxWeightings']].sum()/(x['MktQxWeightings'].sum()/len(x)))
df.groupby(['Date', 'Ref']).apply(percentage)
Date Ref Output
1 one 0.3
1 two 0.1
However, when I am trying to group it only by Date I get:
Date Output
1 0.4
which is the sum of previous outputs, when it should be 0.14 (30+40)/(100+400).
How can I do that with groupby?

IIUC, may be something like this:
def percentage(x):
return (x.loc[x['Price'] >= x['MktPrice'], ['QxWeightings']].sum()/(x['MktQxWeightings'].sum()/len(x)))
df_new=df.groupby(['Date', 'Ref','MktQxWeightings']).apply(percentage).reset_index()
print(df_new)
Date Ref MktQxWeightings QxWeightings
0 1 one 100 0.3
1 1 two 400 0.1
df_new.groupby('Date')['MktQxWeightings','QxWeightings'].apply(lambda x: x['QxWeightings'].\
cumsum().sum()*100/x['MktQxWeightings'].sum())
Date
1 0.14

Related

How to transform index values into columns using Pandas?

I have a dictionary like this:
my_dict = {'RuleSet': {'0': {'RuleSetID': '0',
'RuleSetName': 'Allgemein',
'Rules': [{'RulesID': '10',
'RuleName': 'Gemeinde Seiten',
'GroupHits': '2',
'KeyWordGroups': ['100', '101', '102']}]},
'1': {'RuleSetID': '1',
'RuleSetName': 'Portale Berlin',
'Rules': [{'RulesID': '11',
'RuleName': 'Portale Berlin',
'GroupHits': '4',
'KeyWordGroups': ['100', '101', '102', '107']}]},
'6': {'RuleSetID': '6',
'RuleSetName': 'Zwangsvollstr. Berlin',
'Rules': [{'RulesID': '23',
'RuleName': 'Zwangsvollstr. Berlin',
'GroupHits': '1',
'KeyWordGroups': ['100', '101']}]}}}
When using this code snippet it can be transformed into a dataframe:
rules_pd = pd.DataFrame(my_dict['RuleSet'])
rules_pd
The result is:
I would like to make it look like this:
Does anyone know how to tackle this challenge?

Doing from_dict with index
out = pd.DataFrame.from_dict(my_dict['RuleSet'],'index')
Out[692]:
RuleSetID ... Rules
0 0 ... [{'RulesID': '10', 'RuleName': 'Gemeinde Seite...
1 1 ... [{'RulesID': '11', 'RuleName': 'Portale Berlin...
6 6 ... [{'RulesID': '23', 'RuleName': 'Zwangsvollstr....
[3 rows x 3 columns]
#out.columns
#Out[693]: Index(['RuleSetID', 'RuleSetName', 'Rules'], dtype='object')

You could try use Transpose()
rules_pd = pd.DataFrame(my_dict['RuleSet']).transpose()
print(rules_pd)

How to take average in a timeframe python?

I am beginner in Python so I kindly ask your help. I would like to have a document where I have the first column as 2011.01 and the second column is the number of ARD 'events' in that month and the third column is the average of all of the ARD displayed in that month. If not, that e.g. 2012.07 0 0
I've already tried for 3 hours and now I am getting nervous.
I really much appreciate your help
import pandas as pd
from numpy import mean
from numpy import std
from numpy import cov
from matplotlib import pyplot
from scipy.stats import pearsonr
from scipy.stats import spearmanr
data = pd.read_csv('ARD.txt',delimiter= "\t")
month = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']
day = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31']
year = ['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']
ertek = data[:1].iloc[0].values
print(ertek)
print(data.head)
def list_to_string ( y, m, d):
str = ""
s = [y, m, d]
str.join(s)
return str
for x in year:
for y in month:
for i in day:
x = 1
ertek = data[:x].iloc[0].values
list_to_string(x, y, i)
if ertek[0] == list_to_string[x, y, i]:
print("")
x += 1
else:
print("")
Result:
['2011.01.05.' 0.583333333]
<bound method NDFrame.head of Date ARB
0 2011.01.05. 0.583333
1 2011.01.06. 0.583333
2 2011.01.07. 0.590909
3 2011.01.09. 0.625000
4 2011.01.10. 0.142857
... ... ...
1284 2020.12.31. 0.900000
1285 2020.12.31. 0.900000
1286 2020.12.31. 0.900000
1287 2020.12.31. 0.900000
1288 2020.12.31. 0.900000
[1289 rows x 2 columns]>
Traceback (most recent call last):
File "C:\Users\Kókai Dávid\Desktop\python,java\python\stock-trading-ml-master\venv\Scripts\orosz\oroszpred.py", line 29, in <module>
list_to_string(x, y, i)
File "C:\Users\Kókai Dávid\Desktop\python,java\python\stock-trading-ml-master\venv\Scripts\orosz\oroszpred.py", line 21, in list_to_string
str.join(s)
TypeError: sequence item 0: expected str instance, int found
Process finished with exit code 1

I'm not quite certain I'm tracking your intent with the list_to_string function; if it's for string date comparison, let's sidestep that entirely by
df.iloc[:,0] = pd.to_datetime(df.iloc[:,0]
df.set_index('Date')
df['Month Average'] = df.Date.resample('M').mean()

PANDAS create column by iterating row by row checking values in 2nd dataframe until all values are true

df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005'],
'L' : ['5', '65.0', '445.0', '3200', '65000'],
'H' : ['2', '15.5', '150.5', '1500', '54000'],
'W' : ['5', '85.0', '640.0', '1650', '45000']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000'],
'Width' : ['10', '100', '1000', '10000'],
'Height': ['10', '100', '1000', '10000'],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
So here I have two example dataframes. The first dataframe shows unique item numbers with given dimensions. df2 shows maximum allowable dimensions for given rank and code. Meaning all elements (length, width, height) must not exceed maximum given dimensions. I would like to check the dimensions in df1 against df2 until all dimension criteria are True in order to retrieve it's 'rank' and 'code'. So, in essence, iterate down row by row of df2 until all the criteria is True.
Make a new df3 as follows:
ItemNo Rank Code
001 1 aa
002 2 bb
003 3 cc
004 4 dd
005 5 ee

Using a numpy
changed sample data so that it's not just incrementing results
get index of row in df2 that matches required logic
build df3 using index in step 2
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005'],
'L' : ['5', '65.0', '445.0', '5', '65000'],
'H' : ['2', '15.5', '150.5', '5', '54000'],
'W' : ['5', '85.0', '640.0', '5', '45000']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000',100000],
'Width' : ['10', '100', '1000', '10000',100000],
'Height': ['10', '100', '1000', '10000',100000],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
# fix up datatypes for comparisons
df1.loc[:,["L","H","W"]] = df1.loc[:,["L","H","W"]].astype(float)
df2.loc[:,["Length","Height","Width"]] = df2.loc[:,["Length","Height","Width"]].astype(float)
# row by row comparison, argmax to get first True
idx = [np.argmax((df1.loc[r,["L","H","W"]].values
< df2.loc[:,["Length","Height","Width"]].values).all(axis=1))
for r in df1.index]
# finally the result
pd.concat([df1.ItemNo, df2.loc[idx,["Rank","Code"]].reset_index(drop=True)],axis=1)
ItemNo
Rank
Code
0
001
1
aa
1
002
2
bb
2
003
3
cc
3
004
1
aa
4
005
5
ee

I think you can try:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005', '006'],
'L' : ['5', '65.0', '445.0', '3200', '65000', '10'],
'H' : ['2', '15.5', '150.5', '1500', '54000','1000'],
'W' : ['5', '85.0', '640.0', '1650', '45000', '10']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000', '100000'],
'Width' : ['10', '100', '1000', '10000', '100000'],
'Height': ['10', '100', '1000', '10000', '100000'],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
df_sort = pd.DataFrame({'W': np.searchsorted(df2['Width'].astype(float), df1['W'].astype(float)),
'H': np.searchsorted(df2['Height'].astype(float), df1['H'].astype(float)),
'L': np.searchsorted(df2['Length'].astype(float), df1['L'].astype(float))})
df1['Rank'] = df_sort.max(axis=1).map(df2['Rank'])
df1['Code'] = df1['Rank'].map(df2.set_index('Rank')['Code'])
print(df1)
Output:
ItemNo L H W Rank Code
0 001 5 2 5 1 aa
1 002 65.0 15.5 85.0 2 bb
2 003 445.0 150.5 640.0 3 cc
3 004 3200 1500 1650 4 dd
4 005 65000 54000 45000 5 ee
5 006 10 1000 10 3 cc
The core to the code is the use of the np.searchsorted function. Which is used to find the index of the value of L in Length for example per the conditions listed in the documentations. So, I use np.searchsorted for each of the three dimension then, I take the largest value using max(axis=1) and assign the rank and code based on that largest value using map.

python grouping and transpose

I have a dataframe - df as below :
df = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
"Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
"Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
"Group_code":['123','123','456','678','123','434','678','365','678','987'],
"Amount":['100','240','450','212','432','123','543','567','232','453'],
"Category" :['Electrical','Electrical','Hardware','House','Electrical','Car','House','Toy','House','Bike123']})
Now, i need to group by Customer no and get Total Amount, Top 1,2,3 categories and their percentage against the total amount.
NOTE : In my Toy dataset, i have only 2 categories , in my original data i have more, i need to select top 5 categories.
My Dataframe should look like this :
df = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
"Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
"Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
"Group_code":['123','123','456','678','123','434','678','365','678','987'],
"Amount":['100','240','450','212','432','123','543','567','232','453'],
"Category" :['Electrical','Electrical','Hardware','House','Electrical','Car','House','Toy','House','Bike123'],
"Total amount" :['790','790','790','644','644','123','543','567','232','453'],
"Top-1 Category":['Hardware','Hardware','Hardware','Electrical','Electrical','Car','House','Toy',
'House','Bike123'],
"Top-1 Category %":['57','57','57','67','67','100','100','100','100','100'],
"Top-2 Category":['Electrical','Electrical','Electrical','House','House','','','','',''] ,
"Top-2 Category %":['43','43','43','33','33','0','0','0','0','0']})
Request your help to achieve it.
NOTE :
1) Top Category is selected by Grouping all the Category against each customer and summing up the amount for each category customer wise. Which ever category has more amount its Top 1 category, similarly the next one is Top 2 and so on
2) Top 1 category percentage : Its the Total amount of each category divided by Total amount and multiplied with 100. This is for each customer. Similarly for Top 2 category.

Try this:
#Your data
df = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
"Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
"Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
"Group_code":['123','123','456','678','123','434','678','365','678','987'],
"Amount":['100','240','450','212','432','123','543','567','232','453'],
"Category" :['Electrical','Electrical','Hardware','House','Electrical','Car','House','Toy','House','Bike123']})
#Make some columns numerical
for i in ["Customer_no","Card_no","Group_code","Amount"]:
df[i] = pd.to_numeric(df[i])
#Total sum
Total_amount = pd.DataFrame(df.groupby(["Customer_no"]).Amount.sum()).reset_index().rename(columns={'Amount':'Total amount'})
#Add some nesessery colums and grouping
Top_1_Category = pd.DataFrame(df.groupby(['Customer_no',"Category"]).Amount.sum()).reset_index().rename(columns={'Amount':'group_sum'})
df = df.merge(Total_amount,how='left',on='Customer_no')
df = df.merge(Top_1_Category,how='left',on=['Category','Customer_no'])
group_top_1 = df[['Customer_no','Category','group_sum']].loc[df.groupby('Customer_no').group_sum.agg('idxmax')].rename(columns={'Category':'Top-1 Category','group_sum':'group_sum_1'})
df = df.merge(group_top_1,how='left',on='Customer_no')
#Make columns 'Top-1 Category %'
df['Top-1 Category %'] = round(100*df['group_sum_1']/df['Total amount'],0)
#drop unnecessary columns
df.drop(['group_sum','group_sum_1'],axis=1,inplace=True)
You can add Top-2 column simular

Convert list of lists to custom dictionary

I'm unsuccessfully trying to convert list of lists to a custom dictionary.
I've created the following output saved in two lists:
headers = ['CPU', 'name', 'id', 'cused', 'callc', 'mused', 'mallc']
result = [['1/0', 'aaa', '10', '0.1', '15', '10.73', '16.00'],
['1/0', 'bbb', '10', '0.1', '20', '11.27', '14.00'],
['1/0', 'ccc', '10', '0.2', '10', '11.50', '15.00'],
['1/0', 'aaa', '10', '1.1', '15', '15.10', '23.00']]
Formatted output:
headers:
slot name id cused callc mused mallc
result:
1/0 aaa 10 0.1 15 10.73 16.00
2/0 bbb 25 0.1 20 11.39 14.00
1/0 ccc 10 0.2 10 11.50 15.00
1/0 aaa 10 1.1 15 15.10 23.00
The first n columns (3 in this case) should be used to concatenate key name with all of the remaining columns as output values.
I would like to convert it to a dictionary in the following format:
slot.<slot>.name.<name>.id.<id>.cused:<value>,
slot.<slot>.name.<name>.id.<id>.callc:<value>,
slot.<slot>.name.<name>.id.<id>.mused:<value>,
slot.<slot>.name.<name>.id.<id>.mallc:<value>,
...
for example:
dictionary = {
'slot.1/0.name.aaa.id.10.cused':10,
'slot.1/0.name.aaa.id.25.callc':15,
'slot.1/0.name.aaa.id.10.mused':10.73,
'slot.1/0.name.aaa.id.10.mallc':16.00,
'slot.2/0.name.bbb.id.10.cused':0.1,
...
'slot.<n>.name.<name>.id.<id>.<value_name> <value>
}
Can you show me how that can be done?

Updated - OP added raw lists
Now that you have updated the question to show the raw list it's even easier:
headers = ['CPU', 'name', 'id', 'cused', 'callc', 'mused', 'mallc']
result = [['1/0', 'aaa', '10', '0.1', '15', '10.73', '16.00'],
['1/0', 'bbb', '10', '0.1', '20', '11.27', '14.00'],
['1/0', 'ccc', '10', '0.2', '10', '11.50', '15.00'],
['1/0', 'aaa', '10', '1.1', '15', '15.10', '23.00']]
results = {}
for r in result:
slot, name, _id = r[:3]
results.update(
{'slot.{}.name.{}.id.{}.{}'.format(slot, name, _id, k) : v
for k, v in zip(headers[3:], r[3:])})
>>> from pprint import pprint
>>> pprint(results)
{'slot.1/0.name.aaa.id.10.callc': '15',
'slot.1/0.name.aaa.id.10.cused': '1.1',
'slot.1/0.name.aaa.id.10.mallc': '23.00',
'slot.1/0.name.aaa.id.10.mused': '15.10',
'slot.1/0.name.bbb.id.10.callc': '20',
'slot.1/0.name.bbb.id.10.cused': '0.1',
'slot.1/0.name.bbb.id.10.mallc': '14.00',
'slot.1/0.name.bbb.id.10.mused': '11.27',
'slot.1/0.name.ccc.id.10.callc': '10',
'slot.1/0.name.ccc.id.10.cused': '0.2',
'slot.1/0.name.ccc.id.10.mallc': '15.00',
'slot.1/0.name.ccc.id.10.mused': '11.50'}
Original file based answer
The following code will construct the required dictionary (results). The idea is that each non-header line in the file is split by whitespace into fields, and the fields are used in a dictionary comprehension to construct a dictionary for each line, which is then used to update the results dictionary.
with open('data') as f:
# skip the 3 header lines
for i in range(3):
_ = next(f)
STAT_NAMES = 'cused callc mused mallc'.split()
results = {}
for line in f:
line = line.split()
slot, name, _id = line[:3]
results.update(
{'slot.{}.name.{}.id.{}.{}'.format(slot, name, _id, k) : v
for k, v in zip(STAT_NAMES, line[3:])})
Output
>>> from pprint import pprint
>>> pprint(results)
{'slot.1/0.name.aaa.id.10.callc': '15',
'slot.1/0.name.aaa.id.10.cused': '1.1',
'slot.1/0.name.aaa.id.10.mallc': '23.00',
'slot.1/0.name.aaa.id.10.mused': '15.10',
'slot.1/0.name.ccc.id.10.callc': '10',
'slot.1/0.name.ccc.id.10.cused': '0.2',
'slot.1/0.name.ccc.id.10.mallc': '15.00',
'slot.1/0.name.ccc.id.10.mused': '11.50',
'slot.2/0.name.bbb.id.25.callc': '20',
'slot.2/0.name.bbb.id.25.cused': '0.1',
'slot.2/0.name.bbb.id.25.mallc': '14.00',
'slot.2/0.name.bbb.id.25.mused': '11.39'}

try this, Note: i changed "slot" instead of "CPU"
headers = ['slot', 'name', 'id', 'cused', 'callc', 'mused', 'mallc']
result = [['1/0', 'aaa', '10', '0.1', '15', '10.73', '16.00'],
['1/0', 'bbb', '10', '0.1', '20', '11.27', '14.00'],
['1/0', 'ccc', '10', '0.2', '10', '11.50', '15.00'],
['1/0', 'aaa', '10', '1.1', '15', '15.10', '23.00']]
#I get: [['1/0', '1/0', '1/0', '1/0'], ['aaa', 'bbb', 'ccc', 'aaa'], ....
transpose_result = map(list, zip(*result))
#I get: {'slot': ['1/0', '1/0', '1/0', '1/0'],
# 'mallc': ['16.00', '14.00', '15.00', '23.00'], ...
data = dict(zip(headers, transpose_result))
d = {}
for reg in ("cused", "callc", "mused", "mallc"):
for i, val in enumerate(data[reg]):
key = []
for reg2 in ("slot", "name", "id"):
key.append(reg2)
key.append(data[reg2][i])
key.append(reg)
d[".".join(key)] = val
you get in d
{
'slot.1/0.name.bbb.id.10.cused': '0.1',
'slot.1/0.name.aaa.id.10.cused': '1.1',
'slot.1/0.name.bbb.id.10.callc': '20',
'slot.1/0.name.aaa.id.10.mallc': '23.00',
'slot.1/0.name.aaa.id.10.callc': '15',
'slot.1/0.name.ccc.id.10.mallc': '15.00',
'slot.1/0.name.ccc.id.10.mused': '11.50',
'slot.1/0.name.aaa.id.10.mused': '15.10',
'slot.1/0.name.ccc.id.10.cused': '0.2',
'slot.1/0.name.ccc.id.10.callc': '10',
'slot.1/0.name.bbb.id.10.mallc': '14.00',
'slot.1/0.name.bbb.id.10.mused': '11.27'}

import itertools
headers = 'slot name id cused callc mused mallc'.split()
result = ['1/0 aaa 10 0.1 15 10.73 16.00'.split(),
'2/0 bbb 25 0.1 20 11.39 14.00'.split()]
key_len = 3
d = {}
for row in result:
key_start = '.'.join(itertools.chain(*zip(headers, row[:key_len])))
for key_end, val in zip(headers[key_len:], row[key_len:]):
d[key_start + '.' + key_end] = val

another solution with the correct type for cused, callc, mused and mallc
labels = ['slot','name','id','cused','callc','mused','mallc']
data = ['1/0 aaa 10 0.1 15 10.73 16.00',
'2/0 bbb 25 0.1 20 11.39 14.00',
'1/0 ccc 10 0.2 10 11.50 15.00',
'1/0 aaa 10 1.1 15 15.10 23.00']
data = [tuple(e.split()) for e in data]
data = [zip(labels, e) for e in data]
results = dict()
for e in data:
s = '%s.%s.%s' % tuple(['.'.join(e[i]) for i in range(3)])
for i in range(3,7):
results['%s.%s' % (s, e[i][0])] = int(e[i][1]) if i == 4 else float(e[i][1])
print results

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group By with sumproduct - python

Related

How to transform index values into columns using Pandas?

How to take average in a timeframe python?

PANDAS create column by iterating row by row checking values in 2nd dataframe until all values are true

python grouping and transpose

Convert list of lists to custom dictionary

Categories

Resources