Setting dictionary values while iterating through a 'for-loop' - python

I'm trying to create a nested dictionary with a set of values that are pulled from a for-loop, to measure growth and revenue amounts for various customer-product pairings. However, when I loop through a dataframe to set elements of the dictionary, each dictionary element ends up with the same values. What's going on, here?
I have already tried changing various elements of how the lists are built, but to no avail.
'''
TP_Name = customer name
Service_Level_1 = service name
100.2014 is just a marker to show that someone has started consuming the service
tpdict is already created with necessary nesting below with empty values at each endpoint
'''
for col in pivotdf.columns:
growthlist = []
amountlist = []
first = True
TP_Name, Service_Level_1 = col.split('___')
for row in pivotdf[col]:
if first == True:
past = row+.00001
first = False
if row == 0 and past <.0001 :
growth = 0
elif row != 0 and past == .00001:
growth = 100.2014
else:
current = row
growth = (current-past)/past
growth = round(growth,4)
growthlist.append(growth)
past = row +.00001
amountlist.append(row)
tpdict[TP_Name][Service_Level_1]['growth'] = growthlist
tpdict[TP_Name][Service_Level_1]['amount'] = amountlist
'''
problem: Each value ends up being the same thing
'''
Expected results:
{'CUSTOMER NAME': {'PRODUCT1': {'growth': [unique_growthlist], 'amount': [unique_amountlist]}, 'PRODUCT2': {'growth': [unique_growthlist],'amount': [unique_amountlist]}}}

A dictionary is a key value pair (as I am sure you may know). If you ever try to write to a dictionary with a key that already exists in the dictionary then the dictionary will overwrite the value for that key.
Example:
d = dict()
d[1] = 'a' # d = {1: 'a'}
d[1] = 'b' # d = {1: 'b'}
Your project seems like it may be a good use of a namedtuple in python.
A namedtuple is basically a light weight class/object.
My example code may be wrong because I don't know how your for loop is working (commenting helps everyone). That being said here is an example.
I only make this recommendation as dictionaries consume ~33% more memory then the objects they hold (though they are much faster).
from collections import namedtuple
Customer = namedtuple('Customer', 'name products')
Product = namedtuple('Product', 'growth amount')
customers = []
for col in pivotdf.columns:
products = []
growthlist = []
amountlist = []
first = True
TP_Name, Service_Level_1 = col.split('___')
for row in pivotdf[col]:
if first == True:
past = row + .00001
first = False
if row == 0 and past < .0001 :
growth = 0
elif row != 0 and past == .00001:
growth = 100.2014
else:
current = row
growth = (current - past) / past
growth = round(growth, 4)
growthlist.append(growth)
past = row + .00001
amountlist.append(row)
cur_product = Product(growth=growthlist, amount=amountlist) # Create a new product
products.append(cur_product) # Add that product to our customer
# Create a new customer with our products
cur_customer = Customer(name=TP_Name, products=products)
customers.append(cur_customer) # Add our customer to our list of customers
Here customers is a list of Customer namedtuples that we can use as objects.
For example this is how we can print them out.
for customer in customers:
print(customer.name, customer.products) # Print each name and their products
for growth, amount in customer.products:
print(growth, amount) # Print growth and amount for each product.

Related

SQL Dictionary Appending for Large Datasets in a Loop in cx_Oracle

I am trying to append dictionaries together and then use "from_dict" to get the final returned data from cx_Oracle as I heard that is more efficient that appending each returned row from SQL. However, my loop still takes a very long time (the ending loop returns a VERY large database, each loop gets data for an I.D. which returns ~ 12,000 rows per I.D. - there are over 700 I.D.s in the loop). How do I take advantage of "from_dict" so this speeds up? I don't think this is the most efficient way to do this as I have the code written now. Any suggestions? Thanks.
Is there a more efficient way? Using concat and not append?
for iteration, c in enumerate(l, start = 1):
total = len(l)
data['SP_ID'] = c
data['BEGIN_DATE'] = BEGIN_DATE
print("Getting consumption data for service point I.D.:", c, " ---->", iteration, "of", total)
cursor.arraysize = 1000000
cursor.prefetchrows = 2
cursor.execute(sql, data)
cursor.rowfactory = lambda *args: dict(zip([d[0] for d in cursor.description], args))
df_row = cursor.fetchall()
if len(df_row) == 0:
pass
else:
a = {k: [d[k] for d in df_row] for k in df_row[0]} # Here is where I combine dictionaries, but this is for only dataset pulling from SQL.I want to combine all the dictionaries from each loop to increase efficiency.
AMI_data = pd.DataFrame.from_dict(a)
#AMI.append(AMI_data)
#final_AMI_data = pd.concat(AMI)
# final_data.dropna(inplace = True)
# UPDATED
final_AMI_data = pd.DataFrame()
for iteration, c in enumerate(l, start = 1):
total = len(l)
data['SP_ID'] = c
data['BEGIN_DATE'] = BEGIN_DATE
print("Getting consumption data for service point I.D.:", c, " ---->", iteration, "of", total)
cursor.arraysize = 1000000
cursor.prefetchrows = 2
cursor.execute(sql, data)
cursor.rowfactory = lambda *args: dict(zip([d[0] for d in cursor.description], args))
df_row = cursor.fetchall()
if len(df_row) == 0:
pass
else:
AMI_data = pd.DataFrame.from_records(df_row)
final_AMI_data.append(AMI_data, ignore_index = False)
# final_data.dropna(inplace = True)
You shouldn't need to re-create your dictionary if you've already the dict-style cursor factory. (Btw, see this answer for how to make a better one.)
Assuming your df_rows looks like this after fetching all rows, with 'X' and 'Y' being example column names for the query-result:
[{'X': 'xval1', 'Y': 'yval1'},
{'X': 'xval2', 'Y': 'yval2'},
{'X': 'xval3', 'Y': 'yval3'}]
1. Then use .from_records() to create your dataframe:
pd.DataFrame.from_records(df_rows)
Output:
X Y
0 xval1 yval1
1 xval2 yval2
2 xval3 yval3
That way, you don't need to restructure your results to use with from_dict().
2. And if you want to keep adding each group of 12,000 results to the same DataFrame, use DataFrame.append() with ignore_index=True to keep adding each new group of results to the existing dataframe.
It's better to just append into your dataframe instead of creating a bigger and bigger dictionaryto finally create one df.
In case it wasn't clear, remove these two lines in your else:
a = {k: [d[k] for d in df_row] for k in df_row[0]}
AMI_data = pd.DataFrame.from_dict(a)
and replace it with just:
AMI_data = pd.DataFrame.from_records(df_row)
# and then to add it to your final:
final_AMI_data.append(AMI_data, ignore_index=True)

How can I optimize the groupby.apply(function) in Python?

I have a function that uses deque.collections to track daily stock in based on FIFO. An order will be fulfilled if possible and is substracted from stock accordingly. I use a function in groupby.apply(my_function).
I have struggles where to place the second loop. Both loops work properly when run on their own. But I do not get them working combined.
The dataset is about 1.5 million rows.
Thanks.
DOS = 7
WIP = 1
df_fin['list_stock'] = 0
df_fin['stock_new'] = 0
def create_stocklist(x):
x['date_diff'] = x['dates'] - x['dates'].shift()
x['date_diff'] = x['date_diff'].fillna(0)
x['date_diff'] = (x['date_diff'] / np.timedelta64(1, 'D')).astype(int)
x['list_stock'] = x['list_stock'].astype(object)
x['stock_new'] = x['stock_new'].astype(object)
var_stock = DOS*[0]
sl = deque([0],maxlen=DOS)
for i in x.index:
order = x['order_bin'][i]
if x['date_diff'][i] > 0:
for p in range(0,x['date_diff'][i]):
if p == WIP:
sl.appendleft(x.return_bin[i-1])
else:
sl.appendleft(0)
sl_list = list(sl)
sl_list.reverse()
new_list = []
#from here the loop does not work as I wanted it to work.
#I want to loop over de created sl_list
#and then start the loop above with the outcome of the loop below.
for elem in sl_list:
while order > 0:
val = max(0,elem-order)
order = (abs(min(0,elem-order)))
new_list.append(val)
break
else:
new_list.append(elem)
new_list.reverse()
x.at[i,'list_stock'] = new_list
sl = deque(new_list)
return x
df_fin.groupby(by=['ID']).apply(create_stocklist)
You do not have access to sl_list inside the second loop, you should just define it in the upper scope: for example just after the first global for loop:
for i in x.index:
# define it just here
sl_list = []
order = x['order_bin'][i]

Discover possible combinations in permutation triads with specific criteria

This is the spreadsheet I'm working on. As you can see, the spreadsheet is in a really messy state. I have done some cleaning of the data as per the description below:
Each column heading denotes the Period/lesson in the day
each day has 7 periods/lessons- therefore monday to friday = 35 columns
each cell contains the class descriptor( the first 3 characters), the intitals of the teacher (3 characters after the "$" sign) and the room name ( which is the 3 characters after the "(" ).
I want to get teachers into groups of 3 (triads) where the following criteria is true:
At any given period/lesson in the week 2 teachers are "free" not teaching and 1 teacher is teaching.
within that very same triad in a different period there is a different teacher who is "not free" and Teaching, where the other 2 teachers are teaching
in addition in any other period the other combination of 2 teachers being "free" non teaching and the unused teacher is teaching.
See following idea for further clarification:
Initials A AND B are in the set but not C present in set
ALSO C AND A in the set but not B present in set
ALSO B AND C in the set but not A present in the set
All 3 criteria must be true to work find the final triad
so there should be out of the thousands permutations not many combinations that fit that criteria
In simple terms, what I'm trying to find is to place teachers into groups of 3. Where 2 teachers can go into and observe the lesson of another teacher, ie, in any period, 2 teachers are free and one is teaching. You will see in each column all the teachers who are teaching at any given period and day. Therefore anyone who is not in that column we can deduce is not teaching.
We want the group of 3 teachers to remain as a triad so the each get to be observed. So in any other period in the week a different teacher is teaching from the same triad and the other 2 are NOT teaching.
This is the code I've written so far to clean up the data and create the possible triads. I do not know if this is the best approach to the aforementioned problem, but in any case, this is what I've done so far. I am currently stuck in how I can find the intersections between all these triads in order to correctly identify the teachers who fulfill the above criteria.
import pandas as pd
import numpy as np
import itertools
class unique_element:
def __init__(self,value,occurrences):
self.value = value
self.occurrences = occurrences
def perm_unique(elements):
eset=set(elements)
listunique = [unique_element(i,elements.count(i)) for i in eset]
u=len(elements)
return perm_unique_helper(listunique,[0]*u,u-1)
def perm_unique_helper(listunique,result_list,d):
if d < 0:
yield tuple(result_list)
else:
for i in listunique:
if i.occurrences > 0:
result_list[d]=i.value
i.occurrences-=1
for g in perm_unique_helper(listunique,result_list,d-1):
yield g
i.occurrences+=1
def findsubsets(S, m):
return set(itertools.combinations(S, m))
csv_file = pd.read_csv('Whole_School_TT.csv')
df = csv_file.dropna(how='all')
df = csv_file.fillna(0)
cols = df.columns
df_class_name = df.copy()
df_names = df.copy()
df_room_number = df.copy()
for col in range(0, len(df.columns)):
for row in range(0, len(df)):
if df[cols[col]].iloc[row] is not 0:
text = df[cols[col]].iloc[row]
index_dollar = df[cols[col]].iloc[row].find('$')
r_index_dollar = df[cols[col]].iloc[row].rfind('$')
if index_dollar is not -1:
if index_dollar == r_index_dollar:
df_names[cols[col]].iloc[row] = df[cols[col]].iloc[row][index_dollar+1:index_dollar+4]
else:
name1 = df[cols[col]].iloc[row][index_dollar + 1:index_dollar + 4]
name2 = df[cols[col]].iloc[row][r_index_dollar + 1:r_index_dollar + 4]
df_names[cols[col]].iloc[row] = name1 + ' ' + name2
index_hash = df[cols[col]].iloc[row].find('#')
df_class_name[cols[col]].iloc[row] = df[cols[col]].iloc[row][:(index_dollar - 1)]
df_room_number[cols[col]].iloc[row] = df[cols[col]].iloc[row][index_hash + 1:-1]
else:
df_names[cols[col]].iloc[row] = 0
index_hash = df[cols[col]].iloc[row].find('#')
if index_hash is -1:
df_class_name[cols[col]].iloc[row] = df[cols[col]].iloc[row][:3]
df_room_number[cols[col]].iloc[row] = 0
else:
df_class_name[cols[col]].iloc[row] = df[cols[col]].iloc[row][:(index_hash - 2 )]
df_room_number[cols[col]].iloc[row] = df[cols[col]].iloc[row][index_hash + 1:-1]
teacher_names = []
for col in range(0, len(cols)):
period_names = (df_names[cols[col]].unique())
teacher_names.extend(period_names)
df_all_names = pd.DataFrame(teacher_names, columns=['Names'])
df_all_names = pd.DataFrame(df_all_names['Names'].unique())
df_all_names = df_all_names[(df_all_names.T != 0).any()]
mask = (df_all_names[0].str.len() == 3)
df_single_names = df_all_names.loc[mask] # so now here we have all the teacher names in general who teach
# we will find the teacher who teach per period and teachers who do not teach
set_of_names = set(np.array(df_single_names[0])) # here i have all the unique teacher names
period_set_names = [0]*len(cols)
period_set_names_NO_teach = [0]*len(cols)
# here i get the names for each one of the periods
# and find the intersection with the unique teacher names in order to figure out who teaches per period
for col in range(0, len(cols)):
period_set_names[col] = set(np.array(df_names[cols[col]])) # get teacher names for current period
period_set_names_NO_teach[col] = set_of_names.difference(period_set_names[col])
period_set_names[col] = set_of_names.intersection(period_set_names[col])
# sanity check
print('Teachers who teach and teacher who dont teach should be equivalent to the full list of names: ', end='')
print(period_set_names_NO_teach[col].union(period_set_names[col]) == set_of_names)
def get_current_period_triplets(col):
free_period_pairs = findsubsets(period_set_names_NO_teach[col], 2) # I got all the period Free teacher pairs
# teaching_period_pairs = findsubsets(period_set_names[col], 2)
free_period_pairs_list = list(free_period_pairs)
period_triplets = []
for i in range(0, len(free_period_pairs_list)):
listof = list(free_period_pairs_list)
current_free_pair = list(listof[i])
# print(current_free_pair)
for j in (period_set_names[col]):
temp = current_free_pair.copy()
current_triplet = temp.append(j)
period_triplets.append(tuple(temp))
period_triplets = set(period_triplets)
return period_triplets
for col in range(0, len(cols)):
current_triplets = get_current_period_triplets(col)
print(current_triplets)

Numpy Vs nested dictionaries, which one is more efficient in terms of runtime and memory?

I am new to numpy.I have referred to the following SO question:
Why NumPy instead of Python lists?
The final comment in the above question seems to indicate that numpy is probably slower on a particular dataset.
I am working on a 1650*1650*1650 data set. These are essentially similarity values for each movie in the MovieLens data set along with the movie id.
My options are to either use a 3D numpy array or a nested dictionary. On a reduced data set of 100*100*100, the run times were not too different.
Please find the Ipython code snippet below:
for id1 in range(1,count+1):
data1 = df[df.movie_id == id1].set_index('user_id')[cols]
sim_score = {}
for id2 in range (1, count+1):
if id1 != id2:
data2 = df[df.movie_id == id2].set_index('user_id')[cols]
sim = calculatePearsonCorrUnified(data1, data2)
else:
sim = 1
sim_matrix_panel[id1]['Sim'][id2] = sim
import pdb
from math import sqrt
def calculatePearsonCorrUnified(df1, df2):
sim_score = 0
common_movies_or_users = []
for temp_id in df1.index:
if temp_id in df2.index:
common_movies_or_users.append(temp_id)
#pdb.set_trace()
n = len(common_movies_or_users)
#print ('No. of common movies: ' + str(n))
if n == 0:
return sim_score;
# Ratings corresponding to user_1 / movie_1, present in the common list
rating1 = df1.loc[df1.index.isin(common_movies_or_users)]['rating'].values
# Ratings corresponding to user_2 / movie_2, present in the common list
rating2 = df2.loc[df2.index.isin(common_movies_or_users)]['rating'].values
sum1 = sum (rating1)
sum2 = sum (rating2)
# Sum up the squares
sum1Sq = sum (np.square(rating1))
sum2Sq = sum (np.square(rating2))
# Sum up the products
pSum = sum(np.multiply(rating1, rating2))
# Calculate Pearson score
num = pSum-(sum1*sum2/n)
den = sqrt(float(sum1Sq-pow(sum1,2)/n) * float(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
sim_score = (num/den)
return sim_score
What would be the best way to most precisely time the runtime with either of these options?
Any pointers would be greatly appreciated.

How can I identify an element from a list within another list

I have been trying to make a block of code that finds the index of the largest bid for each item. Then I was going to use the index as a way to identify the person who paid that much moneys name. However no matter what i try I can't link the person and what they have gained from the auction together. Here is the code I have been writing: It has to be able to work with any information inputted
def sealedBids():
n = int(input('\nHow many people are in the group? '))#determines loop lengths
z = 0#meant to act as a counter in while loops
g = []#meant to store the names of all the people/players
s = []#meant to store name of all items being bidded on
b = []#meant to store all bids made
f = []#meant to store the each persons fair share
w = []#meant to store the highest bid for each item
q = []#trying to get it to store person linked to whatever they won
while z < n:#Process: to make the list nest lists
b.append([])
z = z + 1
z = 0
while z < n:
g.append(input('Enter a bidders name: '))#Input: Persons name
z = z + 1 #Process: store name in g[] list
z = 0
i = int(input('How many items are being bid on?'))#determines so loop lengths
while z < i:
s.append(input('Enter the name of an item: '))#input: Name of item
#process: stores names in s[] list
w.append(z)#was going to swap the info inside with the info I wanted
z = z + 1
z = 0
for j in range(n):#specifies which persons bid your taking
for k in range(i):#specifies which item is being bid on
b[j].append(int(input('How much money has {0} bid on the {1}? '.format(g[j], s[k]))))#input: takes bid for certain item
#process: stores bid in b[] list
print(' ')#adds a space when questions are being asked so it doesn't look crunched up
for j in range(n):#calculates fair share
f.append(sum(b[j])/n)#add a persons total bids then divides it by number of people bidding
for j in range(i):#meant to change the item after every bid is compared to stored highest bid
for k in range(n):#meant to change to a different persons bid to find out who bid the most money on a particular item
if w[j] < b[k][j]:#compares stored highest bid for particular item to each persons bid
w[j] = b[k][j]#if highest bid is smaller then highest bid changes to the bid that was larger
q.append(k)#trying to get it to store indentifier for whoever has highest bid so that I can use it later to link it with highest bid somehow
print(g)#meant to check outcome
print(s)#meant to check outcome
print(w)#meant to check outcome
print(q)#meant to check outcome
print(b)#meant to check outcome
print(f)#meant to check outcome
any advice is much appreciated.
You can use other structure for your bids. Instead of using different lists synchronized by index, you can use dictionary and python tuples. Maybe something like this:
items_bids = {
item1: [ (bidder1, amount), (some_other_bidder, amount), ... ],
item2: [ (bidder1, amount), (some_other_bidder, amount), ... ],
.
.
.
}
Then retrieving max. bids for each item is easy:
for item, bids in items_bids.iteritems():
print max(bids, key=lambda x: x[1])
You may design your data structure differently as this one has fast insert of bids but needs more time to retrieve the highest bid. Also retrieving all bids made by one bidder would be more work for the computer.
And for more maintainable code, you may use some objects with named fields instead of tuples.
I think it would be easiest to use dictionaries with the names as keys, that way you can see what's going on:
group_size = 3
bidders = ('Alice', 'Bob', 'Eve')
items = ('Pipe', 'Wrench')
bids = dict(((item ,dict(((bidder, 0) for bidder in bidders))) for item in items))
#>>> print bids
#{'Pipe': {'Bob': 0, 'Alice': 0, 'Eve': 0},
# 'Wrench': {'Bob': 0, 'Alice': 0, 'Eve': 0}}
#get the money amounts
for item in bids:
for bidder in bids[item]:
bids[item][bidder] = int(raw_input('How much money has {0} bid on the {1}?'.format(bidder, item)))
highest_bidders = dict((item, bidder) for item in bids for bidder in bids[item]
if bids[item][bidder] == max(bids[item].itervalues()))
print highest_bidders
This is horrible code - try this:
def sealedBids():
n = input('\nHow many people are in the group? ')
assert isinstance(n, int)
bidders = {}
for i in range(n):
bidders[raw_input('Enter a bidders name: ')] = {}
n = input('How many items are being bid on?')
assert isinstance(n, int)
bid_items = {}
for i in range(n):
bid_items[raw_input('Enter a item name: ')] = {}
del n
f = []
for bidder, bidder_bids in bidders.items():
for bid_item, bid_item_bids in bid_items.items():
bid = input('How much money has %s bid on the %s? '%(bidder, bid_items)
assert isinstance(bid, int)
bidder_bids[bid_item] = bid
bid_item_bids[bidder] = bid
print ''
f.append(sum(bidder_bids.values())/len(bidders)) # what is this for?
for bid_item, bid_item_bids in bid_items.items():
inv_bid_item_bids = dict(map(lambda item: (item[1],item[0]),bid_item_bids.items()))
high_bid = max(inv_bid_item_bids.keys())
high_bidder = inv_bid_item_bids[high_bid]
bid_items[bid_item] = (high_bidder, high_bid)

Categories