For-loop using dictionary key reference not working - python

I have never used Python before but have decided to start learning it by manipulating some market data. I am having trouble using the dictionary structures. In the code for read_arr_price below the command dict_price_recalc[price_id][year_to_index(year), Q] = float(line2)/7.5 assigns float(line2)/7.5 to all arrays, regardless of their price_id affiliation. I wonder if it is because I didn't initialize dict_price correctly.
def read_dict_price(dat_filename, dict_price):
## Load data set
dat_file = open(dat_filename,'r')
## Copy arr_price
dict_price_recalc = dict_price
## Iterate through each row in the data set, assigning values to variables
for line in dat_file:
year = int(line[11:15])
price_id = line[0:4]
Q = 0
Q1 = line[19:21]
Q2 = line[23:25]
Q3 = line[27:29]
Q4 = line[31:33]
## Truncate year_list to prepare for another run of the nested loop
year_list = []
year_list[:] = []
## Repopulate
year_list = [Q1, Q2, Q3, Q4]
#### This is where the mistake happens ####
## Iterate through each row in year_list, populating dict_price_recalc with price data
for line2 in year_list:
dict_price_recalc[price_id][year_to_index(year), Q] = float(line2)/7.5
Q += 1
return dict_price_recalc
My code initializing dict_price is below:
def init_dict_price(dat_filename):
price_id= {}
dat_file = open(dat_filename,'r')
np_array = np.zeros(shape=(100,4)) # Zeros as placeholders
np_array[:] = np.NaN
for line in dat_file:
key = line[:11]
price_id[key] = np_array
return price_id
I appreciate any pointers you can provide.

this line price_id[key] = np_array is setting the same array to each key so every key points to the same array. you probably meant price_id[key] = np_array.copy()

Related

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed. Works for first two loops

Let me start by saying that I know this error message has posts about it, but I'm not sure what's wrong with my code. The block of code works just fine for the first two loops, but then fails. I've even tried removing the first two loops from the data to rule out issues in the 3rd loop, but no luck. I did have it set to print out the unsorted temporary list, and it just prints an empty array for the 3rd loop.
Sorry for the wall of comments in my code, but I'd rather have each line commented than cause confusion over what I'm trying to accomplish.
TL;DR: I'm trying to find and remove outliers from a list of data, but only for groups of entries that have the same number in column 0.
Pastebin with data
import numpy as np, csv, multiprocessing as mp, mysql.connector as msc, pandas as pd
import datetime
#Declare unsorted data array
d_us = []
#Declare temporary array for use in loop
tmp = []
#Declare sorted data array
d = []
#Declare Sum variable
tot = 0
#Declare Mean variable
m = 0
#declare sorted final array
sort = []
#Declare number of STDs
t = 1
#Declare Standard Deviation variable
std = 0
#Declare z-score variable
z_score
#Timestamp for output files
nts = datetime.datetime.now().timestamp()
#Create output file
with open(f"calib_temp-{nts}.csv", 'w') as ctw:
pass
#Read data from CSV
with open("test.csv", 'r', newline='') as drh:
fr_rh = csv.reader(drh, delimiter=',')
for row in fr_rh:
#append data to unsorted array
d_us.append([float(row[0]),float(row[1])])
#Sort array by first column
d = np.sort(d_us)
#Calculate the range of the data
l = round((d[-1][0] - d[0][0]) * 10)
#Declare the starting value
s = d[0][0]
#Declare the ending value
e = d[-1][0]
#Set the while loop counter
n = d[0][0]
#Iterate through data
while n <= e:
#Create array with difference column
for row in d:
if row[0] == n:
diff = round(row[0] - row[1], 1)
tmp.append([row[0],row[1],diff])
#Convert to numpy array
tmp = np.array(tmp)
#Sort numpy array
sort = tmp[np.argsort(tmp[:,2])]
#Calculate sum of differences
for row in tmp:
tot = tot + row[2]
#Calculate mean
m = np.mean(tot)
#Calculate Standard Deviation
std = np.std(tmp[:,2])
#Calculate outliers and write to output file
for y in tmp:
z_score = (y[2] - m)/std
if np.abs(z_score) > t:
with open(f"calib_temp-{nts}.csv", 'a', newline='') as ct:
c = csv.writer(ct, delimiter = ',')
c.writerow([y[0],y[1]])
#Reset Variables
tot = 0
m = 0
n = n + 0.1
tmp = []
std = 0
z_score = 0
Do this before the loop:
#Create output file
ct = open(f"calib_temp-{nts}.csv", 'w')
c = csv.writer(ct, delimiter = ',')
Then change the loop to this. Note that I have moved your initializations to the top of the loop, so you don't need to initialize them twice. Note the if tmp: line, which solves the numpy exception.
#Iterate through data
while n <= e:
tot = 0
m = 0
tmp = []
std = 0
z_score = 0
#Create array with difference column
for row in d:
if row[0] == n:
diff = round(row[0] - row[1], 1)
tmp.append([row[0],row[1],diff])
#Sort numpy array
if tmp:
#Convert to numpy array
tmp = np.array(tmp)
sort = tmp[np.argsort(tmp[:,2])]
#Calculate sum of differences
for row in tmp:
tot = tot + row[2]
#Calculate mean
m = np.mean(tot)
#Calculate Standard Deviation
std = np.std(tmp[:,2])
#Calculate outliers and write to output file
for y in tmp:
z_score = (y[2] - m)/std
if np.abs(z_score) > t:
c.writerow([y[0],y[1]])
#Reset Variables
n = n + 0.1

Is there an optimal solution for this?

I have a large data that contains: ID,year,program. I want to filter out IDs if a particular program say A, has more value of year(example : 2019>2018) than another program say B. I have a solution but that involves a loop. I want to know if there's another way of doing this.
My code :
unique = list(set(finalAD['ID']))
IDFiltered = []
for i in unique:
data = finalAD[finalAD['ID'] == i]
AD1 = data[data['Program'].str.match('AD')]
ind = list(AD1.index.values)
AD2 = data.drop(ind)
date1 = AD1['Year'].max()
date2 = AD2['Year'].min()
if(date2 > date1):
IDFiltered.append(i)
newData = finalAD[finalAD['ID'].isin(IDFiltered)]
newData.reset_index(drop = True, inplace = True)
newData.head()

How can I optimize the groupby.apply(function) in Python?

I have a function that uses deque.collections to track daily stock in based on FIFO. An order will be fulfilled if possible and is substracted from stock accordingly. I use a function in groupby.apply(my_function).
I have struggles where to place the second loop. Both loops work properly when run on their own. But I do not get them working combined.
The dataset is about 1.5 million rows.
Thanks.
DOS = 7
WIP = 1
df_fin['list_stock'] = 0
df_fin['stock_new'] = 0
def create_stocklist(x):
x['date_diff'] = x['dates'] - x['dates'].shift()
x['date_diff'] = x['date_diff'].fillna(0)
x['date_diff'] = (x['date_diff'] / np.timedelta64(1, 'D')).astype(int)
x['list_stock'] = x['list_stock'].astype(object)
x['stock_new'] = x['stock_new'].astype(object)
var_stock = DOS*[0]
sl = deque([0],maxlen=DOS)
for i in x.index:
order = x['order_bin'][i]
if x['date_diff'][i] > 0:
for p in range(0,x['date_diff'][i]):
if p == WIP:
sl.appendleft(x.return_bin[i-1])
else:
sl.appendleft(0)
sl_list = list(sl)
sl_list.reverse()
new_list = []
#from here the loop does not work as I wanted it to work.
#I want to loop over de created sl_list
#and then start the loop above with the outcome of the loop below.
for elem in sl_list:
while order > 0:
val = max(0,elem-order)
order = (abs(min(0,elem-order)))
new_list.append(val)
break
else:
new_list.append(elem)
new_list.reverse()
x.at[i,'list_stock'] = new_list
sl = deque(new_list)
return x
df_fin.groupby(by=['ID']).apply(create_stocklist)
You do not have access to sl_list inside the second loop, you should just define it in the upper scope: for example just after the first global for loop:
for i in x.index:
# define it just here
sl_list = []
order = x['order_bin'][i]

Creating multiple plots in Python for loop

I want to create a plot of (T/Tmax vs R/R0) for different values of Pa in a single plot like below. I have written this code that appends values to a list but all values of (T/Tmax vs R/R0) are appended in single list which does not give a good plot. What can I do to have such a plot? Also how can I make an excel sheet from the data from the loop where column 1 is T/Tmax list and column 2,3,4...are corresponding R/R0 values for different pa?
KLMDAT1 = []
KLMDAT2 = []
for j in range(z):
pa[j] = 120000-10000*j
i = 0
R = R0
q = 0
T = 0
while (T<Tmax):
k1 = KLM_RKM(i*dT,R,q,pa[j])
k2 = KLM_RKM((i+0.5)*dT,R,q+0.5*dT*k1,pa[j])
k3 = KLM_RKM((i+0.5)*dT,R,q+0.5*dT*k2,pa[j])
k4 = KLM_RKM((i+1)*dT,R,q+dT*k3,pa[j])
q = q +1/6.0*dT*(k1+2*k2+2*k3+k4)
R = R+dT*q
if(R>0):
KLMDAT1.append(T / Tmax)
KLMDAT2.append(R / R0)
if(R>Rmax):
Rmax = R
if (abs(q)>c or R < 0):
break
T=T+dT
i = i+1
wb.save('KLM.xlsx')
np.savetxt('KLM.csv',[KLMDAT1, KLMDAT2])
plt.plot(KLMDAT1, KLMDAT2)
plt.show()
You are plotting it wrong. Your first variable needs to be T/Tmax. So initialize an empty T list, append T values to it, divide it by Tmax, and then plot twice: first KLMDAT1 and then KLMDAT2. Following pseudocode explains it
KLMDAT1 = []
KLMDAT2 = []
T_list = [] # <--- Initialize T list here
for j in range(z):
...
while (T<Tmax):
...
T=T+dT
T_list.append(T) # <--- Append T here
i = i+1
# ... rest of the code
plt.plot(np.array(T_list)/Tmax, KLMDAT1) # <--- Changed here
plt.plot(np.array(T_list)/Tmax, KLMDAT2) # <--- Changed here
plt.show()

Setting dictionary values while iterating through a 'for-loop'

I'm trying to create a nested dictionary with a set of values that are pulled from a for-loop, to measure growth and revenue amounts for various customer-product pairings. However, when I loop through a dataframe to set elements of the dictionary, each dictionary element ends up with the same values. What's going on, here?
I have already tried changing various elements of how the lists are built, but to no avail.
'''
TP_Name = customer name
Service_Level_1 = service name
100.2014 is just a marker to show that someone has started consuming the service
tpdict is already created with necessary nesting below with empty values at each endpoint
'''
for col in pivotdf.columns:
growthlist = []
amountlist = []
first = True
TP_Name, Service_Level_1 = col.split('___')
for row in pivotdf[col]:
if first == True:
past = row+.00001
first = False
if row == 0 and past <.0001 :
growth = 0
elif row != 0 and past == .00001:
growth = 100.2014
else:
current = row
growth = (current-past)/past
growth = round(growth,4)
growthlist.append(growth)
past = row +.00001
amountlist.append(row)
tpdict[TP_Name][Service_Level_1]['growth'] = growthlist
tpdict[TP_Name][Service_Level_1]['amount'] = amountlist
'''
problem: Each value ends up being the same thing
'''
Expected results:
{'CUSTOMER NAME': {'PRODUCT1': {'growth': [unique_growthlist], 'amount': [unique_amountlist]}, 'PRODUCT2': {'growth': [unique_growthlist],'amount': [unique_amountlist]}}}
A dictionary is a key value pair (as I am sure you may know). If you ever try to write to a dictionary with a key that already exists in the dictionary then the dictionary will overwrite the value for that key.
Example:
d = dict()
d[1] = 'a' # d = {1: 'a'}
d[1] = 'b' # d = {1: 'b'}
Your project seems like it may be a good use of a namedtuple in python.
A namedtuple is basically a light weight class/object.
My example code may be wrong because I don't know how your for loop is working (commenting helps everyone). That being said here is an example.
I only make this recommendation as dictionaries consume ~33% more memory then the objects they hold (though they are much faster).
from collections import namedtuple
Customer = namedtuple('Customer', 'name products')
Product = namedtuple('Product', 'growth amount')
customers = []
for col in pivotdf.columns:
products = []
growthlist = []
amountlist = []
first = True
TP_Name, Service_Level_1 = col.split('___')
for row in pivotdf[col]:
if first == True:
past = row + .00001
first = False
if row == 0 and past < .0001 :
growth = 0
elif row != 0 and past == .00001:
growth = 100.2014
else:
current = row
growth = (current - past) / past
growth = round(growth, 4)
growthlist.append(growth)
past = row + .00001
amountlist.append(row)
cur_product = Product(growth=growthlist, amount=amountlist) # Create a new product
products.append(cur_product) # Add that product to our customer
# Create a new customer with our products
cur_customer = Customer(name=TP_Name, products=products)
customers.append(cur_customer) # Add our customer to our list of customers
Here customers is a list of Customer namedtuples that we can use as objects.
For example this is how we can print them out.
for customer in customers:
print(customer.name, customer.products) # Print each name and their products
for growth, amount in customer.products:
print(growth, amount) # Print growth and amount for each product.

Categories