Hello I am trying to take a CSV file and iterate over each customers data. To explain, each customer has data for 12 months. I want to analyze their yearly data, save the correlations of this data to a new list and loop this until all customers have been analyzed.
For instance here is what a customers data might look like (simplified case):
I have been able to get this to work to generate correlations in a CSV of one customers data. However, there are thousands of customers in my datasheet. I want to use a nested for loop to get all of the correlation values for each customer into a list/array. The list would have a row of a specific customer's correlations then the next row would be the next customer.
Here is my current code:
import numpy
from numpy import genfromtxt
overalldata = genfromtxt('C:\Users\User V\Desktop\CUSTDATA.csv', delimiter=',')
emptylist = []
overalldatasubtract = overalldata[13::]
#This is where I try to use the four loop to go through all the customers. I don't know if len will give me all the rows or the number of columns.
for x in range(0,len(overalldata),11):
for x in range(0,13,1):
cust_months = overalldata[0:x,1]
cust_balancenormal = overalldata[0:x,16]
cust_demo_one = overalldata[0:x,2]
cust_demo_two = overalldata[0:x,3]
num_acct_A = overalldata[0:x,4]
num_acct_B = overalldata[0:x,5]
#Correlation Calculations
demo_one_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_one)[1,0]
demo_two_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_two)[1,0]
demo_one_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_one)[1,0]
demo_one_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_one)[1,0]
demo_two_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_two)[1,0]
demo_two_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_two)[1,0]
result_correlation = [demo_one_corr_balance, demo_two_corr_balance, demo_one_corr_acct_a, demo_one_corr_acct_b, demo_two_corr_acct_a, demo_two_corr_acct_b]
result_correlation_combined = emptylist.append(result_correlation)
#This is where I try to delete the rows I have already analyzed.
overalldata = overalldata[11**x::]
print result_correlation_combined
print overalldatasubtract
It seemed that my subtraction method was working, but when I tried it with my larger data set, I realized my method is totally wrong.
Would you do this a different way? I think that it can work, but I cannot find my mistake.
You use the same variable x for both loops. In the second loop x goes from 0 to 12 whatever the customer, and since you set the line number only with x you're stuck on the first customer.
Your double loop should rather look like this :
# loop over the customers
for x_customer in range(0,len(overalldata),12):
# loop over the months
for x_month in range(0,12,1):
# line number: x
x = x_customer*12 + x_month
I changed the bounds and steps of the loops because :
loop 1: there are 12 months so 12 lines per customer -> step = 12
loop 2: there are 12 months, so month number ranges from 0 to 11 -> range(0,12,1)
this is how I solved the problem: It was a problem with the placement of my for loops. A simple indentation problem. Thank you for the help to above poster.
for x_customer in range(0,len(overalldata),12):
for x in range(0,13,1):
cust_months = overalldata[0:x,1]
cust_balancenormal = overalldata[0:x,16]
cust_demo_one = overalldata[0:x,2]
cust_demo_two = overalldata[0:x,3]
num_acct_A = overalldata[0:x,4]
num_acct_B = overalldata[0:x,5]
#Correlation Calculations
demo_one_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_one)[1,0]
demo_two_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_two)[1,0]
demo_one_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_one)[1,0]
demo_one_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_one)[1,0]
demo_two_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_two)[1,0]
demo_two_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_two)[1,0]
result_correlation = [(demo_one_corr_balance),(demo_two_corr_balance),(demo_one_corr_acct_a),(demo_one_corr_acct_b),(demo_two_corr_acct_a),(demo_two_corr_acct_b)]
numpy.savetxt('correlationoutput.csv', (result_correlation))
result_correlation_combined = emptylist.append([result_correlation])
cust_delete_list = [0,(x_customer),1]
overalldata = numpy.delete(overalldata, (cust_delete_list), axis=0)
I am writing a python code where I have a condition which till the time it is true I want the calculations to happen and update the dataframe columns. However I am noticing that the dataframe is not getting updated and all the values are of the 1st iteration only. Can an expert guide on where I am going wrong. Below is my sample code -
mbd_out_ub2 = mbd_out_ub1
mbd_out_ub2_len = len(mbd_out_ub2)
plt_mbd_c1_all = pd.DataFrame()
brd2c2_all = pd.DataFrame()
### plt_mbd_c >> this is the data frame with data before the loop starts
plt_mbd_c0 = plt_mbd_c.copy()
plt_mbd_c0 = plt_mbd_c0[plt_mbd_c0['UB_OUT']==1]
while (iterc < 10):
plt_mbd_c1 = plt_mbd_c0.copy()
brd2c2 = plt_mbd_c1.groupby('KEY1')['NEST_VAL_PER'].agg([('KEY1_CNT','count'),('PER1c', lambda x: x.quantile(0.75))]).reset_index()
brd2c2_all = brd2c2_all.append(brd2c2).reset_index(drop=True)
plt_mbd_c1 = pd.merge(plt_mbd_c1,brd2c2[['KEY1','PER1c']],on='KEY1', how='left')
del brd2c2, plt_mbd_c0
plt_mbd_c1['NEST_VAL_PER1'] = plt_mbd_c1['PER1c'] * (plt_mbd_c1['EVAL_LP_%'] / 100)
plt_mbd_c1['NEST_VAL_PER1'] = np.where((plt_mbd_c1['BRD_OUT_FLAG'] == 0),plt_mbd_c1['NEST_VAL'],plt_mbd_c1['NEST_VAL_PER1'] )
plt_mbd_c1['SALESC'] = plt_mbd_c1['NEST_VAL_PER1']/plt_mbd_c1['PROJR']/plt_mbd_c1['NEWPRICE']
plt_mbd_c1['C_SALES_C'] = np.where(plt_mbd_c1['OUT_FLAG'] == 1,plt_mbd_c1['SALESC'],plt_mbd_c1['SALESUNIT'])
plt_mbd_c1['NEST_VAL_PER'] = plt_mbd_c1['C_SALES_C'] * plt_mbd_c1['PROJR'] * plt_mbd_c1['NEWPRICE']
plt_mbd_c1['ITER'] = iterc
plt_mbd_c1_all = plt_mbd_c1_all.append(plt_mbd_c1).reset_index(drop=True)
plt_mbd_c0 = plt_mbd_c1.copy()
del plt_mbd_c1
print("iter = ",iterc)
iterc = iterc + 1
So above I want to take 75th percentile of a column by KEY1 and do few calculations. The idea is after every iteration my 75th percentile will keep reducing as I am updating the same column with calculated value which would be lower then the current value (since it is based on 75th percentile). However when I check I find for all the iterations the values are same as the 1st iteration only. I have tried to delete the data frames, save to temp data frame, copy dataframe but non seem to be working.
Please help !!
I've written the code that will find take a number of n grams, a specific index, and a threshold, and return the values that fall within that threshold. However, currently, it only compares a set of tokens (a given index) to each set of tokens at all other indices. I want to compare each set tokens at all indices to every other set of tokens at all indices. I don't think this is a difficult question, but python is my main language and I struggle with for loops a bit.
So essentially, the variable token in the function should actually iterate over each string in the column, and be compared with comp_token and the index call would be removed, since it would be iterating over all indices.
Let me know if that isn't clear enough and I will think more about how to say this: it is just difficult because the thing I am asking is the thing I am struggling with.
data = ['Time', "NY Times", 'Atlantic']
ph = pd.DataFrame(data, columns=['companies'])
import py_stringmatching as sm
import pandas as pd
import numpy as np
jac = sm.Jaccard()
def predict_label(num, index, thresh):
qg_num_tok = sm.QgramTokenizer(qval = num)
companies = ph.companies.to_list()
ids = ph['index']
companies_qg_num_token = {}
companies_id2index = {}
for i in range(len(companies)):
companies_id2index[i] = companies[i]
companies_qg_num_token[i] = qg_num_tok.tokenize(companies[i])
predicted_label = [1]
token = companies_qg_num_token[index] #index you want: to get all the tokens
for comp_name in ids[1:]:
comp_token = companies_qg_num_token[comp_name]
sim = jac.get_sim_score(token, comp_token)
if sim > thresh:
#companies_id2index must be equal to token numbner
ph.loc[ph['index'] != companies_id2index[index], 'label'] = 0 #if not equal to index
ph['prediction'] = predicted_label
return ph.query('prediction==1')
predict_label(ph, 1, .5)
I'm having a problem manipulating different types. Here is what I'm doing:
for x in a:
decompose_result_mult = seasonal_decompose(analysis, model="additive")
date = decompose_result_mult.trend.to_frame()
trend = decompose_result_mult.trend
trend = decompose_result_mult.trend.to_frame()
seasonal = decompose_result_mult.seasonal
seasonal = decompose_result_mult.seasonal.to_frame()
residual = decompose_result_mult.resid
residual = decompose_result_mult.resid.to_frame()
dict= {'Date':date,'region':x,'Trend':trend,'Seasonal':seasonal,'Residual':residual,'Sales':observed}
The problem with this is that when I run my code I get the next result:
It is a dataframe that have inside Series... Any help? I would like to have 1 value per row and not 1 list per row.
Thank you!
I know that a few posts have been made regarding how to output the unique values of a dataframe without reordering the data.
I have tried many times to implement these methods, however, I believe that the problem relates to how the dataframe in question has been defined.
Basically, I want to look into the dataframe named "C", and output the unique values into a new dataframe named "C1", without changing the order in which they are stored at the moment.
The line that I use currently is:
C1 = pd.DataFrame(np.unique(C))
However, this returns an ascending order list (while, I simply want the list order preserved only with duplicates removed).
Once again, I apologise to the advanced users who will look at my code and shake their heads -- I'm still learning! And, yes, I have tried numerous methods to solve this problem (redefining the C dataframe, converting the output to be a list etc), to no avail unfortunately, so this is my cry for help to the Python gods. I defined both C and C1 as dataframes, as I understand that these are pretty much the best datastructures to house data in, such that they can be recalled and used later, plus it is quite useful to name the columns without affecting the data contained in the dataframe).
Once again, your help would be much appreciated.
F0 = ('08/02/2018','08/02/2018',50)
F1 = ('08/02/2018','09/02/2018',52)
F2 = ('10/02/2018','11/02/2018',46)
F3 = ('12/02/2018','16/02/2018',55)
F4 = ('09/02/2018','28/02/2018',48)
F_mat = [[F0,F1,F2,F3,F4]]
F_test = pd.DataFrame(np.array(F_mat).reshape(5,3),columns=('startdate','enddate','price'))
#convert string dates into DateTime data type
F_test['startdate'] = pd.to_datetime(F_test['startdate'])
F_test['enddate'] = pd.to_datetime(F_test['enddate'])
#convert datetype to be datetime type for columns startdate and enddate
F['startdate'] = pd.to_datetime(F['startdate'])
F['enddate'] = pd.to_datetime(F['enddate'])
#create contract duration column
F['duration'] = (F['enddate'] - F['startdate']).dt.days + 1
#re-order the F matrix by column 'duration', ensure that the bootstrapping
#prioritises the shorter term contracts
F.sort_values(by=['duration'], ascending=[True])
# create prices P
P = pd.DataFrame()
for index, row in F.iterrows():
new_P_row = pd.Series()
for date in pd.date_range(row['startdate'], row['enddate']):
new_P_row[date] = row['price']
P = P.append(new_P_row, ignore_index=True)
P.fillna(0, inplace=True)
#create C matrix, which records the unique day prices across the observation interval
C = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
C.columns = tempDateRange
#create the Repatriation matrix, which records the order in which contracts will be
#stored in the A matrix, which means that once results are generated
#from the linear solver, we know exactly which CalendarDays map to
#which columns in the results array
#this array contains numbers from 1 to NbContracts
R = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
R.columns = tempDateRange
#define a zero filled matrix, P1, which will house the dominant daily prices
P1 = pd.DataFrame(np.zeros((intNbContracts, intNbCalendarDays)))
#rename columns of P1 to be the dates contained in matrix array D
P1.columns = tempDateRange
#create prices in correct rows in P
for i in list(range(0, intNbContracts)):
for j in list(range(0, intNbCalendarDays)):
if (P.iloc[i, j] != 0 and C.iloc[0,j] == 0) :
flUniqueCalendarMarker = P.iloc[i, j]
C.iloc[0,j] = flUniqueCalendarMarker
P1.iloc[i,j] = flUniqueCalendarMarker
R.iloc[0,j] = i
for k in list(range(j+1,intNbCalendarDays)):
if (C.iloc[0,k] == 0 and P.iloc[i,k] != 0):
C.iloc[0,k] = flUniqueCalendarMarker
P1.iloc[i,k] = flUniqueCalendarMarker
R.iloc[0,k] = i
elif (C.iloc[0,j] != 0 and P.iloc[i,j] != 0):
P1.iloc[i,j] = C.iloc[0,j]
#convert C dataframe into C_list, in prepataion for converting C_list
#into a unique, order preserved list
C_list = C.values.tolist()
#create C1 matrix, which records the unique day prices across unique days in the observation period
C1 = pd.DataFrame(np.unique(C))
Use DataFrame.duplicated() to check if your data-frame contains any duplicate or not.
If yes then you can try DataFrame.drop_duplicate() .
I have a dataframe of 600 000 x/y points with date-time information, along another field 'status', with extra descriptive information
My objective is, for each record:
sum column 'status' by records that are within a certain spatial temporal buffer
the specific buffer is within t - 8 hours and < 100 meters
Currently I have the data in a pandas data frame.
I could, loop through the rows, and for each record, subset the dates of interest, then calculate a distances and restrict the selection further. However that would still be quite slow with so many records.
THIS TAKES 4.4 hours to run.
I can see that I could create a 3 dimensional kdtree with x, y, date as epoch time. However, I am not certain how to restrict the distances properly when incorporating dates and geographic distances.
Here is some reproducible code for you guys to test on:
import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta
Create data
Function to generate test data
def CreateDataSet(Number=1):
Output = []
for i in range(Number):
# Create a date range with hour frequency
date = date_range(start='10/1/2012', end='10/31/2012', freq='H')
# Create long lat data
laty = npr.normal(4815862, 5000,size=len(date))
longx = npr.normal(687993, 5000,size=len(date))
# status of interest
status = [0,1]
# Make a random list of statuses
random_status = [status[npr.randint(low=0,high=len(status))] for i in range(len(date))]
# user pool
user = ['sally','derik','james','bob','ryan','chris']
# Make a random list of users
random_user = [user[npr.randint(low=0,high=len(user))] for i in range(len(date))]
Output.extend(zip(random_user, random_status, date, longx, laty))
return pd.DataFrame(Output, columns = ['user', 'status', 'date', 'long', 'lat'])
#Create data
data = CreateDataSet(3)
#some time deltas
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
Function to speed up
def work(df):
output = []
#loop through data index's
for i in range(0, len(df)):
l = []
#first we will filter out the data by date to have a smaller list to compute distances for
#create a mask to query all dates between range for date i
date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
#create a mask to query all users who are not user i (themselves)
user_mask = df['user']!=df['user'].iloc[i]
#apply masks
dists_to_check = df[date_mask & user_mask]
#for point i, create coordinate to calculate distances from
a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
#create array of distances to check on the masked data
b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))
#for j in the date queried data
for j in range(1, len(dists_to_check)):
#compute the ueclidean distance between point a and each point of b (the date masked data)
x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))
#if the distance is within our range of interest append the index to a list
if x <=100:
#use the list of desired index's 'l' to query a final subset of the data
data = dists_to_check.iloc[l]
#summarize the column of interest then append to output list
except IndexError, e:
#print "There were no data to add"
return pd.DataFrame(output)
Run code and time it
start = datetime.now()
out = work(data)
print datetime.now() - start
Is there a way to do this query in a vectorized way? Or should I be chasing another technique.
Here is what at least somewhat solves my problem. Since the loop can operate on different parts of the data independently, parallelization makes sense here.
using Ipython...
from IPython.parallel import Client
cli = Client()
cli = Client()
with dview.sync_imports():
import numpy as np
import os
from datetime import timedelta
import pandas as pd
#We also need to add the time deltas and output list into the function as
#local variables as well as add the Ipython.parallel decorator
def work(df):
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
output = []
final time 1:17:54.910206, about 1/4 original time
I would still be very interested for anyone to suggest small speed improvements within the body of the function.