How to Split Multiple Arrays Simultaneously - python

I have 3 lists, all of the same length. One of the lengths is a number representing a day, and the other two lists are data which correspond to that day, e.g
day = [1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4....]
data1 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4....] # (effectively random numbers)
data2 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4....] # (again, effectively random numbers)
What I need to do is to take data1 and data2 for day 1, perform operations on it, and then repeat the process for day 2, day 3, day 4 and so on.
I currently have:
def sortfile(day, data1, data2):
x=[]
y=[]
date=[]
temp1=[]
temp2=[]
i=00
for i in range(0,len(day)-1):
if day[i] == day[i+1]:
x.append(data1[i])
y.append(data2[i])
i+=1
#print x, y
else:
for i in range(len(x)):
temp1.append(x)
for i in range(len(y)):
temp2.append(y)
date.append(day[i])
x=[]
y=[]
i+=1
while i!=(len(epoch)-1):
x.append(data1[i])
y.append(data2[i])
i+=1
date.append(day[i])
return date, temp1, temp2
This is supposed to append to the x array whilst the day stays the same, and then if it changes append all the data from the x array to the temp1 array, then clear the x array. It will then perform operations on temp1 and temp2. However, when I run this as a check (I'm aware that I'm not clearing temp1 and temp2 at any point), temp1 just fills with the full list of days and temp2 is empty. I'm not sure why this is and am open to completely restarting if a better way is suggested!
Thanks in advance.

Just zip the lists:
x = []
y = []
date = []
temp1 = []
temp2 = []
day = [1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4]
data1 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4]
data2 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4]
zipped = zip(day, data1,data2) # list(zipped) for python 3
for ind, dy, dt1, dt2 in enumerate(zipped[:-1]):
if zipped[ind+1][0] == dy:
x.append(dt1)
y.append(dt2)
else:
temp1 += x
temp2 += y
x = []
y = []
Not sure what your while loop is doing as it is outside the for loops and you don't actually return or use x and y so that code seems irrelevant and may well be the reason your code is not returning what you expect.

groupby and zip are a good solution for this problem. It lets you group bits of sorted data together. zip allows you to access the elements at each index of day, data1, and data2 together as a tuple.
from operator import itemgetter
from itertools import groupby
day = [1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4]
data1 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4]
data2 = [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5,1,2,3,4]
x = []
y = []
for day_num, data in groupby(zip(day, data1, data2), itemgetter(0)):
data = list(data)
data1_total = sum(d[1] for d in data)
x.append(data1_total)
data2_total = sum(d[2] for d in data)
y.append(data2_total)
itemgetter is just a function that tells groupby to group the tuple of elements by the first element in the tuple (the day value).

Another option is to use defaultdict and simply iterate over days adding data as we go:
from collections import defaultdict
d1 = defaultdict(list)
d2 = defaultdict(list)
for n, d in enumerate(day):
d1[d].append(data1[n])
d2[d].append(data2[n])
This creates two dicts like {day: [value1, value2...]...}. Note that this solution doesn't require days to be sorted.

Running several threads is similar to running several different programs concurrently sharing the same data. You can call a thread for each array.
Read more about threading on the threading documentation

Related

SQL Dictionary Appending for Large Datasets in a Loop in cx_Oracle

I am trying to append dictionaries together and then use "from_dict" to get the final returned data from cx_Oracle as I heard that is more efficient that appending each returned row from SQL. However, my loop still takes a very long time (the ending loop returns a VERY large database, each loop gets data for an I.D. which returns ~ 12,000 rows per I.D. - there are over 700 I.D.s in the loop). How do I take advantage of "from_dict" so this speeds up? I don't think this is the most efficient way to do this as I have the code written now. Any suggestions? Thanks.
Is there a more efficient way? Using concat and not append?
for iteration, c in enumerate(l, start = 1):
total = len(l)
data['SP_ID'] = c
data['BEGIN_DATE'] = BEGIN_DATE
print("Getting consumption data for service point I.D.:", c, " ---->", iteration, "of", total)
cursor.arraysize = 1000000
cursor.prefetchrows = 2
cursor.execute(sql, data)
cursor.rowfactory = lambda *args: dict(zip([d[0] for d in cursor.description], args))
df_row = cursor.fetchall()
if len(df_row) == 0:
pass
else:
a = {k: [d[k] for d in df_row] for k in df_row[0]} # Here is where I combine dictionaries, but this is for only dataset pulling from SQL.I want to combine all the dictionaries from each loop to increase efficiency.
AMI_data = pd.DataFrame.from_dict(a)
#AMI.append(AMI_data)
#final_AMI_data = pd.concat(AMI)
# final_data.dropna(inplace = True)
# UPDATED
final_AMI_data = pd.DataFrame()
for iteration, c in enumerate(l, start = 1):
total = len(l)
data['SP_ID'] = c
data['BEGIN_DATE'] = BEGIN_DATE
print("Getting consumption data for service point I.D.:", c, " ---->", iteration, "of", total)
cursor.arraysize = 1000000
cursor.prefetchrows = 2
cursor.execute(sql, data)
cursor.rowfactory = lambda *args: dict(zip([d[0] for d in cursor.description], args))
df_row = cursor.fetchall()
if len(df_row) == 0:
pass
else:
AMI_data = pd.DataFrame.from_records(df_row)
final_AMI_data.append(AMI_data, ignore_index = False)
# final_data.dropna(inplace = True)
You shouldn't need to re-create your dictionary if you've already the dict-style cursor factory. (Btw, see this answer for how to make a better one.)
Assuming your df_rows looks like this after fetching all rows, with 'X' and 'Y' being example column names for the query-result:
[{'X': 'xval1', 'Y': 'yval1'},
{'X': 'xval2', 'Y': 'yval2'},
{'X': 'xval3', 'Y': 'yval3'}]
1. Then use .from_records() to create your dataframe:
pd.DataFrame.from_records(df_rows)
Output:
X Y
0 xval1 yval1
1 xval2 yval2
2 xval3 yval3
That way, you don't need to restructure your results to use with from_dict().
2. And if you want to keep adding each group of 12,000 results to the same DataFrame, use DataFrame.append() with ignore_index=True to keep adding each new group of results to the existing dataframe.
It's better to just append into your dataframe instead of creating a bigger and bigger dictionaryto finally create one df.
In case it wasn't clear, remove these two lines in your else:
a = {k: [d[k] for d in df_row] for k in df_row[0]}
AMI_data = pd.DataFrame.from_dict(a)
and replace it with just:
AMI_data = pd.DataFrame.from_records(df_row)
# and then to add it to your final:
final_AMI_data.append(AMI_data, ignore_index=True)

Is there a faster/better way to do an iterative list intersection?

I'm trying to get a list of credit for any given opportunity ('opp') exclusive to the maximal combinations of 'touches' in my test DataFrame.
So for every combination of touches, I get the intersection of all unique opps (intersect_list).
Then I remove the opp if it also exists in a superset (remove_common_elements)
I'm encountering a really long run time doing this with real data, but i can't figure out a better way to do it. The intersect_list is what takes the longest by far (several hours for ~65k entries of real data, where any given nested list has < 10 elements)
I can't figure out how to do an apply or list comprehension, etc...or other way to speed this up.
test = pd.DataFrame(data={'opp':['a','a','a','b','b','c','a','b','c','d'],'touch':['z','y','x','y','z','x','z','y','x','z']})
touches = test['touch'].unique().tolist()
test_array_indices = list(range(0,len(touches)))
test_array_combos = []
for i in range(1,len(touches)+1):
test_array_combos.extend(combinations(test_array_indices,i))
def intersect_list(list1, list2):
result = [i for i in list1 if i in list2]
return result
def remove_common_elements(list1,list2):
result = [i for i in list1 if i not in list2]
return result
k = []
for i in range(0,len(touches)):
k.append(test[test['touch'] == touches[i]]['opp'].unique().tolist())
test_opp_list = []
for j in range(0,len(test_array_combos)):
t = []
for i in range(0,len(test_array_combos[j])):
t.append(k[test_array_combos[j][i]])
test_opp_list.append(t)
list_t = []
for i in range(0,len(test_opp_list)):
list_t.append([])
for i in range(0,len(test_opp_list)):
list_t[i] = test_opp_list[i][0]
for j in range(0,len(test_opp_list[i])):
list_t[i] = intersect_list(list_t[i],test_opp_list[i][j])
for i in range(0,len(list_t)-1):
for j in range(1, len(list_t)-i):
list_t[i] = remove_common_elements(list_t[i], list_t[i+j])
The best thing I found so far is to turn the list of lists into a list or sets. This speeds things up by about 4x for my tiny data set, but as I understand the difference grows exponentially with size.
test = pd.DataFrame(data={'opp':['a','a','a','b','b','c','a','b','c','d'],'touch':['z','y','x','y','z','x','z','y','x','z']})
touches = test['touch'].unique().tolist()
k = []
for i in range(0,len(touches)):
k.append(set(test[test['touch'] == touches[i]]['opp'].unique().tolist()))
test_array_indices = list(range(0,len(touches)))
test_array_combos = []
for i in range(1,len(touches)+1):
test_array_combos.extend(combinations(test_array_indices,i))
test_opp_list_sets = []
for j in range(0,len(test_array_combos)):
t = []
for i in range(0,len(test_array_combos[j])):
t.append(k[test_array_combos[j][i]])
test_opp_list_sets.append(t)
list_s = []
for i in range(0,len(test_opp_list_sets)):
list_s.append([])
for i in range(0, len(test_opp_list_sets)):
list_s[i] = test_opp_list_sets[i][0].intersection(*test_opp_list_sets[i])
for i in range(0,len(list_s)-1):
for j in range(1, len(list_s)-i):
list_s[i] = list_s[i] - list_s[i+j]

Most efficient and most pythonic way to create a NumPy array within a loop

I'm currently trying to figure out the most efficient way to create a numpy array in a loop, here are the examples:
import numpy as np
from time import time
tic = time()
my_list = range(1000000)
a = np.zeros((len(my_list),))
for i in my_list:
a[i] = i
toc = time()
print(toc-tic)
vs
tic = time()
a = []
my_list = range(1000000)
for i in my_list:
a.append(i)
a = np.array(a)
toc = time()
print(toc-tic)
I was expecting that the second one would be much slower than the first one, because of the need of new memory at each step of the for loop, however these are roughly the same and I was wondering why, but just for curiosity because I can do it with both.
I actually want to write a simple numpy array with data extracted from a dataframe and it looks quite messy. I was wondering if there would be a more pythonic way to do it. I have this dataframe and a list of labels that I need and the simpliest idea would be to do the following (the value I need is the last one of each column):
vars_outputs = ["x1", "x2", "ratio_x1_x2"]
my_df = pd.read_excel(path)
outpts = np.array(my_df[vars_outputs][-1])
However it is not possible because some of the labels I want are not directly available in the dataframe : for example the ratio_x1_x2 need to be computed from the two first columns. So I added a dict with the missing label and the way to compute them (it's only ratio):
missing_labels = {"ratio_x1_x2" : ["x1", "x2"]}
and check the condition and create the numpy array (hence the previous question about efficiency)
outpts = []
for var in vars_outputs:
if var in missing_labels.keys():
outpts.append(my_df[missing_labels[var][0]][-1]/my_df[missing_labels[var][1]][-1])
else:
outpts.append(my_df[var][-1])
outpts = np.array(outpts)
It seems to me way too complicated but I cannot think of an easier way to do so (especially because I need to have this specific order in my numpy output array)
The other idea I have is to add columns in the dataframe with the operation I want but because there are roughly 8000 labels I don't know if it's the best to do because I would have to look into all these labels after this preprocessing step
Thanks a lot
Here is the final code, np.fromiter() does the trick and allows to reduce the number of lines by using list comprehension
df = pd.read_excel(path)
print(df.columns)
It outputs ['x1', 'x2']
vars_outputs = ["x1", "x2", "ratio_x1_x2"]
missing_labels = {"ratio_x1_x2" : ["x1", "x2"]}
it = [df[missing_labels[var][0]].iloc[-1]/df[missing_labels[var][1]].iloc[-1] if var in missing_labels
else df[var].iloc[-1] for var in vars_outputs]
t = np.fromiter(it, dtype = float)
Thanks #hpaulj, that might be very useful for me in future. I wasn't aware of the speed up using fromiter()
import timeit
setup = '''
import numpy as np
H, W = 400, 400
it = [(1 + 1 / (i + 0.5)) ** 2 for i in range(W) for j in range(H)]'''
fns = ['''
x = np.array([[(1 + 1 / (i + 0.5)) ** 2 for i in range(W)] for j in range(H)])
''', '''
x = np.fromiter(it, np.float)
x.reshape(H, W)
''']
for f in fns:
print(timeit.timeit(f,setup=setup, number=100))
# gives me
# 6.905218548999983
# 0.5763416080008028
EDIT PS your for loop could be some kind of iterator like
it = [my_df[missing_labels[var][0]][-1]
/ my_df[missing_labels[var][1]][-1] if var in missing_labels
else my_df[var][-1] for var in var_outputs]

Getting data arrays from CSV with loops

I have a CSV that looks like this:
0.500187550,CPU1,7.93
0.500187550,CPU2,1.62
0.500187550,CPU3,7.93
0.500187550,CPU4,1.62
1.000445359,CPU1,9.96
1.000445359,CPU2,1.61
1.000445359,CPU3,9.96
1.000445359,CPU4,1.61
1.500674877,CPU1,9.94
1.500674877,CPU2,1.61
1.500674877,CPU3,9.94
1.500674877,CPU4,1.61
The first column is time, the second the CPU used and the third is energy.
As a final result I would like to have these arrays:
Time:
[0.500187550, 1.000445359, 1.500674877]
Energy (per CPU): e.g. CPU1
[7.93, 9.96, 9.94]
For parsing the CSV I'm using:
query = csv.reader(csvfile, delimiter=',', skipinitialspace=True)
#Arrays global time and power:
for row in query:
x = row[0]
x = float(x)
x_array.append(x) #column 0 to array
y = row[2]
y = float(y)
y_array.append(y) #column 2 to array
print x_array
print y_array
These way I get all the data from time and energy into two arrays: x_array and y_array.
Then I order the arrays:
energy_core_ord_array = []
time_ord_array = []
#Dividing array into energy and time per core:
for i in range(number_cores[0]):
e = 0 + i
for j in range(len(x_array)/(int(number_cores[0]))):
time_ord = x_array[e]
time_ord_array.append(time_ord)
energy_core_ord = y_array[e]
energy_core_ord_array.append(energy_core_ord)
e = e + int(number_cores[0])
And lastly, I cut the time array into the lenghts it should have:
final_time_ord_array = []
for i in range(len(x_array)/(int(number_cores[0]))):
final_time_ord = time_ord_array[i]
final_time_ord_array.append(final_time_ord)
Till here, although the code is not elegant, it works.
The problem comes when I try to get the array for each core.
I get it for the first core, but when I try to iterate for the next one, I donĀ“t know how to do it, and how can I store each array in a variable with a single name for example.
final_energy_core_ord_array = []
#Trunk energy core array:
for i in range(len(x_array)/(int(number_cores[0]))):
final_energy_core_ord = energy_core_ord_array[i]
final_energy_core_ord_array.append(final_energy_core_ord)
So using Pandas (library to handle dataframes in Python) you can do something like this, which is much quicker than trying to process the CSV manually like you're doing:
import pandas as pd
csvfile = "C:/Users/Simon/Desktop/test.csv"
data = pd.read_csv(csvfile, header=None, names=['time','cpu','energy'])
times = list(pd.unique(data.time.ravel()))
print times
cpuList = data.groupby(['cpu'])
cpuEnergy = {}
for i in range(len(cpuList)):
curCPU = 'CPU' + str(i+1)
cpuEnergy[curCPU] = list(cpuList.get_group('CPU' + str(i+1))['energy'])
for k, v in cpuEnergy.items():
print k, v
that will give the following as output:
[0.50018755000000004, 1.000445359, 1.5006748769999998]
CPU4 [1.6200000000000001, 1.6100000000000001, 1.6100000000000001]
CPU2 [1.6200000000000001, 1.6100000000000001, 1.6100000000000001]
CPU3 [7.9299999999999997, 9.9600000000000009, 9.9399999999999995]
CPU1 [7.9299999999999997, 9.9600000000000009, 9.9399999999999995]
Finally I got the answer, using globals.... not a great idea, but works, leave it here if someone find it useful.
final_energy_core_ord_array = []
#Trunk energy core array:
a = 0
for j in range(number_cores[0]):
for i in range(len(x_array)/(int(number_cores[0]))):
final_energy_core_ord = energy_core_ord_array[a + i]
final_energy_core_ord_array.append(final_energy_core_ord)
globals()['core%s' % j] = final_energy_core_ord_array
final_energy_core_ord_array = []
a = a + 12
print 'Final time and cores:'
print final_time_ord_array
for j in range(number_cores[0]):
print globals()['core%s' % j]

How to reduce a collection of ranges to a minimal set of ranges [duplicate]

This question already has answers here:
Union of multiple ranges
(5 answers)
Closed 7 years ago.
I'm trying to remove overlapping values from a collection of ranges.
The ranges are represented by a string like this:
499-505 100-115 80-119 113-140 500-550
I want the above to be reduced to two ranges: 80-140 499-550. That covers all the values without overlap.
Currently I have the following code.
cr = "100-115 115-119 113-125 80-114 180-185 500-550 109-120 95-114 200-250".split(" ")
ar = []
br = []
for i in cr:
(left,right) = i.split("-")
ar.append(left);
br.append(right);
inc = 0
for f in br:
i = int(f)
vac = []
jnc = 0
for g in ar:
j = int(g)
if(i >= j):
vac.append(j)
del br[jnc]
jnc += jnc
print vac
inc += inc
I split the array by - and store the range limits in ar and br. I iterate over these limits pairwise and if the i is at least as great as the j, I want to delete the element. But the program doesn't work. I expect it to produce this result: 80-125 500-550 200-250 180-185
For a quick and short solution,
from operator import itemgetter
from itertools import groupby
cr = "499-505 100-115 80-119 113-140 500-550".split(" ")
fullNumbers = []
for i in cr:
a = int(i.split("-")[0])
b = int(i.split("-")[1])
fullNumbers+=range(a,b+1)
# Remove duplicates and sort it
fullNumbers = sorted(list(set(fullNumbers)))
# Taken From http://stackoverflow.com/questions/2154249
def convertToRanges(data):
result = []
for k, g in groupby(enumerate(data), lambda (i,x):i-x):
group = map(itemgetter(1), g)
result.append(str(group[0])+"-"+str(group[-1]))
return result
print convertToRanges(fullNumbers)
#Output: ['80-140', '499-550']
For the given set in your program, output is ['80-125', '180-185', '200-250', '500-550']
Main Possible drawback of this solution: This may not be scalable!
Let me offer another solution that doesn't take time linearly proportional to the sum of the range sizes. Its running time is linearly proportional to the number of ranges.
def reduce(range_text):
parts = range_text.split()
if parts == []:
return ''
ranges = [ tuple(map(int, part.split('-'))) for part in parts ]
ranges.sort()
new_ranges = []
left, right = ranges[0]
for range in ranges[1:]:
next_left, next_right = range
if right + 1 < next_left: # Is the next range to the right?
new_ranges.append((left, right)) # Close the current range.
left, right = range # Start a new range.
else:
right = max(right, next_right) # Extend the current range.
new_ranges.append((left, right)) # Close the last range.
return ' '.join([ '-'.join(map(str, range)) for range in new_ranges ]
This function works by sorting the ranges, then looking at them in order and merging consecutive ranges that intersect.
Examples:
print(reduce('499-505 100-115 80-119 113-140 500-550'))
# => 80-140 499-550
print(reduce('100-115 115-119 113-125 80-114 180-185 500-550 109-120 95-114 200-250'))
# => 80-125 180-185 200-250 500-550

Categories