Sort (and sorted) not sorting - python

I have a data file that's built the following way:
source_id, target_id, impressions, clicks
on which I add the following columns:
pair - a tuple of the source and target
CTR - basically clicks/impressions
Lower Bound
Upper Bound
Lower/Upper bound are calculated values (it's irrelevant to my question, but for the curious ones these are the bounds for the Wilson confidence interval.
The thing is, I'm trying to sort the list by the lower bound (position = 6), descending. Tried several things (sort/sorted, using lambda vs. using itemgetter, creating a new list w/o the header and try to sort just that) and still it appears nothing changes. I have the code below.
import csv
from math import sqrt
from operator import itemgetter
#----- Read CSV ----------------------------------------------------------------
raw_data_csv = open('rawdile', "rb")
raw_reader = csv.reader(raw_data_csv)
# transform the values to ints.
raw_data = []
for rownum,row in enumerate(list(raw_reader)):
if rownum == 0: # Header
raw_data.append(row)
else:
r = [] # Col header
r.extend([int(x) for x in row]) # Transforming the values to ints
raw_data.append(r)
# Add cols for pairs (as tuple) and CTR
raw_data[0].append("pair")
for row in raw_data[1:]:
row.append((row[0],row[1])) # tuple
# row.append(float(row[3])/row[2]) # CTR
# ------------------------------------------------------------------------------
z = 1.95996398454005
def confidence(n, clicks):
if n == 0:
return 0
phat = float(clicks) / n
l_bound = ((phat + z*z/(2*n) - z * sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)) # lower bound
u_bound = ((phat + z*z/(2*n) + z * sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)) # upper bound
return phat, l_bound, u_bound
raw_data[0].extend(["CTR","Lower Bound","Upper Bound"])
for row in raw_data[1:]:
phat, l_bound, u_bound = confidence(row[2],row[3])
row.extend([phat, l_bound, u_bound])
# raw_data[1:].sort(key=lambda x: x[6], reverse=True)
sorted(raw_data[1:], key=itemgetter(6), reverse=True)
outputfile= open('outputfile.csv', 'wb')
wr = csv.writer(outputfile,quoting = csv.QUOTE_ALL)
wr.writerows(raw_data)
raw_data_csv.close()
outputfile.close()
Can anybody tell why?
Thanks!

You are sorting a slice in one attempt (which creates a new list object), and in your other attempt you ignore the return value of sorted().
You cannot sort part of a list like that; create a new list by concatenating instead:
rows = rows[:1] + sorted(raw_data[1:], key=itemgetter(6), reverse=True)

Related

Grouping pairs of combination data based on given condition

Suppose I have a huge array of data and sample of them are :
x= [ 511.31, 512.24, 571.77, 588.35, 657.08, 665.49, -1043.45, -1036.56,-969.39, -955.33]
I used the following code to generate all possible pairs
Pairs=[(x[i],x[j]) for i in range(len(x)) for j in range(i+1, len(x))]
Which gave me all possible pairs. Now, I would like to group these pairs if they are within threshold values of -25 or +25 and label them accordingly.
Any idea or advice on how to do this? Thanks in advance
If I understood correctly your problem, the code below should do the trick. The idea is to generate a dictionary whose keys are the mean value, and just keep appending data onto it:
import numpy as np #I use numpy for the mean.
#Your threshold
threshold = 25
#A dictionary will hold the relevant pairs
mylist = {}
for i in Pairs:
#Check for the threshold and discard otherwise
diff = abs(i[1]-i[0])
if(diff < threshold):
#Name of the entry in the dictionary
entry = str('%d'%int(np.mean(i)))
#If the entry already exists, append. Otherwise, create a container list
if(entry in mylist):
mylist[entry].append(i)
else:
mylist[entry] = [i]
which results in the following output:
{'-1040': [(-1043.45, -1036.56)],
'-962': [(-969.39, -955.33)],
'511': [(511.1, 511.31),
(511.1, 512.24),
(511.1, 512.35),
(511.31, 512.24),
(511.31, 512.35)],
'512': [(511.1, 513.35),
(511.31, 513.35),
(512.24, 512.35),
(512.24, 513.35),
(512.35, 513.35)],
'580': [(571.77, 588.35)],
'661': [(657.08, 665.49)]}
This should be a fast way to do that:
import numpy as np
from scipy.spatial.distance import pdist
# Input data
x = np.array([511.31, 512.24, 571.77, 588.35, 657.08,
665.49, -1043.45, -1036.56,-969.39, -955.33])
thres = 25.0
# Compute pairwise distances
# default distance metric is'euclidean' which
# would be equivalent but more expensive to compute
d = pdist(x[:, np.newaxis], 'cityblock')
# Find distances within threshold
d_idx = np.where(d <= thres)[0]
# Convert "condensed" distance indices to pair of indices
r = np.arange(len(x))
c = np.zeros_like(r, dtype=np.int32)
np.cumsum(r[:0:-1], out=c[1:])
i = np.searchsorted(c[1:], d_idx, side='right')
j = d_idx - c[i] + r[i] + 1
# Get pairs of values
v_i = x[i]
v_j = x[j]
# Find means
m = np.round((v_i + v_j) / 2).astype(np.int32)
# Print result
for idx in range(len(m)):
print(f'{m[idx]}: ({v_i[idx]}, {v_j[idx]})')
Output
512: (511.31, 512.24)
580: (571.77, 588.35)
661: (657.08, 665.49)
-1040: (-1043.45, -1036.56)
-962: (-969.39, -955.33)

Create "synthetic points"

I need to create inside a python routine, something that I am calling "synthetic points".
I have a series of data which vary between -1 and 1, however, when I put this data on a chart, they form a trapezoidal chart.
What I would like to do is create points where the same x-axis value, could take two y-axis values, and then, this will create a chart with
straight lines making a "rectangular chart"
An example the format data that I have:
0;-1
1;-1
2;-1
3;-1
4;-1
5;-1
6;-1
7;1
8;1
9;1
10;1
11;1
12;1
13;1
14;1
15;1
16;-1
17;-1
18;-1
19;-1
20;-1
For example, in this case, I would need the data assume the following format:
0;-1
1;-1
2;-1
3;-1
4;-1
5;-1
6;-1
6;1 (point 6 with two values)
7;1
8;1
9;1
10;1
11;1
12;1
13;1
14;1
15;1
15;-1 (point 15 with two values)
16;-1
17;-1
18;-1
19;-1
20;-1
So what you need to do is, always when I had a value change, this will create a new point. This makes the graph, rectangular, as the only possible values for the y variable are -1 and 1.
The code I need to enter is below. What was done next was just to put the input data in this format of -1 and 1.
arq = open('vazdif.out', 'rt')
list = []
i = 0
for row in arq:
field = row.split(';')
vaz = float(field[2])
if vaz < 0:
list.append("-1")
elif vaz > 0:
list.append("1")
n = len(list)
fou = open('res_id.out','wt')
for i in range(n):
fou.write('{};{}\n'.format(i,list[i]))
fou.close
Thank you for your help
P.s. English is not my first language, forgive my mistakes on write or on the code.
I added a new value prev_value, if the previous value is of the opposite sign (multiply with the current value < 0), it adds an extra index to the list.
I think the field[1] and field[2] are probably wrong, but I'll trust your code works so far. Similar with fou, I would replace with with open ...
arq = open('vazdif.out', 'rt')
list = []
i = 0
prev_value = 0
for row in arq:
field = row.split(';')
xxx = int(field[1])
vaz = float(field[2])
if vaz * prev_value < 0:
list.append([list[-1][0], - list[-1][1]])
if vaz < 0:
list.append([xxx, -1])
else:
list.append([xxx, 1])
prev_val = vaz
fou = open('res_id.out','wt')
for i in list:
fou.write(f'{i[0]};{i[1]}\n')
fou.close

How to reduce a collection of ranges to a minimal set of ranges [duplicate]

This question already has answers here:
Union of multiple ranges
(5 answers)
Closed 7 years ago.
I'm trying to remove overlapping values from a collection of ranges.
The ranges are represented by a string like this:
499-505 100-115 80-119 113-140 500-550
I want the above to be reduced to two ranges: 80-140 499-550. That covers all the values without overlap.
Currently I have the following code.
cr = "100-115 115-119 113-125 80-114 180-185 500-550 109-120 95-114 200-250".split(" ")
ar = []
br = []
for i in cr:
(left,right) = i.split("-")
ar.append(left);
br.append(right);
inc = 0
for f in br:
i = int(f)
vac = []
jnc = 0
for g in ar:
j = int(g)
if(i >= j):
vac.append(j)
del br[jnc]
jnc += jnc
print vac
inc += inc
I split the array by - and store the range limits in ar and br. I iterate over these limits pairwise and if the i is at least as great as the j, I want to delete the element. But the program doesn't work. I expect it to produce this result: 80-125 500-550 200-250 180-185
For a quick and short solution,
from operator import itemgetter
from itertools import groupby
cr = "499-505 100-115 80-119 113-140 500-550".split(" ")
fullNumbers = []
for i in cr:
a = int(i.split("-")[0])
b = int(i.split("-")[1])
fullNumbers+=range(a,b+1)
# Remove duplicates and sort it
fullNumbers = sorted(list(set(fullNumbers)))
# Taken From http://stackoverflow.com/questions/2154249
def convertToRanges(data):
result = []
for k, g in groupby(enumerate(data), lambda (i,x):i-x):
group = map(itemgetter(1), g)
result.append(str(group[0])+"-"+str(group[-1]))
return result
print convertToRanges(fullNumbers)
#Output: ['80-140', '499-550']
For the given set in your program, output is ['80-125', '180-185', '200-250', '500-550']
Main Possible drawback of this solution: This may not be scalable!
Let me offer another solution that doesn't take time linearly proportional to the sum of the range sizes. Its running time is linearly proportional to the number of ranges.
def reduce(range_text):
parts = range_text.split()
if parts == []:
return ''
ranges = [ tuple(map(int, part.split('-'))) for part in parts ]
ranges.sort()
new_ranges = []
left, right = ranges[0]
for range in ranges[1:]:
next_left, next_right = range
if right + 1 < next_left: # Is the next range to the right?
new_ranges.append((left, right)) # Close the current range.
left, right = range # Start a new range.
else:
right = max(right, next_right) # Extend the current range.
new_ranges.append((left, right)) # Close the last range.
return ' '.join([ '-'.join(map(str, range)) for range in new_ranges ]
This function works by sorting the ranges, then looking at them in order and merging consecutive ranges that intersect.
Examples:
print(reduce('499-505 100-115 80-119 113-140 500-550'))
# => 80-140 499-550
print(reduce('100-115 115-119 113-125 80-114 180-185 500-550 109-120 95-114 200-250'))
# => 80-125 180-185 200-250 500-550

Get percentile points from a huge list

I have a huge list (45M+ data poitns), with numerical values:
[78,0,5,150,9000,5,......,25,9,78422...]
I can easily get the maximum and minimum values, the number of these values, and the sum of them:
file_handle=open('huge_data_file.txt','r')
sum_values=0
min_value=None
max_value=None
for i,line in enumerate(file_handle):
value=int(line[:-1])
if min_value==None or value<min_value:
min_value=value
if max_value==None or value>max_value:
max_value=value
sum_values+=value
average_value=float(sum_values)/i
However, this is not what I need. I need a list of 10 numbers, where the number of data points between each two consecutive points is equal, for example
median points [0,30,120,325,912,1570,2522,5002,7025,78422]
and we have the number of data points between 0 and 30 or between 30 and 120 to be almost 4.5 million data points.
How can we do this?
=============================
EDIT:
I am well aware that we will need to sort the data. The problem is that I cannot fit all this data in one variable in memory, but I need to read it sequentially from a generator (file_handle)
If you are happy with an approximation, here is a great (and fairly easy to implement) algorithm for computing quantiles from stream data: "Space-Efficient Online Computation of Quantile Summaries" by Greenwald and Khanna.
The silly numpy approach:
import numpy as np
# example data (produced by numpy but converted to a simple list)
datalist = list(np.random.randint(0, 10000000, 45000000))
# converted back to numpy array (start here with your data)
arr = np.array(datalist)
np.percentile(arr, 10), np.percentile(arr, 20), np.percentile(arr, 30)
# ref:
# http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html
You can also hack something together where you just do like:
arr.sort()
# And then select the 10%, 20% etc value, add some check for equal amount of
# numbers within a bin and then calculate the average, excercise for reader :-)
The thing is that calling this function several times will slow it down, so really, just sort the array and then select the elements yourself.
As you said in the comments that you want a solution that can scale to larger datasets then can be stored in RAM, feed the data into an SQLlite3 database. Even if your data set is 10GB and you only have 8GB RAM a SQLlite3 database should still be able to sort the data and give it back to you in order.
The SQLlite3 database gives you a generator over your sorted data.
You might also want to look into going beyond python and take some other database solution.
Here's a pure-python implementation of the partitioned-on-disk sort. It's slow, ugly code, but it works and hopefully each stage is relatively clear (the merge stage is really ugly!).
#!/usr/bin/env python
import os
def get_next_int_from_file(f):
l = f.readline()
if not l:
return None
return int(l.strip())
MAX_SAMPLES_PER_PARTITION = 1000000
PARTITION_FILENAME = "_{}.txt"
# Partition data set
part_id = 0
eof = False
with open("data.txt", "r") as fin:
while not eof:
print "Creating partition {}".format(part_id)
with open(PARTITION_FILENAME.format(part_id), "w") as fout:
for _ in range(MAX_SAMPLES_PER_PARTITION):
line = fin.readline()
if not line:
eof = True
break
fout.write(line)
part_id += 1
num_partitions = part_id
# Sort each partition
for part_id in range(num_partitions):
print "Reading unsorted partition {}".format(part_id)
with open(PARTITION_FILENAME.format(part_id), "r") as fin:
samples = [int(line.strip()) for line in fin.readlines()]
print "Disk-Deleting unsorted {}".format(part_id)
os.remove(PARTITION_FILENAME.format(part_id))
print "In-memory sorting partition {}".format(part_id)
samples.sort()
print "Writing sorted partition {}".format(part_id)
with open(PARTITION_FILENAME.format(part_id), "w") as fout:
fout.writelines(["{}\n".format(sample) for sample in samples])
# Merge-sort the partitions
# NB This is a very inefficient implementation!
print "Merging sorted partitions"
part_files = []
part_next_int = []
num_lines_out = 0
# Setup data structures for the merge
for part_id in range(num_partitions):
fin = open(PARTITION_FILENAME.format(part_id), "r")
next_int = get_next_int_from_file(fin)
if next_int is None:
continue
part_files.append(fin)
part_next_int.append(next_int)
with open("data_sorted.txt", "w") as fout:
while part_files:
# Find the smallest number across all files
min_number = None
min_idx = None
for idx in range(len(part_files)):
if min_number is None or part_next_int[idx] < min_number:
min_number = part_next_int[idx]
min_idx = idx
# Now add that number, and move the relevent file along
fout.write("{}\n".format(min_number))
num_lines_out += 1
if num_lines_out % MAX_SAMPLES_PER_PARTITION == 0:
print "Merged samples: {}".format(num_lines_out)
next_int = get_next_int_from_file(part_files[min_idx])
if next_int is None:
# Remove this partition, it's now finished
del part_files[min_idx:min_idx + 1]
del part_next_int[min_idx:min_idx + 1]
else:
part_next_int[min_idx] = next_int
# Cleanup partition files
for part_id in range(num_partitions):
os.remove(PARTITION_FILENAME.format(part_id))
My code a proposal for finding the result without needing much space. In testing it found a quantile value in 7 minutes 51 seconds for a dataset of size 45 000 000.
from bisect import bisect_left
class data():
def __init__(self, values):
random.shuffle(values)
self.values = values
def __iter__(self):
for i in self.values:
yield i
def __len__(self):
return len(self.values)
def sortedValue(self, percentile):
val = list(self)
val.sort()
num = int(len(self)*percentile)
return val[num]
def init():
numbers = data([x for x in range(1,1000000)])
print(seekPercentile(numbers, 0.1))
print(numbers.sortedValue(0.1))
def seekPercentile(numbers, percentile):
lower, upper = minmax(numbers)
maximum = upper
approx = _approxPercentile(numbers, lower, upper, percentile)
return neighbor(approx, numbers, maximum)
def minmax(list):
minimum = float("inf")
maximum = float("-inf")
for num in list:
if num>maximum:
maximum = num
if num<minimum:
minimum = num
return minimum, maximum
def neighbor(approx, numbers, maximum):
dif = maximum
for num in numbers:
if abs(approx-num)<dif:
result = num
dif = abs(approx-num)
return result
def _approxPercentile(numbers, lower, upper, percentile):
middles = []
less = []
magicNumber = 10000
step = (upper - lower)/magicNumber
less = []
for i in range(1, magicNumber-1):
middles.append(lower + i * step)
less.append(0)
for num in numbers:
index = bisect_left(middles,num)
if index<len(less):
less[index]+= 1
summing = 0
for index, testVal in enumerate(middles):
summing += less[index]
if summing/len(numbers) < percentile:
print(" Change lower from "+str(lower)+" to "+ str(testVal))
lower = testVal
if summing/len(numbers) > percentile:
print(" Change upper from "+str(upper)+" to "+ str(testVal))
upper = testVal
break
precision = 0.01
if (lower+precision)>upper:
return lower
else:
return _approxPercentile(numbers, lower, upper, percentile)
init()
I edited my code a bit and I now think that this way works at least decently even when it's not optimal.

Python increase performance of random.sample

I am writing a function to select randomly elements stored in a dictionary:
import random
from liblas import file as lasfile
from collections import defaultdict
def point_random_selection(list,k):
try:
sample_point = random.sample(list,k)
except ValueError:
sample_point = list
return(sample_point)
def world2Pixel_Id(x,y,X_Min,Y_Max,xDist,yDist):
col = int((x - X_Min)/xDist)
row = int((Y_Max - y)/yDist)
return("{0}_{1}".format(col,row))
def point_GridGroups(inFile,X_Min,Y_Max,xDist,yDist):
Groups = defaultdict(list)
for p in lasfile.File(inFile,None,'r'):
id = world2Pixel_Id(p.x,p.y,X_Min,Y_Max,xDist,yDist)
Groups[id].append(p)
return(Groups)
where k is the number of element to select. Groups is the dictionary
file_out = lasfile.File("outPut",mode='w',header= h)
for m in Groups.iteritems():
# select k point for each dictionary key
point_selected = point_random_selection(m[1],k)
for l in xrange(len(point_selected)):
# save the data
file_out.write(point_selected[l])
file_out.close()
My problem is that this approach is extremely slow (for file of ~800 Mb around 4 days)
You could try and update your samples as you read the coordinates. This at least saves you from having to store everything in memory before running your sample. This is not guaranteed to make things faster.
The following is based off of BlkKnght's excellent answer to build a random sample from file input without retaining all the lines. This just expanded it to keep multiple samples instead.
import random
from liblas import file as lasfile
from collections import defaultdict
def world2Pixel_Id(x, y, X_Min, Y_Max, xDist, yDist):
col = int((x - X_Min) / xDist)
row = int((Y_Max - y) / yDist)
return (col, row)
def random_grouped_samples(infile, n, X_Min, Y_Max, xDist, yDist):
"""Select up to n points *per group* from infile"""
groupcounts = defaultdict(int)
samples = defaultdict(list)
for p in lasfile.File(inFile, None, 'r'):
id = world2Pixel_Id(p.x, p.y, X_Min, Y_Max, xDist, yDist)
i = groupcounts[id]
r = random.randint(0, i)
if r < n:
if i < n:
samples[id].insert(r, p) # add first n items in random order
else:
samples[id][r] = p # at a decreasing rate, replace random items
groupcounts[id] += 1
return samples
The above function takes inFile and your boundary coordinates, as well as the sample size n, and returns grouped samples that have at most n items in each group, picked uniformly.
Because all you use the id for is as a group key, I reduced it to only calculating the col, row tuple, there is no need to make it a string.
You can write these out to a file with:
file_out = lasfile.File("outPut",mode='w',header= h)
for group in samples.itervalues():
for p in group:
file_out.write(p)
file_out.close()

Categories