How to remove lines from a large file

How to remove lines from a large file - python

I have a large file with each line of the form
a b c
I would like to remove all such lines where there does not exist another line either like
b d e
or d a e
with abs(c - e) < 10.
a, b, c, d, e are all integers.
For example if the input is:
0 1 10
1 2 20
2 3 25
0 1 15
1 4 40
then the output should be
1 2 20
2 3 25
0 1 15
Is it possible to do this in anything like linear time?
One idea is to create two dictionaries of sorted lists. One for the third column values associated with first column values. The other for the third column values associated with second column values. Then when you see a b c, look up c in the sorted list you get using key a in the second dictionary and then c in the sorted list you get using key b in the first dictionary.

I don't know if this can be done in linear time. It is straightforward to do it in O(n·log n) time if there are n triplets in the input. Here is a sketch of a method, in a not-necessarily-preferred form of implementation:
Make an array of markers M, initially all clear.
Create an array and make a copy of the input, sorted first on the middle element and then by the third element whenever middle elements are equal. (Time is O(n·log n) so far.)
For each distinct middle value, make a BST (binary search tree) with key = third element. (Time is O(n·log n) again.)
Make a hash table keyed by middle values, with data pointing at appropriate BST's. That is, given a middle value y and third element z, in time O(1) we can get to the BST for triplets whose middle value is y; and from that, in time O(log n) can find the triplet with third-element value closest to z.
For each triplet t = (x,y,z) in turn, if marker is not yet set use the hash table to find the BST, if any, corresponding to x. In that BST, find the triplet u with third element closest to z. If difference is less than 10, set the markers for t and u. (Time is O(n·log n) again.)
Repeat steps 2–5 but with BST's based on first element values rather than middle value, and lookups in step 5 based on y rather than x. (Although the matching-relations are symmetric, so that we can set two markers at each cycle in step 5, some qualifying triplets may end up not marked; ie, they are in tolerance but more distant than the nearest-match that is found. It would be possible to mark all of the qualifying triplets in step 5, but that would increase worst-case time from O(n·log n) to O(n²·log n).)
For each marker that is set, output the corresponding triplet.
Overall time: O(n·log n). The same time can be achieved without building BST's but instead using binary searches within subranges of the sorted arrays.
Edit: In python, one can build structures usable with bisect as illustrated below in excerpts from an ipython interpreter session. (There may be more efficient ways of doing these steps.) Each data item in dictionary h is an array suitable for searching with bisect.
In [1]: from itertools import groupby
In [2]: a=[(0,1,10), (1,2,20), (2,3,25), (0,1,15), (1,4,40), (1,4,33), (3,3,17), (2,1,19)]
In [3]: b=sorted((e[1],e[2],i) for i,e in enumerate(a)); print b
[(1, 10, 0), (1, 15, 3), (1, 19, 7), (2, 20, 1), (3, 17, 6), (3, 25, 2), (4, 33, 5), (4, 40, 4)]
In [4]: h={k:list(g) for k,g in groupby(b,lambda x: x[0])}; h
Out[4]:
{1: [(1, 10, 0), (1, 15, 3), (1, 19, 7)],
2: [(2, 20, 1)],
3: [(3, 17, 6), (3, 25, 2)],
4: [(4, 33, 5), (4, 40, 4)]}

Like others have said, linear time may not be possible. Here is an easy O(n^2) implementation. If you sort the lists inside the dictionaries, you should be able to improve the runtime.
lines = """0 1 10
1 2 20
2 3 25
0 1 15
1 4 40"""
Adata = {}
Bdata = {}
for line in lines.split('\n'):
a,b,c = line.split(' ')[:3]
vals = map(int,[a,b,c])
if b in Adata:
Adata[b].append(vals)
else:
Adata[b] = [vals]
if a in Bdata:
Bdata[a].append(vals)
else:
Bdata[a] = [vals]
def case1(a,b,c):
if a in Adata:
for val in Adata[a]:
if abs(int(c)-val[2]) < 10:
return True
return False
def case2(a,b,c):
if b in Bdata:
for val in Bdata[b]:
if abs(int(c)-val[2]) < 10:
return True
return False
out = []
for line in lines.split('\n'):
a,b,c = line.split(' ')[:3]
if case1(a,b,c) or case2(a,b,c):
out.append(line)
for line in out:
print line

I think what you're looking for is something like
set lines
for line in infile:
if line not in lines:
lines.add(line)
outfile.write(line)

Related

How to create a list of all values within a certain range from each other?

I have a list of tuples:
x = [(2, 10), (4, 5), (8, 10), (9, 11), (10, 15)]
I'm trying to compare the first values in all the tuples to see if they are within 1 from each other. If they are within 1, I want to aggregate (sum) the second value of the tuple, and take the mean of the first value.
The output list would look like this:
[(2, 10), (4, 5), (9, 36)]
Notice that the 8 and 10 have a difference of 2, but they're both only 1 away from 9, so they all 3 get aggregated.
I have been trying something along these lines, but It's not capturing the sequenced values like 8, 9, and 10. It's also still preserving the original values, even if they've been aggregated together.
tuple_list = [(2, 10), (4, 5), (8, 10), (9, 11), (10, 15)]
output_list = []
for x1,y1 in tuple_list:
for x2,y2 in tuple_list:
if x1==x2:
continue
if np.abs(x1-x2) <= 1:
output_list.append((np.mean([x1,x2]), y1+y2))
else:
output_list.append((x1,y1))
output_list = list(set(output_list))

You can do it in a list comprehension using groupby (from itertools). The grouping key will be the difference between the first value and the tuple's index in the list. When the values are 1 apart, this difference will be constant and the tuples will be part of the same group.
For example: [2, 4, 8, 9, 10] minus their indexes [0, 1, 2, 3, 4] will give [2, 3, 6, 6, 6] forming 3 groups: [2], [4] and [8 ,9, 10].
from itertools import groupby
x = [(2, 10), (4, 5), (8, 10), (9, 11), (10, 15)]
y = [ (sum(k)/len(k),sum(v)) # output tuple
for i in [enumerate(x)] # sequence iterator
for _,g in groupby(x,lambda t:t[0]-next(i)[0]) # group by sequence
for k,v in [list(zip(*g))] ] # lists of keys & values
print(y)
[(2.0, 10), (4.0, 5), (9.0, 36)]
The for k,v in [list(zip(*g))] part is a bit tricky but what it does it transform a list of tuples (in a group) into two lists (k and v) with k containing the first item of each tuple and v containing the second items.
e.g. if g is ((8,10),(9,11),(10,15)) then k will be (8,9,10) and v will be (10,11,15)

By sorting the list first, and then using itertools.pairwise to iterate over the next and previous days, this problem starts to become much easier. On sequential days, instead of adding a new item to our final list, we modify the last item added to it. Figuring out the new sum is easy enough, and figuring out the new average is actually super easy because we're averaging sequential numbers. We just need to keep track of how many sequential days have passed and we can use that to get the average.
def on_neighboring_days_sum_occurrances(tuple_list):
tuple_list.sort()
ret = []
sequential_days = 1
# We add the first item now
# And then when we start looping we begin looping on the second item
# This way the loop will always be able to modify ret[-1]
ret.append(tuple_list[0])
# Python 3.10+ only, in older versions do
# for prev, current in zip(tuple_list, tuple_list[1:]):
for prev, current in itertools.pairwise(tuple_list):
day = current[0]
prev_day = prev[0]
is_sequential_day = day - prev_day <= 1
if is_sequential_day:
sequential_days += 1
avg_day = day - sequential_days/2
summed_vals = ret[-1][1] + current[1]
ret[-1] = (avg_day, summed_vals)
else:
sequential_days = 1
ret.append(current)
return ret

You can iterate through the list and keep track of a single tuple, and iterate from the tuple next to the one that you're tracking till the penultimate tuple in the list because, when the last tuple comes into tracking there is no tuple after that and thus it is a waste iteration; and find if the difference between the 1st elements is equal to the difference in indices of the tuples, if so sum up the 2nd as well as 1st elements, when this condition breaks, divide the sum of 1st elements with the difference in indices so as to get the average of them, and append them to the result list, now to make sure the program doesn't consider the same tuples again, jump to the index where the condition broke... like this
x = [(2, 10), (4, 5), (8, 10), (9, 11), (10, 15)]
x.sort()
res, i = [], 0
while i<len(x)-1:
sum2, avg1 = x[i][1], x[i][0]
for j in range(i+1, len(x)):
if abs(x[j][0]-x[i][0]) == (j-i):
sum2 += x[j][1]
avg1 += x[j][0]
else:
res.append(x[i])
i+=1
break
else:
avg1 /= len(x)-i
res.append((int(avg1), sum2))
i = j+1
print(res)
Here the while loop iterates from the start of the list till the penultimate tuple in the list, the sum2, avg1 keeps track of the 2nd and 1st elements of the current tuple respectively. The for loop iterates through the next tuple to the current tuple till the end. The if checks the condition, and if it is met, it adds the elements of the tuple from the for loop since the variables are intialized with the elements of current tuple, else it appends the tuple from the for loop directly to the result list res, increments the while loop variable and breaks out of the iteration. When the for loop culminates without a break, it means that the condition breaks, thus it finds the average of the 1st element and appends the tuple (avg1, sum2) to res and skips to the tuple which is next to the one that broke the condition.

Repeating items when implementing a solution for a similar situation to the classic 0-1 knapsack

This problem is largely the same as a classic 0-1 knapsack problem, but with some minor rule changes and a large dataset to play with.
Dataset (product ID, price, length, width, height, weight):
(20,000 rows)
Problem:
A company is closing in fast on delivering its 1 millionth order. The marketing team decides to give the customer who makes that order a prize as a gesture of appreciation. The prize is: the lucky customer gets a delivery tote and 1 hour in the warehouse. Use the hour to fill up the tote with any products you desire and take them home for free.
Rules:
1 of each item
Combined volume < tote capacity (45 * 30 * 25 = 47250)
Item must fit individually (Dimensions are such that it can fit into the tote, e.g. 45 * 45 * 1 wouldn't fit)
Maximize value of combined products
Minimize weight on draws
Solution (using dynamic programming):
from functools import reduce
# The main solver function
def Solver(myItems, myCapacity):
dp = {myCapacity: (0, (), 0)}
getKeys = dp.keys
for i in range(len(myItems)):
itemID, itemValue, itemVolume, itemWeight = myItems[i]
for oldVolume in list(getKeys()):
newVolume = oldVolume - itemVolume
if newVolume >= 0:
myValue, ListOfItems, myWeight = dp[oldVolume]
node = (myValue + itemValue, ListOfItems + (itemID,), myWeight + itemWeight)
if newVolume not in dp:
dp[newVolume] = node
else:
currentValue, loi, currentWeight = dp[newVolume]
if currentValue < node[0] or (currentValue == node[0] and node[-1] < currentWeight):
dp[newVolume] = node
return max(dp.values())
# Generate the product of all elements within a given list
def List_Multiply(myList):
return reduce(lambda x, y: x * y, myList)
toteDims = [30, 35, 45]
totalVolume = List_Multiply(toteDims)
productsList = []
with open('products.csv', 'r') as myFile:
for myLine in myFile:
myData = [int(x) for x in myLine.strip().split(',')]
itemDims = [myDim for myDim, maxDim in zip(sorted(myData[2:5]), toteDims) if myDim <= maxDim]
if len(itemDims) == 3:
productsList.append((myData[0], myData[1], List_Multiply(myData[2:5]), myData[5]))
print(Solver(productsList, totalVolume))
Issue:
The output is giving repeated items
ie. (14018, (26, 40, 62, 64, 121, 121, 121, 152, 152), 13869)
How can I correct this to make it choose only 1 of each item?

It seems that the reason your code may produce answers with duplicate items is that in the inner loop, when you iterate over all generated volumes so far, it is possible for the code to have replaced the solution for an existing volume value before we get there.
E.g. if your productsList contained the following
productsList = [
# id, value, volume, weight
[1, 1, 2, 1],
[2, 1, 3, 2],
[3, 3, 5, 1]
]
and
totalVolume = 10
then by the time you got to the third item, dp.keys() would contain:
10, 8, 7, 5
The order of iteration is not guaranteed, but for the sake of this example, let's assume it is as given above. Then dp[5] would be replaced by a new solution containing item #3, and later in the iteration, we would be using that as a base for a new, even better solution (except now with a duplicate item).
To overcome the above problem, you could sort the keys before the iteration (in ascending order, which is the default), like for oldVolume in sorted(getKeys()). Assuming all items have a non-negative volume, this should guarantee that we never replace a solution in dp before we have iterated over it.
Another possible problem I see above is the way we get the optimal solution at the end using max(dp.values()). In the problem statement, it says that we want to minimize weight in the case of a draw. If I'm reading the code correctly, the elements of the tuple are value, list of items, weight in that order, so below we're tied for value, but the latter choice would be preferable because of the smaller weight... however max returns the first one:
>>> max([(4, (2, 3), 3), (4, (1, 3), 2)])
(4, (2, 3), 3)
It's possible to specify the sorting key to max so something like this might work:
>>> max([(4, (2, 3), 3), (4, (1, 3), 2)], key=lambda x: (x[0], -x[-1]))
(4, (1, 3), 2)

Python: Compare two lists and create a dictionary

In Python, I have a list of pairs (A) and a list of integers (B). A and B always have the same length. I want to know of a fast way of finding all the elements (pairs) of A that correspond to the same value in B (by comparison of indices of A and B) and then store the values in a dictionary (C) (the keys of the dictionary would correspond to elements of B). As an example, if
A = [(0, 0), (0, 1), (0, 3), (0, 6), (0, 7), (1, 3), (1, 7)]
B = [ 2, 5, 5, 1, 5, 4, 1 ]
then
C = {1: [(0,6),(1,7)], 2: [(0,0)], 4: [(1,3)], 5[(0,1), (0,3), (0,7)]}
Presently, I am trying this approach:
C = {}
for a, b in zip(A, B):
C.setdefault(b, [])
C[b].append(a)
While this approach gives me the desired result, I would like some approach which will be way faster (since I need to work with big datasets). I will be thankful if anyone can suggest a fast way to implement this (i.e. find the dictionary C once one is in knowledge of lists A and B).

I would have suggested
for i in range (0,len(B)):
C2.setdefault(B[i], [])
C2[B[i]].append(A[i])
it would save the zip (A,B) process

import collections
C = collections.defaultdict(list)
for ind, key in enumerate(B):
C[key].append(A[ind])

iterate through an array looking at non-consecutive values

for i,(x,y,z) in enumerate( zip(analysisValues, analysisValues[1:], analysisValues[2:]) ):
if all(k<0.5 for k in (x,y,z)):
instance = i
break
this code iterates through an array and looks for the first 3 consecutive values that meet the condition '<0.5'
==============================
i'm working with 'timeseries' data and comparing the values at t, t+1s and t+2s
if the data is sampled at 1Hz then 3 consecutive values are compared and the code above is correct (points 0,1,2)
if the data is sampled at 2Hz then every other point must be compared (points 0,2,4) or
if the data is sampled at 3Hz then every third point must be compared (points 0,3,6)
the sample rate of input data can vary, but is known and recorded as the variable 'SRate'
==============================
please can you help me incorporate 'time' into this point-by-point analysis

You can use extended slice notation, giving the step value as SRate:
for i,(x,y,z) in enumerate(zip(analysisValues, \
analysisValues[SRate::SRate], \
analysisValues[2 * SRate::SRate])):

Let us first construct helper generator which does the following:
from itertools import izip, tee, ifilter
def sparsed_window(iterator, elements=2, step=1):
its = tee(iterator, elements)
for i,it in enumerate(its):
for _ in range(i*step):
next(it,None) # wind forward each iterator for the needed number of items
return izip(*its)
print list(sparsed_window([1,2,3,4,5,6,7,8,9,10],3,2))
Output:
>>>
[(1, 3, 5), (2, 4, 6), (3, 5, 7), (4, 6, 8), (5, 7, 9), (6, 8, 10)]
This helper avoids us of creating nearly the same lists in memory. It uses tee to clever cache only the part that is needed.
The helper code is based on pairwise recipe
Then we can use this helper to get what we want:
def find_instance(iterator, hz=1):
iterated_in_sparsed_window = sparsed_window(iterator, elements=3, step=hz)
fitting_values = ifilter(lambda (i,els): all(el<0.5 for el in els), enumerate(iterated_in_sparsed_window))
i, first_fitting = next(fitting_values, (None,None))
return i
print find_instance([1,0.4,1,0.4,1,0.4,1,0.4,1], hz=2)
Output:
>>>
1

Sum possibilities, one loop

Earlier I had a lot of wonderful programmers help me get a function done. however the instructor wanted it in a single loop and all the working solutions used multiple loops.
I wrote an another program that almost solves the problem. Instead of using a loop to compare all the values, you have to use the function has_key to see if that specific key exists. Answer of that will rid you of the need to iter through the dictionary to find matching values because u can just know if they are matching or not.
again, charCount is just a function that enters the constants of itself into a dictionary and returns the dictionary.
def sumPair(theList, n):
for a, b in level5.charCount(theList).iteritems():
x = n - a
if level5.charCount(theList).get(a):
if a == x:
if b > 1: #this checks that the frequency of the number is greater then one so the program wouldn't try to multiply a single possibility by itself and use it (example is 6+6=12. there could be a single 6 but it will return 6+6
return a, x
else:
if level5.charCount(theList).get(a) != x:
return a, x
print sumPair([6,3,8,3,2,8,3,2], 9)
I need to just make this code find the sum without iteration by seeing if the current element exists in the list of elements.

You can use collections.Counter function instead of the level5.charCount
And I don't know why you need to check if level5.charCount(theList).get(a):. I think it is no need. a is the key you get from the level5.charCount(theList)
So I simplify you code:
form collections import Counter
def sumPair(the_list, n):
for a, b in Counter(the_list).iteritems():
x = n - a
if a == x and b >1:
return a, x
if a != x and b != x:
return a, x
print sumPair([6, 3, 8, 3, 2, 8, 3, 2], 9) #output>>> (8, 1)
The also can use List Comprehension like this:
>>>result = [(a, n-a) for a, b in Counter(the_list).iteritems() if a==n-a and b>1 or (a != n-a and b != n-a)]
>>>print result
[(8, 1), (2, 7), (3, 6), (6, 3)]
>>>print result[0] #this is the result you want
(8, 1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove lines from a large file - python

I think what you're looking for is something like set lines for line in infile: if line not in lines: lines.add(line) outfile.write(line)

Related

How to create a list of all values within a certain range from each other?

Repeating items when implementing a solution for a similar situation to the classic 0-1 knapsack

Python: Compare two lists and create a dictionary

iterate through an array looking at non-consecutive values

Sum possibilities, one loop

Categories

Resources