This is what I do to find all double lines in a textfile
import regex #regex is as re
#capture all lines in buffer
r = f.readlines()
#create list of all linenumbers
lines = list(range(1,endline+1))
#merge both lists
z=[list(a) for a in zip(r, lines)]
#sort list
newsorting = sorted(z)
#put doubles in list
listdoubles = []
for i in range(0,len(newsorting)-1):
if (i+1) <= len(newsorting):
if (newsorting[i][0] == newsorting[i+1][0]) and (not regex.search('^\s*$',newsorting[i][0])):
listdoubles.append(newsorting[i][1])
listdoubles.append(newsorting[i+1][1])
#remove event. double linenumbers
listdoubles = list(set(listdoubles))
#sort line numeric
listdoubles = sorted(listdoubles, key=int)
print(listdoubles)
But it is very slow. When I have over 10.000 lines it takes 10 seconds to create this list.
Is there a way to do it faster?
You can use a simpler approach:
for each line
if it has been seen before then display it
else add it to the set of known lines
In code:
seen = set()
for L in f:
if L in seen:
print(L)
else:
seen.add(L)
If you want to display the line numbers where duplicates are appearing the code can be simply changed to use a dictionary mapping line content to the line number its text has been seen for the first time:
seen = {}
for n, L in enumerate(f):
if L in seen:
print("Line %i is a duplicate of line %i" % (n, seen[L]))
else:
seen[L] = n
Both dict and set in Python are based on hashing and provide constant-time lookup operations.
EDIT
If you need only the line numbers of last duplicate of a line then the output clearly cannot be done during the processing but you will have first to process the whole input before emitting any output...
# lastdup will be a map from line content to the line number the
# last duplicate was found. On first insertion the value is None
# to mark the line is not a duplicate
lastdup = {}
for n, L in enumerate(f):
if L in lastdup:
lastdup[L] = n
else:
lastdup[L] = None
# Now all values that are not None are the last duplicate of a line
result = sorted(x for x in lastdup.values() if x is not None)
Related
i want to count the unique HH:MM:xx(eg. 11:11:00, 11:12:00, 11:12:11) using regex. so far i am only able to count the total of HH:MM:SS in the text. not sure how to continue from here.. this are my codes
pattern = re.compile("(\d{2}):(\d{2}):(\d{2})") #capture all the pattern with HH:MM:SS
path = r'C:\Users\CL\Desktop\abc.txt'
list1 = [] # to store values in list
for line in open(path,'r'):
for match in re.finditer(pattern, line): #draw 11:11:00, 11:12:00, 11:12:11
list1.append(line) #append to a list
total = len(list1) #sum list
print(total) #3
sample text
11:11:00
abc
11:12:00
abc
11:12:11
abc
the desired output should be 2 (unique values - 11:11:xx and 11:12:xx)
see below (data1.txt is your data)
from collections import defaultdict
data = defaultdict(int)
with open('data1.txt') as f:
lines = [l.strip() for l in f.readlines()]
for line in lines:
if line.count(':') == 2:
data[line[:5]] += 1
print(data)
output
defaultdict(<class 'int'>, {'11:11': 1, '11:12': 2})
You could use re.findall here, followed by a list comprehension to remove duplicates:
with open(path, 'r') as file:
data = file.read()
ts = re.findall(r'(\d{2}:\d{2}):\d{2}', data)
res = []
[res.append(x) for x in ts if x not in res]
print(len(res))
If you only want to count the number of occurences you can simply:
txtfile = open("C:\Users\CL\Desktop\abc.txt", "r")
filetext = txtfile.read()
txtfile.close()
list1 = set(re.findall("(\d{2}:\d{2}):\d{2}",filetext))
total = len(list1) #sum list
print(total) #3
You can use parentheses to specify what you wan to capture (the HH:MM). Then you can use set to remove duplicates.
Have you tried using a set instead of a list?
pattern = re.compile("(\d{2}):(\d{2}):(\d{2})")
path = r'C:\Users\CL\Desktop\abc.txt'
s = set() # use a set instead of a list, to avoid duplicates
for line in open(path,'r'):
for match in re.finditer(pattern, line):
s.add(line[:-3]) #insert into set
total = len(s) #number of elements in s
print(total) #2
This way, if you try to insert an element you've already seen, we won't have multiple copies of it stored, since sets don't allow duplicates.
EDIT: As commented, we are not supposed to include seconds here, which I mistakenly did originally. Fixed now.
I have a file that puts out lines that have two values each. I need to compare the second value in every line to make sure those values are not repeated more than once. I'm very new to coding so any help is appreciated.
My thinking was to turn each line into a list with two items each, and then I could compare the same position from a couple lists.
This is a sample of what my file contains:
20:19:18 -1.234567890
17:16:15 -1.098765432
14:13:12 -1.696969696
11:10:09 -1.696969696
08:07:06 -1.696969696
Here's the code I'm trying to use. Basically I want it to ignore those first two lines and print out the third line, since it gets repeated more than once:
with open('my_file') as txt:
for line in txt: #this section turns the file into lists
linelist = '%s' % (line)
lista = linelist.split(' ')
n = 1
for line in lista:
listn = line[n]
listo = line[n + 1]
listp = line[n + 2]
if listn[1] == listo[1] and listn[1] == listp[1]:
print line
else:
pass
n += 1
What I want to see is:
14:13:12 -1.696969696
But I keep getting an error on the long if statement of "string index out of range"
You would be a lot better off using a dictionary type structure. Dictionary allows you to quickly check for existence.
Basically check if the 2nd value is a key in your dict. If a key then print the line. Else just add the 2nd value as a key for later.
myDict = {}
with open('/home/dmoraine/pylearn/%s' % (file)) as txt:
for line in txt:
key = line.split()[1]
if key in myDict:
print(line)
else:
myDict[key] = None #value doesn't matter
Some simple debugging highlights the functional problem:
with open('my_file.txt') as txt:
for line in txt: #this section turns the file into lists
linelist = '%s' % (line)
lista = linelist.split(' ')
print(linelist, lista)
n = 1
for line in lista:
print("line", n, ":\t", line)
listn = line[n]
listo = line[n + 1]
listp = line[n + 2]
print(listn, '|',listo, '|',listp)
if listn[1] == listo[1] and listn[1] == listp[1]:
print(line)
n += 1
Output:
20:19:18 -1.234567890
['20:19:18', '-1.234567890\n']
17:16:15 -1.098765432
['17:16:15', '-1.098765432\n']
14:13:12 -1.696969696
['14:13:12', '-1.696969696\n']
11:10:09 -1.696969696
['11:10:09', '-1.696969696\n']
08:07:06 -1.696969696
['08:07:06', '-1.696969696\n']
line 1 : 08:07:06
8 | : | 0
In short, you've mis-handled the variables. When you get to the second loop, lista is the "words" of the final line; you've read and discarded all of the others. line iterates through these individual words. Your listn/o/p variables are, therefore, individual characters. Thus, there is no such thing as listn[1], and you get an error.
Instead, you need to build some sort of list of the floating-point numbers. For instance, using your top loop as a starting point:
float_list = {}
for line in txt: #this section turns the file into lists
lista = line.split(' ')
my_float = float(lista[1]) # Convert the second field into a float
float_list.append(my_float)
Now you need to write code that will find duplicates in float_list. Can you take it from there?
Ended up turning each line into a list, and then making a dictionary of all the lists. Thank you all for your help.
I am trying to figure out if it is possible to access the elements of a list around the element you are currently at. I have a list that is large (20k+ lines) and I want to find every instance of the string 'Name'. Additionally, I also want to get +/- 5 elements around each 'Name' element. So 5 lines before and 5 lines after. The code I am using is below.
search_string = 'Name'
with open('test.txt', 'r') as infile, open ('textOut.txt','w') as outfile:
for line in infile:
if search_string in line:
outfile.writelines([line, next(infile), next(infile),
next(infile), next(infile), next(infile)])
Getting the lines after the occurrence of 'Name' is pretty straightforward, but figuring out how to access the elements before it has me stumped. Anyone have an ideas?
20k lines isn't that much, if it's ok to read all of them in a list, we can take slices around the index where a match is found, like this:
with open('test.txt', 'r') as infile, open('textOut.txt','w') as outfile:
lines = [line.strip() for line in infile.readlines()]
n = len(lines)
for i in range(n):
if search_string in lines[i]:
start = max(0, i - 5)
end = min(n, i + 6)
outfile.writelines(lines[start:end])
You can use the function enumerate that allows you to iterate through both elements and indexes.
Example to access elements 5 indexes before and after your current element :
n = len(l)
for i, x in enumerate(l):
print(l[max(i-5, 0)]) # Prevent picking last elements of iterable by using negative indexes
print(x)
print(l[min(i+5, n-1)]) # Prevent overflow
You need to keep track of the index of where in the list you currently are
So something like:
# Read the file into list_of_lines
index = 0
while index < len(list_of_lines):
if list_of_lines[index] == 'Name':
print(list_of_lines[index - 1]) # This is the previous line
print(list_of_lines[index + 1]) # This is the next line
# And so on...
index += 1
Let's say you have your lines stored in your list:
lines = ['line1', 'line2', 'line3', 'line4', 'line5', 'line6', 'line7', 'line8', 'line9']
You could define a method returning elements grouped by n consecutives, as a generator:
def each_cons(iterable, n = 2):
if n < 2: n = 1
i, size = 0, len(iterable)
while i < size-n+1:
yield iterable[i:i+n]
i += 1
Teen, just call the method. To show the content I'm calling list on it, but you can iterate over it:
lines_by_3_cons = each_cons(lines, 3) # or any number of lines, 5 in your case
print(list(lines_by_3_cons))
#=> [['line1', 'line2', 'line3'], ['line2', 'line3', 'line4'], ['line3', 'line4', 'line5'], ['line4', 'line5', 'line6'], ['line5', 'line6', 'line7'], ['line6', 'line7', 'line8'], ['line7', 'line8', 'line9']]
I personally loved that problem. All guys here are doing it by taking the whole file into memory. I think I wrote a memory efficient code.
Here, check this out!
myfile = open('infile.txt')
stack_print_moments = []
expression = 'MYEXPRESSION'
neighbourhood_size = 5
def print_stack(stack):
for line in stack:
print(line.strip())
print('-----')
current_stack = []
for index, line in enumerate(myfile):
current_stack.append(line)
if len(current_stack) > 2 * neighbourhood_size + 1:
current_stack.pop(0)
if expression in line:
stack_print_moments.append(index + neighbourhood_size)
if index in stack_print_moments:
print_stack(current_stack)
last_index = index
for index in range(last_index, last_index + neighbourhood_size + 1):
if index in stack_print_moments:
print_stack(current_stack)
current_stack.pop(0)
More advanced code is here: Github link
I am trying to read a file, collect some lines, batch process them and then post process the result.
Example:
with open('foo') as input:
line_list = []
for line in input:
line_list.append(line)
if len(line_list) == 10:
result = batch_process(line_list)
# something to do with result here
line_list = []
if len(line_list) > 0: # very probably the total lines is not mutiple of 10 e.g. 11
result = batch_process(line_list)
# something to do with result here
I do not want to duplicate the batch invoking and post processing so I want to know if could dynamically add some content to input, e.g.
with open('foo') as input:
line_list = []
# input.append("THE END")
for line in input:
if line != 'THE END':
line_list.append(line)
if len(line_list) == 10 or line == 'THE END':
result = batch_process(line_list)
# something to do with result here
line_list = []
So if in this case I cannot duplicate the code in if branch. Or if has any other better manner could know it's the last line?
If your input is not too large and fits comfortably in memory, you can read everything into a list, slice the list into sub-list of length 10 and loop over them.
k = 10
with open('foo') as input:
lines = input.readlines()
slices = [lines[i:i+k] for i in range(0, len(lines), k)]
for slice in slices:
batch_process(slice)
If you want to append a mark to the input lines, you also have to read all lines first.
So I have a text file consisting of one column, each column consist two numbers
190..255
337..2799
2801..3733
3734..5020
5234..5530
5683..6459
8238..9191
9306..9893
I would like to discard the very 1st and the very last number, in this case, 190 and 9893.
and basically moves the rest of the numbers one spot forward. like this
My desired output
255..337
2799..2801
3733..3734
5020..5234
5530..5683
6459..8238
9191..9306
I hope that makes sense I'm not sure how to approach this
lines = """190..255
337..2799
2801..3733"""
values = [int(v) for line in lines.split() for v in line.split('..')]
# values = [190, 255, 337, 2799, 2801, 3733]
pairs = zip(values[1:-1:2], values[2:-1:2])
# pairs = [(255, 337), (2799, 2801)]
out = '\n'.join('%d..%d' % pair for pair in pairs)
# out = "255..337\n2799..2801"
Try this:
with open(filename, 'r') as f:
lines = f.readlines()
numbers = []
for row in lines:
numbers.extend(row.split('..'))
numbers = numbers[1:len(numbers)-1]
newLines = ['..'.join(numbers[idx:idx+2]) for idx in xrange(0, len(numbers), 2]
with open(filename, 'w') as f:
for line in newLines:
f.write(line)
f.write('\n')
Try this:
Read all of them into one list, split each line into two numbers, so you have one list of all your numbers.
Remove the first and last item from your list
Write out your list, two items at a time, with dots in between them.
Here's an example:
a = """190..255
337..2799
2801..3733
3734..5020
5234..5530
5683..6459
8238..9191
9306..9893"""
a_list = a.replace('..','\n').split()
b_list = a_list[1:-1]
b = ''
for i in range(len(a_list)/2):
b += '..'.join(b_list[2*i:2*i+2]) + '\n'
temp = []
with open('temp.txt') as ofile:
for x in ofile:
temp.append(x.rstrip("\n"))
for x in range(0, len(temp) - 1):
print temp[x].split("..")[1] +".."+ temp[x+1].split("..")[0]
x += 1
Maybe this will help:
def makeColumns(listOfNumbers):
n = int()
while n < len(listOfNumbers):
print(listOfNumbers[n], '..', listOfNumbers[(n+1)])
n += 2
def trim(listOfNumbers):
listOfNumbers.pop(0)
listOfNumbers.pop((len(listOfNumbers) - 1))
listOfNumbers = [190, 255, 337, 2799, 2801, 3733, 3734, 5020, 5234, 5530, 5683, 6459, 8238, 9191, 9306, 9893]
makeColumns(listOfNumbers)
print()
trim(listOfNumbers)
makeColumns(listOfNumbers)
I think this might be useful too. I am reading data from a file name list.
data = open("list","r")
temp = []
value = []
print data
for line in data:
temp = line.split("..")
value.append(temp[0])
value.append(temp[1])
for i in range(1,(len(value)-1),2):
print value[i].strip()+".."+value[i+1]
print value
After reading the data I split and store it in the temporary list.After that, I copy data to the main list value which have all of the data.Then I iterate from the second element to second last element to get the output of interest. strip function is used in order to remove the '\n' character from the value.
You can later write these values to a file Instead of printing out.