Append ellement to list, err list index out of range - python

I need to analyse the Brightkite network and its checkins. basically I have to count the number of distinct users who checked-in at each location. I just When I run this piece of code on small file (just cut 300 first lines from original file) it works good. But if I try to do the same with original file. I get the error
users.append(columns[4])
IndexError: list index out of range. What it could be
Here is my code:
from collections import Counter
f = open("b.txt")
locations = []
users = []
for line in f:
columns = line.strip().split("\t")
locations.append(columns[0])
users.append(columns[4])
l = Counter(locations)
ml = l.most_common(10)
print ml
Here is structure of data
58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411
58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411
58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e
58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e
58188 2010-04-06T06:45:19Z 46.521389 14.854444 ddaa40aaa22411
58188 2008-12-30T15:30:08Z 46.522621 14.849618 58e12bc0d67e11
58189 2009-04-08T07:36:46Z 46.554722 15.646667 ddaf9c4ea22411
58190 2009-04-08T07:01:28Z 46.421389 15.869722 dd793f96a22411

You should use the csv module and update the counter as you go:
from collections import Counter
import csv
with open("Brightkite_totalCheckins.txt") as f:
r = csv.reader(f,delimiter="\t")
cn = Counter()
users = []
for row in r:
# update Counter as you go, no need to build another list
# locations is row[4] not row[0]
cn[row[4]] += 1
# same as columns[]
users.append(row[0])
print(cn.most_common(10))
Output from the full file:
[('00000000000000000000000000000000', 254619), ('ee81ef22a22411ddb5e97f082c799f59', 17396), ('ede07eeea22411dda0ef53e233ec57ca', 16896), ('ee8b1d0ea22411ddb074dbd65f1665cf', 16687), ('ee78cc1ca22411dd9b3d576115a846a7', 14487), ('eefadd1aa22411ddb0fd7f1c9c809c0c', 12227), ('ecceeae0a22411dd831d5f56beef969a', 10731), ('ef45799ca22411dd9236df37bed1f662', 9648), ('d12e8e8aa22411dd90196fa5c210e3cc', 9283), ('ed58942aa22411dd96ff97a15c29d430', 8640)]
If you print the lines using repr you see the file is tab separated:
'7611\t2009-08-30T11:07:52Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-30T00:15:20Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T20:28:13Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:53:59Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:19:36Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:16:45Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T11:52:32Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
..................
The very last line is:
'58227\t2009-01-21T00:24:35Z\t33.833333\t35.833333\t9f6b83bca22411dd85460384f67fcdb0\n'
so make sure that matches and you have not modified the file and there will be no indexError.
You code fails because you have some lines that look like '7573\t\t\t\t\n', the first of which is line number 1909858 so splitting and stripping leaves you with ['7573'].
Using the csv file however gives you ['7573', '', '', '', ''].
If you actually want a list of ten uniques locations, you need to find the values that are equal to 1:
# generator expression of key/values where value == 1
unique = (tup for tup in cn.iteritems() if tup[1] == 1)
from itertools import islice
# take first 10 elements from unique
sli = list(islice(unique,10))
print(sli)
('2d4920e7273c755704c06f2201832d89', 1), ('a4ef963e84f83133484227465e2113e9', 1), ('474f93a6585111dea018003048c10834', 1), ('413754d668b411de9a19003048c0801e', 1), ('d115daaca22411ddb75a33290983eb13', 1), ('4bac110041ad11de8fca003048c0801e', 1), ('fc706c121ec1f54e0a828548ac5e26b8', 1), ('1bcd0cf0f0bd11ddb822003048c0801e', 1), ('e6ed6c09b8994ed125f3c5ef6c210844', 1), ('493ef9b049cfb2c6c24667a931f1592172074545', 1)]
To get the count of all unique locations we can use the rest of our generator expression with sum adding 1 for every element and add the total to the length of what we took with islice:
print(sum(1 for _ in unique) + len(sli))
Which gives you 426831 unique locations.
Using re.split or str.split is not going to work for an obvious reason:
In [13]: re.split("\s+", '7573\t\t\t\t\n'.rstrip())
Out[13]: ['7573']
In [14]: '7573\t\t\t\t\n'.rstrip().split()
Out[14]: ['7573']

The problem was in your data, I checked the website data you provided. The data is not actually separated by tab spaces. They are just separated by spaces. I added some lines to replace the spaces with tabs then splitted the line. It works now.
from collections import Counter
f = open("b.txt")
locations = []
users = []
for line in f:
line = line.replace(" ","\t")
line = line.replace(" ","\t")
line = line.replace(" ","\t")
line = line.replace(" ","\t")
columns = line.strip().split("\t")
locations.append(columns[0])
users.append(columns[4])
l = Counter(locations)
ml = l.most_common(10)
print ml
Note : If you have errors like this check your data, the error message is clear that there is no 4th element.
I hope this is the error you want to resolve.

Related

How can I clean this data for easier visualizing?

I'm writing a program to read a set of data rows and quantify matching sets. I have the code below however would like to cut, or filter the numbers which is not being recognized as a match.
import collections
a = "test.txt" #This can be changed to a = input("What's the filename? ", )
line_file = open(a, "r")
print(line_file.readable()) #Readable check.
#print(line_file.read()) #Prints each individual line.
#Code for quantity counter.
counts = collections.Counter() #Creates a new counter.
with open(a) as infile:
for line in infile:
for number in line.split():
counts.update((number,))
for key, count in counts.items():
print(f"{key}: x{count}")
line_file.close()
This is what it outputs, however I'd like for it to not read the numbers at the end and pair the matching sets accordingly.
A2-W-FF-DIN-22: x1
A2-FF-DIN: x1
A2-W-FF-DIN-11: x1
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
C1-GH-KK-LOP: x1
What I'm aiming for is so that it ignored the "-77" in this, and instead counts the total as x3
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
Split each element on the dashes and check the last element is a number. If so, remove it, then continue on.
from collections import Counter
def trunc(s):
parts = s.split('-')
if parts[-1].isnumeric():
return '-'.join(parts[:-1])
return s
with open('data.txt') as f:
data = [trunc(x.rstrip()) for x in f.readlines()]
counts = Counter(data)
for k, v in counts.items():
print(k, v)
Output
A2-W-FF-DIN 2
A2-FF-DIN 1
B12-H-BB-DD 3
C1-GH-KK-LOP 1
You could use a regular expression to create a matching group for a digit suffix. If each number is its own string, e.g. "A2-W-FF-DIN-11", then a regular expression like (?P<base>.+?)(?:-(?P<suffix>\d+))?\Z could work.
Here, (?P<base>.+?) is a non-greedy match of any character except for a newline grouped under the name "base", (?:-(?P<suffix>\d+))? matches 0 or 1 occurrences of something like -11 occurring at the end of the "base" group and puts the digits in a group named "suffix", and \Z is the end of the string.
This is what it does in action:
>>> import re
>>> regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
>>> regex.match("A2-W-FF-DIN-11").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': '11'}
>>> regex.match("A2-W-FF-DIN").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': None}
So you can see, in this instance, whether or not the string has a digital suffix, the base is the same.
All together, here's a self-contained example of how it might be applied to data like this:
import collections
import re
regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
sample_data = [
"A2-FF-DIN",
"A2-W-FF-DIN-11",
"A2-W-FF-DIN-22",
"B12-H-BB-DD",
"B12-H-BB-DD",
"B12-H-BB-DD-77",
"C1-GH-KK-LOP"
]
counts = collections.Counter()
# Iterates through the data and updates the counter.
for datum in sample_data:
# Isolates the base of the number from any digit suffix.
number = regex.match(datum)["base"]
counts.update((number,))
# Prints each number and prints how many instances were found.
for key, count in counts.items():
print(f"{key}: x{count}")
For which the output is
A2-FF-DIN: x1
A2-W-FF-DIN: x2
B12-H-BB-DD: x3
C1-GH-KK-LOP: x1
Or in the example code you provided, it might look like this:
import collections
import re
# Compiles a regular expression to match the base and suffix
# of a number in the file.
regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
a = "test.txt"
line_file = open(a, "r")
print(line_file.readable()) # Readable check.
# Creates a new counter.
counts = collections.Counter()
with open(a) as infile:
for line in infile:
for number in line.split():
# Isolates the base match of the number.
counts.update((regex.match(number)["base"],))
for key, count in counts.items():
print(f"{key}: x{count}")
line_file.close()

Python program which could auto sort and replace hashes with plain password in csv file

I have an excel file "ex.csv" with columns - Hash, Salt, Name And I have a txt file "found.txt" where are decrypted hashes. Their format is Hash: Salt: Plain_Password I would like to change Hash from "ex.csv" with Plain_Password from "found.txt". Would like to know, how I could do that :) I have written a test program that would output into separate txt file the Hash: Salt but it is not working.
Python code -
# File Reads
a = open("ex.csv")
b = open("found.txt")
# Reading contents
ex = a.read()
found = b.read()
# Splitting files by newline
ex_s = ex.split("\n")
found_s = found.split("\n")
# Splitting them into subarrays by splitting them by ','
temp_exsp2 = []
temp_foundsp2 = []
i=0
for item in ex_s:
temp_exsp2[i] = item[0] # Presumeably here's an error
i+=1
i=0
for item in found_s:
temp_foundsp2[i] = item[0] # Same thing here
i+=1
i=0
z=0 #Used for incrementing found array
FoundArray0 = [] #For line from ex
FoundArray1 = [] #For line from found
while i!=len(ex_s): # Main comparison loop
for item in temp_foundsp2: # Inner loop for looping through all found file
j=0
if item in temp_exsp2[i]:
FoundArray0[z] = i
FoundArray1[z] = j
z+=1
j+=1
i+=1 # Go to the next line in the ex.csv
output = open("output.txt","w")
for out in FoundArray0:
for out2 in FoundArray1:
output.write(str(ex_s[FoundArray0]) + ":" + str(temp_foundsp2[FoundArray1]))
FoundArray here is the line numbers from ex.csv and found.txt (Would like to know if there's a way to do it better ;) Because I feel that it is not right) It is giving me an error - temp_exsp2[i] = item[0] # Presumably here's an error
IndexError: list assignment index out of range
Samples from ex.csv:
210ac64b3c5a570e177b26bb8d1e3e93f72081fd,gx0FMxymN,user1
039e8c304c9ada05fd9cc549ac62e178edbfaed6,eVRCBE2OG,user2
Samples from found.txt
f8fa3b3da3fc71e1eaf6c18e4afef626e1fc7fc1:t7e2jlLvs:pass1
bce61cb17c381e11afbcf89ab30ae5cc8276722f:rjCAX5D6K:pass2
Maybe there's an excel function that does that :D I don't know.
I am new at python and would like to know the best way to realize this :)
Thanks ;)
To split string that has entries separated by a specific delimeter you can use string.split(delimeter) method.
Example:
a = '123:456:abc'
a.split(':')
>>> ['123', '456', 'abc']
You could also take a look at pandas DataFrame which is able to load csv file and then lets you easily manipulate the columns and much more.

Find first line of text according to value in Python

How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])

How to sort a large number of lists to get a top 10 of the longest lists

So I have a text file with around 400,000 lists that mostly look like this.
100005 127545 202036 257630 362970 376927 429080
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 296858 300258 341525 348922 359832 365744
382502 390538 410857 433453 479170 489980 540746
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 300258 341525 348922 359832 365744 382502
So far I have a for loop that goes line by line and turns the current line into a temp array list.
How would I create a top ten list that has the list with the most elements of the whole file.
This is the code I have now.
file = open('node.txt', 'r')
adj = {}
top_ten = []
at_least_3 = 0
for line in file:
data = line.split()
adj[data[0]] = data[1:]
And this is what one of the list look like
['99995', '110038', '330533', '333808', '344852', '376948', '470766', '499315']
# collect the lines
lines = []
with open("so.txt") as f:
for line in f:
# split each line into a list
lines.append(line.split())
# sort the lines by length, descending
lines = sorted(lines, key=lambda x: -len(x))
# print the first 10 lines
print(lines[:10])
Why not use collections to display the top 10? i.e.:
import re
import collections
file = open('numbers.txt', 'r')
content = file.read()
numbers = re.findall(r"\d+", content)
counter = collections.Counter(numbers)
print(counter.most_common(10))
Ideone Demo
When wanting to count and then find the one(s) with the highest counts, collections.Counter comes to mind:
from collections import Counter
lists = Counter()
with open('node.txt', 'r') as file:
for line in file:
values = line.split()
lists[tuple(values)] = len(values)
print('Length Data')
print('====== ====')
for values, length in lists.most_common(10):
print('{:2d} {}'.format(length, list(values)))
Output (using sample file data):
Length Data
====== ====
10 ['191044', '246142', '265214', '290507', '300258', '341525', '348922', '359832', '365744', '382502']
10 ['191044', '246142', '265214', '290507', '296858', '300258', '341525', '348922', '359832', '365744']
10 ['10001', '27638', '51569', '88226', '116422', '126227', '159947', '162938', '184977', '188045']
7 ['382502', '390538', '410857', '433453', '479170', '489980', '540746']
7 ['100005', '127545', '202036', '257630', '362970', '376927', '429080']
Use a for loop and max() maybe? You say you've got a for loop that's placing the values into a temp array. From that you could use "max()" to pick out the largest value and put that into a list.
As an open for loop, something like appending max() to a new list:
newlist = []
for x in data:
largest = max(x)
newlist.append(largest)
Or as a list comprehension:
newlist = [max(x) for x in data]
Then from there you have to do the same process on the new list(s) until you get to the desired top 10 scenario.
EDIT: I've just realised that i've misread your question. You want to get the lists with the most elements, not the highest values. Ok.
len() is a good one for this.
for x in data:
if len(templist) > x:
newlist.append(templist)
That would give you the current highest and from there you could create a top 10 list of lengths or of the temp lists themselves, or both.
If your data is really as shown with each number the same length, then I would make a dictionary with key = line, value = length, get the top value / key pairs in the dictionary and voila. Sounds easy enough.

check if a string from a file exists in a list of list of strings: python

I am reading a .csv file and saving it to a matrix called csvfile, and the matrix contents look like this (abbreviated, there are dozens of records):
[['411-440854-0', '411-440824-0', '411-441232-0', '394-529791', '394-529729', '394-530626'], <...>, ['394-1022430-0', '394-1022431-0', '394-1022432-0', '***another CN with a switch in between'], ['394-833938-0', '394-833939-0', '394-833940-0'], <...>, ['394-1021830-0', '394-1021831-0', '394-1021832-0', '***Sectionalizer end connections'], ['394-1022736-0', '394-1022737-0', '394-1022738-0'], <...>, ['394-1986420-0', '394-1986419-0', '394-1986416-0', '***weird BN line check'], ['394-1986411-0', '394-1986415-0', '394-1986413-0'], <...>, ['394-529865-0', '394-529686-0', '394-530875-0', '***Sectionalizer end connections'], ['394-830900-0', '394-830904-0', '394-830902-0'], ['394-2350772-0', '394-2350776-0', '394-2350774-0', '***Sectionalizer present but no end break'], <...>]
and I am reading a text file into a variable called textfile and the content looks like this:
...
object underground_line {
name SPU123-394-1021830-0-sectionalizer;
phases AN;
from SPU123-391-670003;
to SPU123-395-899674_sectionalizernode;
length 26.536;
configuration SPU123-1/0CN15-AN;
}
object underground_line {
name SPU123-394-1021831-0-sectionalizer;
phases BN;
from SPU123-391-670002;
to SPU123-395-899675_sectionalizernode;
length 17.902;
configuration SPU123-1/0CN15-BN;
}
object underground_line {
name SPU123-394-1028883-0-sectionalizer;
phases CN;
from SPU123-391-542651;
to SPU123-395-907325_sectionalizernode;
length 771.777;
configuration SPU123-1CN15-CN;
}
...
I want to see if a portion of name line in textfile (anything after SPU123- and before -0-sectionalizer) exists in csvfile matrix. If it does not exist, I want to do something (increment a counter) and I tried several ways including below:
counter = 0
for noline in textfile:
if 'name SPU123-' in noline:
if '-' in noline[23]:
if ((noline[13:23] not in s[0]) and (noline[13:23] not in s[1]) and (noline[13:23] not in s[2]) for s in csvfile):
counter = counter+1
else:
if ((noline[13:24] not in s[0]) and (noline[13:24] not in s[1]) and (noline[13:-24] not in s[2]) for s in csvfile):
counter = counter+1
print counter
This is not working. I also tried with if any((noline......) in the above code sample and it doesn't work either.
Checking for a string s in a list of lists l:
>>> l = [['str', 'foo'], ['bar', 'so']]
>>> s = 'foo'
>>> any(s in x for x in l)
True
>>> s = 'nope'
>>> any(s in x for x in l)
False
Implementing this into your code (assuming that noline[13:23] is the string your are wanting search for, and then increment counter if it is not in csvfile):
counter = 0
for noline in textfile:
if 'name SPU123-' in noline:
if '-' in noline[23]: noline[13:23]:
if not any(noline[13:23] in x for x in csvfile) and not any(noline[13:23] + '-0' in x for x in csvfile):
counter += 1
else:
if not any(noline[13:24] in x for x in csvfile) and not any(noline[13:24] + '-0' in x for x in csvfile):
counter += 1
Since your matrix includes loads upon loads of values, it's very slow to iterate over it all each time.
Assemble your values into a mapping instead (a set in this case since there are no associated data) since hash table lookups are very fast:
s = {v for r in matrix for v in r if re.match(r'\d[-\d]+]\d$',v)} #or any filter more appropriate for your notion of valid identifiers
if noline[13:23] in s: #parsing the identifiers instead would be more fault-tolerant
#do something
Due to the preliminary step, this will only start outperforming the brute-force approach beyond a certain scale.
import re, itertools
Flatten csvfile -- data is an iterator
data = itertools.chain.from_iterable(csvfile)
Extract relevant items from data and make it a set for performance (avoid iterating over data multiple times)
data_rex = re.compile(r'\d{3}-\d+')
data = {match.group() for match in itertools.imap(data_rex.match, data) if match}
Quantify the the names that are not in data.
def predicate(match, data = data):
'''Return True if match not found in data'''
return match.group(1) not in data
# after SPU123- and before -0-
name = re.compile(r'name SPU123-(\d{3}-\d+)-')
names = name.finditer(textfile)
# quantify
print sum(itertools.imap(predicate, names))

Categories