Python list comprehensions with regular expressions on a text - python

I have a text file which includes integers in the text. There are one or more integers in a line or none. I want to find these integers with regular expressions and compute the sum.
I have managed to write the code:
import re
doc = raw_input("File Name:")
text = open(doc)
lst = list()
total = 0
for line in text:
nums = re.findall("[0-9]+", line)
if len(nums) == 0:
continue
for num in nums:
num = int(num)
total += num
print total
But i also want to know the list comprehension version, can someone help?

Since you want to calculate the sum of the numbers after you find them It's better to use a generator expression with re.finditer() within sum(). Also if the size of file in not very huge you better to read it at once, rather than one line at a time.
import re
doc = raw_input("File Name:")
with open(doc) as f:
text = f.read()
total = sum(int(g.group(0)) for g in re.finditer(r'\d+', text))

Related

Findin numbers in txt by reg expression, small problem

I wrote the code, however, it is finding only the first number in the line, and I am kind of stuck. So if there are 2 or more numbers in line in getting only 1, what am I doing wrong? I am a beginner.
import re
fhand = open('text2.txt','r')
numlist = list()
total = 0
for line in fhand:
line = line.rstrip()
numbers = re.findall(r'[0-9]+', line)
if len(numbers) < 1: continue
for element in numbers :
num = float(numbers[0])
if num not in numlist:
numlist.append(num)
else : continue
sumlist = sum(numlist)
print(numlist)
print(sumlist)
http://py4e-data.dr-chuck.net/regex_sum_228867.txt that's the text file I am using and my sum is 191882, and the result should much bigger because my text is reading the only first number from a line. Cheers guys I will be grateful
In the comment melpomene already answered but in case you need to see, change your code to
for element in numbers :
num = float(element)
how about this (use re.M) to pass a multi-line flag.
with open('text2.txt') as f:
s = sum(map(float,re.findall(r'[0-9]+', f.read(), re.M)))
print(s)
Returns:
425922.0

How to extract all the numbers from a text file using re.findall() and compute the sum using a for-loop?

The basic outline of this problem is to read the file, look for integers using the re.findall(), looking for regular expression of [0-9]+ and then converting the extracted strings to integers and summing up the integers. I'm having different outcome it supposed to end with (209). Also, how can I simplify my code? Thanks (here is the txt file http://py4e-data.dr-chuck.net/regex_sum_167791.txt)
import re
hand = open("regex_sum_167791.txt")
total = 0
count = 0
for line in hand:
count = count+1
line = line.rstrip()
x = re.findall("[0-9]+", line)
if len(x)!= 1 : continue
num = int(x[0])
total = num + total
print(total)
Assuming that you need to sum all the numbers in your txt:
total = 0
with open("regex_sum_167791.txt") as f:
for line in f:
total += sum(map(int, re.findall("\d+", line)))
print(total)
# 417209
Logics
To start with, try using with when you do open so that once any job is done, open is closed.
Following lines are removed as they seemed redundant:
count = count+1: Not used.
line = line.rstrip(): re.findall takes care of extraction, so you don't have to worry about stripping lines.
if len(x)!= 1 : continue: Seems like you wanted to skip the line with no digits. But since sum(map(int, re.findall("\d+", line))) returns zero in such case, this is also unnecessary.
num = int(x[0]): Finally, this effectively grabs only one digit from the line. In case of two or more digits found in a single line, this won't serve the original purpose. And since int cannot be directly applied to iterables, I used map(int, ...).
You were almost there:
import re
hand = open("regex_sum_167791.txt")
total = 0
for line in hand:
count = count+1
line = line.rstrip()
x = re.findall("[0-9]+", line)
for i in x:
total += int(i)
print(total)
Answer: 417209

Regular Expression for Finding Numbers in a Line (Python)

I'm just learning about regular expressions and I need to read in a text file and find every instance of a number and find the sum of all the numbers.
import re
sum = 0
list_of_numbers = list()
working_file = open("sample.txt", 'r')
for line in working_file:
line = line.rstrip()
working_list = re.findall('[0-9]+', line)
if len(working_list) != 1:
continue
print(working_list)
for number in working_list:
num = int(number)
list_of_numbers.append(num)
for number in list_of_numbers:
sum += number
print(sum)
I put the print(working_list) in order to try and debug it and see if all the numbers are getting found correctly and I've seen, by manually scanning the text file, that some numbers are being skipped while others are not. I'm confused as to why as I thought my regular expression guaranteed that any string with any amount of digits will be added to the list.
Here is the file.
You're only validating lines that have ONLY one number, so a line with two numbers will be skipped because of if len(working_list) != 1: continue, that basically says "if there isn't EXACTLY one number on this line then skip", you may have meant something like if len(working_list) < 1: continue
I would do it like:
import re
digits_re = re.compile(r'(\d+(\.\d+)?)')
with open("sample.txt", 'r') as fh:
numbers = [float(match[0]) for match in digits_re.findall(fh.read())]
print(sum(numbers))
or like you're doing with ints just
import re
digits_re = re.compile(r'(\d+)')
with open("sample.txt", 'r') as fh:
numbers = [int(match[0]) for match in digits_re.findall(fh.read())]
print(sum(numbers))
h = open('file.txt')
nos = list()
for ln in h:
fi = re.findall('[0-9]+', ln)
for i in fi:
nos.append(int(i))
print('Sum:', sum(nos))

Regular Expressions assignment

I am taking an online class for python. I've been at this for 2 weeks. I've written the following code to find numbers in a sample text document. The problem I'm having is when I move from line to line and run the regex, it finds the first set of numbers, then skips any remaining numbers on the line and moves to the next line where it matches only the first number on the line. My code is below:
#!/usr/bin/python
import re
try:
fname = raw_input("Enter file name: ")
fh = open(fname)
except:
print 'Invalid Input'
quit()
numlist = list()
for line in fh:
nums = re.findall('[0-9]+',line)
if len(nums) < 1 : continue
num = int(nums[0])
numlist.append(num)
print (numlist)
you are explicitly telling it to skip all numbers but the first:
num = int(nums[0])
instead, use a list comprehension to coerce to int and append the entire list using extend().
numlist.extend([int(x) for x in num])
As others already noted, you're discarding all other numbers in the list and taking only the first element. You can use the map function to convert the numbers to int and then extend the list
for line in fh:
nums = re.findall('[0-9]+',line)
if len(nums) < 1 : continue
nums = map(int, nums)
numlist.extend(nums)
The problem is that you're not looping on nums, but only appending the first item in the nums list.
To solve this, you should iterate on nums and append each item.

Why is this not correct? (codeeval challenge)PYTHON

This is what I have to do https://www.codeeval.com/open_challenges/140/
I've been on this challenge for three days, please help. It it is 85-90 partially solved. But not 100% solved... why?
This is my code:
import sys
test_cases = open(sys.argv[1], 'r')
for test in test_cases:
saver=[]
text=""
textList=[]
positionList=[]
num=0
exists=int()
counter=0
for l in test.strip().split(";"):
saver.append(l)
for i in saver[0].split(" "):
textList.append(i)
for j in saver[1].split(" "):
positionList.append(j)
for i in range(0,len(positionList)):
positionList[i]=int(positionList[i])
accomodator=[None]*len(textList)
for n in range(1,len(textList)):
if n not in positionList:
accomodator[n]=textList[len(textList)-1]
exists=n
for item in positionList:
accomodator[item-1]=textList[counter]
counter+=1
if counter>item:
accomodator[exists-1]=textList[counter]
for word in accomodator:
text+=str(word) + " "
print text
test_cases.close()
This code works for me:
import sys
def main(name_file):
_file = open(name_file, 'r')
text = ""
while True:
try:
line = _file.next()
disordered_line, numbers_string = line.split(';')
numbers_list = map(int, numbers_string.strip().split(' '))
missing_number = sum(xrange(sorted(numbers_list)[0],sorted(numbers_list)[-1]+1)) - sum(numbers_list)
if missing_number == 0:
missing_number = len(disordered_line)
numbers_list.append(missing_number)
disordered_list = disordered_line.split(' ')
string_position = zip(disordered_list, numbers_list)
ordered = sorted(string_position, key = lambda x: x[1])
text += " ".join([x[0] for x in ordered])
text += "\n"
except StopIteration:
break
_file.close()
print text.strip()
if __name__ == '__main__':
main(sys.argv[1])
I'll try to explain my code step by step so maybe you can see the difference between your code and mine one:
while True
A loop that breaks when there are no more lines.
try:
I put the code inside a try and catch the StopIteracion exception, because this is raised when there are no more items in a generator.
line = _file.next()
Use a generator, so that way you do not put all the lines in memory from once.
disordered_line, numbers_string = line.split(';')
Get the unordered phrase and the numbers of every string's position.
numbers_list = map(int, numbers_string.strip().split(' '))
Convert every number from string to int
missing_number = sum(xrange(sorted(numbers_list)[0],sorted(numbers_list)[-1]+1)) - sum(numbers_list)
Get the missing number from the serial of numbers, so that missing number is the position of the last string in the phrase.
if missing_number == 0:
missing_number = len(unorder_line)
Check if the missing number is equal to 0 if so then the really missing number is equal to the number of the strings that make the phrase.
numbers_list.append(missing_number)
Append the missing number to the list of numbers.
disordered_list = disordered_line.split(' ')
Conver the disordered phrase into a list.
string_position = zip(disordered_list, numbers_list)
Combine every string with its respective position.
ordered = sorted(string_position, key = lambda x: x[1])
Order the combined list by the position of the string.
text += " ".join([x[0] for x in ordered])
Concatenate the ordered phrase, and the reamining code it's easy to understand.
UPDATE
By looking at your code here is my opinion tha might solve your problem.
split already returns a list so you do not have to loop over the splitted content to add that content to another list.
So these six lines:
for l in test.strip().split(";"):
saver.append(l)
for i in saver[0].split(" "):
textList.append(i)
for j in saver[1].split(" "):
positionList.append(j)
can be converted into three:
splitted_test = test.strip().split(';')
textList = splitted_test[0].split(" ")
positionList = map(int, splitted_test[1].split(" "))
In this line positionList = map(int, splitted_test[0].split(" ")) You already convert numbers into int, so you save these two lines:
for i in range(0,len(positionList)):
positionList[i]=int(positionList[i])
The next lines:
accomodator=[None]*len(textList)
for n in range(1,len(textList)):
if n not in positionList:
accomodator[n]=textList[len(textList)-1]
exists=n
can be converted into the next four:
missing_number = sum(xrange(sorted(positionList)[0],sorted(positionList)[-1]+1)) - sum(positionList)
if missing_number == 0:
missing_number = len(textList)
positionList.append(missing_number)
Basically what these lines do is calculate the missing number in the serie of numbers so the len of the serie is the same as textList.
The next lines:
for item in positionList:
accomodator[item-1]=textList[counter]
counter+=1
if counter>item:
accomodator[exists-1]=textList[counter]
for word in accomodator:
text+=str(word) + " "
Can be replaced by these ones:
string_position = zip(textList, positionList)
ordered = sorted(string_position, key = lambda x: x[1])
text += " ".join([x[0] for x in ordered])
text += "\n"
From this way you can save, lines and memory, also use xrange instead of range.
Maybe the factors that make your code pass partially could be:
Number of lines of the script
Number of time your script takes.
Number of memory your script uses.
What you could do is:
Use Generators. #You save memory
Reduce for's, this way you save lines of code and time.
If you think something could be made it easier, do it.
Do not redo the wheel, if something has been already made it, use it.

Categories