How can I extract numbers containing commas from strings in python - python

I am trying to find all numbers in text and return them in a list of floats.
In the text:
Commas are used to separate thousands
Several consecutive numbers are separated by a comma and a space
Numbers can be attached to words
My code seems to extract numbers separated with a comma and space and numbers attached to words.
However, it extracts numbers separated by commas as separate numbers
text = "30feet is about 10metre but that's 1 rough estimate several numbers are like 2, 137, and 40 or something big numbers are like 2,137,040 or something"
list(map(int, re.findall('\d+', text)))
The suggestions below work beautifully
Unfortunately, the output of the below returns a string:
nums = re.findall(r'\b\d{1,3}(?:,\d{3})*(?:\.\d+)?(?!\d)', text)
print(nums)
I need to return the output as a list of floats, with commas between but no speech marks.
Eg.
extract_numbers("1, 2, 3, un pasito pa'lante Maria")
is [1.0, 2.0, 3.0]
Unfortunately, I have not yet been successful in my attempts. Currently, my code reads
def extract_numbers(text):
nums = re.findall(r'\b\d{1,3}(?:,\d{3})*(?:\.\d+)?(?!\d)', text)
return (("[{0}]".format(
', '.join(map(str, nums)))))
extract_numbers(TEXT_SAMPLE)

You may try doing a regex re.findall search on the following pattern:
\b\d{1,3}(?:,\d{3})*(?:\.\d+)?(?!\d)
Sample script - try it here
import re
text = "30feet is about 10metre but that's 1 rough estimate several numbers are like 2, 137, and 40 or something big numbers are like 2,137,040 or something"
nums = re.findall(r'\b\d{1,3}(?:,\d{3})*(?:\.\d+)?(?!\d)', text)
print(nums)
This prints:
['30', '10', '1', '2', '137', '40', '2,137,040']
Here is an explanation of the regex pattern:
\b word boundary
\d{1,3} match 1 to 3 leading digits
(?:,\d{3})* followed by zero or more thousands terms
(?:\.\d+)? match an optional decimal component
(?!\d) assert the "end" of the number by checking for a following non digit

Create a pattern with an optional character group []
Code try it here
import re
text = "30feet is about 10metre but that's 1 rough estimate several numbers are like 2, 137, and 40 or something big numbers are like 2,137,040 or something"
out = [
int(match.replace(',', ''))
for match in re.findall('[\d,]+', text)
]
print(out)
Output
[30, 10, 1, 2, 137, 40, 2137040]

you need to match the commas as well, then strip them before turning them into an integer:
list(map(lambda n: int(n.replace(',','')), re.findall('[\d,]+', text)))
Also, you should probably be using list comprehensions unless you need python2 compatibility for some reason:
[int(n.replace(',', '')) for n in re.findall('[\d,]+', text)]

y not use?
array = re.findall(r'[0-9]+', str)

Related

Returning a fixed row (number) of numbers in a string?

In this data:
[‘23 2312 dfr tr 133’,
‘2344 fdeed’,
‘der3212fr342 96’]
I would like a function which would return values where there are a certan number of numbers in a row. It doesn’t matter about spaces, or other text, as long as there are a certain numbers in a row. (No more, no less) For example:
2 numbers in a row:
[‘23’,’’,’96’]
3 numbers in a row:
[‘133’,’’,’342’]
4 numbers in a row:
[‘2312’,’2344’,’3212’]
Thank you
One way could be using re.findall to extract the contiguous digits from the strings and keep those which have length n:
l = ['23 2312 dfr tr 133',
'2344 fdeed',
'der3212fr342 96']
import re
def length_n_digits(l,n):
return [s for i in l for s in
re.findall(rf'(?<!\d)\d{{{n}}}(?!\d)', i) or ['']]
Note that the double braces '{{}}' are just to escape the inner braces and no interpolation takes place. (?<!\d) and (?!\d) are to lookaround and ensure that it only matches when the sequence of n digits is not surrounded by other digits.
length_n_digits(l, 2)
# ['23', '', '96']
length_n_digits(l, 3)
# ['133', '', '342']
length_n_digits(l, 4)
# ['2312', '2344', '3212']

Extracting Numbers from Formatted Strings with Unusual Delimiters in Python

How can I get the numbers in a formatted string like the following in Python? It has a mixed combination of delimiters such as tab, parenthesis, cm, space, and #.
I used the following code but it does not split the numbers correctly.
s = "1.0000e+036 (1.2365e-004,6.3265e+003cm) (2.3659e-002, 2.3659e-002#)"
parts = re.split('\s|(?<!\d)[,.](?!\d)', s)
print(parts)
['1.0000e+036', '(1.2365e-004,6.3265e+003cm)', '(2.3659e-002,', '2.3659e-002#)']
I am trying to extract:
[1.0000e+036, 1.2365e-004, 6.3265e+003, 2.3659e-002, 2.3659e-002]
Could someone kindly help?
Update:
I tried the regular expression as following, which fails to split the positive exponential numbers
s = "1.0000e+036 (1.2365e-004,6.3265e+003cm) (2.3659e-002, 2.3659e-002#)"
match_number = re.compile('-?\ *[0-9]+\.?[0-9]*(?:[Ee]\ *-?\ *[0-9]+)?')
final_list = [float(x) for x in re.findall(match_number, s)]
print(final_list)
[1.0, 36.0, 0.00012365, 6.3265, 3.0, 0.023659, 0.023659]
As can be seen, the first number is 1e36 which was parsed as two numbers 1.0 and 36.0.
You don't need to treat those items as delimiters. Rather, all you appear to need is a regex to extract all the floats in the line (including exponential / engineering notation), and simply ignore the remaining characters. Comprehensive numerical expressions are readily available on line with a simple search.

How to separate negative floating numbers which are hyphen separated using python?

I have a list containing floating numbers (positive or negative) which are separated by a hyphen. I would like to split them.
For example:
input: -76.833-106.954, -76.833--108.954
output: -76.833,106.954,-76.833,-108.954
I've tried re.split(r"([-+]?\d*\.)-", but it doesn't work. I get an invalid literal statement for int()
Please let me know what code would you recommend me to use. Thank you!
Completing #PyHunterMan's answer:
You want only one hyphen to be optional before the number indicating a negative float:
import re
target = '-76.833-106.954, -76.833--108.954, 83.4, -92, 76.833-106.954, 76.833--108.954'
pattern = r'(-?\d+\.\d+)' # Find all float patterns with an and only one optional hypen at the beggining (others are ignored)
match = re.findall(pattern, target)
numbers = [float(item) for item in match]
print(numbers)
>>> [-76.833, -106.954, -76.833, -108.954, 83.4, 76.833, -106.954, 76.833, -108.954]
You will notice this does not catch -92 and besides -92 is part the Real numbers set, is not written in float format.
If you want to catch the -92 which is an integer use:
import re
input_ = '-76.833-106.954, -76.833--108.954, 83.4, -92, 76.833-106.954, 76.833--108.954'
pattern = r'(-?\d+(\.\d+)?)'
match = re.findall(pattern, input_)
print(match)
result = [float(item[0]) for item in match]
print(result)
>>> [-76.833, -106.954, -76.833, -108.954, 83.4, -92.0, 76.833, -106.954, 76.833, -108.954]

Extract Numeric Data from a Text file in Python

Say I have a text file with the data/string:
Dataset #1: X/Y= 5, Z=7 has been calculated
Dataset #2: X/Y= 6, Z=8 has been calculated
Dataset #10: X/Y =7, Z=9 has been calculated
I want the output to be on a csv file as:
X/Y, X/Y, X/Y
Which should display:
5, 6, 7
Here is my current approach, I am using string.find, but I feel like this is rather difficult in solving this problem:
data = open('TestData.txt').read()
#index of string
counter = 1
if (data.find('X/Y=')==1):
#extracts segment out of string
line = data[r+6:r+14]
r = data.find('X/Y=')
counter += 1
print line
else:
r = data.find('X/Y')`enter code here`
line = data[r+6:r+14]
for x in range(0,counter):
print line
print counter
Error: For some reason, I'm only getting the value of 5. when I setup a #loop, i get infinite 5's.
If you want the numbers and your txt file is formatted like the first two lines i.e X/Y= 6, not like X/Y =7:
import re
result=[]
with open("TestData.txt") as f:
for line in f:
s = re.search(r'(?<=Y=\s)\d+',line) # pattern matches up to "Y" followed by "=" and a space "\s" then a digit or digits.
if s: # if there is a match i.e re.search does not return None, add match to the list.
result.append(s.group())
print result
['5', '6', '7']
To match the pattern in your comment, you should escape the period like . or you will match strings like 1.2+3 etc.. the "." has special meaning re.
So re.search(r'(?<=Counting Numbers =\s)\d\.\d\.\d',s).group()
will return only 1.2.3
If it makes it more explicit, you can use s=re.search(r'(?<=X/Y=\s)\d+',line) using the full X/Y=\s pattern.
Using the original line in your comment and updated line would return :
['5', '6', '7', '5', '5']
The (?<=Y=\s)is called a positive lookbehind assertion.
(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position
There are lots of nice examples here in the re documentation. The items in the parens are not returned.
Since it appears that the entities are all on a single line, I would recommend using readline in a loop to read the file line-by-line and then using a regex to parse out the components you're looking for from that line.
Edit re: OP's comment:
One regex pattern that could be used to capture the number given the specified format in this case would be: X/Y\s*=\s*(.+),

splitting string in Python (2.7)

I have a string such as the one below:
26 (passengers:22 crew:4)
or
32 (passengers:? crew: ?)
. What I'm looking to do is split up the code so that just the numbers representing the number of passengers and crew are extracted. If it's a question mark, I'd look for it to be replaced by a "".
I'm aware I can use string.replace("?", "") to replace the ? however how do I go about extracting the numeric characters for crew or passengers respectively? The numbers may vary from two digits to three so I can't slice the last few characters off the string or at a specific interval.
Thanks in advance
A regular expression to match those would be:
r'\(\s*passengers:\s*(\d{1,3}|\?)\s+ crew:\s*(\d{1,3}|\?)\s*\)'
with some extra whitespace tolerance thrown in.
Results:
>>> import re
>>> numbers = re.compile(r'\(\s*passengers:\s*(\d{1,3}|\?)\s+ crew:\s*(\d{1,3}|\?)\s*\)')
>>> numbers.search('26 (passengers:22 crew:4)').groups()
('22', '4')
>>> numbers.search('32 (passengers:? crew: ?)').groups()
('?', '?')

Categories