Python Regex not returning phone numbers

Python Regex not returning phone numbers - python

Given the following code:
import re
file_object = open("all-OANC.txt", "r")
file_text = file_object.read()
pattern = "(\+?1-)?(\()?[0-9]{3}(\))?(-|.)[0-9]{3}(-|.)[0-9]{4}"
for match in re.findall(pattern, file_text):
print match
I get output that stretches like this:
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
I'm trying to find phone numbers, and I am one hundred percent sure there are numbers in the file. When I search for numbers in an online applet for example, with the same expression, I get matches.
Here is a snippet where the expression is found outside of python:
"Slate on Paper," our
specially formatted print-out version of Slate, is e-mailed to readers
Friday around midday. It also can be downloaded from our
site. Those services are free. An actual paper edition of "Slate on Paper"
can be mailed to you (call 800-555-4995), but that costs money and can take a
few days to arrive."
I want output that at least recognizes the presence of a number

It's your capture groups that are being displayed. Display the whole match:
text = '''"Slate on Paper," our specially formatted print-out version of Slate, is e-mailed to readers Friday around midday. It also can be downloaded from our site. Those services are free. An actual paper edition of "Slate on Paper" can be mailed to you (call 800-555-4995), but that costs money and can take a few days to arrive."'''
pattern = "(\+?1-)?(\()?[0-9]{3}(\))?(-|.)[0-9]{3}(-|.)[0-9]{4}"
for match in re.finditer(pattern,text):
print(match.group())
Output:
800-555-4995

Related

Cant extract "moving" data. Terminal show list = ['', '', '', '', '', '', '', '', '']

In this page there is a table where in each row there is a plus “+” button, where if you click It more detailed info is displayed (the info I need).
Every Plus “+” button have the same class and also, only one piece of detailed info can be displayed at a time. So if I have clicked the first button, at the moment I click the second Plus “+” button, the detailed info of the first row disappears (closes).
The thing is, every time I click one button there is certain movement in the page and I suspect may be the cause of this but it made no sense. ¿Why I suspect this? Because the first item in the list its a text from other position within the row.
I managed to click each button one by one, but I can’t extract the data I want resulting in a list of just blank spaces like [other text from row '', '', '', '', '', '', '', '', '', '']
The code I’m using is this:
list = []
elements = driver.find_elements_by_css_selector("span[class='buttonclass']")
for x in range(len(elements)):
elements[x].click()
time.sleep(2)
results = driver.find_element_by_css_selector("td[class='class of info I want']")
skutxt = results.text
list.append(skutxt)
print(list)
Terminal shows: ['wrong text', '', '', '', '', '', '', '', '', '', '']
Thank you very much!

Using the Python zip() function for parallel iteration.
Try following code:
text_list = []
elements = driver.find_elements_by_css_selector("span[class='buttonclass']")
results = driver.find_elements_by_css_selector("td[class='class of info I want']")
for element, result in zip(elements, results):
elements.click()
time.sleep(2)
text_list.append(result.text)
print(text_list)
But note time.sleep(..) is bad way, you can use other alternative.

How to make str.splitlines method not to split line by hex characters?

I'm trying to parse output from GNU Strings utility with str.splitlines()
Here is the raw output from GNU Strings:
279304 9k=pN\n 279340 9k=PN\n 279376 9k<LN\n 279412 9k=\x0cN\n 279448 9k<4N\n
When I parse the output with the following code:
process = subprocess.run(['strings', '-o', main_exe], check=True, \
stdout=subprocess.PIPE, universal_newlines=True)
output = process.stdout
print(output)
lines = output.splitlines()
for line in lines:
print(line)
I get a result that I don't expect and it breaks my further parsing:
279304 9k=pN
279340 9k=PN
279376 9k<LN
279412 9k=
N
279448 9k<4N
279592 9k<hN
279628 9k;TN
279664 9k<$N
Can I somehow tell the splitlines() method not trigger on \x0 characters?
The desired result should have lines which starts with an offset (that 6 digits at the start of each line) :
279304 9k=pN
279340 9k=PN
279376 9k<LN
279412 9k=N
279448 9k<4N
279592 9k<hN
279628 9k;TN
279664 9k<$N

I think that you actually get the expected result. But assuming ASCII or any of its derevatives (Latin-x, UTF8, etc.) '\x0c' is the control character FormFeed which happens to be represented here as a vertical one line jump.
Said differently I would bet a coin that the resulting file contains the expected bytes, but that your further processing chokes on the control character.

The documentation for str.splitlines() says it will split lines based on a number of line boundary types including \x0c. If you only want to explicitly split by \n then you could user str.split('\n') instead. However note that if your line ends with a `\n then you will end up with an empty group that you might want to drop the last index if its empty string.
data = '279304 9k=pN\n 279340 9k=PN\n 279376 9k<LN\n 279412 9k=\x0cN\n 279448 9k<4N\n'
lines = data.split('\n')
if lines[-1] == '':
lines.pop()
print(lines)
for line in lines:
print(line)
OUTPUT
['279304 9k=pN', ' 279340 9k=PN', ' 279376 9k<LN', ' 279412 9k=\x0cN', ' 279448 9k<4N']
279304 9k=pN
279340 9k=PN
279376 9k<LN
279412 9k=N
279448 9k<4N

process = subprocess.run(['strings', '-o', main_exe], check=True, \
stdout=subprocess.PIPE, universal_newlines=True)
lines = [line.strip() for line in process.stdout.split('\n') if len(line) > 0]
Remove the call to strip() if you do want to keep that leading whitespace on every line

Your problem arises from using the splitlines method of Unicode strings, which produces different results than the splitlines method of byte strings.
There is an issue for cpython for this problem, open since 2014: . str.splitlines splitting on non-\r\n characters - Issue #66428 - python/cpython.
Below I have added a portable splitlines function that uses the traditional ASCII line break characters for both Unicode and byte strings and works both under Python2 and Python3. A poor man's version for efficiency enthusiasts is also provided.
In Python 2, type str is an 8-bit string and Unicode strings have type unicode.
In Python 3, type str is a Unicode string and 8-bit strings have type bytes.
Although there is no actual difference in line splitting between Python 2 and Python 3 Unicode and 8-bit strings, when running vanilla code under Python 3, it is more likely to run into trouble with the extended universal newlines approach for Unicode strings.
The following table shows which Python data type employs which splitting method.
Split Method
Python 2
Python 3
ASCII
str.splitlines
bytes.splitlines
Unicode
unicode.splitlines
str.splitlines
str_is_unicode = len('a\fa'.splitlines()) > 1
def splitlines(string): # ||:fnc:||
r"""Portable definitive ASCII splitlines function.
In Python 2, type :class:`str` is an 8-bit string and Unicode strings
have type :class:`unicode`.
In Python 3, type :class:`str` is a Unicode string and 8-bit strings
have type :class:`bytes`.
Although there is no actual difference in line splitting between
Python 2 and Python 3 Unicode and 8-bit strings, when running
vanilla code under Python 3, it is more likely to run into trouble
with the extended `universal newlines`_ approach for Unicode
strings.
The following table shows which Python data type employs which
splitting method.
+--------------+---------------------------+---------------------------+
| Split Method | Python 2 | Python 3 |
+==============+===========================+===========================+
| ASCII | `str.splitlines <ssl2_>`_ | `bytes.splitlines`_ |
+--------------+---------------------------+---------------------------+
| Unicode | `unicode.splitlines`_ | `str.splitlines <ssl3_>`_ |
+--------------+---------------------------+---------------------------+
This function provides a portable and definitive method to apply
ASCII `universal newlines`_ for line splitting. The reencoding is
performed to take advantage of splitlines' `universal newlines`_
aproach for Unix, DOS and Macintosh line endings.
While the poor man's version of simply splitting on \\n might seem
more performant, it falls short, when a mixture of Unix, DOS and
Macintosh line endings are encountered. Just for reference, a
general implementation is presented, which avoids some common
pitfalls.
>>> test_strings = (
... "##\ftrail\n##\n\ndone\n\n\n",
... "##\ftrail\n##\n\ndone\n\n\nxx",
... "##\ftrail\n##\n\ndone\n\nx\n",
... "##\ftrail\r##\r\rdone\r\r\r",
... "##\ftrail\r\n##\r\n\r\ndone\r\n\r\n\r\n")
The global variable :data:`str_is_unicode` determines portably,
whether a :class:`str` object is a Unicode string.
.. code-block:: sh
str_is_unicode = len('a\fa'.splitlines()) > 1
This allows to define some generic conversion functions:
>>> if str_is_unicode:
... make_native_str = lambda s, e=None: getattr(s, 'decode', lambda _e: s)(e or 'utf8')
... make_uc_string = make_native_str
... make_u8_string = lambda s, e=None: ((isinstance(s, str) and (s.encode(e or 'utf8'), 1)) or (s, 1))[0]
... else:
... make_native_str = lambda s, e=None: ((isinstance(s, unicode) and (s.encode(e or 'utf8'), 1)) or (s, 1))[0]
... make_u8_string = make_native_str
... make_uc_string = lambda s, e=None: ((not isinstance(s, unicode) and (s.decode('utf8'), 1)) or (s, 1))[0]
for a protable doctest:
>>> for test_string in test_strings:
... print('--------------------')
... print(repr(test_string))
... print(repr([make_native_str(_l) for _l in splitlines(make_u8_string(test_string))]))
... print(repr([make_native_str(_l) for _l in poor_mans_splitlines(make_u8_string(test_string))]))
... print([make_native_str(_l) for _l in splitlines(make_uc_string(test_string))])
... print([make_native_str(_l) for _l in poor_mans_splitlines(make_uc_string(test_string))])
--------------------
'##\x0ctrail\n##\n\ndone\n\n\n'
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
--------------------
'##\x0ctrail\n##\n\ndone\n\n\nxx'
['##\x0ctrail', '##', '', 'done', '', '', 'xx']
['##\x0ctrail', '##', '', 'done', '', '', 'xx']
['##\x0ctrail', '##', '', 'done', '', '', 'xx']
['##\x0ctrail', '##', '', 'done', '', '', 'xx']
--------------------
'##\x0ctrail\n##\n\ndone\n\nx\n'
['##\x0ctrail', '##', '', 'done', '', 'x']
['##\x0ctrail', '##', '', 'done', '', 'x']
['##\x0ctrail', '##', '', 'done', '', 'x']
['##\x0ctrail', '##', '', 'done', '', 'x']
--------------------
'##\x0ctrail\r##\r\rdone\r\r\r'
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
--------------------
'##\x0ctrail\r\n##\r\n\r\ndone\r\n\r\n\r\n'
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
For further details see
- Python 2: `5. Built-in Types - Python 2.7.18 documentation
<https://docs.python.org/2.7/library/stdtypes.html>`_
- Python 3: `Built-in Types - Python 3.10.4 documentation
<https://docs.python.org/3/library/stdtypes.html>`_
.. _`universal newlines`: https://docs.python.org/3/glossary.html
.. _`ssl2`: https://docs.python.org/2.7/library/stdtypes.html#str.splitlines
.. _`unicode.splitlines`: https://docs.python.org/2.7/library/stdtypes.html#unicode.splitlines
.. _`ssl3`: https://docs.python.org/3/library/stdtypes.html#str.splitlines -
.. _`bytes.splitlines`: https://docs.python.org/3/library/stdtypes.html#bytes.splitlines
"""
if ((str_is_unicode and isinstance(string, str))
or (not str_is_unicode and not isinstance(string, str))):
# unicode string
u8 = string.encode('utf8')
lines = u8.splitlines()
return [l.decode('utf8') for l in lines]
# byte string
return string.splitlines()
def poor_mans_splitlines(string):
r"""
"""
if str_is_unicode:
native_uc_type = str
else:
native_uc_type = unicode
if ((str_is_unicode and isinstance(string, str))
or (not str_is_unicode and isinstance(string, native_uc_type))):
# unicode string
sep = '\r\n|\n'
if not re.search(sep, string):
sep = '\r'
else:
# |:info:|
# if there is a single newline at the end, `$` matches that newline
# if there are multiple newlines at the end, `$` matches before the last newline
string += '\n'
sep_end = '(' + sep + ')$'
# prevent additional blank line at end
string = re.sub(sep_end, '', string)
return re.split(sep, string)
# byte string
return string.splitlines()

in python regular expression,why can't (h)* and (h)+ yield same result?

I am learning the re module in python. I have found something that doean't make sense(to me) and i don't know why. Here is a small example,
x=re.compile(r'(ha)*')
c=x.search('the man know how to hahahaha')
print(c.group())#output will be nothing,no error.But i expect "hahahaha"
same happens if i use re.compile(r'(ha)?'),
x=re.compile(r'(ha)?')
c=x.search('the man know how to hahahaha')
print(c.group())#output will be nothing,no error.But i expect "ha".
But if i use re.compile(r'(ha)+'),
x=re.compile(r'(ha)+')
c=x.search('the man know how to hahahaha')
print(c.group())#output will be `hahahaha`,just as expected.
Why is this,aren't re.compile(r'(ha)*') and re.compile(r'(ha)+') same in this case?

The pattern r'h+' and r'h*' are not identical, thats why they do not deliver the same result. + implies 1 or more matches of your pattern, * zero or more:
re.search returns "nothing" because it only looks at the first match. The first match for * is a zero occurence of your '(ha)' pattern at the first letter of your string:
import re
x=re.compile(r'(ha)*')
c=x.findall('the man know how to hahahaha') # get _all_ matches
print(c)
Output:
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'ha', '']
# t h e m a n k n o w h o w t o hahahaha
* and ? quantifier allow 0 matches
Doku:
Pattern.search(string[, pos[, endpos]])
Scan through string looking for the first location where this regular expression produces a match, ...
(source: https://docs.python.org/3/library/re.html#re.Pattern.search)

Strip off characters from output

I have the following structure generated by bs4, python.
['Y10765227', '9884877926, 9283183326', '', 'Dealer', 'Rgmuthu']
['L10038779', '9551154555', ',', ',']
['R10831945', '9150000747, 9282109134, 9043728565', ',', ',']
['B10750123', '9952946340', '', 'Dealer', 'Bala']
['R10763559', '9841280752, 9884797013', '', 'Dealer', 'Senthil']
I wanna rip characters off and I should get something like the following
9884877926, 9283183326, Dealer, Rgmuthu
9551154555
9150000747, 9282109134, 9043728565
9952946340 , Dealer, Bala
9841280752, 9884797013, Dealer, Senthil
I am using print re.findall("'([a-zA-Z0-9,\s]*)'", eachproperty['onclick'])
So basically I wanna remove the "[]" and "''" and "," and random ID which is in the start.
Update
onclick="try{appendPropertyPosition(this,'Y10765227','9884877926, 9283183326','','Dealer','Rgmuthu');jsb9onUnloadTracking();jsevt.stopBubble(event);}catch(e){};"
So I am scraping from this onclick attribute to get the above mentioned data.

You can use a combination of str.join and str.translate here:
>>> from string import punctuation, whitespace
>>> lis = [['Y10765227', '9884877926, 9283183326', '', 'Dealer', 'Rgmuthu'],
['L10038779', '9551154555', ',', ','],['R10831945', '9150000747, 9282109134, 9043728565', ',', ','],
['B10750123', '9952946340', '', 'Dealer', 'Bala'],
['R10763559', '9841280752, 9884797013', '', 'Dealer', 'Senthil']]
for item in lis:
print ", ".join(x for x in item[1:]
if x.translate(None, punctuation + whitespace))
...
9884877926, 9283183326, Dealer, Rgmuthu
9551154555
9150000747, 9282109134, 9043728565
9952946340, Dealer, Bala
9841280752, 9884797013, Dealer, Senthil

Python regexp multiple expressions with grouping

I'm trying to match the output given by a Modem when asked about the network info, it looks like this:
Network survey started...
For BCCH-Carrier:
arfcn: 15,bsic: 4,dBm: -68
For non BCCH-Carrier:
arfcn: 10,dBm: -72
arfcn: 6,dBm: -78
arfcn: 11,dBm: -81
arfcn: 14,dBm: -83
arfcn: 16,dBm: -83
So I've two types of expressions to match, the BCCH and non BCCH. the following code is almost working:
match = re.findall('(?:arfcn: (\d*),dBm: (-\d*))|(?:arfcn: (\d*),bsic: (\d*),dBm: (-\d*))', data)
But it seems that BOTH expressions are being matched, and not found fields left blank:
>>> match
[('', '', '15', '4', '-68'), ('10', '-72', '', '', ''), ('6', '-78', '', '', ''), ('11', '-81', '', '', ''), ('14', '-83', '', '', ''), ('16', '-83', '', '', '')]
May anyone help? Why such behaviour? I've tried changing the order of the expressions, with no luck.
Thanks!

That is how capturing groups work. Since you have five of them, there will always be five parts returned.
Based on your data, I think you could simplify your regex by making the bsic part optional. That way each row would return three parts, the middle one being empty for non BCCH-Carriers.
match = re.findall('arfcn: (\d*)(?:,bsic: (\d*))?,dBm: (-\d*)', data)

You have an expression with 5 groups.
The fact that you have 2 of those in one optional part and the other 3 in a mutually exclusive other part of your expression doesn't change that fact. Either 2 or 3 of the groups are going to be empty, depending on what line you matched.
If you have to match either line with one expression, there is no way around this. You can use named groups (and return a dictionary of matched groups) to make this a little easier to manage, but you will always end up with empty groups.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex not returning phone numbers - python

Related

Cant extract "moving" data. Terminal show list = ['', '', '', '', '', '', '', '', '']

How to make str.splitlines method not to split line by hex characters?

in python regular expression,why can't (h)* and (h)+ yield same result?

Strip off characters from output

Python regexp multiple expressions with grouping

Categories

Resources