Related
I'm trying to parse output from GNU Strings utility with str.splitlines()
Here is the raw output from GNU Strings:
279304 9k=pN\n 279340 9k=PN\n 279376 9k<LN\n 279412 9k=\x0cN\n 279448 9k<4N\n
When I parse the output with the following code:
process = subprocess.run(['strings', '-o', main_exe], check=True, \
stdout=subprocess.PIPE, universal_newlines=True)
output = process.stdout
print(output)
lines = output.splitlines()
for line in lines:
print(line)
I get a result that I don't expect and it breaks my further parsing:
279304 9k=pN
279340 9k=PN
279376 9k<LN
279412 9k=
N
279448 9k<4N
279592 9k<hN
279628 9k;TN
279664 9k<$N
Can I somehow tell the splitlines() method not trigger on \x0 characters?
The desired result should have lines which starts with an offset (that 6 digits at the start of each line) :
279304 9k=pN
279340 9k=PN
279376 9k<LN
279412 9k=N
279448 9k<4N
279592 9k<hN
279628 9k;TN
279664 9k<$N
I think that you actually get the expected result. But assuming ASCII or any of its derevatives (Latin-x, UTF8, etc.) '\x0c' is the control character FormFeed which happens to be represented here as a vertical one line jump.
Said differently I would bet a coin that the resulting file contains the expected bytes, but that your further processing chokes on the control character.
The documentation for str.splitlines() says it will split lines based on a number of line boundary types including \x0c. If you only want to explicitly split by \n then you could user str.split('\n') instead. However note that if your line ends with a `\n then you will end up with an empty group that you might want to drop the last index if its empty string.
data = '279304 9k=pN\n 279340 9k=PN\n 279376 9k<LN\n 279412 9k=\x0cN\n 279448 9k<4N\n'
lines = data.split('\n')
if lines[-1] == '':
lines.pop()
print(lines)
for line in lines:
print(line)
OUTPUT
['279304 9k=pN', ' 279340 9k=PN', ' 279376 9k<LN', ' 279412 9k=\x0cN', ' 279448 9k<4N']
279304 9k=pN
279340 9k=PN
279376 9k<LN
279412 9k=N
279448 9k<4N
process = subprocess.run(['strings', '-o', main_exe], check=True, \
stdout=subprocess.PIPE, universal_newlines=True)
lines = [line.strip() for line in process.stdout.split('\n') if len(line) > 0]
Remove the call to strip() if you do want to keep that leading whitespace on every line
Your problem arises from using the splitlines method of Unicode strings, which produces different results than the splitlines method of byte strings.
There is an issue for cpython for this problem, open since 2014: . str.splitlines splitting on non-\r\n characters - Issue #66428 - python/cpython.
Below I have added a portable splitlines function that uses the traditional ASCII line break characters for both Unicode and byte strings and works both under Python2 and Python3. A poor man's version for efficiency enthusiasts is also provided.
In Python 2, type str is an 8-bit string and Unicode strings have type unicode.
In Python 3, type str is a Unicode string and 8-bit strings have type bytes.
Although there is no actual difference in line splitting between Python 2 and Python 3 Unicode and 8-bit strings, when running vanilla code under Python 3, it is more likely to run into trouble with the extended universal newlines approach for Unicode strings.
The following table shows which Python data type employs which splitting method.
Split Method
Python 2
Python 3
ASCII
str.splitlines
bytes.splitlines
Unicode
unicode.splitlines
str.splitlines
str_is_unicode = len('a\fa'.splitlines()) > 1
def splitlines(string): # ||:fnc:||
r"""Portable definitive ASCII splitlines function.
In Python 2, type :class:`str` is an 8-bit string and Unicode strings
have type :class:`unicode`.
In Python 3, type :class:`str` is a Unicode string and 8-bit strings
have type :class:`bytes`.
Although there is no actual difference in line splitting between
Python 2 and Python 3 Unicode and 8-bit strings, when running
vanilla code under Python 3, it is more likely to run into trouble
with the extended `universal newlines`_ approach for Unicode
strings.
The following table shows which Python data type employs which
splitting method.
+--------------+---------------------------+---------------------------+
| Split Method | Python 2 | Python 3 |
+==============+===========================+===========================+
| ASCII | `str.splitlines <ssl2_>`_ | `bytes.splitlines`_ |
+--------------+---------------------------+---------------------------+
| Unicode | `unicode.splitlines`_ | `str.splitlines <ssl3_>`_ |
+--------------+---------------------------+---------------------------+
This function provides a portable and definitive method to apply
ASCII `universal newlines`_ for line splitting. The reencoding is
performed to take advantage of splitlines' `universal newlines`_
aproach for Unix, DOS and Macintosh line endings.
While the poor man's version of simply splitting on \\n might seem
more performant, it falls short, when a mixture of Unix, DOS and
Macintosh line endings are encountered. Just for reference, a
general implementation is presented, which avoids some common
pitfalls.
>>> test_strings = (
... "##\ftrail\n##\n\ndone\n\n\n",
... "##\ftrail\n##\n\ndone\n\n\nxx",
... "##\ftrail\n##\n\ndone\n\nx\n",
... "##\ftrail\r##\r\rdone\r\r\r",
... "##\ftrail\r\n##\r\n\r\ndone\r\n\r\n\r\n")
The global variable :data:`str_is_unicode` determines portably,
whether a :class:`str` object is a Unicode string.
.. code-block:: sh
str_is_unicode = len('a\fa'.splitlines()) > 1
This allows to define some generic conversion functions:
>>> if str_is_unicode:
... make_native_str = lambda s, e=None: getattr(s, 'decode', lambda _e: s)(e or 'utf8')
... make_uc_string = make_native_str
... make_u8_string = lambda s, e=None: ((isinstance(s, str) and (s.encode(e or 'utf8'), 1)) or (s, 1))[0]
... else:
... make_native_str = lambda s, e=None: ((isinstance(s, unicode) and (s.encode(e or 'utf8'), 1)) or (s, 1))[0]
... make_u8_string = make_native_str
... make_uc_string = lambda s, e=None: ((not isinstance(s, unicode) and (s.decode('utf8'), 1)) or (s, 1))[0]
for a protable doctest:
>>> for test_string in test_strings:
... print('--------------------')
... print(repr(test_string))
... print(repr([make_native_str(_l) for _l in splitlines(make_u8_string(test_string))]))
... print(repr([make_native_str(_l) for _l in poor_mans_splitlines(make_u8_string(test_string))]))
... print([make_native_str(_l) for _l in splitlines(make_uc_string(test_string))])
... print([make_native_str(_l) for _l in poor_mans_splitlines(make_uc_string(test_string))])
--------------------
'##\x0ctrail\n##\n\ndone\n\n\n'
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
--------------------
'##\x0ctrail\n##\n\ndone\n\n\nxx'
['##\x0ctrail', '##', '', 'done', '', '', 'xx']
['##\x0ctrail', '##', '', 'done', '', '', 'xx']
['##\x0ctrail', '##', '', 'done', '', '', 'xx']
['##\x0ctrail', '##', '', 'done', '', '', 'xx']
--------------------
'##\x0ctrail\n##\n\ndone\n\nx\n'
['##\x0ctrail', '##', '', 'done', '', 'x']
['##\x0ctrail', '##', '', 'done', '', 'x']
['##\x0ctrail', '##', '', 'done', '', 'x']
['##\x0ctrail', '##', '', 'done', '', 'x']
--------------------
'##\x0ctrail\r##\r\rdone\r\r\r'
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
--------------------
'##\x0ctrail\r\n##\r\n\r\ndone\r\n\r\n\r\n'
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
['##\x0ctrail', '##', '', 'done', '', '']
For further details see
- Python 2: `5. Built-in Types - Python 2.7.18 documentation
<https://docs.python.org/2.7/library/stdtypes.html>`_
- Python 3: `Built-in Types - Python 3.10.4 documentation
<https://docs.python.org/3/library/stdtypes.html>`_
.. _`universal newlines`: https://docs.python.org/3/glossary.html
.. _`ssl2`: https://docs.python.org/2.7/library/stdtypes.html#str.splitlines
.. _`unicode.splitlines`: https://docs.python.org/2.7/library/stdtypes.html#unicode.splitlines
.. _`ssl3`: https://docs.python.org/3/library/stdtypes.html#str.splitlines -
.. _`bytes.splitlines`: https://docs.python.org/3/library/stdtypes.html#bytes.splitlines
"""
if ((str_is_unicode and isinstance(string, str))
or (not str_is_unicode and not isinstance(string, str))):
# unicode string
u8 = string.encode('utf8')
lines = u8.splitlines()
return [l.decode('utf8') for l in lines]
# byte string
return string.splitlines()
def poor_mans_splitlines(string):
r"""
"""
if str_is_unicode:
native_uc_type = str
else:
native_uc_type = unicode
if ((str_is_unicode and isinstance(string, str))
or (not str_is_unicode and isinstance(string, native_uc_type))):
# unicode string
sep = '\r\n|\n'
if not re.search(sep, string):
sep = '\r'
else:
# |:info:|
# if there is a single newline at the end, `$` matches that newline
# if there are multiple newlines at the end, `$` matches before the last newline
string += '\n'
sep_end = '(' + sep + ')$'
# prevent additional blank line at end
string = re.sub(sep_end, '', string)
return re.split(sep, string)
# byte string
return string.splitlines()
I am learning the re module in python. I have found something that doean't make sense(to me) and i don't know why. Here is a small example,
x=re.compile(r'(ha)*')
c=x.search('the man know how to hahahaha')
print(c.group())#output will be nothing,no error.But i expect "hahahaha"
same happens if i use re.compile(r'(ha)?'),
x=re.compile(r'(ha)?')
c=x.search('the man know how to hahahaha')
print(c.group())#output will be nothing,no error.But i expect "ha".
But if i use re.compile(r'(ha)+'),
x=re.compile(r'(ha)+')
c=x.search('the man know how to hahahaha')
print(c.group())#output will be `hahahaha`,just as expected.
Why is this,aren't re.compile(r'(ha)*') and re.compile(r'(ha)+') same in this case?
The pattern r'h+' and r'h*' are not identical, thats why they do not deliver the same result. + implies 1 or more matches of your pattern, * zero or more:
re.search returns "nothing" because it only looks at the first match. The first match for * is a zero occurence of your '(ha)' pattern at the first letter of your string:
import re
x=re.compile(r'(ha)*')
c=x.findall('the man know how to hahahaha') # get _all_ matches
print(c)
Output:
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'ha', '']
# t h e m a n k n o w h o w t o hahahaha
* and ? quantifier allow 0 matches
Doku:
Pattern.search(string[, pos[, endpos]])
Scan through string looking for the first location where this regular expression produces a match, ...
(source: https://docs.python.org/3/library/re.html#re.Pattern.search)
Given the following code:
import re
file_object = open("all-OANC.txt", "r")
file_text = file_object.read()
pattern = "(\+?1-)?(\()?[0-9]{3}(\))?(-|.)[0-9]{3}(-|.)[0-9]{4}"
for match in re.findall(pattern, file_text):
print match
I get output that stretches like this:
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
I'm trying to find phone numbers, and I am one hundred percent sure there are numbers in the file. When I search for numbers in an online applet for example, with the same expression, I get matches.
Here is a snippet where the expression is found outside of python:
"Slate on Paper," our
specially formatted print-out version of Slate, is e-mailed to readers
Friday around midday. It also can be downloaded from our
site. Those services are free. An actual paper edition of "Slate on Paper"
can be mailed to you (call 800-555-4995), but that costs money and can take a
few days to arrive."
I want output that at least recognizes the presence of a number
It's your capture groups that are being displayed. Display the whole match:
text = '''"Slate on Paper," our specially formatted print-out version of Slate, is e-mailed to readers Friday around midday. It also can be downloaded from our site. Those services are free. An actual paper edition of "Slate on Paper" can be mailed to you (call 800-555-4995), but that costs money and can take a few days to arrive."'''
pattern = "(\+?1-)?(\()?[0-9]{3}(\))?(-|.)[0-9]{3}(-|.)[0-9]{4}"
for match in re.finditer(pattern,text):
print(match.group())
Output:
800-555-4995
I don't have much xp with xlrd/xlwt but I have managed to access one of the files I want to collect data from. I want to collect data from all files in the directory and move it to one sheet. I was thinking if there is someway I can store it all in one array/list it would be easy to output to a csv. If this is too much work and there is a simple way plz help, otherwise I am using Idle to play around with ideas and have come up with this so far:
>>> import xlrd, xlwt
>>> book = xlrd.open_workbook('c:\excelTry\Papineau.csv.xls')
>>> book.sheet_names()
[u'Charge Codes', u'Month']
>>> sh = book.sheet_by_index(1)
>>> #produces:
>>> sh.book
<xlrd.Book object at 0x01213BF0>
>>> for x in range(0, 10):
sh.row_values(x)
[u'William Papineau', u'Pay Period 11', '', '', u' ', u' ', '', '', '', u'Weekly Total', '', '', u' ', '', '', '', '', u'Weekly Total', u'Biweekly', u'Percent of Effort']
[u'Index Number', u'Index Description', 40678.0, 40679.0, 40680.0, 40681.0, 40682.0, 40683.0, 40684.0, '', 40685.0, 40686.0, 40687.0, 40688.0, 40689.0, 40690.0, 40691.0, '', u'Total', '']
[u'E45776', u'Seat Belt Study', '', 8.0, 8.0, 8.0, 8.0, u' ', '', 32.0, '', '', '', '', '', u' ', '', 0.0, 32.0, 0.4155844155844156]
[u'E43457', u'MultiScaleWaterQuality', '', '', '', '', '', 8.0, '', 8.0, '', 5.0, 8.0, u' ', '', '', '', 13.0, 21.0, 0.2727272727272727]
[u'E45125', u'GLOSS', '', '', '', '', '', '', '', 0.0, '', '', '', 8.0, 8.0, '', '', 16.0, 16.0, 0.2077922077922078]
[u'E45131', u'GLOS AOC Trib Monitoring', '', '', '', '', '', '', '', 0.0, '', '', '', '', '', 8.0, '', 8.0, 8.0, 0.1038961038961039]
this produces what looks like a list object but every attempt I have made to manipulate or append it produces errors saying not scriptable or iterable. The file iteration will be handled with the os module using os.listdir(path) and a for loop. Any help would be greatly appreciated!
So far in your code you don't appear to be doing anything with the values you get from the worksheet. Maybe some of the code didn't get pasted into the question...
Would you be able to include the output of that last line of code?
You say that you want to store it all in one list.
Try something like this:
final = []
for rowx in xrange(sh.nrows):
final.extend(sh.row_values(rowx))
Also:
Be careful with Windows paths. Single-backslashes will work only if the following letter does not, with the backslash, form an escape sequence (e.g. \t or tab). Other options (option 3 is probably best; unless there is a specific reason not to use it):
Raw strings: book = xlrd.open_workbook(r'c:\excelTry\Papineau.csv.xls')
Forward-slashes: book = xlrd.open_workbook('c:/excelTry/Papineau.csv.xls')
os.path.join:
book = xlrd.open_workbook(os.path.join('c:','excelTry','Papineau.csv.xls'))
data = []
for i in xrange(sh.nrows):
data.append(sh.row_values(i))
it will append each rows from xls file into list "data".
eg: [['a','b'],['c','d'],['e','f']] like this .
I use the following block of code to read lines out of a file 'f' into a nested list:
for data in f:
clean_data = data.rstrip()
data = clean_data.split('\t')
t += [data[0]]
strmat += [data[1:]]
Sometimes, however, the data is incomplete and a row may look like this:
['955.159', '62.8168', '', '', '', '', '', '', '', '', '', '', '', '', '', '29', '30', '0', '0']
It puts a spanner in the works because I would like Python to implicitly cast my list as floats but the empty fields '' cause it to be cast as an array of strings (dtype: s12).
I could start a second 'if' statement and convert all empty fields into NULL (since 0 is wrong in this instance) but I was unsure whether this was best.
Is this the best strategy of dealing with incomplete data?
Should I edit the stream or do it post-hoc?
The way how you should deal with incomplete values depends on the context of your application (which you haven't mentioned yet).
For example, you can simply ignore missing values
>>> l = ['955.159', '62.8168', '', '', '', '', '', '', '', '', '', '', '', '', '', '29', '30', '0', '0']
>>> filter(bool, l) # remove empty values
['955.159', '62.8168', '29', '30', '0', '0']
>>> map(float, filter(bool, l)) # remove empty values and convert the rest to floats
[955.15899999999999, 62.816800000000001, 29.0, 30.0, 0.0, 0.0]
Or alternatively, you might want to replace them with NULL as you mentioned:
>>> map(lambda x: x or 'NULL', l)
['955.159', '62.8168', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', '29', '30', '0', '0']
As you can see, there are many different strategies of dealing with incomplete data. Anyway, the example snippets here might help you to choose the right one for your task. And as you can see, I prefer the functional programming like build-ins for doing stuff like this, because it's often the shortest and easiest way to do it (and I don't think there will be any noticeable differences in the execution time).