I'm trying to work with a settings file that is a binary file to find out how it's stuctured so that I might get some information about file location etc. from it.
As far as I can tell, the interesting data is either exactly after or near escape chars b'\x03\SETTING' - here's an example with a setting I'm interested in 'LQ'..
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0
\x03HTAPp\x00\x00\x00\x02\x02\x00\x00\x01\x02L\x02\x00\x00\x00\x01
\x03LQ\x00\x00\x00\\\\Media\\Render_Drive\\mediafiles\\mxf\\k70255.2\\a08.56d829a7_56d82956d829a0.mxf
\x03HTAPp\\x00\x00\x00\x02\x02\x00\x00\x01\x02L\x02\x00\x00\x00\x01
\x03LQ\x00\x00\x00\\\\Media\\Render_Drive\\mediafiles\\mxf\\k70255.2\\a07.56d829a6_56d82956d829a0.mxf
so it looks like each 'sentence' starts with \x03 - & the path I'm looking for here is on the 8th byte after the LQ setting '\x03LQ'
The file also has other settings that I want to capture - and each time it looks like the setting is directly after an escape char and padded by a short desciption of the setting and a number of bytes.
ATM I am reading the binary and can find a specific path (only, if I know how long it is right now)
with open(file, "rb") as abin:
abin.seek(0)
data = abin.read()
foo = re.search(b'\x03LQ', data)
abin.seek(foo.start() + 8) # cursor lands on 8th byte
eg = abin.read(32)
# so I get the path of some length as eg.....
This is not what I want, as I want to read the entire bytestring until the next escape char, and then find the next setting that occurs and read the path.
I'm experimenting with findall(), but it just returns a list of bytes objects that are the same (it seems), and I don't understand how to search for each unique path & the instance of each byte string and read from some cursor position in the data. Eg.
bar = re.findall(b'\x03LQ', data)
for bs in bar:
foo = re.search(bs, data)
abin.seek(foo.start() + 8)
eg = abin.read(64)
print('This is just the same path each time', eg)
Pointers anyone?
The key is to look at the result of your findall(), which is just going to be:
[b'\x03LQ', b'\x03LQ', b'\x03LQ', ...]
You're only telling it to find a static string, so that's all it's going to return. To make the results useful, you can tell it to instead capture what comes after the given string. Here's an example that will grab everything after the given string until the next \x03 byte:
findall(rb'\x03LQ([^\x03]*)', data)
The parens tell findall() what part of the match you want, and [^\x03]* means "match any number of bytes that are not \x03". The result from your example should be:
[b'\x00\x00\x00\\\\Media\\Render_Drive\\mediafiles\\mxf\\k70255.2\\a08.56d829a7_56d82956d829a0.mxf\n',
b'\x00\x00\x00\\\\Media\\Render_Drive\\mediafiles\\mxf\\k70255.2\\a07.56d829a6_56d82956d829a0.mxf']
Related
my project is to capture a log number from Google Sheet using gspread module. But now the problem is that the log number captured is in the form of string ".\1300". I only want the number in the string but I could not remove it using the below code.
Tried using .replace() function to replace "\" with "" but failed.
a='.\1362'
a.replace('\\',"")
Should obtain the string "1362" without the symbol.
But the result obtained is ".^2"
The problem is that \136 has special meaning (similar to \n for newline, \t for tab, etc). Seemingly it represents ^.
Check out the following example:
a = '.\1362'
a = a.replace('\\',"")
print(a)
b = r'.\1362'
b = b.replace('\\',"")
print(b)
Produces
.^2
.\1362
Now, if your Google Sheets module sends .\1362 instead of .\\1362, if is very likely because you are in fact supposed to receive .^2. Or, there's a problem with your character encoding somewhere along the way.
The r modifier I put on the b variable means raw string, meaning Python will not interpret backlashes and leave your string alone. This is only really useful when typing the strings in manually, but you could perhaps try:
a = r'{}'.format(yourStringFromGoogle)
Edit: As pointed out in the comments, the original code did in fact discard the result of the .replace() method. I've updated the code, but please note that the string interpolation issue remains the same.
When you do a='.\1362', a will only have three bytes:
a = '.\1362'`
print(len(a)) # => 3
That is because \132 represents a single character. If you want to create a six byte string with a dot, a slash, and the digits 1362, you either need to escape the backslash, or create a raw string:
a = r'.\1362'
print(len(a)) # => 6
In either case, calling replace on a string will not replace the characters in that string. a will still be what it was before calling replace. Instead, replace returns a new string:
a = r'.\1362'
b = a.replace('\\', '')
print(a) # => .\1362
print(b) # => .1362
So, if you want to replace characters, calling replace is the way to do it, but you've got to save the result in a new variable or overwrite the old.
See String and Bytes literals in the official python documentation for more information.
Your string should contains 2 backslashes like this .\\1362 or use r'.\1362' (which is declaring the string as raw and then it will be converted to normal during compile time). If there is only one backslash, Python will understand that \136 mean ^ as you can see (ref: link)
Whats happening here is that \1362 is being encoded as ^2 because of the backslash, so you need to make the string raw before you're able to use it, you can do this by doing
a = r'{}'.format(rawInputString)
or if you're on python3.6+ you can do
a = rf'{rawInputString}'
I am using s3fs read_block to distribute a csv equally across multiple processes. Each process needs to be given a byte range to operate on and work independently of others. Every line in the csv needs to be processed without overlap.
The problem is that the beginning and ends of byte ranges are unlikely to be the beginning and ends of lines. So some lines may get chopped off.
For example-
My csv looks like this-
beer\npizza\nwings
And I want to process this in chunks of 9 bytes. For byte range 0-9 I will get "beer". And for byte range 10-16 i will get "wings". I will never get "pizza" because the split exists in the middle of a line
beer\npizza\nwings
__________^_______
What I need is some kind of lookahead. Where I want to get bytes between 0-9, and any additional bytes required to form the next line. Then my results would be beer\npizza, wings.
Is lookahead the right way of looking at this or is there another solution? If lookahead is the right way to do this, can this be done with s3fs or do I need a custom implementation to do this lookahead first to find the correct byte range?
Edit:
Custom implementation example:
if self._lookahead:
self._logger.debug('Performing lookahead')
"""Use lookahead to find next newline in csv"""
self._logger.debug(f'{end - 1}, {self._lookahead + 1}')
r = s3.read_block(self._s3_path, end - 1, self._lookahead + 1)
if '\n' not in (r[0], r[1]):
"""Range ends in the middle of a line. Look ahead for the next newline"""
read_length = read_length + r.index(b'\n')
self._logger.debug(f'New end found {read_length}')
From s3fs documentation. You can simply pass the delimiter to the read_block() function.
Hope it helps:
s3.read_block(path, offset=1000, length=10, delimiter=b'\n')
I want to keep track of the file pointer on a simple text file (just a few lines), after having used readline() on it. I observed that the tell() function also counts the line endings.
My questions:
How to instruct the code to skip counting the line endings ?
How to do the first question regardless the line ending type (to work the same in case the text file uses just \n, or just \r, or both) ?
You are navigating into trouble.
DOn't do that: either use the number "tell" tells you about, or count what you have in memory, regardless of the file contents.
You won't be able to correlate a position in text, read in memory, to a physicall place in a text file: text files are not meant for that. They are meant to be read one line at a time, or in whole: your pogram consumes the text, and let the OS worry about the file position.
You can open your file in binary mode, read its contents as they are into memory, and have some method of retrieving readable text from those contents as needed - doing this with a proper class can make it not that messy.
Consider the problem you already have with the line-endings which could be either "\n" or "\r\n" and still count as a single character, and now, imagine that situation one hundred fold more complex if the file has a single utf-8 encoded character that takes more than one byte to encode.
And even in binary files, knowing the absolute file pointer position can only be useful in a handful situations where, usually, one would be better using a database engine to start with.
tell is tell. It counts the number of bytes from the start of the file to the cursor. \n and \r are bytes, so they get counted. If you want to count the number of bytes, but not count certain characters, you will have to do it manually:
data_read = … # data you have already read
len([b for b in data_read if b not in '\r\n'])
The bad news is that it's far more annoying to do this than just looking at tell. The good news is that it answers both your questions.
or, I suppose you could do
yourfile.tell() - data_read.count('\r') - data_read.count('\n')
result = re.sub("[\r\n]", "", subject)
http://regex101.com/r/kM6dA1
Match a single character present in the list below «[\r\n]»
A carriage return character «\r»
A line feed character «\n»
I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character.
(You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.)
To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove.
for work in glob.glob(pathtofiles):
openfile = open(work)
readfile = openfile.read()
stringfile = str(readfile)
decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line?
soup = BeautifulSoup(decodefile)
textwithtags = soup.findAll('text')
textwithtagsasstring = str(textwithtags)
#this method strips everything between anglebrackets as it should
textwithouttags = stripTags(textwithtagsasstring)
#clean text
nonewlines = textwithouttags.replace("\n", " ")
noextrawhitespace = re.sub(' +',' ', nonewlines)
print noextrawhitespace #the boxes appear
I tried to remove the boxes by using
noboxes = noextrawhitespace.replace(u"\u2610", "")
But Python threw an error flag:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128)
Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.
The problem is that you're mixing unicode and str. Whenever you do that, Python has to convert one to the other, which is does by using sys.getdefaultencoding(), which is usually ASCII, which is almost never what you want.*
If the exception comes from this line:
noboxes = noextrawhitespace.replace(u"\u2610", "")
… the fix is simple… except that you have to know whether noextrawhitespace is supposed to be a unicode object or a UTF-8-encoding str object). If the former, it's this:
noboxes = noextrawhitespace.replace(u"\u2610", u"")
If the latter, it's this:
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
But really, you have to get all of the strings consistent in your code; mixing the two up is going to cause problems in more places than this one.
Since I don't have your XML files to test, I wrote my own:
<xml>
<text>abc☐def</text>
</xml>
Then, I added these two lines to the bottom of your code (and a bit to the top to just open my file instead of globbing for whatever):
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
print noboxes
The output is now:
[<text>abc☐def</text>]
[<text>abc☐def</text>]
[<text>abcdef</text>]
So, I think that's what you want here.
* Sure sometimes you want ASCII… but those aren't usually the times when you have unicode objects…
Give this a try:
noextrawhitespace.replace("\\u2610", "")
I think you are just missing that extra '\'
This might also work.
print(noextrawhitespace.decode('unicode_escape').encode('ascii','ignore'))
Reading your sample, the following are the non-ASCII characters in the document:
0x2223 DIVIDES
0x2022 BULLET
0x3009 RIGHT ANGLE BRACKET
0x25aa BLACK SMALL SQUARE
0x25ca LOZENGE
0x3008 LEFT ANGLE BRACKET
0x2014 EM DASH
0x2026 HORIZONTAL ELLIPSIS
\u2223 is the actual character in question in line 3682, and it is being used as a soft hyphen. The others are used in markup for tagging illegible characters, such as:
<GAP DESC="illegible" RESP="oxf" EXTENT="4+ letters" DISP="\u2022\u2022\u2022\u2022\u2026"/>
Here's some code to do what your code is attempting. Make sure to process in Unicode:
from bs4 import BeautifulSoup
import re
with open('k000039.000.xml') as f:
soup = BeautifulSoup(f) # BS figures out the encoding
text = u''.join(soup.strings) # strings is a generator for just the text bits.
text = re.sub(ur'\s+',ur' ',text) # Simplify all white space.
text = text.replace(u'\u2223',u'') # Get rid of the DIVIDES character.
print text
Output:
[[truncated]] reckon my self a Bridegroom too. Buckle. I doubt Kickey won't find him such. [Aside.] Mrs. Sago. Well,—poor Keckky's bound to good Behaviour, or she had lost quite her Puddy's Favour. Shall I for this repine at Fortune?—No. I'm glad at Heart that I'm forgiven so. Some Neighbours Wives have but too lately shown, When Spouse had left 'em all their Friends were flown. Then all you Wives that wou'd avoid my Fate. Remain contented with your present State FINIS.
I'm writing a Python 3.2 script to find characters in a Unicode XML-formatted text file which aren't valid in XML 1.0. The file itself isn't XML 1.0, so it could easily contain characters supported in 1.1 and later, but the application which uses it can only handle characters valid in XML 1.0 so I need to find them.
XML 1.0 doesn't support any characters in the range \u0001-\u0020, except for \u0009, \u000A, \u000D, and \u0020. Above that, \u0021-\uD7FF and \u010000-\u10FFFF are also supported ranges, but nothing else. In my Python code, I define that regex pattern this way:
re.compile("[^\u0009\u000A\u000D\u0020\u0021-\uD7FF\uE000-\uFFFD\u010000-\u10FFFF]")
However, the code below isn't finding a known bad character in my sample file (\u0007, the 'bell' character.) Unfortunately I can't provide a sample line (proprietary data).
I think the problem is in one of two places: Either a bad regex pattern, or how I'm opening the file and reading in lines—i.e. an encoding problem. I could be wrong, of course.
Here's the relevant code snippet.
processChunkFile() takes three parameters: chunkfile is an absolute path to a file (a 'chunk' of 500,000 lines of the original file, in this case) which may or may not contain a bad character. outputfile is an absolute path to an optional, pre-existing file to write output to. verbose is a boolean flag to enable more verbose command-line output. The rest of the code is just getting command-line arguments (using argparse) and breaking the single large file up into smaller files. (The original file's typically larger than 4GB, hence the need to 'chunk' it.)
def processChunkFile(chunkfile, outputfile, verbose):
"""
Processes a given chunk file, looking for XML 1.0 chars.
Outputs any line containing such a character.
"""
badlines = []
if verbose:
print("Processing file {0}".format(os.path.basename(chunkfile)))
# open given chunk file and read it as a list of lines
with open(chunkfile, 'r') as chunk:
chunklines = chunk.readlines()
# check to see if a line contains a bad character;
# if so, add it to the badlines list
for line in chunklines:
if badCharacterCheck(line, verbose) == True:
badlines.append(line)
# output to file if required
if outputfile is not None:
with open(outputfile.encode(), 'a') as outfile:
for badline in badlines:
outfile.write(str(badline) + '\n')
# return list of bad lines
return badlines
def badCharacterCheck(line, verbose):
"""
Use regular expressions to seek characters in a line
which aren't supported in XML 1.0.
"""
invalidCharacters = re.compile("[^\u0009\u000A\u000D\u0020\u0021-\uD7FF\uE000-\uFFFD\u010000-\u10FFFF]")
matches = re.search(invalidCharacters, line)
if matches:
if verbose:
print(line)
print("FOUND: " + matches.groups())
return True
return False
\u010000
Python \u escapes are four-digit only, so that U+0100 followed by two U+0030 Digit Zeros. Use capital-U escape with eight digits for characters outside the BMP:
\U00010000-\U0010FFFF
Note that this and your expression in general won't work on ‘narrow builds’ of Python where strings are based on UTF-16 code units and characters outside the BMP are handled as two surrogate code units. (Narrow build were the default for Windows. Thankfully they go away with Python 3.3.)
it could easily contain characters supported in 1.1 and later
(Although XML 1.1 can only contain those characters when they're encoded as numeric character references &#...;, so the file itself may still not be well-formed.)
open(chunkfile, 'r')
Are you sure the chunkfile is encoded in locale.getpreferredencoding?
The original file's typically larger than 4GB, hence the need to 'chunk' it.
Ugh, monster XML is painful. But with sensible streaming APIs (and filesystems!) it should still be possible to handle. For example here, you could process each line one at a time using for line in chunk: instead of reading all of the chunk at once using readlines().
re.search(invalidCharacters, line)
As invalidCharacters is already a compiled pattern object you can just invalidCharacters.search(...).
Having said all that, it still matches U+0007 Bell for me.
The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup. Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '' var)#wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.