PDFtotext - whitespace showing as aacute on commandline - python

I am extracting text using python from a textfile created from pdf using pdftotext. It is one of 2000 files and in this particular one, a line of keywords ends in EU. The remainder of the line is blank to the naked eye and so is the following line.
The program normally strips off any trailing blanks at the end of a line and ignores the subsequent blank line.
In this instance, it is saving the whitespace which is seen when it is printed out in at textfile between "EU. " and similarly in html (Simile Exhibit).
I also printed to the command line and here I see a string of aacute. [?]
I thought the obvious way to deal with this was to search and replace the accute. I've tried to do that with a compile statement and I've played with permutations of decoding the incoming text.
Oddly though, when I print "\255" I don't get an aacute, I get an o grave.
It seems likely with this odd combination of errors that I have misunderstood something fundamental. Any tips of how to begin unravelling this?
Many thanks.

The first tip is not to print wildly to all possible output mechanisms using various unstated encodings. Find out exactly what you have got. Do this:
print repr(the_line_with_the_problem) # Python 2.x
print(ascii(the_line_with_the_problem)) # Python 3.x
and edit your question and copy/paste the result.
Second tip: When asking for help, give information about your environment:
What version of Python? What version of what operating system?
Also show locale-related info; following example is from my computer running Python 2.7 in a Windows 7 Command Prompt window::
>>> import sys, locale
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_AU', 'cp1252')
>>>
Third tip: Don't use your own jargon ... the concepts "Simile Exhibit", "printed to the command line", and "compile statement" need explanation.
What is the relevance of "\255"? Where did you get that from?
Wild guesses while waiting for some facts to emerge:
(1) The offending character is U+00A0 NO-BREAK SPACE aka NBSP which appears in your text as "\xA0" and when sent to stdout in a Western European locale on Windows using a Command Prompt window would be treated as being encoded in cp850 and thus appear as a-acute. How this could be transmogrified into o-grave is a mystery.
(2) "\255" == \xAD implies the offending character is U+00AD SOFT HYPHEN but why this would be seen as o-grave is a mystery, and it's not "whitespace"; it shouldn't be shown at all, and it it is shown it should be as a hyphen/minus-sign, not a space.

Related

Line terminator adds dot at the end of a line in npm test of Python code

I wanted to learn command line programming using Python.
I saw a to-do challenge on the internet and started to work on it by learning from the web. The challenge is to create a command line interface of a to-do app.
The challenge is titled CoronaSafe Engineering Fellowship Test Problem. Here is the challenge material on Google Drive: https://drive.google.com/drive/folders/1SyLcxnEBNRecIyFAuL5kZqSg8Dw4xnTG?usp=sharing
and there is a GitHub project at https://github.com/nseadlc-2020/package-todo-cli-task/
In the README.md I was instructed to create symbolic link for the batch file todo.bat with the name todo. Now, my first condition is that, when the symbolic link is called from the command prompt without any arguments, it must print some usage tips for the program. Finally, I have to use the npm test command to test the execution.
At the very beginning I got this trouble, whenever I use a print statement, I see a dot • at the end of every string which ends with a new line. For instance,
import sys
import random
args = sys.argv[1:]
if len(args) == 0:
print('Usage :-', end='\n')
print('$ ./todo help # Show usage', end='')
The above statements when executed without arguments gives the output,
Usage :-.
$ ./todo help # Show usage
Here, I noticed that for the first print statement ends with a newline, the string ends with what looks like a middle dot (•). Whereas, for the second print statement since I override the end parameter with an empty string, no newline character was output, and so the dot is not printed. See the screen shot:
What's wrong, and how can I pass the test? My program does not print a middle dot at all.
The problem seems to be squarely inside the todo.test.js file.
In brief, Windows and Unix-like platforms have different line ending conventions (printing a line in Windows adds two control characters at the end, whilst on Unix-like systems only one is printed) and it looks like the test suite is only prepared to cope with results from Unix-like systems.
Try forcing your Python to only print Unix line feeds, or switch to a free Unix-like system for running the tests.
Alternatively, rename todo.test.js and replace it with a copy with DOS line feeds. In many Windows text editors, you should be able to simply open the file as a Unix text file, then "Save As..." and select Windows text file (maybe select "ANSI" if it offers that, though the term is horribly wrong and they should know better); see e.g. Windows command to convert Unix line endings? for many alternative solutions (many of which vividly illustrate some of the other issues with Windows; proceed with caution).
This seems to be a known issue, as noted in the README.md you shared: https://github.com/nseadlc-2020/package-todo-cli-task/issues/12 (though it imprecisely labels this as "newline UTF encoding issues"; the problem has nothing to do with UTF-8 or UTF-16).
See also the proposed duplicate Line endings (also known as Newlines) in JS strings
I had exactly the same problem.
I replaced:
print(variable_name) # Or print("Your text here")
With:
sys.stdout.buffer.write(variable_name.encode('utf-8')) # To sys.stdout.buffer.write("Your text here".encode('utf-8'))
Now it worked fine in windows.
First write your help string like this
help_string='Usage :-\n$ ./task add 2 hello world # Add a new item with priority 2 and text "hello world" to the list\n$ ./task ls # Show incomplete priority list items sorted by priority in ascending order\n$ ./task del INDEX # Delete the incomplete item with the given index\n$ ./task done INDEX # Mark the incomplete item with the given index as complete\n$ ./task help # Show usage\n$ ./task report # Statistics'
Then print it on the console using
sys.stdout.buffer.write(help_string.encode('utf8'))
This problem occurs due to differences in encoding type of windows and npm tests. Also make sure to avoid any spaces after or before "\n".
Why have multiple prints,when python prints can incorporate new line without having to declare separately, follow example below:
print("Usage :- \n$ ./todo help #Show usage")
Output:
Usage :-
$ ./todo help #Show usage

Python 2.7 VSCODE doesn't get the correct length of input string

in my class we are studying python 2.7. I am using vscode to test the exercises.
exercise 1: read user input and print the length. If the user write
exit the program finish.
My code is follow:
myexit=False
while (myexit!=True):
#read user input
txt=raw_input("write a string or write exit to go out: ")
#print the user input string
print txt
if (str(txt)=='exit'):
myexit=True#exit from while
else:
print len(txt) #print the string length
print "finish"
when i test the code i get always the length of the string +1
example: if i write foo the output is 4 and no 3. When i write exit i
don't go out from the while and the output is 5.
Where i wrong ?
I have missed a module?
Thanks for your help
I am not sure exactly why this is happening, and I don't have access to a windows machine to test/verify but based on the comments above, it appears that on the version of python you are using that raw_input is only stripping the newline(\n) and not the carriage return(\r). Windows uses \r\n while unix uses \n. When raw input returns the \r is still on the string, hence the extra char. A useful debugging technique at the cli is to use the function repr() on the value to see exactly how it is represented. This is helpful to locate any stray control or invisible chars in strings.
The function rstrip() will remove all whitespace from the right side of the string, which in this case should safely remove the stray \r. It should also be safe if this code is running on a *nix like system as rstrip() will only remove the whitespace if it is present. You can also specify a set of char to strip, so if you would would like to be pedantic, you could use rstrip("\r").
txt=raw_input("write a string or write exit to go out: ").rstrip("\r")
Should fix the issue while still maintaining compatibility on different versions.

Red lines coming up after strings in SublimeREPL (python)?

In writing a simple python application, I'm printing out some strings to the console in SublimeREPL (for python), using Python 2.7.8 and Sublime 3, 64 bit for Windows 8.1. However, I'm getting some very annoying red lines after each of the strings that I'm printing. Does someone know why this is happening?
I would appreciate any help.
Thanks!
The apostrophe ' character is causing Sublime's syntax highlighting engine to think that you're beginning a single-quoted string. Since ending a line with a string "open" is an error, it is being highlighted with the reddish invalid.illegal scope in your color scheme. It's nothing to worry about, it's just something you'll see happen with SublimeREPL when you have non-closing quotes on a line.
To verify this is the case, try opening a new file in Sublime, setting the syntax to Python, and pasting in the following code:
"This is a valid string"
"This is also valid even though it has a single quote ' char"
"This string is not valid
"""This string is valid, and doesn't have the red line
even though it has a newline, as it's triple-quoted"""
The middle (invalid Python syntax) line will have the red stripe from the end of the word valid to the right side of the window. The others won't.

'\b' doesn't print backspace in PyCharm console

I am trying to update the last line in PyCharm's console. Say, I print a and then I want to change it to c. However, I encounter the following problem. When I run:
print 'a\bc'
it prints
a c
while the desired output (which is also what I see in the Windows console) is:
c
Is there a way to move the cursor back in PyCharm's console? or maybe delete the whole line?
This is not a bug, this is a limitation of the interactive console found both in PyCharm, and in the IDLE shell.
When using the command prompt of windows, or a linux shell - the \b character is interpreted as a backspace and implemented as it is being parsed - However, in the interactive console of PyCharm and IDLE the \b character and many others are disabled, and instead you simply get the ASCII representation of the character (a white space in most cases).
It's a known bug: http://youtrack.jetbrains.com/issue/PY-11300
If you care about this, please get an account on the bug tracker and upload the bug to give it more attention.
The \r works. I know this is ASCII Carriage Return, but i use this as a workaround
print("\ra")
print("\rc")
will yield in c in the console
By the way, backspace is a ASCII Character
I just ran into the same issue in PyCharm (2019.1) and stumbled on this post. It turns out that you can use the \b character if you use the sys.stdout.write function instead of print. I wasn't able to get any of the above examples working within PyCharm using the print function.
Here's how I update the last line of text in my code assuming I don't need more than 100 characters:
# Initialize output line with spaces
sys.stdout.write(' ' * 100)
# Update line in a loop
for k in range(10)
# Generate new line of text
cur_line = 'foo %i' % k
# Remove last 100 characters, write new line and pad with spaces
sys.stdout.write('\b' * 100)
sys.stdout.write(cur_line + ' '*(100 - len(cur_line)))
# ... do other stuff in loop
This should generate "foo 1", then replaced with "foo 2", "foo 3", etc. all on the same line and overwriting the previous output for each string output. I'm using spaces to pad everything because different programs implement the backspace character differently, where sometimes it removes the character, and other times it only moves the cursor backwards and thus still requires new text to overwrite.
I've got to credit the Keras library for this solution, which correctly updates the console output (including PyCharm) during learning. I found that they were using the sys.stdout.write function in their progress bar update code.

'2' character with star inside?

Edit: Determined so far: It's not the 2, it's a character before the two, hex value BF, causing the star in the following character (which happens to be 2)
I'm running an elastic-mapreduce job using python scripts I have written, and I'm getting some weird output in the form of unexpected lines. I have noticed a pattern, however. The expected lines all have unexpected '2's in the form of characters with small stars just inside the top curve of the character. That is, when I open the file in Notepad++ (but not Notepad or Word) I see some twos show up like this (excuse the links, I am unable to embed images at less than 10 rep):
In text: http://i.imgur.com/zaWtC3S.png
Zoomed in: http://i.imgur.com/bTYIlh6.png
The weird '2's also show up when I run the python scripts on my own machine (though the unexpected lines do not). Does anyone know what might be causing this? It might shed some light on the odd extra lines of output I'm getting. I'm also just genuinely curious.
Also, I thought it might have had to do with encoding/decoding I was doing to parse safe URLs, but when I took out those parts the weird '2's remained, so it wasn't that.
Thanks
You have EF BB BF in there... that's the UTF-8 encoding of the BOM mark: byte order mark. See http://en.wikipedia.org/wiki/Byte_order_mark . I suspect that the star in the letter is your editor's way of signifying "I just got a BOM". See this earlier question . It seems to be a well known "thing", and that thread has some suggestions for dealing with it.

Categories