Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
I want to detect and/or replace weird utf, non-emoji characters that break my tokenization pipeline, like \uf0fc, which renders like a cup/glass:
That image / code is not contained in the emojis package, which I tried for filtering.
Is there a class that describes all such characters?
Is there a way I can reliably detect them?
This is a character from a Private Use Area. It happens to look like a tankard in your font, but the Unicode standard doesn't mandate a specific look or meaning for these; it has whatever meaning you assign to it. The idea is that you agree upon a meaning with whoever you're communicating with - privately, meaning without getting the Unicode Consortium involved.
You can use the standard unicodedata module to check whether a character is from the Co category, or just hardcode the ranges, as described here.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have texts in an excel file that looks something like this:
alpha123_4rf
45beta_Frank
Red5Great_Sam_Fun
dan.dan_mmem_ber
huh_k
han.jk_jj
huhu
I am trying to use a regex to match all of these words and save them into a set().
I have tried r"(\w+..*?_.*?\w+)" as seen here . But cant seem to capture the word huhu that does not have special characters.
Your regex is capturing word that have a _ in them, and huhu don't.
You could change your regex to match every letter, number, underscore, and dots, multiple times.
([\w.]+)
I've fork your regex101
If you wish to match something more precise, you might need to give us more information about your context and what exactly you are trying to match.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am using Pytesseract for OCR. But it looks like there is no option in the documentation to extract the confidence of ever character. I already have the Confidence of word but I want to know at which character the confidence is getting low.
So after research I came to know there is a function tesserractExtractResult() in the tesseract API which can give confidence of characters.
How can I use this function in Python?
Pytesseract calls Tesseract in the background as if launched in a terminal (here in the source code), so you have at your disposition only what the shell command can do - and as far I know, you can't get character confidence.
I think that pyocr should be able to do so, but it is needed to add the function call (maybe in tesseract_raw.py? ).
Also, more as a note: it seems that python-tesseract and pytess have at least some line in code referring to tesseractExtractResult, but last commits were respectively in 2015 and 2012.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
as I learn in my college in programming concept language course
the string length in any language may be one of three:
Static: as COBOL, Java’s String class
Limited Dynamic Length:as C and C++
In these languages, a special character is used to indicate the end of a string’s characters, rather than maintaining the length
Dynamic (no maximum): SNOBOL4, Perl, JavaScript
What of these options python string length are?
Python strings are immutable. Any operation that appears to be changing a string's length is actually returning a new string.
Definately not your third option -- a string can start out arbitrary long but can't later change it's length. Your option 1 sounds closest.
You may find the Strings subsection of this web page helpful. Even in a language where strings can, in general, change length, that may not be true of all strings, depending on where they originate (e.g. code text) or how they are declared.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have a keyword:
Verify Payment Method Field
Element Text Should Be ${paymentMethodValueField} PDF-lasku sähköpostiin
here is the logs:
Step 3 Fields verification :: OK: Display Customer Information fie... | FAIL |
The text of element '//div/span' should have been 'PDF-lasku s?hk?postiin' but in fact it was 'PDF-lasku s?hk?postiin'.
I need to write something like that, but I don't know how:
PDF-lasku s[ascii symbol]hk[ascii symbol]postiin
can somebody help me?
I would probably convert the whole thing to one format or another, then evaluate? Or is it important that ASCII characters are located in certain parts of the string? If not and you simply want to verify what is returned is exactly what you expect, I'd probably use Encode String to Bytes for simplicity, perhaps even the encoding/decoding keyword would serve your needs if the ASCII is important.
http://robotframework.org/robotframework/latest/libraries/String.html#Encode%20String%20To%20Bytes
By using the above you could set it to ignore the characters that cannot be converted or replace them with a known character that you provide. Simply get the text first, then perform whatever manipulation you want and evaluate.
The alternative with regard to decoding/encoding if ASCII location is important is:
http://robotframework.org/robotframework/latest/libraries/BuiltIn.html#Convert%20To%20Bytes
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I currently have an application in which I am getting a long string of jpg's in as a string. I would like to break this string into individual files, but I can't find any clear way to recognize EOF's from within Python. I imagine this is a fairly common problem, but I haven't been able to find a solution for this. The string should only be about 20-30 jpgs long, so it's pretty small, but I'm not sure how to recognize EOF's as I go through the string.
I tried just splitting on \0, but it seems that this does not quite indicate EOF for these jpgs.
It would be better to restructure the sending of the files so that you receive them either one by one or with an unambiguous delimiter between each file in the byte stream.
If this is not possible, potentially, you can use the APPO marker and ID. So the marker would be either 0xFFE0[length]0x4A46585800 or 0xFFE0[length]0x4A46494600.
Best case:
read stream until 0xFFE0 is found;
read two byte length (up to 65535×65535 pixels);
verify format by reading the next 5 bytes - break if not nul terminated JFIF or JFXX
read that length and deal with the payload;
loop back to 1
It would be better to deal with each file one by one however.