Regex behaves differently for the same input string - python

I am trying to get a pdf page with a particular string and the string is:
"statement of profit or loss"
and I'm trying to accomplish this using following regex:
re.search('statement of profit or loss', text, re.IGNORECASE)
But even though the page contained this string "statement of profit or loss" the regex returned None.
On further investigating the document, I found that the characters 'fi' in the "profit" as written in the document are more congested. When I copied it from the document and pasted it in my code it worked fine.
So, If I copy "statement of profit or loss" from document and paste it in re.search() in my code, it works fine. But if I write "statement of profit or loss" manually in my code, re.search() returns none.
How can I avoid this behavior?

The 'congested' characters copied from your PDF are actually a single character: the 'fi ligature' U+FB01: fi.
Either it was entered as such in the source document, or the typesetting engine that was used to create the PDF, replaced the combination f+i by fi.
Combining two or more characters into a single glyph is a fairly usual operation for "nice typesetting", and is not limited to fi, fl, ff, and fj, although these are the most used combinations. (That is because in some fonts the long overhang of the f glyph jarringly touches or overlaps the next character.) Actually, you can have any amount of ligatures; some Adobe fonts use a single ligature for Th.
Usually this is not a problem with text extracting, because in the PDF it can be specified that certain glyphs must be decoded as a string of characters – the original characters. So, possibly your PDF does not contain such a definition, or the typesetting engine did not bother because the single character fi is a valid Unicode character on itself (although it is highly advised not to use it).
You can work around this by explicitly cleaning up your text strings before processing any further:
text = text.replace('fi', 'fi')
– repeat this for other problematic ligatures which have a Unicode codepoint: fl, ff, ffi, ffl (I possibly missed some more).

Related

Detecting Arabic characters in regex

I have a dataset of Arabic sentences, and I want to remove non-Arabic characters or special characters. I used this regex in python:
text = re.sub(r'[^ء-ي0-9]',' ',text)
It works perfectly, but in some sentences (4 cases from the whole dataset) the regex also removes the Arabic words!
I read the dataset using Panda (python package) like:
train = pd.read_excel('d.xlsx', encoding='utf-8')
Just to show you in a picture, I tested on Pythex site:
What is the problem?
------------------ Edited:
The sentences in the example:
انا بحكي رجعو مبارك واعملو حفلة واحرقوها بالمعازيم ولما الاخوان يروحو
يعزو احرقو العزا -- احسنلكم والله #مصر
ﺷﻔﻴﻖ ﺃﺭﺩﻭﻏﺎﻥ ﻣﺼﺮ ..ﺃﺣﻨﺍ ﻧﺒﻘﻰ ﻣﻴﻦ ﻳﺎ ﺩﺍﺩﺍ؟ #ﻣﺴﺨﺮﺓ #ﻋﺒﺚ #EgyPresident #Egypt #ﻣﻘﺎﻃﻌﻮﻥ لا يا حبيبي ما حزرت: بشار غبي بوجود بعثة أنان حاب يفضح روحه انه مجرم من هيك نفذ المجزرة لترى البعثة اجرامه بحق السورين
Those incorrectly included characters are not in the common Unicode range for Arabic (U+0621..U+64A), but are "hardcoded" as their initial, medial, and final forms.
Comparable to capitalization in Latin-based languages, but more strict than that, Arabic writing indicates both the start and end of words with a special 'flourish' form. In addition it also allows an "isolated" form (to be used when the character is not part of a full word).
This is usually encoded in a file as 'an' Arabic character and the actual rendering in initial, medial, or final form is left to the text renderer, but since all forms also have Unicode codepoints of their own, it is also possible to "hardcode" the exact forms. That is what you encountered: a mix of these two systems.
Fortunately, the Unicode ranges for the hardcoded forms are also fixed values:
Arabic Presentation Forms-A is a Unicode block encoding contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. The presentation forms are present only for compatibility with older standards such as codepage 864 used in DOS, and are typically used in visual and not logical order.
(https://en.wikipedia.org/wiki/Arabic_Presentation_Forms-A)
and their ranges are U+FB50..U+FDFF (Presentation Forms A) and U+FE70..U+FEFC (Presentation Forms B). If you add these ranges to your exclusion set, the regex will no longer delete these texts:
[^ء-ي0-9ﭐ-﷿ﹰ-ﻼ]
Depending on your browser and/or editor, you may have problems with selecting this text to copy and paste it. It may be more clear to explicitly use a string specifying the exact characters:
[^0-9\u0621-\u064a\ufb50-\ufdff\ufe70-\ufefc]
I have made some try on Pythex and I Found this (With the help from Regular Expression Arabic characters and numbers only) : [\u0621-\u064A0-9] who catch almost all non-Arabic characters. For un Unknown reason, this dosen't catch 'y' so you have to add it yourself : [\u0621-\u064A0-9y]
This can catch all non-arabic character. For special character, i'm sorry but i found nothing except to add them inside : [\u0621-\u064A0-9y#\!\?\,]

How to select an entire entity around a regex without splitting the string first?

My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.
Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?

Reading and writing PDF files with ligatures?

I am attempting to read text from a PDF file, and then later on, write that same text back to another PDF using Python. After the text is read in, the representation of the string when I print it to the console is:
Officially, it’s called
However, when I print the repr() of this text string, I see:
O\xef\xac\x83cially, it\xe2\x80\x99s called
This makes plenty of sense to me - these are ligatures of symbols from the PDFs i.e. \xef\xac\x83 represents a ligature for 'ff'. The problem is that when I write this string to PDF, using reportlab libraries, the PDFs have black symbols in place, as seen below:
This only happens with certain ligatures. I am wondering what I can do so that the string I write to the PDF does not contain these ligatures or if there is an efficient way to replace all of them.
It appears your input is correct, but to see the ffi character in your output, use a font that does have one.
The font you are using here is bog standard Arial, which does not contain it.
Some suggestions (mainly depending on your platform, but some of these are Open Source):
Arial Unicode MS
Lucida Grande
Calibri
Cambria
Corbel
Droid Sans/Droid Serif
Helvetica Neue
Ubuntu
If you don't want, or are not able, to change the font, replace the sequence \xef\xac\x83 with the plain characters ffi in your program before writing text to PDF. (And similar for those other certain ligatures you mentioned.)
What I ended up doing was copying the characters out of my text file and doing a .replace on them. ie str.replace('ff','ff') - if this looks the same, it's the same. The param on the left is the ligature character and the param on the right is two f's. Also, don't forget # -- coding: utf-8 -- .

Python’s `str.format()`, fill characters, and ANSI colors

In Python 2, I’m using str.format() to align a bunch of columns of text I’m printing to a terminal. Basically, it’s a table, but I’m not printing any borders or anything—it’s simply rows of text, aligned into columns.
With no color-fiddling, everything prints as expected.
If I wrap an entire row (i.e., one print statement) with ANSI color codes, everything prints as expected.
However: If I try to make each column a different color within a row, the alignment is thrown off. Technically, the alignment is preserved; it’s the fill characters (spaces) that aren’t printing as desired; in fact, the fill characters seem to be completely removed.
I’ve verified the same issue with both colorama and xtermcolor. The results were the same. Therefore, I’m certain the issue has to do with str.format() not playing well with ANSI escape sequences in the middle of a string.
But I don’t know what to do about it! :( I would really like to know if there’s any kind of workaround for this problem.
Color and alignment are powerful tools for improving readability, and readability is an important part of software usability. It would mean a lot to me if this could be accomplished without manually aligning each column of text.
Little help? ☺
This is a very late answer, left as bread crumbs for anyone who finds this page while struggling to format text with built-in ANSI color codes.
byoungb's comment about making padding decisions on the length of pre-colorized text is exactly right. But if you already have colored text, here's a work-around:
See my ansiwrap module on PyPI. Its primary purpose is providing textwrap for ANSI-colored text, but it also exports ansilen() which tells you "how long would this string be if it didn't contain ANSI control codes?" It's quite useful in making formatting, column-width, and wrapping decisions on pre-colored text. Add width - ansilen(s) spaces to the end or beginning of s to left (or respectively, right) justify s in a column of your desired width. E.g.:
def ansi_ljust(s, width):
needed = width - ansilen(s)
if needed > 0:
return s + ' ' * needed
else:
return s
Also, if you need to split, truncate, or combine colored text at some point, you will find that ANSI's stateful nature makes that a chore. You may find ansi_terminate_lines() helpful; it "patch up" a list of sub-strings so that each has independent, self-standing ANSI codes with equivalent effect as the original string.
The latest versions of ansicolors also contain an equivalent implementation of ansilen().
Python doesn't distinguish between 'normal' characters and ANSI colour codes, which are also characters that the terminal interprets.
In other words, printing '\x1b[92m' to a terminal may change the terminal text colour, Python doesn't see that as anything but a set of 5 characters. If you use print repr(line) instead, python will print the string literal form instead, including using escape codes for non-ASCII printable characters (so the ESC ASCII code, 27, is displayed as \x1b) to see how many have been added.
You'll need to adjust your column alignments manually to allow for those extra characters.
Without your actual code, that's hard for us to help you with though.
Also late to the party. Had this same issue dealing with color and alignment. Here is a function I wrote which adds padding to a string that has characters that are 'invisible' by default, such as escape sequences.
def ljustcolor(text: str, padding: int, char=" ") -> str:
import re
pattern = r'(?:\x1B[#-_]|[\x80-\x9F])[0-?]*[ -/]*[#-~]'
matches = re.findall(pattern, text)
offset = sum(len(match) for match in matches)
return text.ljust(padding + offset,char[0])
The pattern matches all ansi escape sequences, including color codes. We then get the total length of all matches which will serve as our offset when we add it to the padding value in ljust.

Working with strings in python produces strange quotation marks

currently I am working with scrapy, which is a web crawling framework based on python. The data is extracted from html using XPATH . (I am new to python) To wrap the data scrapy uses items, e.g.
item = MyItem()
item['id'] = obj.select('div[#class="id"]').extract()
When the id is printed like print item['id'] I get following output
[u'12346']
My problem is that this output is not always in the same form. Sometimes I get an output like
"[u""someText""]"
This happens only with text, but actually there is nothing speciall with the text compared to other text that is handled corretly just like the ID.
Does anyone know what the quotation marks mean? Like I said the someText was crawled like all other text data, e.g. from
<a>someText</a>
Any ideas?
Edit:
My spider crawls all pages of a blog. Here is the exact output
[u'41039'];[u'title]
[u'40942'];"[u""title""]"]
...
Extracted with
item['title'] = site.select('div[#class="header"]/h2/a/#title').extract()
I noticed that always the same blog posts have this quotation marks. So they dont appear randomly. But there is nothing special to the text. E.g. this title produces quotation marks
<a title="Xtra Pac Telekom web'n'walk Stick Basic für 9,95" href="someURL">
Xtra Pac Telekom web'n'walk Stick Basic für 9,95</a>
So my first thought was that this is because of some special chars but there arent any.
This happeny only when the items are written to csv, when I print them in cmd there are no quotation marks.
Any ideas?
python can use both single ' and double " quotes as quotation marks. when it prints something out it chooses single quotes normally, but will switch to double quotes if the text it is printing contains single quotes (to avoid having to escape the quote in the string):
so normally, it is printing [u'....'] but sometimes you have text that contains a ' character and then it prints [u"...."].
then there is an extra complication writing to csv. if a string is written to csv that contains just a ' then it is written as it is. so [u'....'] is written as [u'....'].
but if it contains double quotes then (1) everything is put inside double quotes and (2) any double quotes are repeated twice. so u["..."] is written as "[u""...""]". if you read the csv data back with a csv library then this will be detected and removed, so it will not cause any problems.
so it's a combination of the text containing a single quote (making python use double quotes) and the csv quoting rules (which apply to double quotes, but not single quotes).
if this is a problem the csv library has various options to change the behaviour - http://docs.python.org/library/csv.html
the wikipedia page explains the quoting rules in more detail - the behavuour here is shown by the example with "Super, ""luxurious"" truck"

Categories