Working with strings in python produces strange quotation marks - python

currently I am working with scrapy, which is a web crawling framework based on python. The data is extracted from html using XPATH . (I am new to python) To wrap the data scrapy uses items, e.g.
item = MyItem()
item['id'] = obj.select('div[#class="id"]').extract()
When the id is printed like print item['id'] I get following output
[u'12346']
My problem is that this output is not always in the same form. Sometimes I get an output like
"[u""someText""]"
This happens only with text, but actually there is nothing speciall with the text compared to other text that is handled corretly just like the ID.
Does anyone know what the quotation marks mean? Like I said the someText was crawled like all other text data, e.g. from
<a>someText</a>
Any ideas?
Edit:
My spider crawls all pages of a blog. Here is the exact output
[u'41039'];[u'title]
[u'40942'];"[u""title""]"]
...
Extracted with
item['title'] = site.select('div[#class="header"]/h2/a/#title').extract()
I noticed that always the same blog posts have this quotation marks. So they dont appear randomly. But there is nothing special to the text. E.g. this title produces quotation marks
<a title="Xtra Pac Telekom web'n'walk Stick Basic für 9,95" href="someURL">
Xtra Pac Telekom web'n'walk Stick Basic für 9,95</a>
So my first thought was that this is because of some special chars but there arent any.
This happeny only when the items are written to csv, when I print them in cmd there are no quotation marks.
Any ideas?

python can use both single ' and double " quotes as quotation marks. when it prints something out it chooses single quotes normally, but will switch to double quotes if the text it is printing contains single quotes (to avoid having to escape the quote in the string):
so normally, it is printing [u'....'] but sometimes you have text that contains a ' character and then it prints [u"...."].
then there is an extra complication writing to csv. if a string is written to csv that contains just a ' then it is written as it is. so [u'....'] is written as [u'....'].
but if it contains double quotes then (1) everything is put inside double quotes and (2) any double quotes are repeated twice. so u["..."] is written as "[u""...""]". if you read the csv data back with a csv library then this will be detected and removed, so it will not cause any problems.
so it's a combination of the text containing a single quote (making python use double quotes) and the csv quoting rules (which apply to double quotes, but not single quotes).
if this is a problem the csv library has various options to change the behaviour - http://docs.python.org/library/csv.html
the wikipedia page explains the quoting rules in more detail - the behavuour here is shown by the example with "Super, ""luxurious"" truck"

Related

Regex behaves differently for the same input string

I am trying to get a pdf page with a particular string and the string is:
"statement of profit or loss"
and I'm trying to accomplish this using following regex:
re.search('statement of profit or loss', text, re.IGNORECASE)
But even though the page contained this string "statement of profit or loss" the regex returned None.
On further investigating the document, I found that the characters 'fi' in the "profit" as written in the document are more congested. When I copied it from the document and pasted it in my code it worked fine.
So, If I copy "statement of profit or loss" from document and paste it in re.search() in my code, it works fine. But if I write "statement of profit or loss" manually in my code, re.search() returns none.
How can I avoid this behavior?
The 'congested' characters copied from your PDF are actually a single character: the 'fi ligature' U+FB01: fi.
Either it was entered as such in the source document, or the typesetting engine that was used to create the PDF, replaced the combination f+i by fi.
Combining two or more characters into a single glyph is a fairly usual operation for "nice typesetting", and is not limited to fi, fl, ff, and fj, although these are the most used combinations. (That is because in some fonts the long overhang of the f glyph jarringly touches or overlaps the next character.) Actually, you can have any amount of ligatures; some Adobe fonts use a single ligature for Th.
Usually this is not a problem with text extracting, because in the PDF it can be specified that certain glyphs must be decoded as a string of characters – the original characters. So, possibly your PDF does not contain such a definition, or the typesetting engine did not bother because the single character fi is a valid Unicode character on itself (although it is highly advised not to use it).
You can work around this by explicitly cleaning up your text strings before processing any further:
text = text.replace('fi', 'fi')
– repeat this for other problematic ligatures which have a Unicode codepoint: fl, ff, ffi, ffl (I possibly missed some more).

Which is the most efficent of matching and replacing with an identifier every three new lines?

I am working with some .txt files that doesn't have structure (they are messy), they represent a number of pages. In order to give them some structure I would like to identify the number of pages since the file itself doesn't have them. This can be done by replacing every three newlines with some annotation like:
\n
page: N
\n
Where N is the number. This is how my files look like, and I also tried with a simple replace. However, this function confuses and does not give me the expected format which would be something like this. Any idea of how to replace the spaces with some kind of identifier, just to try to parse them and getting the position of some information (page)?.
I also tried this:
import re
replaced = re.sub('\b(\s+\t+)\b', '\n\n\n', text)
print (replaced)
If the format is as regular as you state in your problem description:
Replace every occurrence of three newlines \n with page: N
You wouldn't have to use the re module. Something as simple as the following would do the trick:
>>> s='aaaaaaaaaaaaaaaaa\n\n\nbbbbbbbbbbbbbbbbbbbbbbb\n\n\nccccccccccccccccccccccc'
>>> pages = s.split('\n\n\n')
>>> ''.join(page + '\n\tpage: {}\n'.format(i + 1) for i, page in enumerate(pages))
'aaaaaaaaaaaaaaaaa\n\tpage: 1\nbbbbbbbbbbbbbbbbbbbbbbb\n\tpage: 2\nccccccccccccccccccccccc\n\tpage: 3\n'
I suspect, though, that your format is less regular than that, but you'll have to include more details before I can give a good answer for that.
If you want to split with messy whitespace (which I'll define as at least three newlines with any other whitespace mixed in), you can replace s.split('\n\n\n') with:
re.split(r'(?:\n\s*?){3,}', s)

Removing encoded text from strings read from txt file

Here's the problem:
I copied and pasted this entire list to a txt file from https://www.cboe.org/mdx/mdi/mdiproducts.aspx
Sample of text lines:
BFLY - The CBOE S&P 500 Iron Butterfly Index
BPVIX - CBOE/CME FX British Pound Volatility Index
BPVIX1 - CBOE/CME FX British Pound Volatility First Term Structure Index
BPVIX2 - CBOE/CME FX British Pound Volatility Second Term Structure Index
These lines of course appear normal in my text file, and I saved the file with utf-8 encoding.
My goal is to use python to strip out only the symbols from this long list, .e.g. BFLY, VPVIX etc, and write them to a new file
I am using the following code to read the file and split it:
x=open('sometextfile.txt','r')
y=x.read().split()
The issue I'm seeing is that there are unfamiliar characters popping up and they are affecting my ability to filter the list. Example:
print(y[0])
BFLY
I'm guessing that these characters have something to do with the encoding and I have tried a few different things with the codec module without success. Using .decode('utf-8') throws an error when trying to use it against the above variables x or y. I am able to use .encode('utf-8'), which obviously makes things even worse.
The main problem is that when I try to loop through the list and remove any items that are not all upper case or contain non-alpha characters. Ex:
y[0].isalpha()
False
y[0].isupper()
False
So in this example the symbol BFLY ends up being removed from the list.
Funny thing is that these characters are not present in a txt file if I do something like:
q=open('someotherfile.txt','w')
q.write(y[0])
Any help would be greatly appreciated. I would really like to understand why this frequently happens when copying and pasting text from web pages like this one.
Why not use Regex?
I think this will catch the letters in caps
"[A-Z]{1,}/?[A-Z]{1,}[0-9]?"
This is better. I got a list of all such symbols. Here's my result.
['BFLY', 'CBOE', 'BPVIX', 'CBOE/CME', 'FX', 'BPVIX1', 'CBOE/CME', 'FX', 'BPVIX2', 'CBOE/CME', 'FX']
Here's the code
import re
reg_obj = re.compile(r'[A-Z]{1,}/?[A-Z]{1,}[0-9]?')
sym = reg_obj.findall(a)enter code here
print(sym)

Convert text to HTML comment in Python

I have a Python script that generates some HTML. It does so using the Python markdown library. I'd like to stick the original Markdown text in a comment at the end of the HTML, where it will occasionally be useful for debugging purposes. I've tried just plunking the Markdown text after the end of the HTML, and it doesn't work for me (Firefox). So the way I imagine this working is that I run Markdown and then simply append the Markdown source, marked as a comment, after the HTML. However, HTML is apparently somewhat finicky about what it will allow in comments. The site htmlhelp.com gives the following advice after some discussion:
For this reason, use the following simple rule to compose valid and accepted [portable] comments:
An HTML comment begins with "" and does not contain "--" or ">" anywhere in the comment.
(source)
So it looks like I need to do some escaping or something to get my bunch of markdown text into a form that HTML will accept as a comment. Is there an existing tool that will help me do this?
According to the w3:
Comments consist of the following parts, in exactly the following order:
- the comment start delimiter "<!--"
- text
- the comment end delimiter "-->"
The text part of comments has the following restrictions:
1. must not start with a ">" character
2. must not start with the string "->"
3. must not contain the string "--"
4. must not end with a "-" character
These are very simple rules. You could regex-enforce them, but they are so simple you don't even need that!
3 of the 4 conditions can be met with concatenation, and the other one with a simple replace(). All in all, it's a one-liner:
def html_comment(text):
return '<!-- ' + text.replace('--', '- - ') + ' -->'
Note the spaces.
Can't you just .replace it? Ultimately, you could replace those characters with anything, but substituting with escape codes probably won't make your comment any more readible than substituting with nothing.
commented = '<!-- %s -->' % markdown_text.replace('--', '').replace('>', '')

How to convert u'\x96' to u'–' in python

I'm porting content from an old Wordpress blog to Mezzanine. I was given a json dump of the database and the posts are littered with special characters that look like this: \x96 among otherwise unescaped html.
If I manually replace the slash with &# and append a semicolon the character renders correctly
so \x96 to –
escaped UTF-8(hex) to HTML Entity(hex)
How to do this in Python?
If – is also acceptable, you can use:
>>> u'\x96'.encode('ascii', 'xmlcharrefreplace')
'–'
which is even called out in the documentation1.
1(although not very clearly)...

Categories