RSS parser + unicode decoding ( python )

RSS parser + unicode decoding ( python ) - python

I have two questions :)
I am working on extension for my irc bot. It is supposed to check rss for new content and post it to channel. I am using feedparser. Only way I found is to store every new content to file and every few minutes download rss content and match it with content in file, which is in my opinion kinda weird. Is there some easy way how to check if there is new content in rss? Thx
When I am saving content to file, sometimes some parts are encoded by unicode ( special characters in czech language ) - u"xxx". But I want to save them into file as utf8. How do I do it?

RSS items usually have a GUID or a link associated with them. Use the GUID if present, otherwise the link to uniquely identify each item. You'll still have to keep track of which ones you've seen before, as the RSS format doesn't tell you what changed since last time. There really is no other way, I'm afraid.
To save data(a unicode object) in UTF-8, simply encode it when writing to the file:
output.write(data.encode('utf8'))
Please do read the Joel Spolsky article on Unicode and the Python Unicode HOWTO, to fully understand what encoding and decoding means.

Related

Facebook/messenger archive contains emoji that I am unable to parse

I can' figure out how to decode facebook's way of encoding emoji in the messenger archive.
Hi everyone,
I'm trying to code a handy utility to explore messenger's archive file with PYTHON.
The message's file is a "badly encoded "JSON and as stated in this other post: Facebook JSON badly encoded
Using .encode('latin1').decode('utf8) I've been able to deal with most characters such as "é" or "à" and display them correctly. But I'm having a hard time with emojis, as they seem to be encoded in a different way.
Example of a problematic emoji : \u00f3\u00be\u008c\u00ba
The encoding/decoding does not yield any errors, but Tkinter is not willing to display what the function outputs and gives "_tkinter.TclError: character U+fe33a is above the range (U+0000-U+FFFF) allowed by Tcl". Tkinter is not yet this issue thought because trying to display the same emoji in the consol yields "ó¾º" which clearly isn't what's supposed to be displayed ( it's supposed to be a crying face)
I've tried using the emoji library but it doesn't seem to help any
>>> print(emoji.emojize("\u00f3\u00be\u008c\u00ba"))
'ó¾º'
How can I retrieve the proper emoji and display it?
If it's not possible, how can I detect problematic emojis to maybe sanitize and remove them from the JSON in the first place?
Thank you in advance

.encode('latin1').decode('utf8) is correct - it results in the codepoint U+fe33a("󾌺"). This codepoint is in a Private Use Area (PUA) (specifically Supplemental Private Use Area-A), so everyone can assign his own meaning to that codepoint (Maybe facebook wanted to use a crying face, when there wasn't yet one in Unicode, so they used PUA?).
Googling for that char (https://www.google.com/search?q=󾌺) makes google autocorrect it to U+1f62d ("😭") - sadly I have no idea how google maps U+fe33a to U+1f62d.
Googling for U+fe33a site:unicode.org gives https://unicode.org/L2/L2010/10132-emojidata.pdf, which lists U+1F62D as proposed official codepoint.
As that document from unicode lists U+fe33a as a codepoint used by google, I searched for android old emoji codepoints pua. Among other stuff two actually usable results:
How to get Android emoji code point - the question links to :
https://unicodey.com/emoji-data/table.htm - a html table, that seems to be acceptably parsable
and even better: https://github.com/google/mozc/blob/master/src/data/emoji/emoji_data.tsv - a tab sepperated list, that maps modern codepoints to legacy PUA codepoints and other information like this:
1F62D 😭 FE33A E72D E411[...]
https://github.com/googlei18n/noto-emoji/issues/115 - this thread links to:
https://github.com/Crissov/noto-emoji/blob/legacy-pua/emoji_aliases.txt - a machine readable document, that translates legacy PUA codepoints to modern codepoints like this:
FE33A;1F62D # Google
I included my search queries in the answer, because non of the results I found are in any way authoritative - but it should be enough, to get your tool working :-)

How to reliable tell the uploaded file type (text or binary)?

I have an application where users should be able to upload a wide variety of files, but I need to know for each file, if I can safely display its textual representation as plain text.
Using python-magic like
m = Magic(mime=True).from_buffer(cgi.FieldStorage.file.read())
gives me the correct MIME type.
But sometimes, the MIME type for scripts is application/*, so simply looking for m.startswith('text/') is not enough.
Another site suggested using
m = Magic().from_buffer(cgi.FieldStorage.file.read())
and checking for 'text' in m.
Would the second approach be reliable enough for a collection of arbitrary file uploads or could someone give me another idea?
Thanks a lot.

What is your goal? Do you want the real mime type? Is that important for security reasons? Or is it "nice to have"?
The problem is that the same file can have different mime types. When a script file has a proper #! header, python-magic can determine the script type and tell you. If the header is missing, text/plain might be the best you can get.
This means there is no general "will always work" magic solution (despite the name of the module). You will have to sit down and think what information you can get, what it means and how you want to treat it.
The secure solution would be to create a list of mime types that you accept and check them with:
allowed_mime_types = [ ... ]
if m in allowed_mime_types:
That means only perfect matches are accepted. It also means that your server will reject valid files which don't have the correct mime type for some reason (missing header, magic failed to recognize the file, you forgot to mention the mime type in your list).
Or to put it another way: Why do you check the mime type of the file if you don't really care?
[EDIT] When you say
I need to know for each file, if I can safely display its textual representation as plain text.
then this isn't as easy as it sounds. First of all, "text" files have no encoding stored in them, so you will need to know the encoding that the user used when they created the file. This isn't a trivial task. There are heuristics to do so but things get hairy when encodings like ISO 8859-1 and 8859-15 are used (the latter has the Euro symbol).
To fix this, you will need to force your users to either save the text files in a specific encoding (UTF-8 is currently the best choice) or you need to supply a form into which users will have to paste the text.
When using a form, the user can see whether the text is encoded correctly (they see it on the screen), they can fix any problems and you can make sure that the browser sends you the text encoded with UTF-8.
If you can't do that, your only choice is to check for any bytes below 0x20 in the input with the exception of \r, \n and \t. That is a pretty good check for "is this a text document".
But when users use umlauts (like when you write an application that is being used world wide), this approach will eventually fail unless you can enforce a specific encoding on the user's side (which you probably can't since you don't trust the user).
[EDIT2] Since you need this to check actual source code: If you want to make sure the source code is "safe", then parse it. Most languages allow to parse the code without actually executing it. That would give you some real information (because the parsers know what to look for) and you wouldn't need to make wild guesses :-)

After playing around a bit, I discovered that I can propably use the Magic(mime_encoding=True) results!
I ran a simple script on my Dropbox folder and grouped the results both by encoding and by extension to check for irregularities.
But it does seem pretty usable by looking for 'binary' in encoding.
I think I will hang on to that, but thank you all.

Issue with scraping site with foreign characters

I need help with a scraper I'm writing. I'm trying to scrape a table of university rankings, and some of those schools are European universities with foreign characters in their names (e.g. ä, ü). I'm already scraping another table on another site with foreign universities in the exact same way, and everything works fine. But for some reason, the current scraper won't work with foreign characters (and as far as parsing foreign characters, the two scrapers are exactly the same).
Here's what I'm doing to try & make things work:
Declare encoding on the very first line of the file:
# -*- coding: utf-8 -*-
Importing & using smart unicode from django framework from django.utils.encoding import smart_unicode
school_name = smart_unicode(html_elements[2].text_content(), encoding='utf-8',
strings_only=False, errors='strict').encode('utf-8')
Use encode function, as seen above when chained with the smart_unicode function.
I can't think of what else I could be doing wrong. Before dealing with these scrapers, I really didn't understand much about different encoding, so it's been a bit of an eye-opening experience. I've tried reading the following, but still can't overcome this problem
http://farmdev.com/talks/unicode/
http://www.joelonsoftware.com/articles/Unicode.html
I understand that in an encoding, every character is assigned a number, which can be expressed in hex, binary, etc. Different encodings have different capacities for how many languages they support (e.g. ASCII only supports English, UTF-8 supports everything it seems. However, I feel like I'm doing everything necessary to ensure the characters are printed correctly. I don't know where my mistake is, and it's driving me crazy.
Please help!!

When extracting information from a web page, you need to determine its character encoding, similarly to how browsers do such things (analyzing HTTP headers, parsing HTML to find meta tags, and possibly guesswork based on the actual data, e.g. the presence of something that looks like BOM in some encoding). Hopefully you can find a library routine that does this for you.
In any case, you should not expect all web sites to be utf-8 encoded. Iso-8859-1 is still in widespread use, and in general reading iso-8859-1 as if it were utf-8 results in a big mess (for any non-Ascii characters).

If you are using the requests library it will automatically decode the content based on HTTP headers. Getting the HTML content of a page is really easy:
>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.text
'[{"repository":{"open_issues":0,"url":"https://github.com/...

You need first to look at the <head> part of the document and see whether there's charset information:
<meta http-equiv="Content-Type" content="text/html; charset=xxxxx">
(Note that StackOverflow, this very page, doesn't have any charset info... I wonder how 中文字, which I typed assuming it's UTF-8 in here, will display on Chinese PeeCees that are most probably set up as GBK, or Japanese pasokon which are still firmly in Shift-JIS land).
So if you have a charset, you know what to expect, and deal with it accordingly. If not, you'll have to do some educated guessing -- are there non-ASCII chars (>127) in the plain text version of the page? Are there HTML entities like 一 (一) or é (é)?
Once you have guessed/ascertained the encoding of the page, you can convert that to UTF-8, and be on your way.

How do I access both binary and text data for email processing with Python 3?

I am converting a Python 2 program to Python 3 and I'm not sure about the approach to take.
The program reads in either a single email from STDIN, or file(s) are specified containing emails. The program then parses the emails and does some processing on them.
SO we need to work with the raw data of the email input, to store it on disk and do an MD5 hash on it. We also need to work with the text of the email input in order to run it through the Python email parser and extract fields etc.
With Python 3 it is unclear to me how we should be reading in the data. I believe we need the raw binary data in order to do an md5 on it, and also to be able to write it to disk. I understand we also need it in text form to be able to parse it with the email library. Python 3 has made significant changes to the IO handling and text handling and I can't see the "correct" approach to read the email raw data and also use the same data in text form.
Can anyone offer general guidance on this?

The general guidance is convert everything to unicode ASAP and keep it that way until the last possible minute.
Remember that str is the old unicode and bytes is the old str.
See http://docs.python.org/dev/howto/unicode.html for a start.
With Python 3 it is unclear to me how we should be reading in the data.
Specify the encoding when you open the file it and it will automatically give you unicode. If you're reading from stdin, you'll get unicode. You can read from stdin.buffer to get binary data.
I believe we need the raw binary data in order to do an md5 on it
Yes, you do. encode it when you need to hash it.
and also to be able to write it to disk.
You specify the encoding when you open the file you're writing it to, and the file object encodes it for you.
I understand we also need it in text form to be able to parse it with the email library.
Yep, but since it'll get decoded when you open the file, that's what you'll have.
That said, this question is really too open ended for Stack Overflow. When you have a specific problem / question, come back and we'll help.

Using Python to translate Japanese to English

I am using Python to write some scripts that integrate two systems. The system scans mailboxes and searches for a specific subject line and then parses the information from the email. One of the elements I am looking for is an HTML link which I then use Curl to write the html code to a text file in text format.
My question is if the text in the email is in Japanese, are there any modules in Python that will automatically convert that text to English? Or do I have the convert to string to Unicode and then decode that?
Here is an example of what I am seeing. When I use curl to grab the text from the URL:
USB Host Stack 処理において解放されたメモリを不正に使用している
When I do a simple re.match to grab the string and write it to a file get this:
USB Host Stack æQtk0J0D0f0ã‰>eU0Œ0_0á0â0ê0’0Nckk0O(uW0f0D0‹0
I also get the following when I grab the email using the email module
>>> emailMessage.get_payload()
USB Host Stack =E5=87=A6=E7=90=86=E3=81=AB=E3=81=8A=E3=81=84=E3=81=A6=E8=A7=
=A3=E6=94=BE=E3=81=95=E3=82=8C=E3=81=9F=E3=83=A1=E3=83=A2=E3=83=AA=E3=82=92=
=E4=B8=8D=E6=AD=A3=E3=81=AB=E4=BD=BF=E7=94=A8=E3=81=97=E3=81=A6=E3=81=84=E3=
=82=8B
So, I guess my real question is what steps do I have to take to get this to convert to English correctly. I'd really like to take the first one which are Japanese characters and convert that to English.

Natural language translation is a very challenging problem, as others wrote. So look into sending strings to be translated to a service, e.g., google translate, which will translate them for you (poorly, but it's better than nothing) and send them back.
The following SO link shows one way: translate url with google translate from python script
Before you get that to work, you should sort out your encoding problems (unicode, uuencoding etc.) so that you're reading and writing text without corrupting it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.