Python UTF8 encoding with arabic - python

I have a encoding problem, When I try to crawl youtube (arabic channel) :
#!/usr/bin/python
# -*- coding: utf8 -*-
from django.core.management.base import BaseCommand, CommandError
import requests, lxml, re
from lxml import html
class Command(BaseCommand):
def handle(self, *args, **options):
r = requests.get("https://www.youtube.com/user/aljazeerachannel/videos?view=0")
root = lxml.html.fromstring(r.content)
for data in root.xpath('.//*[#id="branded-page-body"]/div/div/div[1]/div/div[2]/ul/li[1]/span/span/a'):
print data.text
The result is :
[root#vmi9105 buzzbal]# python manage.py youtube
اÙتخابات اÙÙجاÙس اÙبÙدÙØ© Ù٠سÙØ·ÙØ© عÙÙاÙ

try this it sloved my problem in python:
f"{yourString}".encode('latin-1').decode("utf-8")

Related

Cyrillic symbols do not show in Python 3

I want to write a script that takes a POST (HTTP) request and sends a response. If the POST data has english characters, they are returned/displayed normally. But cyrillic chars look like this:
������ ������ �������������� ������
My Python script is as follows:
# -*- coding: utf-8 -*-
print('Content-type: text/html\n')
import cgi
from imp import reload
form = cgi.FieldStorage()
text2 = form.getfirst("postvar1", "не задано")
print(text2)
I use a custom server script
# -*- coding: utf-8 -*-
from http.server import HTTPServer, CGIHTTPRequestHandler
server_address = ("", 8080)
httpd = HTTPServer(server_address, CGIHTTPRequestHandler)
httpd.serve_forever()
The python script is saved as UTF-8.
How can this problem be solved?

(python/boto sqs) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)

I can not send messages with accented characters for SQS in python with the AWS SDK (boto).
Versions
Python: 2.7.6
boto: 2.20.1
CODE
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import boto.sqs
from boto.sqs.message import RawMessage
# print boto.Version
sqs_conn = boto.sqs.connect_to_region(
'my_region',
aws_access_key_id='my_kye',
aws_secret_access_key='my_secret_ky')
queue = sqs_conn.get_queue('my_queue')
queue.set_message_class(RawMessage)
msg = RawMessage()
body = '1 café, 2 cafés, 3 cafés ...'
msg.set_body(body)
queue.write(msg)
One solution:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Full code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import boto.sqs
from boto.sqs.message import RawMessage
import sys # <== added this line
reload(sys) # <== added this line
sys.setdefaultencoding('utf-8') # <== added this line
# print boto.Version
sqs_conn = boto.sqs.connect_to_region(
'my_region',
aws_access_key_id='my_kye',
aws_secret_access_key='my_secret_ky')
queue = sqs_conn.get_queue('my_queue')
queue.set_message_class(RawMessage)
msg = RawMessage()
body = '1 café, 2 cafés, 3 cafés ...'
msg.set_body(body)
queue.write(msg)
Source: https://pythonadventures.wordpress.com/2012/03/29/ascii-codec-cant-encode-character/#comment-4672

ConfigParser and global variables

I have a python file that calls a function in another directory.
I would like to use the config variable DATA_DIR in the function directly without importing configuration at each time.
The main file looks like this :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import ConfigParser
config = ConfigParser.ConfigParser()
config.read('static.cfg')
global __DATA_DIR__
__DATA_DIR__ = config.get('Directories', '__DATA_DIR__')
from src.directory import file
file.function()
The function looks like this :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
def function():
global __DATA_DIR__
print (__DATA_DIR__)
The configuration file looks like this :
[Directories]
__DATA_DIR__=/directorie/to/config.cfg
When executing the first main program, I had this error :
NameError: global name 'DATA_DIR' is not defined
Why not pass the config argument to the function. This would need confirmation but I would imagine that only the read method actually reads and parses the actual file, and the config.get method only gives you data from an internal datastructure, so passing the config object and doing a config.get inside the function would be pretty efficient.
so :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import ConfigParser
from src.directory import file
config = ConfigParser.ConfigParser()
config.read('static.cfg')
file.function( config )
and in your file :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
def function( cfg ):
print ( cfg.get("__DATA_DIR__") )
Since you are importing your function, you should pass in your __DATA_DIR__ variable to it:
from src.directory import file
file.function(__DATA_DIR__)
#your other file
def function(data):
print (data)

avoid user abort python subprocesses

I want the process i initiate through the script to run on webserver even if user closes the page.
It doesn't seem to be working with this:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import cgi,cgitb,subprocess
print "Content-Type: text/plain;charset=utf-8"
print
form = cgi.FieldStorage()
ticker = form['ticker'].value
print subprocess.Popen(['/usr/bin/env/python','options.py',ticker])
Please help! Thanks!
I guess this is wrong:
'/usr/bin/env/python'
it should be usually:
'/usr/bin/env python'
but better use this:
>>> import sys
>>> sys.executable # contains the executable running this python process
'C:\\Python27\\pythonw.exe'
I use to do it like this:
p = subprocess.Popen([sys.executable,'options.py',ticker])

Python/feedparser script won't display on CGI/ character coding

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import os
import cgi
import string
import feedparser
count = 0
print "Content-Type: text/html\n\n"
print """<PRE><B>WORK MAINTENANCE/B></PRE>"""
d = feedparser.parse("http://www.hep.hr/ods/rss/radovi.aspx?dp=zagreb")
for opis in d:
try:
print """<B>Place/Time:</B> %s<br>""" % d.entries[count].title
print """<B>Streets:</B> %s<br>""" % d.entries[count].description
print """<B>Published:</B> %s<br>""" % d.entries[count].date
print "<br>"
count+= 1
except:
pass
I have a problem with CGI and paython script. Under the terminal script runs just fine except "IndexError: list index out of range", and I put pass for that. But when I run script through CGI I only get WORK MAINTENANCE line and first line from d.entries[count].title repeated 9 times? So confusing...
Also how can I setup support in feedparser for Croation(balkan) letters; č,ć,š,ž,đ ?
# -- coding: utf-8 -- is not working and I m running Ubuntu server.
Thank you in advance for help.
Regards.
for opis in d:
try:
print """<B>Place/Time:</B> %s<br>""" % d.entries[count].title
You're not using 'opis' in your output.
Try something like this:
for entry in d.entries:
try:
print """<B>Place/Time:</B> %s<br>""" % entry.title
....
Oke had another problem, text that I manualy entered would show on CGI but RSS web pages wouldnt. So you need to encode before you write:
# -*- coding: utf-8 -*-
import sys, os, string
import cgi
import feedparser
import codecs
d = blablablabla
print "Content-Type: text/html; charset=utf-8\n\n"
print
for entry in d.entries:
print """%s""" % entry.title.encode('utf-8')

Categories