stream tweets with tweepy and Arabic characters problem - python

I'm trying to get Arabic tweets by using tweepy library in python 3.6, with English it works perfectly but when i try to get Arabic tweets i faced many problemm the problem with this last code is that the tweets in Arabic characters appear as "\u0635\u0648\u0651\u062a\u0648\u0627 "
i tried several solution in the internet but there is no one that solved my problem because most of them try to get just "text" of the tweet so they can fix the encode problem directly with the text only, but for me i want to get the whole info in json
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
access_token = '-'
access_token_secret = '-'
consumer_key = '-'
consumer_secret = '-'
class StdOutListener(StreamListener):
def on_data(self, data):
print (data.encode("UTF-8"))
return True
def on_error(self, status):
print (status)
if __name__ == '__main__':
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
stream.filter( track=["عربي"])
> $ python file.py > file2.txt
the results in text file and in the terminal:
{"created_at":"Thu Jan 17 12:12:16 +0000 2019","id":1085872428432195585,"id_str":"1085872428432195585","text":"RT #MALHACHIMI: \u0642\u0627\u062f\u0629 \u062d\u0631\u0643\u0629 \u0627\u0644\u0646\u0647\u0636\u0629 \u0635\u0648\u0651\u062a\u0648\u0627 \u0636\u062f \u0627\u0639\u062a\....etc}

If I do this with the first example in your question:
>>> print( "\u0635\u0648\u0651\u062a\u0648\u0627 ")
صوّتوا
the Arabic appears. But if you display a dict at the console, without specifying how you want it displayed, Python will just use a default representation that uses the ASCII character set, and anything not printable in that character set will be represented as escapes. This is because if you wanted to code this string in a program, your IDE editor might have a problem coping with the Arabic. The reason is that switches between the left-to-right order of the Python code and the right-to-left order of the string is very hard to manage. The information hasn't been lost or mangled, just displayed in a lowest-common-denominator format.

Related

handling unicode in python2

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tweepy
import json
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
class listener(StreamListener):
def on_data(self, data):
try:
print data
tweet = data.split(',"text":"')[1].split('","source')[0]
print tweet
saveThis = str(time.time())+'::' + tweet
saveFile = open("tweetDB3.csv", "a")
saveFile.write(saveThis)
saveFile.write("\n")
saveFile.close()
return True
except BaseException, e:
print "failed ondata,",str(e)
time.sleep(5)
def on_error(self, status):
print status
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track = ['오늘'])
example of result:
1465042178.01::RT #BTS_twt: korea#\ud83c\uddf0\ud83c\uddf7 https://t.co/zwKaGo4Lcj
1465042181.76::RT #wdfrog: \ud5e4\ub7f4\ub4dc \uacbd\uc81c\uac00 \uc774\ubc88 \uc77c\ub85c \uc0ac\uacfc\ubb38\uc744
\uc62c\ub838\uc9c0\ub9cc \uc774\uc790\ub4e4\uc740 \ubd88\uacfc
3\uac1c\uc6d4 \uc804\uc778 3\uc6d4 4\uc77c\uc5d0\ub3c4
\uc55e\uc73c\ub85c \uc870\uc2ec\ud558\uaca0\ub2e4\ub294
\uc0ac\uacfc\ubb38\uc744 \uc62c\ub9b0 \ubc14 \uc788\ub2e4.
\uc77c\uc774 \ucee4\uc9c8\uae4c \uba74\ud53c\ud558\ub294
\uac83\uc774\ub2c8 \uc5b8\ub860\uc911\uc7ac\uc704\uc5d0 \ud55c\uce35
\uac00\uc5f4\ucc28\uac8c \ubbfc\uc6d0\uc744
\ub123\uc74d\uc2dc\ub2e4\nhttps://t.co/Wb\u2026
Question:
If I do a twitter API stream through the above code (using Korean characters)
the message above is what is being created in excel file which is shown as unicode.
These unicodes have corresponding Korean characters that can be found by print u'string'
But is it possible to make all these unicodes automatically converted Korean?
I've tried to fix python code and tried to solve within excel but no luck.
Despite the setdefaultencoding method you can't change default encoding in python 2.7. You should use python 3, (default encoding is UTF-8 and you can change it)

How can i solve this ascii error in python

def scrapeFacebookPageFeedStatus(page_id, access_token):
# -*- coding: utf-8 -*-
with open('%s_facebook_statuses.csv' % page_id, 'wb') as file:
w = csv.writer(file)
w.writerow(["status_id", "status_message", "link_name", "status_type", "status_link",
"status_published", "num_likes", "num_comments", "num_shares"])
has_next_page = True
num_processed = 0 # keep a count on how many we've processed
scrape_starttime = datetime.datetime.now()
print "Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime)
statuses = getFacebookPageFeedData(page_id, access_token, 100)
while has_next_page:
for status in statuses['data']:
w.writerow(processFacebookPageFeedStatus(status))
# output progress occasionally to make sure code is not stalling
num_processed += 1
if num_processed % 1000 == 0:
print "%s Statuses Processed: %s" % (num_processed, datetime.datetime.now())
# if there is no next page, we're done.
if 'paging' in statuses.keys():
statuses = json.loads(request_until_succeed(statuses['paging']['next']))
else:
has_next_page = False
print "\nDone!\n%s Statuses Processed in %s" % (num_processed, datetime.datetime.now() - scrape_starttime)
scrapeFacebookPageFeedStatus(page_id, access_token)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 40-43: ordinal not in range(128)
I'm writing code to scrape through Facebook pages to gather all the posts in cvs file.
The code is working properly when there is only the English language, but
the error above appears when I try to scrape through pages that post in Arabic.
I know the solution is to use utf-8 but I don't know how to implement it on the code.
Your problem probably is not in this code, I suspect is in your processFacebookPageFeedStatus function. But when you are formatting your fields you'll want to make sure any that may contain unicode characters are all decoded (or encoded as appropriate) in utf-8.
import codecs
field_a = "some unicode text in here"
field_a.decode('utf-8') -----> \u1234\u........
field_a.encode('utf-8') -----> Back to original unicode
Your CSV probably doesn't support unicode, so you need to decode each field in your source data.
Debugging unicode is a pain, but there are a lot of SO posts about different problems related to encoding/decoding unicode
import sys
reload(sys).setdefaultencoding("utf-8")
I added this piece of code and it works fine when I open this file in pandas .
there are no other errors or what so ever for now

Getting lat/long coordinates from identified tweets using tweepy; getting KeyError: 'coordinates'

I'm trying to get the lat/long coordinates from identified tweets. The part I am having trouble with is the if decoded['coordinates']!=None: t.write(str(decoded['coordinates']['coordinates']) block. I don't know exactly if it's working or not because sometimes ~150 tweets will be returned with coordinates as [None] before the error is returned, so I believe the error comes back when a tweet with coordinates is found, and then it returns KeyError: 'coordinates'.
The following is my code:
import tweepy
import json
from HTMLParser import HTMLParser
import os
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
# This is the listener, resposible for receiving data
class StdOutListener(tweepy.StreamListener):
def on_data(self, data):
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(HTMLParser().unescape(data))
os.chdir('/home/scott/810py/Project')
t = open('hashtagHipster.txt','a')
# Also, we convert UTF-8 to ASCII ignoring all bad characters sent by users
#if decoded['coordinates']:
# decoded['coordinates'] returns a few objects that are not useful,
# like type and place which we don't want. ['coordinates'] has a
# second thing called ['coordinates'] that returns just the lat/long.
# it may be that the code is correct but location is so few and far
# between that I haven't been able to capture one. This program just
# looks for 'hipster' in the tweet. There should be a stream of tweets
# in the shell and everytime one that has coordinates tehy should be
# added to the file 'hashtagHipster.txt'. Let me know what you think.
if decoded['coordinates']!=None:
t.write(str(decoded['coordinates']['coordinates'])) #gets just [LAT][LONG]
print '[%s] #%s: %s' % (decoded['coordinates'], decoded['user']['screen_name'], decoded['text'].encode('ascii', 'ignore'))
print ''
return True
def on_error(self, status):
print status
if __name__ == '__main__':
l = StdOutListener()
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
print "Showing all new tweets for #hipster:"
# There are different kinds of streams: public stream, user stream, multi-user streams
# In this example follow #vintage tag
# For more details refer to https://dev.twitter.com/docs/streaming-apis
stream = tweepy.Stream(auth, l)
stream.filter(track=['hipster'])
Any help? thanks.
Not all tweet objects contains the 'coordinates' key, so you have to check that it exists with something like this:
if decoded.get('coordinates',None) is not None:
coordinates = decoded.get('coordinates','').get('coordinates','')
Also, please note that:
"Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators."
(PEP 8)

Python Requests URL with Unicode Parameters

I'm currently trying to hit the google tts url, http://translate.google.com/translate_tts with japanese characters and phrases in python using the requests library.
Here is an example:
http://translate.google.com/translate_tts?tl=ja&q=ひとつ
However, when I try to use the python requests library to download the mp3 that the endpoint returns, the resulting mp3 is blank. I have verified that I can hit this URL in requests using non-unicode characters (via romanji) and have gotten correct responses back.
Here is a part of the code I am using to make the request
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
url = 'http://translate.google.com/translate_tts'
if download:
result = requests.get(url, params={'tl': glang, 'q': text})
r.write(result.content)
r.seek(0)
return r
else:
return url
Also, if I print textor url within this snippet, the kana/kanji is rendered correctly in my console.
Edit:
If I attempt to encode the unicode and quote it as such, I still get the same response.
# -*- coding: utf-8 -*-
from StringIO import StringIO
import urllib
import requests
__author__ = 'jacob'
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
text = urllib.quote(text.encode('utf-8'))
url = 'http://translate.google.com/translate_tts?tl=%(glang)s&q=%(text)s' % locals()
print url
if download:
result = requests.get(url)
r.write(result.content)
r.seek(0)
return r
else:
return url
Which returns this:
http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
Which seems like it should work, but doesn't.
Edit 2:
If I attempt to use urlllb/urllib2, I get a 403 error.
Edit 3:
So, it seems that this problem/behavior is simply limited to this endpoint. If I try the following URL, a different endpoint.
http://www.kanjidamage.com/kanji/13-un-%E4%B8%8D
From within requests and my browser, I get the same response (they match). If I even try ascii characters to the server, like this url.
http://translate.google.com/translate_tts?tl=ja&q=sayonara
I get the same response as well (they match again). But if I attempt to send unicode characters to this URL, I get a correct audio file on my browser, but not from requests, which sends an audio file, but with no sound.
http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
So, it seems like this behavior is limited to the Google TTL URL?
The user agent can be part of the problem, however, it is not in this case. The translate_tts service rejects (with HTTP 403) some user agents, e.g. any that begin with Python, curl, wget, and possibly others. That is why you are seeing a HTTP 403 response when using urllib2.urlopen() - it sets the user agent to Python-urllib/2.7 (the version might vary).
You found that setting the user agent to Mozilla/5.0 fixed the problem, but that might work because the API might assume a particular encoding based on the user agent.
What you actually should do is to explicitly specify the URL character encoding with the ie field. Your URL request should look like this:
http://translate.google.com/translate_tts?ie=UTF-8&tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
Note the ie=UTF-8 which explicitly sets the URL character encoding. The spec does state that UTF-8 is the default, but doesn't seem entirely true, so you should always set ie in your requests.
The API supports kanji, hiragana, and katakana (possibly others?). These URLs all produce "nihongo", although the audio produced for hiragana input has a slightly different inflection to the others.
import requests
one = u'\u3072\u3068\u3064'
kanji = u'\u65e5\u672c\u8a9e'
hiragana = u'\u306b\u307b\u3093\u3054'
katakana = u'\u30cb\u30db\u30f3\u30b4'
url = 'http://translate.google.com/translate_tts'
for text in one, kanji, hiragana, katakana:
r = requests.get(url, params={'ie': 'UTF-8', 'tl': 'ja', 'q': text})
print u"{} -> {}".format(text, r.url)
open(u'/tmp/{}.mp3'.format(text), 'wb').write(r.content)
I made this little method before to help me with UTF-8 encoding. I was having issues printing cyrllic and CJK languages to csvs and this did the trick.
def assist(unicode_string):
utf8 = unicode_string.encode('utf-8')
read = utf8.decode('string_escape')
return read ## UTF-8 encoded string
Also, make sure you have these two lines at the beginning of your .py.
#!/usr/bin/python
# -*- coding: utf-8 -*-
The first line is just a good python habit, it specifies which compiler to use on the .py (really only useful if you have more than one version of python loaded on your machine). The second line specifies the encoding of the python file. A slightly longer answer for this is given here.
Setting the User-Agent to Mozilla/5.0 fixes this issue.
from StringIO import StringIO
import urllib
import requests
__author__ = 'jacob'
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
url = 'http://translate.google.com/translate_tts'
if download:
result = requests.get(url, params={'tl': glang, 'q': text}, headers={'User-Agent': 'Mozilla/5.0'})
r.write(result.content)
r.seek(0)
return r
else:
return url

Python - writing a variable to a text file

So this is my very first attempt at Python and programming the Raspberry Pi. My small project is to light an LED when I get a mention on Twitter. All very simple and the code shown below works well. My question relates to storing the previous mentions in a text file instead of a variable. Essentially the code checks the printed_ids variable for the list of tweet.ids that have already been seen so as to prevent the LED's from just continually flashing every time the program is re-run. My plan is to run the python code in a scheduled job but I don't want to be in a situation where every time I restart the Pi and run the program, the program has to go through all my mentions and write each occurrence to the printed_ids variable.
So, my thought was to write them instead to a text file so as the program survives a reboot.
Any thoughts/recommendations?
Thanks for your help.
import sys
import tweepy
import RPi.GPIO as GPIO ## Import GPIO library
import time ## Import 'time' library. Allows use of 'sleep'
GPIO.setmode(GPIO.BOARD) ## Use board pin numbering
CONSUMER_KEY = '******************'
CONSUMER_SECRET = '*****************'
ACCESS_KEY = '**********************'
ACCESS_SECRET = '*********************'
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
api = tweepy.API(auth)
speed = 2
printed_ids = []
while True:
for tweet in api.mentions_timeline():
if tweet.id not in printed_ids:
print "#%s: %s" % (tweet.author.screen_name, tweet.text)
GPIO.setup(7,GPIO.OUT) ## Setup GPIO Pin 7 to OUT
GPIO.output(7,True)## Switch on pin 7
time.sleep(speed)## Wait
GPIO.output(7,False)## Switch off pin 7
f.open('out','w')
f.write(tweet.id)
##printed_ids.append(tweet.id)
GPIO.cleanup()
time.sleep(60) # Wait for 60 seconds.
What you're looking for is called "serialization" and Python provides many options for that. Perhaps the simplest and the most portable one is the json module
import json
# read:
with open('ids.json', 'r') as fp:
printed_ids = json.load(fp)
# #TODO: handle errors if the file doesn't exist or is empty
# write:
with open('ids.json', 'w') as fp:
json.dump(printed_ids, fp)

Categories