I want to use out_file on Python - python

I'm studying Python for two months. My goal is to do the sentiment analysis! But self-study is too hard, so I want to ask for help.
I collected data from Twitter API, and I put the data to notepad. It is too long like that. {created_at":"Fri Nov 03 03:28:33 +0000 2017", ~~ id, tweet, unicode}
I converted data to simple on IPython console(Spyder). It's like "Fri Nov 03 03:46:46 +0000 2017 #user blah blah [hash tags] time stamp". Then I want to put the simple data to notepad again. The code is written as follows. How can I change the code on part of out_file?
try:
import json
except ImportError:
import simplejson as json
tweets_filename = 'C:/Users/ID500/Desktop/SA/Corpus/siri/siri_0.txt' #Not converted data
tweets_file = open(tweets_filename, "r")
for line in tweets_file:
try:
tweet = json.loads(line.strip())
if 'text' in tweet:
print (tweet['id'])
print (tweet['created_at'])
print (tweet['text'])
print (tweet['user']['id'])
print (tweet['user']['name'])
print (tweet['user']['screen_name'])
hashtags = []
for hashtag in tweet['entities']['hashtags']:
hashtags.append(hashtag['text'])
print(hashtags)
out_file = open("C:/Users/ID500/Desktop/SA/Corpus/final/fn_siri_1.txt", 'a') # I want to put data to that path.
out_file.write() # What can I write here?
out_file.close()
except:
continue
Thank you!

You can open two files at once. Don't open one within the loop
For example
open(tweets_filename) as tweets_file, open(output, "a") as out_file:
for line in tweets_file:
# parse the line here
out_file.write(line)

Related

How to open large twitter file (30GB+) in Python?

I'm wondering what's the proper script to open large Twitter files streamed using tweepy on python 3. I've used the following with smaller files but now that my data collection is above 30GB+ I'm getting memory errors:
with open('data.txt') as f:
tweetStream = f.read().splitlines()
tweet = json.loads(tweetStream[0])
print(tweet['text'])
print(tweet['user']['screen_name'])
I've been unable to find what I need online so far so any help would be much appreciated.
Don't try and create an object that contains the entire file. Instead, as each line contains a tweet, work on the file one line at a time:
with open('data.txt') as f:
for line in f:
tweet = json.loads(line)
print(tweet['text'])
print(tweet['user']['screen_name'])
Perhaps store relevant tweets to another file or database, or produce a stastical summay. For example:
total = 0
about_badgers = 0
with open('data.txt') as f:
for line in f:
tweet = json.loads(line)
total +=1
if "badger" in tweet['text'].lower():
about_badgers += 1
print("Of " + str(total) +", " + str(about_badgers) +" were about badgers.")
Catch errors relating to unparseable lines like this:
with open('data.txt') as f:
for line in f:
try:
tweet = json.loads(line)
print(tweet['text'])
except json.decoder.JSONDecodeError:
# Do something useful, like write the failing line to an error log
pass
print(tweet['user']['screen_name'])

Can't get any data on Python

I collected a lot of tweet. Then I want to output only English tweets. I can get all of tweet included non-English tweet. But if I append some code for i in range (0,1000): if tweet['statuses'][i][u'lang']==u'en': for getting only English tweet, it can't be collected like that. And there are no error.
In [1]: runfile('C:/Users/Desktop/tweets.py', wdir='C:/Users/Desktop')
It just runs and there("C:/Users/Desktop/A.txt") are no data. My code is as follows. What should I do with it?
try:
import json
except ImportError:
import simplejson as json
tweets_filename = 'C:/Users/Desktop/tweet.txt' #Original tweet
tweets_file = open(tweets_filename, "r")
for line in tweets_file:
try:
tweet = json.loads(line.strip())
for i in range (0,1000): #This is the part for filtering English tweet
if tweet['statuses'][i][u'lang']==u'en': #Same part
if 'text' in tweet:
print (tweet['created_at'])
print (tweet['text'])
hashtags = []
for hashtag in tweet['entities']['hashtags']:
hashtags.append(hashtag['text'])
print(hashtags)
output = "C:/Users/Desktop/A.txt" #Only English tweet path
out_file = open(output, 'a')
out_file.write(tweet['user']['name'] + "," + tweet['text'] + "\n \n")
out_file.close()
except:
continue
You have to read the lines of tweet_file, like this;
lines = tweet_file.readlines()
for line in lines:
...
Also, if you want to see the errors. Donät catch them. Some good reading Zen of Python

write list of paragraph tuples to a csv file

The following code is designed to write a tuple, each containing a large paragraph of text, and 2 identifiers behind them, to a single line per each entry.
import urllib2
import json
import csv
base_url = "https://www.eventbriteapi.com/v3/events/search/?page={}
writer = csv.writer(open("./data/events.csv", "a"))
writer.writerow(["description", "category_id", "subcategory_id"])
def format_event(event):
return event["description"]["text"].encode("utf-8").rstrip("\n\r"), event["category_id"], event["subcategory_id"]
for x in range(1, 2):
print "fetching page - {}".format(x)
formatted_url = base_url.format(str(x))
resp = urllib2.urlopen(formatted_url)
data = resp.read()
j_data = json.loads(data)
events = map(format_event, j_data["events"])
for event in events:
#print event
writer.writerow(event)
print "wrote out events for page - {}".format(x)
The ideal format would be to have each line contain a single paragraph, followed by the other fields listed above, yet here is a screenshot of how the data comes out.
If instead I this line to the following:
writer.writerow([event])
Here is how the file now looks:
It certainly looks much closer to what I want, but its got parenthesis around each entry which are undesirable.
EDIT
here is a snippet that contains a sample of the data Im working with.
Can you try writing to the CSV file directly without using using the csv module? You can write/append comma-delimited strings to the CSV file just like writing to typical text files. Also, the way you deal with removing \r and \n characters might not be working. You can use regex to find those characters and replace them with an empty string "":
import urllib2
import json
import re
base_url = "https://www.eventbriteapi.com/v3/events/search/?page={}"
def format_event(event):
ws_to_strip = re.compile(r"(\r|\n)")
description = re.sub(ws_to_strip, "", event["description"]["text"].encode("utf-8"))
return [description, event["category_id"], event["subcategory_id"]]
with open("./data/events.csv", "a") as events_file:
events_file.write(",".join(["description", "category_id", "subcategory_id"]))
for x in range(1, 2):
print "fetching page - {}".format(x)
formatted_url = base_url.format(str(x))
resp = urllib2.urlopen(formatted_url)
data = resp.read()
j_data = json.loads(data)
events = map(format_event, j_data["events"])
for event in events:
events_file.write(",".join(event))
print "wrote out events for page - {}".format(x)
Change your csv writer to be DictWriter.
Make a few tweaks:
def format_event(event):
return {"description": event["description"]["text"].encode("utf-8").rstrip("\n\r"),
"category_id": event["category_id"],
"subcategory_id": event["subcategory_id"]}
May be a few other small things you need to do, but using DictWriter and formatting your data appropriately has been the easiest way to work with csv files that I've found.

Yaml not working properly in Python3 version of pythonanywhere

Good day. I am trying to create a quick&dirty configuration file for my pythonanywhere code. I tried to use YAML, but the result is weird.
import os
import yaml
yaml_str = """Time_config:
Tiempo_entre_avisos: 1
Tiempo_entre_backups: 7
Tiempo_entre_pushes: 30
Other_config:
Morosos_treshold: 800
Mail_config:
Comunication_mail: ''
Backup_mail: ''
Director_mail: []
"""
try:
yaml_file = open("/BBDD/file.yml", 'w+')
except:
print("FILE NOT FOUND")
else:
print("PROCESSING FILE")
yaml.dump(yaml_str, yaml_file, default_flow_style=False)
a = yaml.dump(yaml_str, default_flow_style=False)
print(a) #I make a print to debug
yaml_file.close()
The code seems to work fairly well. However, the result seems corrupted. Both in the file and in the print it looks like this (including the "s):
"Time_config:\n Tiempo_entre_avisos: 1\n Tiempo_entre_backups: 7\n Tiempo_entre_pushes:\ \ 30\nOther_config:\n Morosos_treshold: 800\nMail_config:\n Comunication_mail:\ \ ''\n Backup_mail: ''\n Director_mail: []\n"
If I copy and paste that string in the python console, yaml gives me the intended result, which is this one:
Time_config:
Tiempo_entre_avisos: 1
Tiempo_entre_backups: 7
Tiempo_entre_pushes: 30
Other_config:
Morosos_treshold: 800
Mail_config:
Comunication_mail: ''
Backup_mail: ''
Director_mail: []
Why does this happen? Why I am not getting the result in the first shot? Why is it printing the newline symbol (\n) instead of inserting a new line? Why does it include the " symbols?
I think you should be loading the yaml from the string first, then continuing:
# Everything before here is the same
print("PROCESSING FILE")
yaml_data = yaml.load(yaml_str)
yaml.dump(yaml_data, yaml_file, default_flow_style=False)
a = yaml.dump(yaml_data, default_flow_style=False)
print(a) #I make a print to debug
yaml_file.close()

Trying to parse twitter json from a text file

I am new to python and am trying to parse "tweets' from a text file for analysis.
My test file has a number of tweets, here is an example of one:
{"created_at":"Mon May 06 17:39:59 +0000 2013","id":331463367074148352,"id_str":"331463367074148352","text":"Extra\u00f1o el trabajo en las aulas !! * se jala los cabellos","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":276765971,"id_str":"276765971","name":"Shiro","screen_name":"_Shira3mmanueL_","location":"","url":null,"description":null,"protected":false,"followers_count":826,"friends_count":1080,"listed_count":5,"created_at":"Mon Apr 04 01:36:52 +0000 2011","favourites_count":1043,"utc_offset":-21600,"time_zone":"Mexico City","geo_enabled":true,"verified":false,"statuses_count":28727,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"1A1B1F","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme9\/bg.gif","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme9\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3608152674\/45133759fb72090ebbe880145d8966a6_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3608152674\/45133759fb72090ebbe880145d8966a6_normal.jpeg","profile_banner_url":"https:\/\/si0.twimg.com\/profile_banners\/276765971\/1367525440","profile_link_color":"2FC2EF","profile_sidebar_border_color":"181A1E","profile_sidebar_fill_color":"252429","profile_text_color":"666666","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[19.30303082,-99.54709768]},"coordinates":{"type":"Point","coordinates":[-99.54709768,19.30303082]},"place":{"id":"1d23a12800a574a8","url":"http:\/\/api.twitter.com\/1\/geo\/id\/1d23a12800a574a8.json","place_type":"city","name":"Lerma","full_name":"Lerma, M\u00e9xico","country_code":"MX","country":"M\u00e9xico","bounding_box":{"type":"Polygon","coordinates":[[[-99.552193,19.223171],[-99.552193,19.4343],[-99.379483,19.4343],[-99.379483,19.223171]]]},"attributes":{}},"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"es"}
My code is:
import re
import json
pattern_split = re.compile(r"\W+")
def sentment_tbl(sent_file):
# Read in AFINN-111.txt
tbl = dict(map(lambda (w, s): (w, int(s)), [
ws.strip().split('\t') for ws in open(sent_file)]))
return tbl
def sentiment(text,afinn):
# Word splitter pattern
words = pattern_split.split(text.lower())
sentiments = map(lambda word: afinn.get(word, 0), words)
if sentiments:
sentiment = float(sum(sentiments))
else:
sentiment = 0
return sentiment
def main():
sent_file = sys.argv[1]
afinn = sentment_tbl(sent_file)
tweet_file = (sys.argv[2])
with open(tweet_file) as f:
for line_str in f:
print type(line_str)
print line_str
tweet = json.loads(line_str.read())
print("%6.2f %s" % (sentiment(line_str,afinn)))
#Test: text = "Finn is stupid and idiotic"
#print("%6.2f %s" % (sentiment(text,afinn), text))
if __name__ == '__main__':
main()
I get an error about
I get the feeling I am mixing apples and oranges and would like some experienced assistance
thanks, Chris
If you've written multiple tweets to a file. EG:
o.write(tweet1)
o.write(tweet2)
You will have to read them in line by line as well, because json can't decode a file of multiple objects written line by line.
tweets = []
for line in open('test.txt', 'r'):
tweets.append(json.loads(line))
Why don't you use the built-in JSON library instead of your loop, reading and parsing every line as JSON, as follows:
import json
jsonObj = json.loads(open(tweet_file, 'r'))
# Now jsonObject is an array of dictionaries corresponding to the JSON
You need to pass a string to json.loads:
tweet = json.loads(line_str)
as line_str is a string.
After that you need to make sure you properly pass tweet or some details in tweet to sentiment() for further processing. Note that you're now calling sentiment() with line_str and tweet isn't used (yet).

Categories