I want to use the urllib module to send HTTP requests and grab data. I can get the data by using the urlopen() function, but not really sure how to incorporate it into classes. I really need help with the query class to move forward. From the query I need to pull
• Top Rated
• Top Favorites
• Most Viewed
• Most Recent
• Most Discussed
My issue is, I can't parse the XML document to retrieve this data. I also don't know how to use classes to do it.
Here is what I have so far:
import urllib #this allows the programm to sen HTTP requests and to read the responses.
class Query:
'''performs the actual HTTP requests and initial parsing to build the Video-
objects from the response. It will also calculate the following information
based on the video and user results. '''
def __init__(self, feed_id, max_results):
'''Takes as input the type of query (feed_id) and the maximum number of
results (max_results) that the query should obtain. The correct HTTP
request must be constructed and submitted. The results are converted
into Video objects, which are stored within this class.
'''
self.feed = feed_id
self.max = max_results
top_rated = urllib.urlopen("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
results_str = top_rated.read()
splittedlist = results_str.split('<entry')
top_rated.close()
def __str__(self):
''' prints out the information on each video and Youtube user. '''
pass
class Video:
pass
class User:
pass
#main function: This handles all the user inputs and stuff.
def main():
useinput = raw_input('''Welcome to the YouTube text-based query application.
You can select a popular feed to perform a query on and view statistical
information about the related videos and users.
1) today
2) this week
3) this month
4) since youtube started
Please select a time(or 'Q' to quit):''')
secondinput = raw_input("\n1) Top Rated\n2) Top Favorited\n3) Most Viewed\n4) Most Recent\n5) Most Discussed\n\nPlease select a feed (or 'Q' to quit):")
thirdinput = raw_input("Enter the maximum number of results to obtain:")
main()
toplist = []
top_rated = urllib.urlopen("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
result_str = top_rated.read()
top_rated.close()
splittedlist = result_str.split('<entry')
results_str = top_rated.read()
x=splittedlist[1].find('title')#find the title index
splittedlist[1][x: x+75]#string around the title (/ marks the end of the title)
w=splittedlist[1][x: x+75].find(">")#gives you the start index
z=splittedlist[1][x: x+75].find("<")#gives you the end index
titles = splittedlist[1][x: x+75][w+1:z]#gives you the title!!!!
toplist.append(titles)
print toplist
I assume that your challenge is parsing XML.
results_str = top_rated.read()
splittedlist = results_str.split('<entry')
And I see you are using string functions to parse XML. Such functions based on finite automata (regular languages) are NOT suited for parsing context-free languages such as XML. Expect it to break very easily.
For more reasons, please refer RegEx match open tags except XHTML self-contained tags
Solution: consider using an XML parser like elementree. It comes with Python and allows you to browse the XML tree pythonically. http://effbot.org/zone/element-index.htm
Your may come up with code like:
import elementtree.ElementTree as ET
..
results_str = top_rated.read()
root = ET.fromstring(results_str)
for node in root:
print node
I also don't know how to use classes to do it.
Don't be in a rush to create classes :-)
In the above example, you are importing a module, not importing a class and instantiating/initializing it, like you do for Java. Python has powerful primitive types (dictionaries, lists) and considers modules as objects: so (IMO) you can go easy on classes.
You use classes to organize stuff, not because your teacher has indoctrinated you "classes are good. Lets have lots of them".
Basically you want to use the Query class to communicate with the API.
def __init__(self, feed_id, max_results, time):
qs = "http://gdata.youtube.com/feeds/api/standardfeeds/"+feed_id+"?max- results="+str(max_results)+"&time=" + time
self.feed_id = feed_id
self.max_results = max_results
wo = urllib.urlopen(qs)
result_str = wo.read()
wo.close()
Related
I would like to use python to convert all synonyms and plural forms of words to the base version of the word.
e.g. Babies would become baby and so would infant and infants.
I tried creating a naive version of plural to root code but it has the issue that it doesn't always function correctly and can't detect a large amount of cases.
contents = ["buying", "stalls", "responsibilities"]
for token in contents:
if token.endswith("ies"):
token = token.replace('ies','y')
elif token.endswith('s'):
token = token[:-1]
elif token.endswith("ed"):
token = token[:-2]
elif token.endswith("ing"):
token = token[:-3]
print(contents)
I have not used this library before, so that this with a grain of salt. However, NodeBox Linguistics seems to be a reasonable set of scripts that will do exactly what you are looking for if you are on MacOS. Check the link here: https://www.nodebox.net/code/index.php/Linguistics
Based on their documentation, it looks like you will be able to use lines like so:
print( en.noun.singular("people") )
>>> person
print( en.verb.infinitive("swimming") )
>>> swim
etc.
In addition to the example above, another to consider is a natural language processing library like NLTK. The reason why I recommend using an external library is because English has a lot of exceptions. As mentioned in my comment, consider words like: class, fling, red, geese, etc., which would trip up the rules that was mentioned in the original question.
I build a python library - Plurals and Countable, which is open source on github. The main purpose is to get plurals (yes, mutliple plurals for some words), but it also solves this particular problem.
import plurals_counterable as pluc
pluc.pluc_lookup_plurals('men', strict_level='dictionary')
will return a dictionary of the following.
{
'query': 'men',
'base': 'man',
'plural': ['men'],
'countable': 'countable'
}
The base field is what you need.
The library actually looks up the words in dictionaries, so it takes some time to request, parse and return. Alternatively, you might use REST API provided by Dictionary.video. You'll need contact admin#dictionary.video to get an API key. The call will be like
import requests
import json
import logging
url = 'https://dictionary.video/api/noun/plurals/men?key=YOUR_API_KEY'
response = requests.get(url)
if response.status_code == 200:
return json.loads(response.text)['base']
else:
logging.error(url + ' response: status_code[%d]' % response.status_code)
return None
I have an XML file 7 GB, It is about all transactions in one company, and I want to filter only the records of the Last Year (2015).
The structure of a file is:
<Customer>
<Name>A</Name>
<Year>2015<Year>
</Customer>
I have also its DTD file.
I do not know how can I filter such data into text file.
Is there any tutorial or library to be used in this regard.
Welcome!
As your data is large, I assume you've already decided that you won't be able to load the whole lot into memory. This would be the approach using a DOM-style (document object model) parser. And you've actually tagged your question 'SAX' (the simple API for XML) which further implies you know that you need a non-memory approach.
Two approaches come to mind:
Using grep
Sometimes with XML it can be useful to use plain text processing tools. grep would allow you to filter through your XML document as plain text and find the occurrences of 2015:
$ grep -B 2 -A 1 "<Year>2015</Year>"
The -B and -A options instruct grep to print some lines of context around the match.
However, this approach will only work if your XML is also precisely structured plain text, which there's absolutely no need for it (as XML) to be. That is, your XML could have any combination of whitespace (or non at all) and still be semantically identical, but the grep approach depends on exact whitespace arrangement.
SAX
So a more reliable non-memory approach would be to use SAX. SAX implementations are conceptually quite simple, but a little tedious to write. Essentially, you have to override a class which provides methods that are called when certain 'events' occur in the source XML document. In the xml.sax.handler module in the standard library, this class is ContentHandler. These methods include:
startElement
endElement
characters
Your overridden methods then determine how to handle those events. In a typical implementation of startElement(name, attrs) you might test the name argument to determine what the tag name of the element is. You might then maintain a stack of elements you have entered. When endElement(name) occurs, you might then pop the top element off that stack, and possibly do some processing on the completed element. The characters(content) happens when character data is encountered in the source document. In this method you might consider building up a string of the character data which can then be processed when you encounter an endElement.
So for your specific task, something like this may work:
from xml.sax import parse
from xml.sax.handler import ContentHandler
class filter2015(ContentHandler):
def __init__(self):
self.elements = [] # stack of elements
self.char_data = u'' # string buffer
self.current_customer = u'' # name of customer
self.current_year = u''
def startElement(self, name, attrs):
if name == u'Name':
self.elements.append(u'Name')
if name == u'Year':
self.elements.append(u'Year')
def characters(self, chars):
if len(self.elements) > 0 and self.elements[-1] in [u'Name', u'Year']:
self.char_data += chars
def endElement(self, name):
self.elements.pop() if len(self.elements) > 0 else None
if name == u'Name':
self.current_customer = self.char_data
self.char_data = ''
if name == u'Year':
self.current_year = self.char_data
self.char_data = ''
if name == 'Customer':
# wait to check the year until the Customer is closed
if self.current_year == u'2015':
print 'Found:', self.current_customer
# clear the buffers now that the Customer is finished
self.current_year = u''
self.current_customer = u''
self.char_data = u''
source = open('test.xml')
parse(source, filter2015())
Check out this question. It will let you interact with it as a generator:
python: is there an XML parser implemented as a generator?
You want to use a generator so that you don't load the entire doc into memory first.
Specifically:
import xml.etree.cElementTree as ET
for event, element in ET.iterparse('huge.xml'):
if event == 'end' and element.tag == 'ticket':
#process ticket...
Source: http://enginerds.craftsy.com/blog/2014/04/parsing-large-xml-files-in-python-without-a-billion-gigs-of-ram.html
Hello all… my question is regarding how to find out the location of a pair of geographical coordinates. Sorry the questions are fragmented because I did some searches and putting them together still doesn't find the right way.
1st question:
Seems Geopy is the best way to geocoding, if some outputs in non-English words is not a matter. (any help is warmly welcome for solving the non-English output problem)
However some of coordinates I need to search is in non- habitable areas.
An example below:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
location = geolocator.reverse("81.79875708, -42.1875")
print(location.address)
It returns.
None
When search this coordinates on https://maps.google.com/, it indicates the nearest recognized point “Northeast Greenland National Park”, which is better than “None”.
So how can I get this “nearest recognized point”?
2nd question on using “pygmaps”
aamap = pygmaps.maps(81.79875708, -42.1875, 16)
the output is:
<pygmaps.maps instance at 0x000000000313CC08>
What’s the use of it, and how to turn it into texts?
3rd question using the “Json” way:
response = urllib2.urlopen('https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA')
js = json.load(response)
print js['results'][0]['geometry']['location']
Is there a way to reverse? i.e. using coordinates to find out its addresses.
Thanks.
I cannot answer all your questions but, regarding your second question I suppose that you made use of print(aamap) to get the output.
The default behavior when printing an object that doesn't specify a way to be print is to print its class and its memory location. That is the result you obtained.
<pymaps.maps instance at 0x0000000000313CC08>
Test this code to understand :
class NotPrintable(object):
def __init__(self, text):
""" Instanciation method """
self.text = text
class Printable(object):
def __init__(self, text):
""" Instanciation method """
self.text = text
def __repr__(self):
""" Text representation of the object called by print"""
return "A printable object whose the value is : {0}".format(self.text)
if __name__ == "__main__":
print( NotPrintable("StackOverFlowTest") )
print( Printable("StackOverFlowTest") )
The result is :
<__main__.NotPrintable object at 0x7f2e644fe208>
A printable object whose the value is : StackOverFlowTest
The result you get indicate that you are not supposed to print the item to visualize its content. A quick search on "google" lead me to this site : https://code.google.com/p/pygmaps/
It provides example of code using pymaps, and it appear that the good way to visualize the data is :
#create the map object
mymap = pygmaps.maps(37.428, -122.145, 16)
#export the map
mymap.draw('./mymap.html')
That create an HTML page that contain your map.
If interested in, you could take a look at the maps code at https://code.google.com/p/pygmaps/source/browse/trunk/pygmaps.py but it seems that it is the only way to output the object data. If you really have to provide a text output, you'll probably have to code the export by yourself.
Hope it helps!
=====EDIT=====
Concerning the googlemap JSON API, seeing the documentation : https://developers.google.com/maps/documentation/geocoding/
You can find in the first lines information about Reverse geocoding, i.e.
The process of [...] translating a location on the map into a human-readable address, is known as reverse geocoding.
See this particulary : https://developers.google.com/maps/documentation/geocoding/#ReverseGeocoding
here is an example of query for information about the location whose the latitude is "40.71" and the longitude is "-73.96" :
https://maps.googleapis.com/maps/api/geocode/json?latlng=40.71,-73.96&key=API_KEY
Here is the documentation about API_key : https://developers.google.com/console/help/#UsingKeys
I've been learning Python for a couple of months through online courses and would like to further my learning through a real world mini project.
For this project, I would like to collect tweets from the twitter streaming API and store them in json format (though you can choose to just save the key information like status.text, status.id, I've been advised that the best way to do this is to save all the data and do the processing after). However, with the addition of the on_data() the code ceases to work. Would someone be able to to assist please? I'm also open to suggestions on the best way to store/process tweets! My end goal is to be able to track tweets based on demographic variables (e.g., country, user profile age, etc) and the sentiment of particular brands (e.g., Apple, HTC, Samsung).
In addition, I would also like to try filtering tweets by location AND keywords. I've adapted the code from How to add a location filter to tweepy module separately. However, while it works when there are a few keywords, it stops when the number of keywords grows. I presume my code is inefficient. Is there a better way of doing it?
### code to save tweets in json###
import sys
import tweepy
import json
consumer_key=" "
consumer_secret=" "
access_key = " "
access_secret = " "
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
file = open('today.txt', 'a')
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
print status.text
def on_data(self, data):
json_data = json.loads(data)
file.write(str(json_data))
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(track=['twitter'])
I found a way to save the tweets to a json file. Happy to hear how it can be improved!
# initialize blank list to contain tweets
tweets = []
# file name that you want to open is the second argument
save_file = open('9may.json', 'a')
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
super(tweepy.StreamListener, self).__init__()
self.save_file = tweets
def on_data(self, tweet):
self.save_file.append(json.loads(tweet))
print tweet
save_file.write(str(tweet))
In rereading your original question, I realize that you ask a lot of smaller questions. I'll try to answer most of them here but some may merit actually asking a separate question on SO.
Why does it break with the addition of on_data ?
Without seeing the actual error, it's hard to say. It actually didn't work for me until I regenerated my consumer/acess keys, I'd try that.
There are a few things I might do differently than your answer.
tweets is a global list. This means that if you have multiple StreamListeners (i.e. in multiple threads), every tweet collected by any stream listener will be added to this list. This is because lists in Python refer to locations in memory--if that's confusing, here's a basic example of what I mean:
>>> bar = []
>>> foo = bar
>>> foo.append(7)
>>> print bar
[7]
Notice that even though you thought appended 7 to foo, foo and bar actually refer to the same thing (and therefore changing one changes both).
If you meant to do this, it's a pretty great solution. However, if your intention was to segregate tweets from different listeners, it could be a huge headache. I personally would construct my class like this:
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
super(tweepy.StreamListener, self).__init__()
self.list_of_tweets = []
This changes the tweets list to be only in the scope of your class. Also, I think it's appropriate to change the property name from self.save_file to self.list_of_tweets because you also name the file that you're appending the tweets to save_file. Although this will not strictly cause an error, it's confusing to human me that self.save_file is a list and save_file is a file. It helps future you and anyone else that reads your code figure out what the heck everything does/is. More on variable naming.
In my comment, I mentioned that you shouldn't use file as a variable name. file is a Python builtin function that constructs a new object of type file. You can technically overwrite it, but it is a very bad idea to do so. For more builtins, see the Python documentation.
How do I filter results on multiple keywords?
All keywords are OR'd together in this type of search, source:
sapi.filter(track=['twitter', 'python', 'tweepy'])
This means that this will get tweets containing 'twitter', 'python' or 'tweepy'. If you want the union (AND) all of the terms, you have to post-process by checking a tweet against the list of all terms you want to search for.
How do I filter results based on location AND keyword?
I actually just realized that you did ask this as its own question as I was about to suggest. A regex post-processing solution is a good way to accomplish this. You could also try filtering by both location and keyword like so:
sapi.filter(locations=[103.60998,1.25752,104.03295,1.44973], track=['twitter'])
What is the best way to store/process tweets?
That depends on how many you'll be collecting. I'm a fan of databases, especially if you're planning to do a sentiment analysis on a lot of tweets. When you collect data, you should only collect things you will need. This means, when you save results to your database/wherever in your on_data method, you should extract the important parts from the JSON and not save anything else. If for example you want to look at brand, country and time, only take those three things; don't save the entire JSON dump of the tweet because it'll just take up unnecessary space.
I just insert the raw JSON into the database. It seems a bit ugly and hacky but it does work. A noteable problem is that the creation dates of the Tweets are stored as strings. How do I compare dates from Twitter data stored in MongoDB via PyMongo? provides a way to fix that (I inserted a comment in the code to indicate where one would perform that task)
# ...
client = pymongo.MongoClient()
db = client.twitter_db
twitter_collection = db.tweets
# ...
class CustomStreamListener(tweepy.StreamListener):
# ...
def on_status(self, status):
try:
twitter_json = status._json
# TODO: Transform created_at to Date objects before insertion
tweet_id = twitter_collection.insert(twitter_json)
except:
# Catch any unicode errors while printing to console
# and just ignore them to avoid breaking application.
pass
# ...
stream = tweepy.Stream(auth, CustomStreamListener(), timeout=None, compression=True)
stream.sample()
According to the feedparser documentation, I can turn an RSS feed into a parsed object like this:
import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
but I can't find anything showing how to go the other way; I'd like to be able do manipulate 'd' and then output the result as XML:
print d.toXML()
but there doesn't seem to be anything in feedparser for going in that direction. Am I going to have to loop through d's various elements, or is there a quicker way?
Appended is a not hugely-elegant, but working solution - it uses feedparser to parse the feed, you can then modify the entries, and it passes the data to PyRSS2Gen. It preserves most of the feed info (the important bits anyway, there are somethings that will need extra conversion, the parsed_feed['feed']['image'] element for example).
I put this together as part of a little feed-processing framework I'm fiddling about with.. It may be of some use (it's pretty short - should be less than 100 lines of code in total when done..)
#!/usr/bin/env python
import datetime
# http://www.feedparser.org/
import feedparser
# http://www.dalkescientific.com/Python/PyRSS2Gen.html
import PyRSS2Gen
# Get the data
parsed_feed = feedparser.parse('http://reddit.com/.rss')
# Modify the parsed_feed data here
items = [
PyRSS2Gen.RSSItem(
title = x.title,
link = x.link,
description = x.summary,
guid = x.link,
pubDate = datetime.datetime(
x.modified_parsed[0],
x.modified_parsed[1],
x.modified_parsed[2],
x.modified_parsed[3],
x.modified_parsed[4],
x.modified_parsed[5])
)
for x in parsed_feed.entries
]
# make the RSS2 object
# Try to grab the title, link, language etc from the orig feed
rss = PyRSS2Gen.RSS2(
title = parsed_feed['feed'].get("title"),
link = parsed_feed['feed'].get("link"),
description = parsed_feed['feed'].get("description"),
language = parsed_feed['feed'].get("language"),
copyright = parsed_feed['feed'].get("copyright"),
managingEditor = parsed_feed['feed'].get("managingEditor"),
webMaster = parsed_feed['feed'].get("webMaster"),
pubDate = parsed_feed['feed'].get("pubDate"),
lastBuildDate = parsed_feed['feed'].get("lastBuildDate"),
categories = parsed_feed['feed'].get("categories"),
generator = parsed_feed['feed'].get("generator"),
docs = parsed_feed['feed'].get("docs"),
items = items
)
print rss.to_xml()
If you're looking to read in an XML feed, modify it and then output it again, there's a page on the main python wiki indicating that the RSS.py library might support what you're after (it reads most RSS and is able to output RSS 1.0). I've not looked at it in much detail though..
from xml.dom import minidom
doc= minidom.parse('./your/file.xml')
print doc.toxml()
The only problem is that it do not download feeds from the internet.
As a method of making a feed, how about PyRSS2Gen? :)
I've not played with FeedParser, but have you tried just doing str(yourFeedParserObject)? I've often been suprised by various modules that have str methods to just output the object as text.
[Edit] Just tried the str() method and it doesn't work on this one. Worth a shot though ;-)