Python Parsing with lxml - python

I've created the following scraper for NFL play-by-play data. It writes the results to a csv file and does everything I need it to except I don't know how to attach a column for who actually has possession of the ball in each line of the csv file.
I can grab the text from the "home" and "away" <tr> tag to show who is playing in the game for query purposes later, but I need the scraper to recognize when possession changes (goes from home to away or vice versa). I'm fairly new to Python and have tried different indention but I don't think that's the issue. Any help would be greatly appreciated. I feel like the answer is beyond my scope of understanding.
I also realize that my code probably isn't the most Pythonic but I'm still learning. I'm using Python 2.7.9.
import lxml
from lxml import html
import csv
import urllib2
import re
game_date = raw_input('Enter game date: ')
data_html = 'http://www.cbssports.com/nfl/gametracker/playbyplay/NFL_20160109_PIT#CIN'
url = urllib2.urlopen(data_html).read()
data = lxml.html.fromstring(url)
plays = data.cssselect('tr#play')
home = data.cssselect('tr#home')
away = data.cssselect('tr#away')
csvfile = open('C:\\DATA\\PBP.csv', 'a')
writer = csv.writer(csvfile)
for play in plays:
frame = []
play = play.text_content()
down = re.search(r'\d', play)
if down == None:
pass
else:
down = down.group()
dist = re.search(r'-(\d+)', play)
if dist == None:
pass
else:
dist = dist.group(1)
field_end = re.search(r'[A-Z]+', play)
if field_end == None:
pass
else:
field_end = field_end.group()
yard_line = re.search(r'[A-Z]+([\d]+)', play)
if yard_line == None:
pass
else:
yard_line = yard_line.group(1)
desc = re.search(r'\s(.*)', play)
if desc == None:
pass
else:
desc = desc.group()
time = re.search(r'\((..*\d)\)\s', play)
if time == None:
pass
else:
time = time.group(1)
for team in away:
teamA = team.text_content()
teamA = re.search(r'(\w+)\s', teamA)
teamA = teamA.group(1)
teamA = teamA.upper()
for team in home:
teamH = team.text_content()
teamH = re.search(r'(\w+)\s', teamH)
teamH = teamH.group(1)
teamH = teamH.upper()
frame.append(game_date)
frame.append(down)
frame.append(dist)
frame.append(field_end)
frame.append(yard_line)
frame.append(time)
frame.append(teamA)
frame.append(teamH)
frame.append(desc)
writer.writerow(frame)
csvfile.close()

I guess you need to append another value to the frame, for each row, which an indication of whether the possession changed.
After:
frame.append(desc)
add:
if teamA == teamH:
frame.append("Same possession")
else:
frame.append("Changed possession")
(note this assumes the team names are consistent, no extra spaces/padding/formatting in the teamA/teamH values).
You don't have to use strings, for example you could put 0 for no change and 1 for a change of possession.
HTH
Barny

Related

Extract Text from a word document

I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]

Scraping only new elements from website using requests and bs4

I want to look at a certain website to collect data from it, at the first visit to the said website I collect all the data that was there to ignore it. I want to perform a certain action if a new row is added(for example print as in here). But whenever a new item appears it seems to print every single row on the website even though I'm checking if the row exists already in the dictionary. Don't know how to fix it, can anyone take a look?
import requests
import re
import copy
import time
from datetime import date
from bs4 import BeautifulSoup
class KillStatistics:
def __init__(self):
self.records = {}
self.watched_names = ["Test"]
self.iter = 0
def parse_records(self):
r = requests.get("http://149.56.28.71/?subtopic=killstatistics")
soup = BeautifulSoup(r.content, "html.parser")
table = soup.findChildren("table")
for record in table:
for data in record:
if data.text == "Last Deaths":
pass
else:
entry = data.text
entry = re.split("..?(?=[0-9][A-Z]).", data.text)
entry[0] = entry[0].split(", ")
entry[0][0] = entry[0][0].split(".")
entry_id, day, month, year, hour = (
entry[0][0][0],
entry[0][0][1],
entry[0][0][2],
entry[0][0][3],
entry[0][1],
)
message = entry[1]
nickname = (re.findall(".+?(?=at)", message)[0]).strip()
killed_by = (re.findall(r"(?<=\bby).*", message)[0]).strip()
if self.iter < 1:
"""Its the first visit to the website, i want it to download all the data and store it in dictionary"""
self.records[
entry_id
] = f"{nickname} was killed by {killed_by} at {day}-{month}-{year} {hour}"
elif (
self.iter > 1
and f"{nickname} was killed by {killed_by} at {day}-{month}-{year} {hour}"
not in self.records.values()
):
"""Here I want to look into the dictionary to check if the element exists in it,
if not print it and add to the dictionary at [entry_id] so we can skip it in next iteration
Don't know why but whenever a new item appears on the website it seems to edit every item in the dictionary instead of just editing the one
that wasnt there"""
print(
f"{nickname} was killed by {killed_by} at {day}-{month}-{year} {hour}"
)
self.records[
entry_id
] = f"{nickname} was killed by {killed_by} at {day}-{month}-{year} {hour}"
print("---")
self.iter += 1
ks = KillStatistics()
if __name__ == "__main__":
while True:
ks.parse_records()
time.sleep(10)
The entry_id are always the same its 500 rows of data and they have id of 1,2,3...500 the newest is always 1. I know i could always check for 1 to get the newest but sometimes for example 10 players can die at the same time so i would like to check them all if they changed and only perform print on new ones.
Current output:
Velerion was killed by Rat and Cave Rat at 27-12-2021 16:53
Scrappy was killed by Cursed Queen at 27-12-2021 16:52
Velerion was killed by Rat at 27-12-2021 16:28
Velerion was killed by Rat at 27-12-2021 16:22
Velerion was killed by Rat at 27-12-2021 16:21
Velerion was killed by Rat at 27-12-2021 15:51
Shade was killed by Tentacle Slayer at 27-12-2021 15:46
Mr Yahoo was killed by Immortal Hunter at 27-12-2021 15:41
Scrappy was killed by Witch Hunter at 27-12-2021 15:39
Barbudo Arqueiro was killed by Seahorse at 27-12-2021 15:23
Emperor Martino was killed by Dark Slayer at 27-12-2021 15:14
Shade was killed by Tentacle Slayer at 27-12-2021 15:11
Head Hunter was killed by Demon Blood Slayer at 27-12-2021 15:09
Expected output:
Velerion was killed by Rat and Cave Rat at 27-12-2021 16:53
Here's what I've changed your processing. I'm using one regex to parse all the header information. That gets me all 7 numeric fields at once, plus the overall length of the match tells me where the message starts.
Then, I'm using the timestamp to determine what data is new. The newest entry is always first, so I grab the first timestamp of the lot to use as the threshold for the next.
Then, I'm storing the entries in a list instead of a dict. If you don't really need to store them forever, but just want to print them, then you don't need to track the list at all.
import requests
import re
import copy
import time
from datetime import date
from bs4 import BeautifulSoup
prefix = r"(\d+)\.(\d+)\.(\d+)\.(\d+)\, (\d+):(\d+):(\d+)"
class KillStatistics:
def __init__(self):
self.records = []
self.latest = (0,0,0,0,0,0,0)
self.watched_names = ["Test"]
def parse_records(self):
r = requests.get("http://149.56.28.71/?subtopic=killstatistics")
soup = BeautifulSoup(r.content, "html.parser")
table = soup.findChildren("table")
latest = None
for record in table:
for data in record:
if data.text == "Last Deaths":
continue
entry = data.text
mo = re.match(prefix, entry)
entry_id, day, month, year, hour, mm, ss = mo.groups()
stamp = tuple(int(i) for i in (year, month, day, hour, mm, ss))
if latest is None:
latest = stamp
if stamp > self.latest:
rest = entry[mo.span()[1]:]
i = rest.find(" at ")
j = rest.find(" by ")
nickname = rest[:i]
killed_by = rest[j+4:]
msg = f"{nickname} was killed by {killed_by} at {day}-{month}-{year} {hour}"
print( msg )
self.records.append( msg )
print("---")
self.latest = latest
ks = KillStatistics()
if __name__ == "__main__":
while True:
ks.parse_records()
time.sleep(10)
I managed to find the solution by myself. In this specific case, I needed to create a list of entries and a copy of it. Then after each lookup, I compare the newly created list to the old one and using set().difference() I return the new records.
import requests
import re
import copy
import time
from datetime import date
from bs4 import BeautifulSoup
class KillStatistics:
def __init__(self):
self.records = []
self.old_table = []
self.watched_names = ["Test"]
self.visited = False
def parse_records(self):
r = requests.get("http://149.56.28.71/?subtopic=killstatistics")
soup = BeautifulSoup(r.content, "html.parser")
table = soup.findChildren("table")
self.records = []
for record in table:
for data in record:
if data.text == "Last Deaths":
continue
else:
entry = data.text
entry = re.split("..?(?=[0-9][A-Z]).", data.text)
entry[0] = entry[0].split(", ")
entry[0][0] = entry[0][0].split(".")
entry_id, day, month, year, hour = (
entry[0][0][0],
entry[0][0][1],
entry[0][0][2],
entry[0][0][3],
entry[0][1],
)
## czas recordów jest w EST. Można konwertować na czas w Brazylii i w CET
message = entry[1]
nickname = (re.findall(".+?(?=at)", message)[0]).strip()
killed_by = (re.findall(r"(?<=\bby).*", message)[0]).strip()
record_to_add = (
f"{nickname},{killed_by},{day}-{month}-{year} {hour}"
)
self.records.append(record_to_add)
if len(self.old_table) > 0:
self.compare_records(self.old_table, self.records)
else:
print("Setting up initial data...")
print("---")
self.visited = True
def compare_records(self, old_records, new_records):
new_record = list(set(new_records).difference(old_records))
if len(new_record) > 0:
print("We got new record")
for i in new_record:
print(i)
with open("/home/sammy/gp/new_records", "a") as f:
f.write(i + "\n")
else:
print("No new records")
class DiscordBot:
def __init__(self):
pass
if __name__ == "__main__":
ks = KillStatistics()
while True:
ks.parse_records()
ks.old_table = copy.deepcopy(ks.records)
time.sleep(10)

Wikipedia Infobox parser with Multi-Language Support

I am trying to develop an Infobox parser in Python which supports all the languages of Wikipedia. The parser will get the infobox data and will return the data in a Dictionary.
The keys of the Dictionary will be the property which is described (e.g. Population, City name, etc...).
The problem is that Wikipedia has slightly different page contents for each language. But the most important thing is that the API response structure for each language can also be different.
For example, the API response for 'Paris' in English contains this Infobox:
{{Infobox French commune |name = Paris |commune status = [[Communes of France|Commune]] and [[Departments of France|department]] |image = <imagemap> File:Paris montage.jpg|275px|alt=Paris montage
and in Greek, the corresponding part for 'Παρίσι' is:
[...] {{Πόλη (Γαλλία) | Πόλη = Παρίσι | Έμβλημα =Blason paris 75.svg | Σημαία =Mairie De Paris (SVG).svg | Πλάτος Σημαίας =120px | Εικόνα =Paris - Eiffelturm und Marsfeld2.jpg [...]
In the second example, there isn't any 'Infobox' occurrence after the {{. Also, in the API response the name = Paris is not the exact translation for Πόλη = Παρίσι. (Πόλη means city, not name)
Because of such differences between the responses, my code fails.
Here is the code:
class WikipediaInfobox():
# Class to get and parse the Wikipedia Infobox Data
infoboxArrayUnprocessed = [] # Maintains the order which the data is displayed.
infoboxDictUnprocessed = {} # Still Contains Brackets and Wikitext coding. Will be processed more later...
language="en"
def getInfoboxDict(self, infoboxRaw): # Get the Infobox in Dict and Array form (Unprocessed)
if infoboxRaw.strip() == "":
return {}
boxLines = [line.strip().replace(" "," ") for line in infoboxRaw.splitlines()]
wikiObjectType = boxLines[0]
infoboxData = [line[1:] for line in boxLines[1:]]
toReturn = {"wiki_type":wikiObjectType}
for i in infoboxData:
key = i.split("=")[0].strip()
value = ""
if i.strip() != key + "=":
value=i.split("=")[1].strip()
self.infoboxArrayUnprocessed.append({key:value})
toReturn[key]=value
self.infoboxDictUnprocessed = toReturn
return toReturn
def getInfoboxRaw(self, pageTitle, followRedirect = False, resetOld=True): # Get Infobox in Raw Text
if resetOld:
infoboxDict = {}
infoboxDictUnprocessed = {}
infoboxArray = []
infoboxArrayUnprocessed = []
params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "%s" % urllib.quote(pageTitle.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://" + self.language + ".wikipedia.org/w/api.php?%s" % qs
tree = etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')
if len(revs) == 0:
return ""
if "#REDIRECT" in revs[-1].text and followRedirect == True:
redirectPage = revs[-1].text[revs[-1].text.find("[[")+2:revs[-1].text.find("]]")]
return self.getInfoboxRaw(redirectPage,followRedirect,resetOld)
elif "#REDIRECT" in revs[-1].text and followRedirect == False:
return ""
infoboxRaw = ""
if "{{Infobox" in revs[-1].text: # -> No Multi-language support:
infoboxRaw = revs[-1].text.split("{{Infobox")[1].split("}}")[0]
return infoboxRaw
def __init__(self, pageTitle = "", followRedirect = False): # Constructor
if pageTitle != "":
self.language = guess_language.guessLanguage(pageTitle)
if self.language == "UNKNOWN":
self.language = "en"
infoboxRaw = self.getInfoboxRaw(pageTitle, followRedirect)
self.getInfoboxDict(infoboxRaw) # Now the parsed data is in self.infoboxDictUnprocessed
Some parts of this code was found on this blog...
I don't want to reinvent the wheel, so maybe someone has a nice solution for multi-language support and neat parsing of the Infobox section of Wikipedia.
I have seen many alternatives, like DBPedia or some other parsers that MediaWiki recommends, but I haven't found anything that suits my needs, yet. I also want to avoid scraping the page with BeautifulSoup, because it can fail on some cases, but if it is necessary it will do.
If something isn't clear enough, please ask. I want to help as much as I can.
Wikidata is definitely the first choice these days if you want to get structured data, anyway if in the future you need to parse data from wikipedia articles, especially as you are using Python, I can recommand mwparserfromhell which is a python library aimed at parsing wikitext and that has an option to extract templates and their attributes. That won't directly fix your issue as the multiple templates in multiple languages will definitely be different but that might be useful if you continue trying to parse wikitext.

Pyuno indexing issue that I would like an explanation for

The following python libreoffice Uno macro works but only with the try..except statement.
The macro allows you to select text in a writer document and send it to a search engine in your default browser.
The issue, is that if you select a single piece of text,oSelected.getByIndex(0) is populated but if you select multiple pieces of text oSelected.getByIndex(0) is not populated. In this case the data starts at oSelected.getByIndex(1) and oSelected.getByIndex(0) is left blank.
I have no idea why this should be and would love to know if anyone can explain this strange behaviour.
#!/usr/bin/python
import os
import webbrowser
from configobj import ConfigObj
from com.sun.star.awt.MessageBoxButtons import BUTTONS_OK, BUTTONS_OK_CANCEL, BUTTONS_YES_NO, BUTTONS_YES_NO_CANCEL, BUTTONS_RETRY_CANCEL, BUTTONS_ABORT_IGNORE_RETRY
from com.sun.star.awt.MessageBoxButtons import DEFAULT_BUTTON_OK, DEFAULT_BUTTON_CANCEL, DEFAULT_BUTTON_RETRY, DEFAULT_BUTTON_YES, DEFAULT_BUTTON_NO, DEFAULT_BUTTON_IGNORE
from com.sun.star.awt.MessageBoxType import MESSAGEBOX, INFOBOX, WARNINGBOX, ERRORBOX, QUERYBOX
def fs3Browser(*args):
#get the doc from the scripting context which is made available to all scripts
desktop = XSCRIPTCONTEXT.getDesktop()
model = desktop.getCurrentComponent()
doc = XSCRIPTCONTEXT.getDocument()
parentwindow = doc.CurrentController.Frame.ContainerWindow
oSelected = model.getCurrentSelection()
oText = ""
try:
for i in range(0,4,1):
print ("Index No ", str(i))
try:
oSel = oSelected.getByIndex(i)
print (str(i), oSel.getString())
oText += oSel.getString()+" "
except:
break
except AttributeError:
mess = "Do not select text from more than one table cell"
heading = "Processing error"
MessageBox(parentwindow, mess, heading, INFOBOX, BUTTONS_OK)
return
lookup = str(oText)
special_c =str.maketrans("","",'!|##"$~%&/()=?+*][}{-;:,.<>')
lookup = lookup.translate(special_c)
lookup = lookup.strip()
configuration_dir = os.environ["HOME"]+"/fs3"
config_filename = configuration_dir + "/fs3.cfg"
if os.access(config_filename, os.R_OK):
cfg = ConfigObj(config_filename)
#define search engine from the configuration file
try:
searchengine = cfg["control"]["ENGINE"]
except:
searchengine = "https://duckduckgo.com"
if 'duck' in searchengine:
webbrowser.open_new('https://www.duckduckgo.com//?q='+lookup+'&kj=%23FFD700 &k7=%23C9C4FF &ia=meanings')
else:
webbrowser.open_new('https://www.google.com/search?/&q='+lookup)
return None
def MessageBox(ParentWindow, MsgText, MsgTitle, MsgType, MsgButtons):
ctx = XSCRIPTCONTEXT.getComponentContext()
sm = ctx.ServiceManager
si = sm.createInstanceWithContext("com.sun.star.awt.Toolkit", ctx)
mBox = si.createMessageBox(ParentWindow, MsgType, MsgButtons, MsgTitle, MsgText)
mBox.execute()
Your code is missing something. This works without needing an extra try/except clause:
selected_strings = []
try:
for i in range(oSelected.getCount()):
oSel = oSelected.getByIndex(i)
if oSel.getString():
selected_strings.append(oSel.getString())
except AttributeError:
# handle exception...
return
result = " ".join(selected_strings)
To answer your question about the "strange behaviour," it seems pretty straightforward to me. If the 0th element is empty, then there are multiple selections which may need to be handled differently.

How to gather personal information (age,gender..) of all the authors of the comments on a specific video, with Python YouTube API

I'm using YouTube API with Python. I can already gather all the comments of a specific video, including the name of the authors, the date and the content of the comments.
I can also with a separate piece of code, extract the personal information (age,gender,interests,...) of a specific author.
But I can not use them in one place. i.e. I need to gather all the comments of a video, with the name of the comments' authors and having the personal information of all those authors.
in follow is the code that I developed. But I get an 'RequestError' which I dont know how to handle and where is the problem.
import gdata.youtube
import gdata.youtube.service
yt_service = gdata.youtube.service.YouTubeService()
f = open('test1.csv','w')
f.writelines(['UserName',',','Age',',','Date',',','Comment','\n'])
def GetAndPrintVideoFeed(string1):
yt_service = gdata.youtube.service.YouTubeService()
user_entry = yt_service.GetYouTubeUserEntry(username = string1)
X = PrintentryEntry(user_entry)
return X
def PrintentryEntry(entry):
# print required fields where we know there will be information
Y = entry.age.text
return Y
def GetComment(next1):
yt_service = gdata.youtube.service.YouTubeService()
nextPageFeed = yt_service.GetYouTubeVideoCommentFeed(next1)
for comment_entry in nextPageFeed.entry:
string1 = comment_entry.author[0].name.text.split("/")[-1]
Z = GetAndPrintVideoFeed(string1)
string2 = comment_entry.updated.text.split("/")[-1]
string3 = comment_entry.content.text.split("/")[-1]
f.writelines( [str(string1),',',Z,',',string2,',',string3,'\n'])
next2 = nextPageFeed.GetNextLink().href
GetComment(next2)
video_id = '8wxOVn99FTE'
comment_feed = yt_service.GetYouTubeVideoCommentFeed(video_id=video_id)
for comment_entry in comment_feed.entry:
string1 = comment_entry.author[0].name.text.split("/")[-1]
Z = GetAndPrintVideoFeed(string1)
string2 = comment_entry.updated.text.split("/")[-1]
string3 = comment_entry.content.text.split("/")[-1]
f.writelines( [str(string1),',',Z,',',string2,',',string3,'\n'])
next1 = comment_feed.GetNextLink().href
GetComment(next1)
I think you need a better understanding of the Youtube API and how everything relates together. I've written wrapper classes which can handle multiple types of Feeds or Entries and "fixes" gdata's inconsistent parameter conventions.
Here are some snippets showing how the scraping/crawling can be generalized without too much difficulty.
I know this isn't directly answering your question, It's more high level design but it's worth thinking about if you're going to be doing a large amount of youtube/gdata data pulling.
def get_feed(thing=None, feed_type=api.GetYouTubeUserFeed):
if feed_type == 'user':
feed = api.GetYouTubeUserFeed(username=thing)
if feed_type == 'related':
feed = api.GetYouTubeRelatedFeed(video_id=thing)
if feed_type == 'comments':
feed = api.GetYouTubeVideoCommentFeed(video_id=thing)
feeds = []
entries = []
while feed:
feeds.append(feed)
feed = api.GetNext(feed)
[entries.extend(f.entry) for f in feeds]
return entries
...
def myget(url,service=None):
def myconverter(x):
logfile = url.replace('/',':')+'.log'
logfile = logfile[len('http://gdata.youtube.com/feeds/api/'):]
my_logger.info("myget: %s" % url)
if service == 'user_feed':
return gdata.youtube.YouTubeUserFeedFromString(x)
if service == 'comment_feed':
return gdata.youtube.YouTubeVideoCommentFeedFromString(x)
if service == 'comment_entry':
return gdata.youtube.YouTubeVideoCommentEntryFromString(x)
if service == 'video_feed':
return gdata.youtube.YouTubeVideoFeedFromString(x)
if service == 'video_entry':
return gdata.youtube.YouTubeVideoEntryFromString(x)
return api.GetWithRetries(url,
converter=myconverter,
num_retries=3,
delay=2,
backoff=5,
logger=my_logger
)
mapper={}
mapper[api.GetYouTubeUserFeed]='user_feed'
mapper[api.GetYouTubeVideoFeed]='video_feed'
mapper[api.GetYouTubeVideoCommentFeed]='comment_feed'
https://gist.github.com/2303769 data/service.py (routing)

Categories