I was recently assisted with getting the scores from a yahoo NHL page that would print out the teams and their aforementioned scores in a respective manner. Here is my code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = urlopen("http://sports.yahoo.com/nhl/scoreboard?d=2013-01-19")
content = url.read()
soup = BeautifulSoup(content)
def yahooscores():
results = {}
for table in soup.find_all('table', class_='scores'):
for row in table.find_all('tr'):
scores = []
name = None
for cell in row.find_all('td', class_='yspscores'):
link = cell.find('a')
if link:
name = link.text
elif cell.text.isdigit():
scores.append(cell.text)
if name is not None:
results[name] = scores
for name, scores in results.items():
print ('%s: %s' % (name, ', '.join(scores)) + '.')
yahooscores()
Now, first of all: I am associating this stuff in a function because I am going to have to change the url constantly to get all the values of every day of the January month.
The issue here is that while I can print the scores and team text fine, I am trying to accomplish this:
Ottawa: 1, 1, 2.
Winnipeg: 1, 0, 0.
Pittsburgh: 2, 0, 1
Philadelphia: 0, 1, 0.
See, my code doesn't do that. I was in the process of trying to get that to happen, but what is complicating the process is that the tables are all under the same class of "scores" and seemingly, I can't find anything different amongst them.
In a nutshell, associate teams correctly with each other and have a space in-between for organization.
The trouble is, you're putting the results for each team into a dict, but there's no order in a dict and so you loose track of which scores came from which table on the page (i.e. which game).
To get around this, you could just print the results directly instead of storing them, and add an extra newline in the outer for loop:
def yahooscores():
results = {}
for table in soup.find_all('table', class_='scores'):
for row in table.find_all('tr'):
scores = []
name = None
for cell in row.find_all('td', class_='yspscores'):
link = cell.find('a')
if link:
name = link.text
elif cell.text.isdigit():
scores.append(cell.text)
if name is not None:
print ('%s: %s' % (name, ', '.join(scores)) + '.')
print ""
yahooscores()
Or, if you want to store the scores and show them later, you can store the teams for each game as well and use them to group the results:
def yahooscores():
results = {}
games = []
for table in soup.find_all('table', class_='scores'):
teams = []
for row in table.find_all('tr'):
scores = []
name = None
for cell in row.find_all('td', class_='yspscores'):
link = cell.find('a')
if link:
name = link.text
elif cell.text.isdigit():
scores.append(cell.text)
if name is not None:
results[name] = scores
teams.append(name)
games.append(teams)
for teams in games:
for name in teams:
scores = results[name]
print ('%s: %s' % (name, ', '.join(scores)) + '.')
print ""
yahooscores()
The problem is that you're treating the table as a flat list of teams, rather than as a list of scores, each of which has two teams in it.
The clean way to fix that is to change the way you parse the page so you loop over the games, then, for each game, store something like a pair of names-and-scores.
But there's also a quick&dirty solution: If you kept the teams in order, you could just pair them up after the fact. A dict has no inherent order, but an OrderedDict preserves the order of insertion. So, just change results = {} to results = collections.OrderedDict.
(Although if the only thing you ever do with this dict is iterate its items(), I'm not sure why you want a dictionary at all. Just do results = [], replace results[name] = scores with results.append((name, scores)), and then iterate over results instead of results.items().)
And now, if you want to print them out in pairs… well, you can make an iterator over pairs from any iterable very easily. For example:
def pairs(iterable):
return zip(*[iter(iterable)]*2)
for (name1, score1), (name2, score2) in pairs(results.items()):
print ('%s: %s' % (n1, ', '.join(s1)) + '.')
print ('%s: %s' % (n2, ', '.join(s2)) + '.')
print
Or, if you can't figure out what that means, something hacky like this works fine too:
pair_done = False
for name, scores in results.items():
print ('%s: %s' % (name, ', '.join(scores)) + '.')
if pair_done:
print
pair_done = not pair_done
… or:
for i, (name, scores) in enumerate(results.items()):
print ('%s: %s' % (name, ', '.join(scores)) + '.')
if i % 2:
print
Related
I'm building a web scraper and I'm able to print all he data I need, but I'm struggling adding the data to my csv file, I feel like I need to add another for loop or even a function. Currently I'm able to get it to print one row of scraped data values, but it skips the 64 other rows of data values.
So far I've tried to put in another for loop and break up each variable into it's own function, but it just breaks my code, Here's what I have so far, I feel like I'm just missing something too.
#Gets listing box
listingBox = searchGrid.find_elements(By.CLASS_NAME, 'v2-listing-card')
#Loops through each listing box
for listingBoxes in listingBox:
listingUrl = []
listingImg = []
listingTitle = []
listingPrice = []
#Gets listing url
listingUrl = listingBoxes.find_element(By.CSS_SELECTOR, 'a.listing-link')
print("LISTING URL:", listingUrl.get_attribute('href'))
#Gets listing image
listingImg = listingBoxes.find_element(By.CSS_SELECTOR, 'img.wt-position-absolute')
print("IMAGE:", listingImg.get_attribute('src'))
#Gets listing title
listingTitle = listingBoxes.find_element(By.CLASS_NAME, 'wt-text-caption')
print("TITLE:", listingTitle.text)
#Gets price
listingPrice = listingBoxes.find_element(By.CLASS_NAME, 'currency-value')
print("ITEM PRICE: $", listingPrice.get_attribute("innerHTML"))
#Gets seller name
# listingSellerName = listingBoxes.find_element(By.XPATH, '/html/body/main/div/div[1]/div/div[3]/div[8]/div[2]/div[10]/div[1]/div/div/ol/li/div/div/a[1]/div[2]/div[2]/span[3]')
# print("SELLER NAME:", listingSellerName.get_attribute("innerHTML"))
print("---------------")
finally:
driver.quit()
data = {'Listing URL': listingUrl, 'Listing Thumbnail': listingImg,'Listing Title': listingTitle, 'Listing Price': listingPrice}
df = pd.DataFrame.from_dict(data, orient='index')
df = df.transpose()
df.to_csv('raw_data.csv')
print('Data has been scrapped and added.')
In your code each loop reset the lists listingUrl, listingImg etc that's why df contains only one row of scraped data, corresponding to the last loop executed. If you want to add elements to a list you have to define the list BEFORE the loop and then use the .append() method inside the loop.
Then, instead of doing listingUrl.get_attribute('href') you will do listingUrl[-1].get_attribute('href') where [-1] means that you are taking the last element of the list.
listingUrl = []
listingImg = []
listingTitle = []
listingPrice = []
for listingBoxes in listingBox:
#Gets listing url
listingUrl.append( listingBoxes.find_element(By.CSS_SELECTOR, 'a.listing-link') )
print("LISTING URL:", listingUrl[-1].get_attribute('href'))
#Gets listing image
listingImg.append( listingBoxes.find_element(By.CSS_SELECTOR, 'img.wt-position-absolute') )
print("IMAGE:", listingImg[-1].get_attribute('src'))
#Gets listing title
listingTitle.append( listingBoxes.find_element(By.CLASS_NAME, 'wt-text-caption') )
print("TITLE:", listingTitle[-1].text)
#Gets price
listingPrice.append( listingBoxes.find_element(By.CLASS_NAME, 'currency-value') )
print("ITEM PRICE: $", listingPrice[-1].get_attribute("innerHTML"))
I'm trying to web scrape UFC fighters stats based on user input. I'm using beautiful soup and pandas. The idea is that the user input is matched to a fighters first and last name then returns their stats. Ideally I'd like to add the option of specifying which particular stat is required in a separate input. I've been able to pull the html table headers successfully but I don't know how to assign values to them which will correspond to the matching fighter name and print the associated value. In my code currently I'm splitting the fighter name input into first and last name, but I don't know how to then match them to the table data or how to return corresponding data. The data being returned currently is just the first line of results (fighter 'Tom Aaron') but no lookups or matching is being carried out. Are nested dictionaries the way to go? Any advice is greatly appreciated, this is my first python project so code is probably all over the place.
("Which fight do you want information on?"
input - Forrest Griffin
"What information do you want?:
input - Wins
"Forrest Griffin has won 8 times"
from bs4 import BeautifulSoup
import requests
import pandas as pd
website = "http://ufcstats.com/statistics/fighters?char=a&page=all"
response = requests.get(website)
response
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find('table', {'class' : 'b-statistics__table'}).find('tbody').find_all('tr')
len(results)
#print(results)
row = soup.find('tr')
print(row.get_text())
###attempting to split the table headers an assign
table = soup.find('table', {'class' : 'b-statistics__table'}).find('thead').find_all('tr')
#df=pd.read_html(str(table))[0]
#print(df.set_index(0).to_dict('dict'))
#firstname
first_name = str(results[0].find_all('td')[0].get_text())
print(first_name)
def first_names():
for names in first_name:
print(names)
return
#first_names()
last_name = results[1].find_all('td')[1].get_text()
print(last_name)
alias = results[1].find_all('td')[2].get_text()
if len(alias) == 0:
print("n/a")
else:
print(alias)
height = results[1].find_all('td')[3].get_text()
print(height)
weight = results[1].find_all('td')[4].get_text()
#print(weight)
wins = results[1].find_all('td')[7].get_text()
losses = results[1].find_all('td')[8].get_text()
draws = results[1].find_all('td')[9].get_text()
###split user input into list of first + second name
x = input("Which fighter do you want to know about?")
####print(str(first_name) + " " + str(last_name) + " has " + str(wins) + " wins, " + str(losses) + " losses and " + str(draws) + ".")
y = input("What do you want to know about?")
###if user input first name is in results row 1(Tom Aarons row) - still need to search through all result names
if x.split()[0] in str(results[1].find_all('td')[0].get_text()) and x.split()[1] in str(results[1].find_all('td')[1].get_text()) and y == "wins":
print(first_name+ " "+last_name+" has won " + wins + " times.")
if x.split()[1] in str(results[1].find_all('td')[1].get_text()):
print("ok")
else:
print('fail')
###Tom Test
print(x.split()[0])
###if input[1] = first_name and input2[2] == second_name:
if x.split()[1] == first_name:
print(x.split()[1])
if x.split()[0] in results[1] and x.split()[1] in results[1]:
print('wins')
else:
print("Who")
print(str(results[1].find_all('td')[0].get_text()))
I am using the following line of code for executing and printing data from my sql database. For some reason that is the only command that works for me.
json_string = json.dumps(location_query_1)
My question is that when I print json_string it shows data in the following format:
Actions.py code:
class FindByLocation(Action):
def name(self) -> Text:
return "action_find_by_location"
def run (self, dispatcher: CollectingDispatcher,
tracker: Tracker,
doman: Dict[Text, Any])-> List[Dict[Text,Any]]:
global flag
location = tracker.get_slot("location")
price = tracker.get_slot("price")
cuisine = tracker.get_slot("cuisine")
print("In find by Location")
print(location)
location_query = "SELECT Name FROM Restaurant WHERE Location = '%s' LIMIT 5" % location
location_count_query = "SELECT COUNT(Name) FROM Restaurant WHERE Location = '%s'" % location
location_query_1 = getData(location_query)
location_count_query_1 = getData(location_count_query)
if not location_query_1:
flag = 1
sublocation_view_query = "CREATE VIEW SublocationView AS SELECT RestaurantID, Name, PhoneNumber, Rating, PriceRange, Location, Sublocation FROM Restaurant WHERE Sublocation = '%s'"%(location)
sublocation_view = getData(sublocation_view_query)
dispatcher.utter_message(text="یہ جگہ کس ایریا میں ہے")
else:
flag = 0
if cuisine is None and price is None:
json_string = json.dumps(location_query_1)
print(isinstance(json_string, str))
print("Check here")
list_a=json_string.split(',')
remove=["'",'"','[',']']
for i in remove:
list_a=[s.replace(i, '') for s in list_a]
dispatcher.utter_message(text="Restaurants in Location only: ")
dispatcher.utter_message(list_a)
What should I do se that the data is showed in a vertical list format (new line indentation) and without the bracket and quotation marks? Thank you
First of all, have you tried reading your data into a pandas object? I have done some programs with a sqlite database and this worked for me:
df = pd.read_sql_query("SELECT * FROM {}".format(self.tablename), conn)
But now to the string formatting part:
# this code should do the work for you
# first of all we have our string a like yours
a="[['hallo'],['welt'],['kannst'],['du'],['mich'],['hoeren?']]"
# now we split the string into a list on every ,
list_a=a.split(',')
# this is our list with chars we want to remove
remove=["'",'"','[',']']
# now we replace all elements step by step with nothing
for i in remove:
list_a=[s.replace(i, '') for s in list_a]
print(list_a)
for z in list_a:
print(z)
The output is then:
['hallo', 'welt', 'kannst', 'du', 'mich', 'hoeren?']
hallo
welt
kannst
du
mich
hoeren?
I hope I could help.
I am currently trying to download a large number of NY Times articles using their API, based on Python 2.7. To do so, I was able to reuse a piece of code i found online:
[code]from nytimesarticle import articleAPI
api = articleAPI('...')
articles = api.search( q = 'Brazil',
fq = {'headline':'Brazil', 'source':['Reuters','AP', 'The New York Times']},
begin_date = '20090101' )
def parse_articles(articles):
'''
This function takes in a response to the NYT api and parses
the articles into a list of dictionaries
'''
news = []
for i in articles['response']['docs']:
dic = {}
dic['id'] = i['_id']
if i['abstract'] is not None:
dic['abstract'] = i['abstract'].encode("utf8")
dic['headline'] = i['headline']['main'].encode("utf8")
dic['desk'] = i['news_desk']
dic['date'] = i['pub_date'][0:10] # cutting time of day.
dic['section'] = i['section_name']
if i['snippet'] is not None:
dic['snippet'] = i['snippet'].encode("utf8")
dic['source'] = i['source']
dic['type'] = i['type_of_material']
dic['url'] = i['web_url']
dic['word_count'] = i['word_count']
# locations
locations = []
for x in range(0,len(i['keywords'])):
if 'glocations' in i['keywords'][x]['name']:
locations.append(i['keywords'][x]['value'])
dic['locations'] = locations
# subject
subjects = []
for x in range(0,len(i['keywords'])):
if 'subject' in i['keywords'][x]['name']:
subjects.append(i['keywords'][x]['value'])
dic['subjects'] = subjects
news.append(dic)
return(news)
def get_articles(date,query):
'''
This function accepts a year in string format (e.g.'1980')
and a query (e.g.'Amnesty International') and it will
return a list of parsed articles (in dictionaries)
for that year.
'''
all_articles = []
for i in range(0,100): #NYT limits pager to first 100 pages. But rarely will you find over 100 pages of results anyway.
articles = api.search(q = query,
fq = {'headline':'Brazil','source':['Reuters','AP', 'The New York Times']},
begin_date = date + '0101',
end_date = date + '1231',
page = str(i))
articles = parse_articles(articles)
all_articles = all_articles + articles
return(all_articles)
Download_all = []
for i in range(2009,2010):
print 'Processing' + str(i) + '...'
Amnesty_year = get_articles(str(i),'Brazil')
Download_all = Download_all + Amnesty_year
import csv
keys = Download_all[0].keys()
with open('brazil-mentions.csv', 'wb') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(Download_all)
Without the last bit (starting with "... import csv" this seems to be working fine. If I simply print my results, ("print Download_all") I can see them, however in a very unstructured way. Running the actual code i however get the message:
File "C:\Users\xxx.yyy\AppData\Local\Continuum\Anaconda2\lib\csv.py", line 148, in _dict_to_list
+ ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'abstract'
Since I am quite a newbie at this, I would highly appreciate your help in guiding me how to download the news articles into a csv file in a structured way.
Thanks a lot in advance!
Best regards
Where you have:
keys = Download_all[0].keys()
This takes the column headers for the CSV from the dictionary for the first article. The problem is that the article dictionaries do not all have the same keys, so when you reach the first one that has the extra abstract key, it fails.
It looks like you'll have problems with abstract and snippet which are only added to the dictionary if they exist in the response.
You need to make keys equal to the superset of all possible keys:
keys = Download_all[0].keys() + ['abstract', 'snippet']
Or, ensure that every dict has a value for every field:
def parse_articles(articles):
...
if i['abstract'] is not None:
dic['abstract'] = i['abstract'].encode("utf8")
else:
dic['abstract'] = ""
...
if i['snippet'] is not None:
dic['snippet'] = i['snippet'].encode("utf8")
else:
dic['snippet'] = ""
I am trying to scrape all the different variations of this webpage.For instance the code that should scrape this webpage http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11849.
should be the same as the code i use to scrape this webpage
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11849
def extract_contact(url):
r=requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]
list=[]
Contact=tbl.findAll('p')[0]
for br in Contact.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = re.sub(r'[\n\r\t\xa0]','',next).replace('Phone:','').strip()
list.append(text)
print list
#Street=list.pop(0)
#CityStateZip=list.pop(0)
#Phone=list.pop(0)
#City,StateZip= CityStateZip.split(',')
#State,Zip= StateZip.split(' ')
#ContactName = Contact.findAll('b')[1]
#ContactEmail = Contact.findAll('a')[1]
#Body=tbl.findAll('p')[1]
#Website = Contact.findAll('a')[2]
#Email = ContactEmail.text.strip()
#ContactName = ContactName.text.strip()
#Website = Website.text.strip()
#Body = Body.text
#Body = re.sub(r'[\n\r\t\xa0]','',Body).strip()
#list.extend([Street,City,State,Zip,ContactName,Phone,Email,Website,Body])
return list
The way i believe i will need to write the code in order it to work, is to set it up so that print list returns the same number of values, ordered identically.Currently, the above script returns these values
[u'2133 Craigs Store Road', u'Afton,VA 22920', u'434-882-3150']
[u'Alexandria,VA 22305']
Accounting for missing values,in order to be able to parse this page consistently,
I need the print list command to return something similar to this
[u'2133 Craigs Store Road', u'Afton,VA 22920', u'434-882-3150']
['',u'Alexandria,VA 22305','']
this way i will be able to manipulate values by position(as they will be in consistent order). The problem is that i don't know how to accomplish this as I am still very new to parsing. If anybody has any insight as to how to solve the problem i would be highly appreciative.
def extract_contact(url):
r=requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]
list=[]
Contact=tbl.findAll('p')[0]
for br in Contact.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = re.sub(r'[\n\r\t\xa0]','',next).replace('Phone:','').strip()
list.append(text)
Street=[s for s in list if ',' not in s and '-' not in s]
CityStateZip=[s for s in list if ',' in s]
Phone = [s for s in list if '-' in s]
if not Street:
Street=''
else:
Street=Street[0]
if not CityStateZip:
CityStateZip=''
else:
City,StateZip= CityStateZip[0].split(',')
State,Zip= StateZip.split(' ')
if not Phone:
Phone=''
else:
Phone=Phone[0]
list=[]
I figured out an alternative solution using substrings and if statements. Since there are only 3 values max in the list, all with defining characteristics i realized that i could delegate by looking for special characters rather than the position of the record.