I am unable to convert the data retrieved with bs4 into a meaningful csv file. It only takes the last set of data from what is actually retrieved.
#Beautiful soup or BS4 is a package I will be using to allow me to parse the HTML data which I will be retrieving from a website.
#parsing is te conversion of codes from machine language into a code which humans can understand and allow it to be structured.
#(Converting data from one format to another) with BS4
from bs4 import BeautifulSoup
#requests is an HTTP Library which allows me to send requests to websites the retrieve date using Python. This is helpful as
#The website is writtin in a different language so it allows me to retrieve what I want and read it as well.
import requests
#import writer
url= "https://myanimelist.net/anime/season"
#requesting to get data using 'requests' and gain acess as well.
#hadve to check the response before moving forward to ensure there is no problem retrieving data.
page= requests.get(url)
#print(page)
#<Response [200]> response was "200" meaning "Successful responses"
soup = BeautifulSoup(page.content, 'html.parser')
#here i retrieve my
#for this to identify the html code and determine what we will be producing(retrieveing data) for each item on the page we had to
#find the parent category which contains all the info we need to make our data categories.
lists = soup.select("[data-genre]")
#we add _ after class to make class_ because without the underscore the program identifies it as a python class
# when really it is more of a cs class
all_data = []
#must create loop to find titles seperate as there are alot that will come up
for list in lists:
#identify and find class which includes the title of the shows, show ratings, members watching, and episodes
#added .text.replace in order to get rid of the|n spacing which was in html format
title= list.find("a", class_="link-title").text.replace("\n", "")
rating= list.find("div", class_="score").text.replace("\n", "")
members= list.find("div", class_="scormem-item member").text.replace("\n", "")
release_date= list.find("span", class_="item").text.replace("\n", "")
all_data.append(
[title.strip(), rating.strip(), members.strip(), release_date.strip()]
)
print(*all_data, sep="\n")
#testing for errors and makins sure locations are correct to withdraw/request the data
#this allows us to create and close a csv file. using 'w' to allow editing
from csv import writer
#defining the website which I will be retrieving my code
#organizing chart
header=['Title', 'Show Rating', 'Members', 'Release Date']
info= [title.strip(), rating.strip(), members.strip(), release_date.strip()]
with open('shows.csv', 'w', encoding='utf8', newline='')as f:
#will write onto our file 'f'
writing = writer(f)
#use our writer to write a row in file
writing.writerow(header)
writing.writerow(info)
I tried to change definition of list but to no avail. This is currently what I get. Even though it should be much longer.
Instead of writing just the last line with
writing.writerow(info)
you need to write all of the lines:
writing.writerows(all_data)
I have a table, found below and stored as "table". It contains the following:
http://pastebin.com/aBFLpU4U
My code captures the correct information, but I need to know how to get each piece of the information into it's own variable. I appreciate any help with this, I have only been playing with BeautifulSoup for a week so forgive me. I have looked all over stack and haven't found an answer that works for me.
This is the output I see: http://pastebin.com/fiYQvBix
import sys, locale, os, re, urllib2
import lxml.etree, requests
from bs4 import BeautifulSoup as bSoup
# Website that we are scraping:
BASE_URL = 'https://www.biddergy.com/detail.asp?id='
#ID = raw_input("Enter listing #: ")
ID = str(330998) # defined constant for debugging
# Store response in soup:
response = requests.get(BASE_URL+ID)
soup = bSoup(response.text)
# Find auction info <table>
table = soup.find('table', cellpadding="2")
#### Everything above this line works great ####
for row in table.find_all('tr'):
for col in row.find_all("td"):
print(col.string)
Well, I have figured it out.
data = []
for row in table.find_all('tr'):
for cols in row.find_all('td', text=True)
for col in cols:
data.append(col.strip())
Then data can be extracted from the data[] list and saved into the respective variables.
Thank you for all who have read my question!
I am trying to pull out information from an html file of the link http://dl.acm.org/results.cfm?CFID=376026650&CFTOKEN=88529867. For every Paper title, i need the authors, journal name and abstract. But i am getting repetitive versions of each first before getting them together. Please help. Meaning i am first getting a list of titles, then authors, then journals, then abstract, and then i get them together per title, as in title first, then the respective authors, journal name, and the abstract. I only need them together, not individually.
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests
import re
f = open('acmpage.html', 'r') #open html file stores locally
html = f.read() #read from the html file and store the content in 'html'
soup = BeautifulSoup(html)
pret = soup.prettify()
soup1 = BeautifulSoup(pret)
for content in soup1.find_all("table"):
soup2 = BeautifulSoup(str(content))
pret2 = soup2.prettify()
soup3 = BeautifulSoup(pret2)
for titles in soup3.find_all('a', target = '_self'): #to print title
print "Title: ",
print titles.get_text()
for auth in soup3.find_all('div', class_ = 'authors'): #to print authors
print "Authors: ",
print auth.get_text()
for journ in soup3.find_all('div', class_ = 'addinfo'): #to print name of journal
print "Journal: ",
print journ.get_text()
for abs in soup3.find_all('div', class_ = 'abstract2'): # to print abstract
print "Abstract: ",
print abs.get_text()
You are searching for each list of information separately, there is little question as to why you see each type of information listed separately.
Your code is also full of redundancies; you only need to import one version of BeautifulSoup (the first import is shadowed by the second), and you don't need to re-parse the elements 2 times either. You import two different URL loading libraries then ignore both by loading a local file instead.
Search for the table rows containing the title information instead, then per table row, parse out the information contained.
For this page, with its more complex (and frankly, disorganized) layout with multiple tables, it'd be easiest just to go up to the table row per title link found:
from bs4 import BeautifulSoup
import requests
resp = requests.get('http://dl.acm.org/results.cfm',
params={'CFID': '376026650', 'CFTOKEN': '88529867'})
soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)
for title_link in soup.find_all('a', target='_self'):
# find parent row to base rest of search of
row = next(p for p in title_link.parents if p.name == 'tr')
title = title_link.get_text()
authors = row.find('div', class_='authors').get_text()
journal = row.find('div', class_='addinfo').get_text()
abstract = row.find('div', class_='abstract2').get_text()
The next() call loops over a generator expression that goes over each parent of the title link until a <tr> element is found.
Now you have all the information grouped per title.
You need to find the first addinfo div, then creep forward to find the publisher in a div further on in the document. You will need to go up the tree to the enclosing tr, then get the next sibling for the consecutive tr. Then search inside that tr for the next data item ( the publisher).
Once you have done this for all the items you need to display, issue a single print command for all the items you've found
I'm new to Python and am playing around with making a very basic web crawler. For instance, I have made a simple function to load a page that shows the high scores for an online game. So I am able to get the source code of the html page, but I need to draw specific numbers from that page. For instance, the webpage looks like this:
http://hiscore.runescape.com/hiscorepersonal.ws?user1=bigdrizzle13
where 'bigdrizzle13' is the unique part of the link. The numbers on that page need to be drawn out and returned. Essentially, I want to build a program that all I would have to do is type in 'bigdrizzle13' and it could output those numbers.
As another poster mentioned, BeautifulSoup is a wonderful tool for this job.
Here's the entire, ostentatiously-commented program. It could use a lot of error tolerance, but as long as you enter a valid username, it will pull all the scores from the corresponding web page.
I tried to comment as well as I could. If you're fresh to BeautifulSoup, I highly recommend working through my example with the BeautifulSoup documentation handy.
The whole program...
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import sys
URL = "http://hiscore.runescape.com/hiscorepersonal.ws?user1=" + sys.argv[1]
# Grab page html, create BeatifulSoup object
html = urlopen(URL).read()
soup = BeautifulSoup(html)
# Grab the <table id="mini_player"> element
scores = soup.find('table', {'id':'mini_player'})
# Get a list of all the <tr>s in the table, skip the header row
rows = scores.findAll('tr')[1:]
# Helper function to return concatenation of all character data in an element
def parse_string(el):
text = ''.join(el.findAll(text=True))
return text.strip()
for row in rows:
# Get all the text from the <td>s
data = map(parse_string, row.findAll('td'))
# Skip the first td, which is an image
data = data[1:]
# Do something with the data...
print data
And here's a test run.
> test.py bigdrizzle13
[u'Overall', u'87,417', u'1,784', u'78,772,017']
[u'Attack', u'140,903', u'88', u'4,509,031']
[u'Defence', u'123,057', u'85', u'3,449,751']
[u'Strength', u'325,883', u'84', u'3,057,628']
[u'Hitpoints', u'245,982', u'85', u'3,571,420']
[u'Ranged', u'583,645', u'71', u'856,428']
[u'Prayer', u'227,853', u'62', u'357,847']
[u'Magic', u'368,201', u'75', u'1,264,042']
[u'Cooking', u'34,754', u'99', u'13,192,745']
[u'Woodcutting', u'50,080', u'93', u'7,751,265']
[u'Fletching', u'53,269', u'99', u'13,051,939']
[u'Fishing', u'5,195', u'99', u'14,512,569']
[u'Firemaking', u'46,398', u'88', u'4,677,933']
[u'Crafting', u'328,268', u'62', u'343,143']
[u'Smithing', u'39,898', u'77', u'1,561,493']
[u'Mining', u'31,584', u'85', u'3,331,051']
[u'Herblore', u'247,149', u'52', u'135,215']
[u'Agility', u'225,869', u'60', u'276,753']
[u'Thieving', u'292,638', u'56', u'193,037']
[u'Slayer', u'113,245', u'73', u'998,607']
[u'Farming', u'204,608', u'51', u'115,507']
[u'Runecraft', u'38,369', u'71', u'880,789']
[u'Hunter', u'384,920', u'53', u'139,030']
[u'Construction', u'232,379', u'52', u'125,708']
[u'Summoning', u'87,236', u'64', u'419,086']
Voila :)
You can use Beautiful Soup to parse the HTML.
Sorry if you feel like this has been asked but I have read the related questions and being quite new to Python I could not find how to write this request in a clean manner.
For now I have this minimal Python code:
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2
br = Browser()
br.open("http://www.atpworldtour.com/Rankings/Singles.aspx")
filename = "rankings.html"
FILE = open(filename,"w")
html = br.response().read();
soup = BeautifulSoup(html);
links = soup.findAll('a', href=re.compile("Players"));
for link in links:
print link['href'];
FILE.writelines(html);
It retrieves all the link where the href contains the word player.
Now the HTML I need to parse looks something like this:
<tr>
<td>1</td>
<td>Federer, Roger (SUI)</td>
<td>10,550</td>
<td>0</td>
<td>19</td>
</tr>
The 1 contains the rank of the player.
I would like to be able to retrieve this data in a dictionary:
rank
name of the player
link to the detailed page (here /Tennis/Players/Top-Players/Roger-Federer.aspx)
Could you give me some pointers or if this is easy enough help me to build the piece of code ? I am not sure about how to formulate the request in Beautiful Soup.
Anthony
Searching for the players using your method will work, but will return 3 results per player. Easier to search for the table itself, and then iterate over the rows (except the header):
table=soup.find('table', 'bioTableAlt')
for row in table.findAll('tr')[1:]:
cells = row.findAll('td')
#retreieve data from cells...
To get the data you need:
rank = cells[0].string
player = cells[1].a.string
link = cells[1].a['href']