Beautiful Soup - HTML Parser seems to not pull in things after comment

Beautiful Soup - HTML Parser seems to not pull in things after comment - python

Just started learning python (3.8), building a scraper to get some football stats. Here's the code so far.
I originally wanted to pull a div with id = 'div_alphabet' which is clearly in the html tree on the website, but for some reason bs4 wasn't pulling it in. I investigated further and noticed that when I pull in the parent div 'all_alphabet' and then look for all child divs, 'div_alphabet' is missing. The only thing weird about the html structure is the long block comment that sits right above 'div_alphabet'. Is this a potential issue?
https://www.pro-football-reference.com/players
import requests
from bs4 import BeautifulSoup
URL = 'https://www.pro-football-reference.com/'
homepage = requests.get(URL)
home_soup = BeautifulSoup(homepage.content, 'html.parser')
players_nav_URL = home_soup.find(id='header_players').a['href']
players_directory_page = requests.get(URL + players_nav_URL)
players_directory_soup = BeautifulSoup(players_directory_page.content, 'html.parser')
alphabet_nav = players_directory_soup.find(id='all_alphabet')
all_letters = alphabet_nav.find_all('div')
print(all_letters)

links = [a['href'] for a in players_directory_soup.select('ul.page_index li div a')]
names = [a.get_text() for a in players_directory_soup.select('ul.page_index li div a')]
This gives you a list and names of all the relative links of alphabetised players.
I wouldn't concern yourself with the div_alphabet it doesn't have any useful information.
Here we are selecting the ul tag with class "page_index". But you'll get a list, so we need to do a for loop and grab the href attribute. The get_text() also gives you the names.
If you haven't come across list comprehensions then this would also be acceptable.
links = []
for a in players_directory_soup.select('ul.page_index li div a'):
links.append(a['href'])
names = []
for a in players_directory_soup.select('ul.page_index li div a'):
names.append(a.get_text())

Something like this cod will make it:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 '}
r = requests.get('https://www.pro-football-reference.com/players/', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
data = soup.select('ul.page_index li div')
for link in data:
print(*[f'{a.get("href")}\n' for a in link.select('a')])
A more useful way to do this is to make a DataFrame with pandas of it and save it as a csv or something:
import requests
from bs4 import BeautifulSoup
import pandas as pd
players = []
headers = {'User-Agent': 'Mozilla/5.0 '}
r = requests.get('https://www.pro-football-reference.com/players/', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
data = soup.select('ul.page_index li div a')
for link in data:
players.append([link.get_text(strip=True), 'https://www.pro-football-reference.com' + link.get('href')])
print(players[0])
df = pd.DataFrame(players, columns=['Player name', 'Url'])
print(df.head())
df.to_csv('players.csv', index=False)

Related

lxml to grab All items that share a certain xpath

I'm trying to grab all prices from a website, using the xpath. all prices have the same xpath, and only [0], or I assume the 1st item works... let me show you:
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
print(dom.xpath('/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li[1]/article/div[1]/div[2]/div')[0].text)
This successfully prints the 1st price!!!
I tried changing "[0].text" to 1, to print the 2nd item but it returned "out of range".
Then I was trying to think of some For loop that would print All Items, so I could create an average.
Any help would be Greatly appreciated!!!
I apologize edited in is the code
from bs4 import BeautifulSoup
from lxml import etree
import requests
URL = "https://www.newegg.com/p/pl?d=GPU&N=601357247%20100007709"
#HEADERS = you'll need to add your own headers here, won't let post.
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
print(dom.xpath('/html/body/div[10]/div[4]/section/div/div/div[2]/div/div/div/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/ul/li[3]/strong')[0].text)

You could just use css selectors which, in this instance, are a lot more readable. I would also remove some of the offers info to leave just the actual price.
import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
r = requests.get("https://www.newegg.com/p/pl?d=GPU&N=601357247%20100007709", headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.text, features="lxml")
prices = {}
for i in soup.select('.item-container'):
if a:=i.select_one('.price-current-num'): a.decompose()
prices[i.select_one('.item-title').text] = i.select_one('.price-current').get_text(strip=True)[:-1]
pprint(prices)
prices as list of floats
import requests, re
from bs4 import BeautifulSoup as bs
from pprint import pprint
r = requests.get("https://www.newegg.com/p/pl?d=GPU&N=601357247%20100007709", headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.text, features="lxml")
prices = []
for i in soup.select('.item-container'):
if a:=i.select_one('.price-current-num'): a.decompose()
prices.append(float(re.sub('\$|,', '', i.select_one('.price-current').get_text(strip=True)[:-1])))
pprint(prices)

How can I get the correct urls of ads?

I'm trying to scrape the urls of the ads on "Marktplaats" website (link is provided below).
As you can see I'm looking for 30 URLs. These URLs are placed inside a 'href' field and all start with "/a/auto-s/". Unfortunately, I only keep getting the first few URLs. I found out that on this sites all the data is places within "<li class = "mp-Listing mp-Listing--list-item"> ... </li>". Does anyone have an idea how to fix it? (you can see that you won't find all the URLs of the ads when you run my code)
Link:
https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true
My code:
import requests
from bs4 import BeautifulSoup
url = "https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
url_list = soup.find_all(class_ = 'mp-Listing mp-Listing--list-item')
print(url_list)

You can try something like this:
import requests
from bs4 import BeautifulSoup
def parse_links(url):
links = []
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for li in soup.find_all(class_="mp-Listing mp-Listing--list-item"):
links.append(li.a.get('href'))
return links
url = "https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true"
links = parse_links(url)
print('\n'.join(map(str, links)))
Output
/a/auto-s/oldtimers/a1302359148-allis-chalmers-ed40-1965.html
/a/auto-s/bestelauto-s/a1258166221-opel-movano-2-3-cdti-96kw-2018.html
/a/auto-s/oldtimers/a1302359184-chevrolet-biscayne-bel-air-1960.html
/a/auto-s/renault/a1240974413-ruim-aanbod-rolstoelauto-s-www-autoland-nl.html
/a/auto-s/volkswagen/m1457703674-golf-6-1-2tsi-comfortline-bluemotion-77kw-2de-eigenaar.html
/a/auto-s/peugeot/m1457564187-peugeot-208-1-6-e-hdi-68kw-92pk-5-d-2014-zwart.html
/a/auto-s/volkswagen/m1457124365-volkswagen-touareg-3-2-v6-177kw-4motion-aut-2004-grijs.html
/a/auto-s/volkswagen/m1456753596-volkswagen-golf-vii-2-0-tdi-highline-150pk-xenon-trekhaak.html
/a/auto-s/bestelauto-s/a1001658686-200-nw-en-gebruikte-bestelwagens-personenbusjes-pick-ups.html
/a/auto-s/bestelauto-s/m940111355-bus-verkopen-bestelauto-inkoop-bestelwagen-opkoper-rdw.html
/a/auto-s/volkswagen/m1456401063-volkswagen-golf-1-6-74kw-2000-zwart.html
/a/auto-s/renault/m1456242548-renault-espace-2-0-dci-110kw-e4-2006-zwart.html
/a/auto-s/nissan/m1448699345-nissan-qashqai-1-5-dci-connect-2011-grijs-panoramadak.html
/a/auto-s/bestelauto-s/a1212708374-70-x-kleine-bestelwagens-lage-km-scherpe-prijzen.html
/a/auto-s/bmw/m1452641019-bmw-5-serie-2-0-520d-touring-aut-2014-grijs.html
/a/auto-s/mercedes-benz/m1448671698-mercedes-benz-a-klasse-a250-amg-224pk-7g-dct-panoramadak-wid.html
/a/auto-s/bmw/m1455671862-bmw-3-serie-2-0-i-320-cabrio-aut-2007-bruin.html
/a/auto-s/bestelauto-s/m1455562699-volkswagen-transporter-kmstand-151-534-2-5-tdi-65kw-2002.html
/a/auto-s/bestelauto-s/a1295698562-35-x-renault-kangoo-2013-t-m-2015-v-a-25000-km.html
/a/auto-s/infiniti/m1458111256-infiniti-q50-3-5-hybrid-awd-2016-grijs.html
/a/auto-s/ford/m1458111166-ford-ka-1-3-i-44kw-2007-zwart.html
/a/auto-s/bestelauto-s/m1457499260-renault-master-l3h2-2018-airco-camera-cruise-laadruimte-12.html
/a/auto-s/land-rover/m1458110209-land-rover-discovery-4-3-0-tdv6-2010-grijs.html
/a/auto-s/dodge/a1279463634-5-jaar-ram-dealer-garantie-lage-bijtelling.html
/a/auto-s/bmw/m1455389317-bmw-320i-e46-sedan-bieden.html
/a/auto-s/ford/m1457306473-ford-galaxy-2-0-tdci-85kw-dpf-2011-blauw.html
/a/auto-s/peugeot/m1456912876-peugeot-407-2-0-16v-sw-2006-grijs.html
/a/auto-s/bestelauto-s/m1457161395-renault-master-t35-2-3-dci-l3h2-130-pk-navi-airco-camera-pdc.html
/a/auto-s/bestelauto-s/a1299134880-citroen-berlingo-1-6-hdi-2017-airco-sd-3-zits-v-a-179-p-m.html
/a/auto-s/hyundai/m1458105451-hyundai-atos-gezocht-hoge-prijs-tel-0653222206.html
/a/auto-s/volkswagen/m1458103618-volkswagen-polo-1-4-tsi-132kw-dsg-2012-wit.html
/a/auto-s/vrachtwagens/m1458101965-scania-torpedo.html
/a/auto-s/toyota/m1458101624-toyota-yaris-1-0-12v-vvt-i-aspiration-5dr-2012.html
/a/auto-s/dodge/a1279447576-5-jaar-ram-dealer-garantie-en-historie-bekijk-onze-website.html
You can also build the actual url of the page by appending 'https://www.marktplaats.nl' to li.a.get('href'). So, your whole code should look like this:
import requests
from bs4 import BeautifulSoup
def parse_links(url):
links = []
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for li in soup.find_all(class_="mp-Listing mp-Listing--list-item"):
links.append('https://www.marktplaats.nl' + li.a.get('href'))
return links
url = "https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true"
links = parse_links(url)
print('\n'.join(map(str, links)))
It should produce the output like this:
https://www.marktplaats.nl/a/auto-s/renault/a1302508082-mooi-renault-megane-scenic-1-6-16v-aut-2005-2003-groen-airco.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302359157-morris-minor-cabriolet-1970.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302743902-online-veiling-oldtimers-en-classic-cars-zedelgem-vavato.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302359138-mercedes-benz-g-500-guard-pantzer-1999.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1457703674-golf-6-1-2tsi-comfortline-bluemotion-77kw-2de-eigenaar.html
https://www.marktplaats.nl/a/auto-s/peugeot/m1457564187-peugeot-208-1-6-e-hdi-68kw-92pk-5-d-2014-zwart.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1457124365-volkswagen-touareg-3-2-v6-177kw-4motion-aut-2004-grijs.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1456753596-volkswagen-golf-vii-2-0-tdi-highline-150pk-xenon-trekhaak.html
https://www.marktplaats.nl/a/auto-s/volkswagen/a1279696849-vw-take-up-5-d-radio-airco-private-lease.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m940111355-bus-verkopen-bestelauto-inkoop-bestelwagen-opkoper-rdw.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1456401063-volkswagen-golf-1-6-74kw-2000-zwart.html
https://www.marktplaats.nl/a/auto-s/renault/m1456242548-renault-espace-2-0-dci-110kw-e4-2006-zwart.html
https://www.marktplaats.nl/a/auto-s/nissan/m1448699345-nissan-qashqai-1-5-dci-connect-2011-grijs-panoramadak.html
https://www.marktplaats.nl/a/auto-s/citroen/a1277007710-citroen-c1-feel-5-d-airco-private-lease-vanaf-189-euro-mnd.html
https://www.marktplaats.nl/a/auto-s/bmw/m1452641019-bmw-5-serie-2-0-520d-touring-aut-2014-grijs.html
https://www.marktplaats.nl/a/auto-s/mercedes-benz/m1448671698-mercedes-benz-a-klasse-a250-amg-224pk-7g-dct-panoramadak-wid.html
https://www.marktplaats.nl/a/auto-s/bmw/m1455671862-bmw-3-serie-2-0-i-320-cabrio-aut-2007-bruin.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m1455562699-volkswagen-transporter-kmstand-151-534-2-5-tdi-65kw-2002.html
https://www.marktplaats.nl/a/auto-s/peugeot/a1298813052-private-lease-occasion-outlet-prive-lease.html
https://www.marktplaats.nl/a/auto-s/audi/m1458114563-audi-a4-2-0-tfsi-132kw-avant-multitronic-nl-auto.html
https://www.marktplaats.nl/a/auto-s/mercedes-benz/m1452983872-mercedes-a-klasse-2-0-cdi-a200-5drs-aut-2007-grijs.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m1457499260-renault-master-l3h2-2018-airco-camera-cruise-laadruimte-12.html
https://www.marktplaats.nl/a/auto-s/infiniti/m1458111256-infiniti-q50-3-5-hybrid-awd-2016-grijs.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/a1001658686-200-nw-en-gebruikte-bestelwagens-personenbusjes-pick-ups.html
https://www.marktplaats.nl/a/auto-s/ford/m1458111166-ford-ka-1-3-i-44kw-2007-zwart.html
https://www.marktplaats.nl/a/auto-s/land-rover/m1458110209-land-rover-discovery-4-3-0-tdv6-2010-grijs.html
https://www.marktplaats.nl/a/auto-s/bmw/m1455389317-bmw-320i-e46-sedan-bieden.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m1457161395-renault-master-t35-2-3-dci-l3h2-130-pk-navi-airco-camera-pdc.html
https://www.marktplaats.nl/a/auto-s/renault/a1302508082-mooi-renault-megane-scenic-1-6-16v-aut-2005-2003-groen-airco.html
https://www.marktplaats.nl/a/auto-s/ford/m1457306473-ford-galaxy-2-0-tdci-85kw-dpf-2011-blauw.html
https://www.marktplaats.nl/a/auto-s/peugeot/m1456912876-peugeot-407-2-0-16v-sw-2006-grijs.html
https://www.marktplaats.nl/a/auto-s/hyundai/m1458105451-hyundai-atos-gezocht-hoge-prijs-tel-0653222206.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1458103618-volkswagen-polo-1-4-tsi-132kw-dsg-2012-wit.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302743902-online-veiling-oldtimers-en-classic-cars-zedelgem-vavato.html
Good luck!

Adding objects for each item added from scraping data from a website

I am trying to retrieve data from a website with and add for each row of data and object, I am new to python and I clearly miss something because I can get only 1 object, what Im trying to get is all the objects I get sorted by key value pairs:
import urllib.request
import bs4 as bs
url = 'http://freemusicarchive.org/search/?quicksearch=drake/'
search = ''
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(html, 'html.parser')
tracks_info = [{}]
spans = soup.find_all('span', {'class': 'ptxt-artist'})
for span in spans:
arts = span.find_all('a')
for art in arts:
print(art.text)
spans = soup.find_all('span', {'class': 'ptxt-track'})
for span in spans:
tracks = span.find_all('a')
for track in tracks:
print(track.text)
for download_links in soup.find_all('a', {'title': 'Download'}):
print(download_links.get('href'))
for info in tracks_info:
info.update({'artist': art.text})
info.update({'track': track.text})
info.update({'link': download_links.get('href')})
print(info)
I failed to add an object for each element I get from the website, Im clearly doing something wrong\or not doing and any help would be much appreciated!

You could use a slightly different struture and syntax such as below.
I use a contains CSS class selector to retrieve the rows of info as the id is different for each track
The CSS selector combination of div[class*="play-item gcol gid-electronic tid-"]
looks for div elements with class attribute having value containing play-item gcol gid-electronic tid-.
Within that the various columns of interest are then selected by their class name and a descendant css selector is used for the a tag element for the final download link.
import urllib.request
import bs4 as bs
import pandas as pd
url = 'http://freemusicarchive.org/search/?quicksearch=drake/'
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(html, 'html.parser')
tracks_Info = []
headRow = ['Artist','TrackName','DownloadLink']
for item in soup.select('div[class*="play-item gcol gid-electronic tid-"]'):
tracks_Info.append([item.select_one(".ptxt-artist").text.strip(), item.select_one(".ptxt-track").text, item.select_one(".playicn a").get('href')])
df = pd.DataFrame(tracks_Info,columns=headRow)
print(df)

Python bs4 BeautifulSoup: findall gives empty bracket

when i run this code it gives me an empty bracket. Im new to web scraping so i dont know what im doing wrong.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
#btw the space is also there in the html code
print(container)
results:
[]
What i tried is to grab the html code from the site, and to soup trough the li tags where all the information is stored so I can print out all the information in a for loop.
Also if someone wants to explain how to use BeautifulSoup we can always talk.
Thank you guys.

So a working code that grabs product and price would could look something like this.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url, headers={'User-Agent': 'Mozilla Firefox'})
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
for cont in container:
h2 = cont.h2.text.strip()
# Amazon lists prices in two ways. If one fails, use the other
try:
currency = cont.find('sup', {'class': 'sx-price-currency'}).text.strip()
price = currency + cont.find('span', {'class': 'sx-price-whole'}).text.strip()
except:
price = cont.find('span', {'class': 'a-size-base a-color-base'})
print('Product: {}, Price: {}'.format(h2, price))
Let me know if that helps you further...

Web crawler - following links

Please bear with me. I am quite new at Python - but having a lot of fun. I am trying to code a web crawler that crawls through election results from the last referendum in Denmark. I have managed to extract all the relevant links from the main page. And now I want Python to follow each of the 92 links and gather 9 pieces of information from each of those pages. But I am so stuck. Hope you can give me a hint.
Here is my code:
import requests
import urllib2
from bs4 import BeautifulSoup
# This is the original url http://www.kmdvalg.dk/
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
my_list = []
all_links = soup.find_all("a")
for link in all_links:
link2 = link["href"]
my_list.append(link2)
for i in my_list[1:93]:
print i
# The output shows all the links that I would like to follow and gather information from. How do I do that?

Here is my solution using lxml. It's similar to BeautifulSoup
import lxml
from lxml import html
import requests
page = requests.get('http://www.kmdvalg.dk/main')
tree = html.fromstring(page.content)
my_list = tree.xpath('//div[#class="LetterGroup"]//a/#href') # grab all link
print 'Length of all links = ', len(my_list)
my_list is a list consist of all links. And now you can use for loop to scrape information inside each page.
We can for loop through each links. Inside each page, you can extract information as example. This is only for the top table.
table_information = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
table_key = tree.xpath('//td[#class="statusHeader"]/text()')
table_value = tree.xpath('//td[#class="statusText"]/text()') + tree.xpath('//td[#class="statusText"]/a/text()')
table_information.append(zip([t]*len(table_key), table_key, table_value))
For table below the page,
table_information_below = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
l1 = tree.xpath('//tr[#class="tableRowPrimary"]/td[#class="StemmerNu"]/text()')
l2 = tree.xpath('//tr[#class="tableRowSecondary"]/td[#class="StemmerNu"]/text()')
table_information_below.append([t]+l1+l2)
Hope this help!

A simple approach would be to iterate through your list of urls and parse them each individually:
for url in my_list:
soup = BeautifulSoup(urllib2.urlopen(url).read())
# then parse each page individually here
Alternatively, you could speed things up significantly using Futures.
from requests_futures.sessions import FuturesSession
def my_parse_function(html):
"""Use this function to parse each page"""
soup = BeautifulSoup(html)
all_paragraphs = soup.find_all('p')
return all_paragraphs
session = FuturesSession(max_workers=5)
futures = [session.get(url) for url in my_list]
page_results = [my_parse_function(future.result()) for future in results]

This would be my solution for your problem
import requests
from bs4 import BeautifulSoup
def spider():
url = "http://www.kmdvalg.dk/main"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('div', {'class': 'LetterGroup'}):
anc = link.find('a')
href = anc.get('href')
print(anc.getText())
print(href)
# spider2(href) call a second function from here that is similar to this one(making url = to herf)
spider2(href)
print("\n")
def spider2(linktofollow):
url = linktofollow
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('tr', {'class': 'tableRowPrimary'}):
anc = link.find('td')
print(anc.getText())
print("\n")
spider()
its not done... i only get a simple element from the table but you get the idea and how its supposed to work.

Here is my final code that works smooth. Please let me know if I could have done it smarter!
import urllib2
from bs4 import BeautifulSoup
import codecs
f = codecs.open("eu2015valg.txt", "w", encoding="iso-8859-1")
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
liste = []
alle_links = soup.find_all("a")
for link in alle_links:
link2 = link["href"]
liste.append(link2)
for url in liste[1:93]:
soup = BeautifulSoup(urllib2.urlopen(url).read().decode('iso-8859-1'))
tds = soup.findAll('td')
stemmernu = soup.findAll('td', class_='StemmerNu')
print >> f, tds[5].string,";",tds[12].string,";",tds[14].string,";",tds[16].string,";", stemmernu[0].string,";",stemmernu[1].string,";",stemmernu[2].string,";",stemmernu[3].string,";",stemmernu[6].string,";",stemmernu[8].string,";",'\r\n'
f.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup - HTML Parser seems to not pull in things after comment - python

Related

lxml to grab All items that share a certain xpath

How can I get the correct urls of ads?

Adding objects for each item added from scraping data from a website

Python bs4 BeautifulSoup: findall gives empty bracket

Web crawler - following links

Categories

Resources