The big goal is to find specific house bills.
With this code I am trying to select the link: /legislation?q=%7B%22congress%22%3A%22113%22%2C%22chamber%22%3A%22House%22%7D to narrow down my search to house bills.
from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation"))
for link in soup.find_all('a'):
soup_links = link.get('href')
import re
r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
print r1.findall(soup_links)
When I do this I get an empty list instead of the link.
It isn't my regular express because the following works:
r2 = re.compile(r'\S+congress\S+chamber\S+House\S+')
newstring = '/legislation?q=%7B%22congress%22%3A%22113%22%2C%22chamber%22%3A%22House%22%7D'
print r2.findall(newstring)
You are re-assigning a new value to soup_links each iteration; in the end only the last href attribute is assigned.
BeautifulSoup can do the searching for you:
soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation"))
r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
soup_links = [l['href'] for l in soup.find_all('a', href=r1)]
print soup_links
This produces the one matching link:
>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation"))
>>> r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
>>> [l['href'] for l in soup.find_all('a', href=r1)]
['/legislation?q=%7B%22congress%22%3A%22113%22%2C%22chamber%22%3A%22House%22%7D']
If you only expect one link to match, use soup.find() instead of soup.find_all():
soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation"))
r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
soup_link = soup.find('a', href=r1)
print soup_link['href']
Related
I have this code that extracts all the numbers on the website
if I want to get a specific value how can I do it?
I did this but it doesn't work
import urllib
import re
import requests
from bs4 import *
url = requests.get("http://python-data.dr-chuck.net/comments_216543.html")
soup = BeautifulSoup(url.content, "html.parser")
sum=0
tags = soup('span')
for tag in tags:
y=str(tag)
x= re.findall("[0-9]+",y)
for i in x:
print (i[1])
To get tag "Coby", you can use pass a custom function to .find():
import requests
from bs4 import *
url = requests.get("http://python-data.dr-chuck.net/comments_216543.html")
soup = BeautifulSoup(url.content, "html.parser")
coby = soup.find(lambda tag: tag.name == "tr" and "Coby" in tag.text)
print(coby.get_text(separator=" "))
Output:
Coby 95
Or, to only get the comment, use .find_next():
print(coby.find_next("span", class_="comments").get_text())
Output:
95
I'm a newbie in web scraping. I do as below
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar")
soup = BeautifulSoup(html, "html.parser")
res = soup.find_all('a', {'href': re.compile("r'\b?20\b'")})
print (res)
and get
[]
My goal is this fragment
<script language="javascript" type="text/javascript">
cont = new Array();
count = new Array();
for (i=1979; i <=2015; i++){count[i]=0};
cont[1979] = "<li><a href='?1979_1#24jan'>24 января</a>" +
..............
cont[2016] = "<li><a href='?2016/2016_spr#cur'>Весенняя серия</a>" +
"<li><a href='?2016/2016_sum#cur'>Летняя серия</a>" +
"<li><a href='?2016/2016_aut#cur'>Осенняя серия</a>" +
"<li><a href='?2016/2016_win#cur'>Зимняя серия</a>";
And i try to get the result like this
'?2016/2016_spr#cur'
'?2016/2016_sum#cur'
'?2016/2016_aut#cur'
'?2016/2016_win#cur'
From 2000 to this moment (so '20' in "r'\b?20\b'" is for this reason). Can you help me, please?
Preliminaries:
>>> import requests
>>> import bs4
>>> page = requests.get('http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
Having done this it might seem that the most straightforward way of identifying the script element might be to use this:
>>> scripts = soup.findAll('script', text=bs4.re.compile('cont = new Array();'))
However, scripts proves to be an empty list. (I don't know why.)
The basic approach works, if I choose a different target within the script but it would appear the it's unsafe to depend on the exact formatting of the content of Javascript script element.
>>> scripts = soup.find_all(string=bs4.re.compile('i=1979'))
>>> len(scripts)
1
Still, this might be good enough for you. Please just notice that the script has the change function at the end to be discarded.
A safer approach might be to look for the containing table element, then the second td element within that and finally the script within that.
>>> table = soup.find_all('table', class_='common_table')
>>> tds = table[0].findAll('td')[1]
>>> script = tds.find('script')
Again, you will need to discard function change.
You can use get('attribute') and then filter the results if needed:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar")
soup = BeautifulSoup(html, "html.parser")
res = [link.get('href') for link in soup.find_all('a')]
print (res)
I've been trying to extract just the links corresponding to the jobs on each page. But for some reason they dont print when I execute the script. No errors occur.
for the inputs I put engineering, toronto respectively. Here is my code.
import requests
from bs4 import BeautifulSoup
import webbrowser
jobsearch = input("What type of job?: ")
location = input("What is your location: ")
url = ("https://ca.indeed.com/jobs?q=" + jobsearch + "&l=" + location)
r = requests.get(url)
rcontent = r.content
prettify = BeautifulSoup(rcontent, "html.parser")
all_job_url = []
for tag in prettify.find_all('div', {'data-tn-element':"jobTitle"}):
for links in tag.find_all('a'):
print (links['href'])
You should be looking for the anchor a tag. It looks like this:
<a class="turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=3611ac98c0167102&fccid=459dce363200e1be" ...>Project <b>Engineer</b></a>
Call soup.find_all and iterate over the result set, extracting the links through the href attribute.
import requests
from bs4 import BeautifulSoup
# valid query, replace with something else
url = "https://ca.indeed.com/jobs?q=engineer&l=Calgary%2C+AB"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
all_job_url = []
for tag in soup.find_all('a', {'data-tn-element':"jobTitle"}):
all_job_url.append(tag['href'])
Please bear with me. I am quite new at Python - but having a lot of fun. I am trying to code a web crawler that crawls through election results from the last referendum in Denmark. I have managed to extract all the relevant links from the main page. And now I want Python to follow each of the 92 links and gather 9 pieces of information from each of those pages. But I am so stuck. Hope you can give me a hint.
Here is my code:
import requests
import urllib2
from bs4 import BeautifulSoup
# This is the original url http://www.kmdvalg.dk/
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
my_list = []
all_links = soup.find_all("a")
for link in all_links:
link2 = link["href"]
my_list.append(link2)
for i in my_list[1:93]:
print i
# The output shows all the links that I would like to follow and gather information from. How do I do that?
Here is my solution using lxml. It's similar to BeautifulSoup
import lxml
from lxml import html
import requests
page = requests.get('http://www.kmdvalg.dk/main')
tree = html.fromstring(page.content)
my_list = tree.xpath('//div[#class="LetterGroup"]//a/#href') # grab all link
print 'Length of all links = ', len(my_list)
my_list is a list consist of all links. And now you can use for loop to scrape information inside each page.
We can for loop through each links. Inside each page, you can extract information as example. This is only for the top table.
table_information = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
table_key = tree.xpath('//td[#class="statusHeader"]/text()')
table_value = tree.xpath('//td[#class="statusText"]/text()') + tree.xpath('//td[#class="statusText"]/a/text()')
table_information.append(zip([t]*len(table_key), table_key, table_value))
For table below the page,
table_information_below = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
l1 = tree.xpath('//tr[#class="tableRowPrimary"]/td[#class="StemmerNu"]/text()')
l2 = tree.xpath('//tr[#class="tableRowSecondary"]/td[#class="StemmerNu"]/text()')
table_information_below.append([t]+l1+l2)
Hope this help!
A simple approach would be to iterate through your list of urls and parse them each individually:
for url in my_list:
soup = BeautifulSoup(urllib2.urlopen(url).read())
# then parse each page individually here
Alternatively, you could speed things up significantly using Futures.
from requests_futures.sessions import FuturesSession
def my_parse_function(html):
"""Use this function to parse each page"""
soup = BeautifulSoup(html)
all_paragraphs = soup.find_all('p')
return all_paragraphs
session = FuturesSession(max_workers=5)
futures = [session.get(url) for url in my_list]
page_results = [my_parse_function(future.result()) for future in results]
This would be my solution for your problem
import requests
from bs4 import BeautifulSoup
def spider():
url = "http://www.kmdvalg.dk/main"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('div', {'class': 'LetterGroup'}):
anc = link.find('a')
href = anc.get('href')
print(anc.getText())
print(href)
# spider2(href) call a second function from here that is similar to this one(making url = to herf)
spider2(href)
print("\n")
def spider2(linktofollow):
url = linktofollow
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('tr', {'class': 'tableRowPrimary'}):
anc = link.find('td')
print(anc.getText())
print("\n")
spider()
its not done... i only get a simple element from the table but you get the idea and how its supposed to work.
Here is my final code that works smooth. Please let me know if I could have done it smarter!
import urllib2
from bs4 import BeautifulSoup
import codecs
f = codecs.open("eu2015valg.txt", "w", encoding="iso-8859-1")
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
liste = []
alle_links = soup.find_all("a")
for link in alle_links:
link2 = link["href"]
liste.append(link2)
for url in liste[1:93]:
soup = BeautifulSoup(urllib2.urlopen(url).read().decode('iso-8859-1'))
tds = soup.findAll('td')
stemmernu = soup.findAll('td', class_='StemmerNu')
print >> f, tds[5].string,";",tds[12].string,";",tds[14].string,";",tds[16].string,";", stemmernu[0].string,";",stemmernu[1].string,";",stemmernu[2].string,";",stemmernu[3].string,";",stemmernu[6].string,";",stemmernu[8].string,";",'\r\n'
f.close()
I have tried hard to get the link (i.e. /d/Hinchinbrook+25691+Masjid-Bilal) from "result" below while using beautifulsoup in Python. Please help?
result:
<div class="subtitleLink"><b>Masjid Bilal</b></div>
code:
url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
results = soup.findAll("div", {"class" : "subtitleLink"})
for result in results :
print result
br = result.find('a')
pos = br.get_text()
print pos
import urllib2
from bs4 import BeautifulSoup
url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
for link in soup.findAll('a'):
print link.get('href')
This should work if you want all the links. Let me know if it doesn't.
The get_text method returns only the string components of a tag. To get the link here, reference it as an attribute. For this specific instance, you can change br.get_text() to br['href'] to get your desired result.
...
>>> br = result.find('a')
>>> pos = br['href']
>>> print pos
/d/Hinchinbrook+25691+Masjid-Bilal