How do I grab a title and link from a website? - python

I'm currently working on a web scraper on this website (https://www.allabolag.se). I would like to grab the title and link to every result on the page, and I'm currently stuck.
<a data-v-4565614c="" href="/5566435201/grenspecialisten-forvaltning-ab">Grenspecialisten Förvaltning AB</a>
This is an example from the website where I would like to grab the href and >Grenspecialisten Förvaltning AB< as It contains the title and link. How would I go about doing that?
The code I have currently looks like this
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15'}
url = 'https://www.allabolag.se/bransch/bygg-design-inredningsverksamhet/6/_/xv/BYGG-,%20DESIGN-%20&%20INREDNINGSVERKSAMHET/xv/JURIDIK,%20EKONOMI%20&%20KONSULTTJÄNSTER/xl/12/xb/AB/xe/4/xe/3'
r = requests.get (url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
questions = soup.findAll('div', {'class': 'tw-flex'})
for item in questions:
title = item.find('a', {''}).text
print(title)
Any help would be greatly appreciated!
Best regards :)

The results are embedded in the page in Json form. To decode it, you can use next example:
import json
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15"
}
url = "https://www.allabolag.se/bransch/bygg-design-inredningsverksamhet/6/_/xv/BYGG-,%20DESIGN-%20&%20INREDNINGSVERKSAMHET/xv/JURIDIK,%20EKONOMI%20&%20KONSULTTJÄNSTER/xl/12/xb/AB/xe/4/xe/3"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
data = json.loads(
soup.find(attrs={":search-result-default": True})[":search-result-default"]
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for result in data:
print(
"{:<50} {}".format(
result["jurnamn"], "https://www.allabolag.se/" + result["linkTo"]
)
)
Prints:
Grenspecialisten Förvaltning AB https://www.allabolag.se/5566435201/grenspecialisten-forvaltning-ab
Peab Fastighetsutveckling Syd AB https://www.allabolag.se/5566998430/peab-fastighetsutveckling-syd-ab
BayWa r.e. Nordic AB https://www.allabolag.se/5569701377/baywa-re-nordic-ab
Kronetorp Park Projekt AB https://www.allabolag.se/5567196539/kronetorp-park-projekt-ab
SVENSKA HUSCOMPAGNIET AB https://www.allabolag.se/5568155583/svenska-huscompagniet-ab
Byggnadsaktiebolaget Gösta Bengtsson https://www.allabolag.se/5561081869/byggnadsaktiebolaget-gosta-bengtsson
Tectum Byggnader AB https://www.allabolag.se/5562903582/tectum-byggnader-ab
Winthrop Engineering and Contracting AB https://www.allabolag.se/5592128176/winthrop-engineering-and-contracting-ab
SPI Global Play AB https://www.allabolag.se/5565082897/spi-global-play-ab
Trelleborg Offshore & Construction AB https://www.allabolag.se/5560557711/trelleborg-offshore-construction-ab
M.J. Eriksson Entreprenad AB https://www.allabolag.se/5567043814/mj-eriksson-entreprenad-ab
Solix Group AB https://www.allabolag.se/5569669574/solix-group-ab
Gripen Betongelement AB https://www.allabolag.se/5566646427/gripen-betongelement-ab
BLS Construction AB https://www.allabolag.se/5569814345/bls-construction-ab
We Construction AB https://www.allabolag.se/5590705116/we-construction-ab
Helsingborgs Fasad & Kakel AB https://www.allabolag.se/5567814248/helsingborgs-fasad-kakel-ab
Gat & Kantsten Sverige AB https://www.allabolag.se/5566564919/gat-kantsten-sverige-ab
Bjärno Byggsystem AB https://www.allabolag.se/5566743190/bjarno-byggsystem-ab
Bosse Sandells Bygg Aktiebolag https://www.allabolag.se/5564391158/bosse-sandells-bygg-aktiebolag
Econet Vatten & Miljöteknik AB https://www.allabolag.se/5567388953/econet-vatten-miljoteknik-ab

Related

Scrape data with <Script type="text/javascript" using beautifulsoup

Im building a web scrape to pull product data from a website, this particular company hides the price behind a "login for Price" banner but the price is hidden in the HTML under <Script type="text/javascript" but im unable to pull it out. the specific link that im testing is https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/
My current code is this and the last line is the one im using to pull the text out.
```
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl="https://www.chadwellsupply.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
productlinks = []
for x in range (1,3):
response = requests.get(f'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/?q=&filter=&clearedfilter=undefined&orderby=19&pagesize=24&viewmode=list&currenttab=products&pagenumber={x}&articlepage=')
soup = BeautifulSoup(response.content,'html.parser')
productlist = soup.find_all('div', class_="product-header")
for item in productlist:
for link in item.find_all('a', href = True):
productlinks.append(link['href'])
testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'
response = requests.get(testlink, headers = headers)
soup = BeautifulSoup(response.content,'html.parser')
print(soup.find('div',class_="product-title").text.strip())
print(soup.find('p',class_="status").text.strip())
print(soup.find('meta',{'property':'og:url'}))
print(soup.find('div',class_="tab-pane fade show active").text.strip())
print(soup.find('div',class_="Chadwell-Shared-Breadcrumbs").text.strip())
print(soup.find('script',{'type':'text/javascript'}).text.strip())
```
Below is the chunk of script from the website (tried to paste directly here but it wouldnt format correctly) that im expecting it to pull but what it gives me is
"window.dataLayer = window.dataLayer || [];"
HTML From website
Ideally id like to just pull the price out but if i can atleast get the whole chunk of data out i can manually extract price.
You can use re/json module to search/parse the HTML data (obviously, beautifulsoup cannot parse JavaScript - another option is to use selenium).
import re
import json
import requests
url = "https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/"
html_doc = requests.get(url).text
data = re.search(r"ga\('ec:addProduct', (.*?)\);", html_doc).group(1)
data = json.loads(data)
print(data)
Prints:
{
"id": "301078",
"name": 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE',
"category": "Stove/ Ranges",
"brand": "Hotpoint",
"price": "759",
}
Then for price you can do:
print(data["price"])
Prints:
759
A hacky alternative to regex is to select for a function in the scripts. In your case, the script contains function(i,s,o,g,r,a,m).
from bs4 import BeautifulSoup
import requests
import json
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'
response = requests.get(testlink, headers = headers)
soup = BeautifulSoup(response.content,'html.parser')
for el in soup.find_all("script"):
if "function(i,s,o,g,r,a,m)" in el.text:
scripttext = el.text
You can then select the data.
extracted = scripttext.split("{")[-1].split("}")[0]
my_json = json.loads("{%s}" % extracted)
print(my_json)
#{'id': '301078', 'name': 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE', 'category': 'Stove/ Ranges', 'brand': 'Hotpoint', 'price': '759'}
Then get the price.
print(my_json["price"])
#759

Web-scraping: Accessing text information within a large list

Example: https://www.realtor.com/realestateandhomes-detail/20013-Hazeltine-Pl_Ashburn_VA_20147_M65748-31771
I am trying to access the number of garage spaces for several real estate listings. The only problem is that the location of the number of garage spaces isn't always in the 9th location of the list. On some pages it is earlier, and on other pages it is later.
garage = info[9].strip().replace('\n','')[15]
where
info = soup.find_all('ul', {'class': "list-default"})
info = [t.text for t in info]
and
header = {"user agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Safari/605.1.15"}
page = requests.get(url, headers = header)
page.reason
requests.utils.default_user_agent()
soup = bs4.BeautifulSoup(page.text, 'html5lib')
What is the best way for me to obtain how many garage spaces a house listing has?
You can use CSS selector li:contains("Garage Spaces:") that will find <li> tag with the text "Garage Spaces:".
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.realtor.com/realestateandhomes-detail/20013-Hazeltine-Pl_Ashburn_VA_20147_M65748-31771'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Safari/605.1.15"}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
garage_spaces = soup.select_one('li:contains("Garage Spaces:")')
if garage_spaces:
garage_spaces = garage_spaces.text.split()[-1]
print('Found Garage spaces! num =', garage_spaces)
Prints:
Found Garage spaces! num = 2

Web scraping twitter

I want to do web scraping on twitter page to download tweets on a specific search word. I am not able to fetch recursively all the tweets, rather I can fetch 20 tweets. Please help to fetch all the tweets recursively. Below is the code
from bs4 import BeautifulSoup
import requests
import pandas as pd
company_name = 'ABC'
url = 'https://twitter.com/search?q=%23%27%20%20%20' + company_name + '&src=typd&lang=en'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
req = requests.get(url, headers=headers);#print(req)
data = req.text;# print(data)
# soup = BeautifulSoup(data, "lxml");# print(soup)
soup = BeautifulSoup(data, "html.parser");# print(soup)
tweets = [p.text for p in soup.findAll('p', class_='tweet-text')]
# print(tweets)
df = pd.DataFrame()
df['Tweet'] = tweets
print(df.head())
print(df.shape)

Scraping Table With Python/BS4

Im trying to scrape the "Team Stats" table from http://www.pro-football-reference.com/boxscores/201602070den.htm with BS4 and Python 2.7. However Im unable to get anywhere close to it,
url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html5lib")
table=soup.findAll('table', {'id':"team_stats", "class":"stats_table"})
print table
I thought something like the above code would work but no luck.
The problem in this case is that the "Team Stats" table is located inside a comment in the HTML source which you download with requests. Locate the comment and reparse it with BeautifulSoup into a "soup" object:
import requests
from bs4 import BeautifulSoup, NavigableString
url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})
soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)
soup = BeautifulSoup(comment, "html5lib")
table = soup.find("table", id="team_stats")
print(table)
And/or, you can load the table into, for example, a pandas dataframe which is very convenient to work with:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})
soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)
df = pd.read_html(comment)[0]
print(df)
Prints:
Unnamed: 0 DEN CAR
0 First Downs 11 21
1 Rush-Yds-TDs 28-90-1 27-118-1
2 Cmp-Att-Yd-TD-INT 13-23-141-0-1 18-41-265-0-1
3 Sacked-Yards 5-37 7-68
4 Net Pass Yards 104 197
5 Total Yards 194 315
6 Fumbles-Lost 3-1 4-3
7 Turnovers 2 4
8 Penalties-Yards 6-51 12-102
9 Third Down Conv. 1-14 3-15
10 Fourth Down Conv. 0-0 0-0
11 Time of Possession 27:13 32:47

Beautifulsoup parsing error

I am trying to extract some information about an App on Google Play and BeautifulSoup doesn't seem to work.
The link is this(say):
https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts
My code:
url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts"
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html)
l = soup.find_all("div", { "class" : "document-subtitles"})
print len(l)
0 #How is this 0?! There is clearly a div with that class
I decided to go all in, didn't work either:
i = soup.select('html body.no-focus-outline.sidebar-visible.user-has-no-subscription div#wrapper.wrapper.wrapper-with-footer div#body-content.body-content div.outer-container div.inner-container div.main-content div div.details-wrapper.apps.square-cover.id-track-partial-impression.id-deep-link-item div.details-info div.info-container div.info-box-top')
print i
What am I doing wrong?
You need to pretend to be a real browser by supplying the User-Agent header:
import requests
from bs4 import BeautifulSoup
url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts"
r = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
})
html = r.content
soup = BeautifulSoup(html, "html.parser")
title = soup.find(class_="id-app-title").get_text()
rating = soup.select_one(".document-subtitle .star-rating-non-editable-container")["aria-label"].strip()
print(title)
print(rating)
Prints the title and the current rating:
Weird Facts
Rated 4.3 stars out of five stars
To get the additional information field values, you can use the following generic function:
def get_info(soup, text):
return soup.find("div", class_="title", text=lambda t: t and t.strip() == text).\
find_next_sibling("div", class_="content").get_text(strip=True)
Then, if you do:
print(get_info(soup, "Size"))
print(get_info(soup, "Developer"))
You will see printed:
1.4M
Email email#here.com

Categories