How to scrape specific IDs from a Webpage

How to scrape specific IDs from a Webpage - python

I need to do some real estate market research and for this in need the prices, and other values from new houses.
So my idea was to go on the website where i get the information.
Go to the Main-Search-Site and scrape all the RealEstateIDs that would navigate me directly to the single pages for each house where i can than extract my infos that i need.
My problem is how do i get all the real estate ids from the main page and store them in a list, so i can use them in the next step to build the urls with them to go to the acutal sites.
I tried it with beautifulsoup but failed because i dont understand how to search for a specific word and extract what comes after it.
The html code looks like this:
""realEstateId":110356727,"newHomeBuilder":"false","disabledGrouping":"false","resultlist.realEstate":{"#xsi.type":"search:ApartmentBuy","#id":"110356727","title":"
Since the value "realEstateId" appears around 60 times, i want to scrape evertime the number (here: 110356727) that comes after it and store it in a list, so that i can use them later.
Edit:
import time
import urllib.request
from urllib.request import urlopen
import bs4 as bs
import datetime as dt
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests
from requests import get
url = 'https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
def expose_IDs():
resp = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('resultListModel')
tickers = []
for row in table.findAll('realestateID')[1:]:
ticker = row.findAll(',')[0].text
tickers.append(ticker)
with open("exposeID.pickle", "wb") as f:
pickle.dump(tickers, f)
return tickers
expose_IDs()

Something like this? There are 68 keys in a dictionary that are ids. I use regex to grab the same script as you are after and trim of an unwanted character, then load with json.loads and access the json object as shown in image at bottom.
import requests
import json
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
#resultListModel:
results = json.loads(script)
ids = list(results['searchResponseModel']['entryInformation'].keys())
print(ids)
Ids:
Since website updated:
import requests
import json
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
results = json.loads(script)
ids = [item['#id'] for item in results['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']]
print(ids)

Related

Reading multiple urls does not work in Python

I want to webscrape a few urls. This is what I do:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
url_2021_int = ["https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html","https://www.ecb.europa.eu/press/inter/date/2020/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2019/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2018/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2017/html/index_include.en.html"]
for url in url_2021_int:
req_int = requests.get(url)
soup_int = BeautifulSoup(req_int.text)
titles_int = soup_int.select(".title a")
titles_int=[data.text for data in titles_int]
However, I get data only for the last url (2017).
What am I doing wrong?
Thanks!

When you use req_int = requests.get(url) in the loop, the req_int variable is re-written each time.
If you want to store the requests.get(url) results in a list variable you can use
req_ints = [requests.get(url) for url in url_2021_int]
However, it seems logical to process the data in the same loop:
for url in url_2021_int:
req_int = requests.get(url)
soup_int = BeautifulSoup(req_int.text, "html.parser")
titles_int = soup_int.select(".title a")
titles_int=[data.text for data in titles_int]
Note that you can specify the "html.parser" as a second argument to the BeautifulSoup call, since the documents you are parsing are HTML documents.

Dynamically extract text from webpage using Python BeautifulSoup

I'm trying to extract player position from many players' webpages (here's an example for Malcolm Brogdon). I'm able to extract Malcolm Brogdon's position using the following code:
player_id = 'malcolm-brogdon-1'
# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np
url = "https://www.sports-reference.com/cbb/players/{}.html".format(player_id)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
pos = page_soup.p.find("strong").next_sibling.strip()
pos
However, I want to be able to do this in a more dynamic way (that is, to locate "Position:" and then find what comes after). There are other players for which the webpage is structured slightly differently, and my current code wouldn't return position (i.e. Cat Barber).
I've tried doing something like page_soup.find("strong", text="Position:") but that doesn't seem to work.

You can select the element that contains the text "Position:" and then the next text sibling:
import requests
from bs4 import BeautifulSoup
url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
pos = soup.select_one('strong:contains("Position")').find_next_sibling(text=True).strip()
print(pos)
Prints:
Guard
EDIT: Another version:
import requests
from bs4 import BeautifulSoup
url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
pos = (
soup.find("strong", text=lambda t: "Position" in t)
.find_next_sibling(text=True)
.strip()
)
print(pos)

How to extract content from <script> using Beautiful Soup

I am attempting to extract campaign_hearts and postal_code from the code in the script tag here (the entire code is too long to post):
<script>
...
"campaign_hearts":4817,"social_share_total":11242,"social_share_last_update":"2020-01-17T10:51:22-06:00","location":{"city":"Los Angeles, CA","country":"US","postal_code":"90012"},"is_partner":false,"partner":{},"is_team":true,"team":{"name":"Team STEVENS NATION","team_pic_url":"https://d2g8igdw686xgo.cloudfront.net
...
I can identify the script I need with the following code:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json
page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.content, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[0]
However, I'm at a loss for how to extract the values I want. (I'm very new to Python.)
This thread recommended the following solution for a similar problem (edited to reflect the html I'm working with).
data = json.loads(all_scripts[0].get_text()[27:])
However, running this produces an error: JSONDecodeError: Expecting value: line 1 column 1 (char 0).
What can I do to extract the values I need now that I have the correct script identified? I have also tried the solutions listed here, but had trouble importing Parser.

You can parse the content of <script> with json module and then get your values. For example:
import re
import json
import requests
url = 'https://www.gofundme.com/f/eric-stevens-care-trust'
txt = requests.get(url).text
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])
# print( json.dumps(data, indent=4) ) # <-- uncomment this to see all data
print('Campaign Hearts =', data['feed']['campaign']['campaign_hearts'])
print('Postal Code =', data['feed']['campaign']['location']['postal_code'])
Prints:
Campaign Hearts = 4817
Postal Code = 90012

The more libraries you use; the more inefficient a code becomes! Here is a simpler solution-
#This imports the website content.
import requests
url = "https://www.gofundme.com/f/eric-stevens-care-trust"
a = requests.post(url)
a= (a.content)
print(a)
#These will show your data.
campaign_hearts = str(a,'utf-8').split('campaign_hearts":')[1]
campaign_hearts = campaign_hearts.split(',"social_share_total"')[0]
print(campaign_hearts)
postal_code = str(a,'utf-8').split('postal_code":"')[1]
postal_code = postal_code.split('"},"is_partner')[0]
print(postal_code)

Your json.loads was failing because of the final semicolon. It will work if you use a regex to extract only the object string (excluding the final semicolon).
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json
page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.content, 'html.parser')
all_scripts = soup.find_all('script')
txt = all_scripts[0].get_text()
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])

This should be fine for now, I might try to write a pure lxml version or at least improve the searching for the element.
This solution uses regex to get only the JSON data, without the window.initialState = and semicolon.
import json
import re
import requests
from bs4 import BeautifulSoup
url_1 = "https://www.gofundme.com/f/eric-stevens-care-trust"
req = requests.get(url_1)
soup = BeautifulSoup(req.content, 'lxml')
script_tag = soup.find('script')
raw_json = re.fullmatch(r"window\.initialState = (.+);", script_tag.text).group(1)
json_content = json.loads(raw_json)

How to make beautiful soup grab only what is between a set of "[:" ":]" in a web page?

Good afternoon! How do I make Beautifulsoup grab only what is between multiple sets of "[:" and ":]" So far I have got the entire page in my soup, but it does not have tags, sadly.
What it looks like so far
I have tried a couple of things so far:
soup.findAll(text="[")
keys = soup.find("span", attrs = {"class": "objectBox objectBox-string"})
import bs4 as bs
import urllib.request
source = urllib.request.urlopen("https://login.microsoftonline.com/common/discovery/keys").read()
soup = bs.BeautifulSoup(source,'lxml')
# ---------------------------------------------
# prior script that I was playing with trying to tackle this issue
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set URL to scrape new certs from
newcerts = "https://login.microsoftonline.com/common/discovery/keys"
# Connect to the URL
response = requests.get(newcerts)
# Parse HTML and save to BeautifulSoup Object
soup = BeautifulSoup(response.text, "html.parser")
keys = soup.find("span", attrs = {"class": "objectBox objectBox-string"})
End goal is to retrieve the public PKI keys from Azure's website at https://login.microsoftonline.com/common/discovery/keys

Not sure if this is what you meant to grab. Try the script below:
import json
import requests
url = 'https://login.microsoftonline.com/common/discovery/keys'
res = requests.get(url)
jsonobject = json.loads(res.content)
for item in jsonobject['keys']:
print(item['x5c'])

Why doesn't my web-scraping code work?

I want to scrape the airplane arrivals from a website with Python 2.7, and export it to excel, but something is wrong with my code:
import urllib2
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
from bs4 import BeautifulSoup
filename=r'output.csv'
resultcsv=open(filename,"wb")
output=csv.writer(resultcsv, delimiter=';',quotechar = '"', quoting=csv.QUOTE_NONNUMERIC, encoding='latin-1')
url = "https://www.flightradar24.com/data/airports/bud/arrivals"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
data = soup.find('div', { "class" : "row cnt-schedule-table"})
print data
I need the contents of the div with class row cnt-schedule table. What am I doing wrong?

I believe the problem is that you are trying to get data from a JavaScript loaded data-set. Instead of loading from the page directly you'll need to mimic the requests for the data that the page is making to populate it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape specific IDs from a Webpage - python

Related

Reading multiple urls does not work in Python

Dynamically extract text from webpage using Python BeautifulSoup

How to extract content from <script> using Beautiful Soup

How to make beautiful soup grab only what is between a set of "[:" ":]" in a web page?

Why doesn't my web-scraping code work?

Categories

Resources