BS4 for HTML with special chars - python

I need some help, for a project I should have to parse information from a real estate website.
Somehow I am able to parse almost, everything, but it has a oneliner, which I've never seen before.
The code itself is too large, but some example snippet:
<div class="d-none" data-listing='{"strippedPhotos":[{"caption":"","description":"","urls":{"1920x1080":"https:\/\/ot.ingatlancdn.com\/d6\/07\/32844921_216401477_hd.jpg","800x600":"https:\/\/ot.ingatlancdn.com\/d6\/07\/32844921_216401477_l.jpg","228x171":"https:\/\/ot.ingatlancdn.com\/d6\/07\/32844921_216401477_m.jpg","80x60":"https:\/\/ot.ingatlancdn.com\/d6\/07
Can you please help me to identify this, and maybe a solution to how to parse all the contained info into a pandas DF?
Edit, code added:
other = []
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
hdr = {'User-Agent': 'Mozilla/5.0'}
site= "https://ingatlan.com/xiii-ker/elado+lakas/tegla-epitesu-lakas/32844921"
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
data = soup.find_all('div', id="listing", class_="d-none", attrs="data-listing")
data

You could access the value of the attribute and convert the string via json.loads():
data = json.loads(soup.find('div', id="listing", class_="d-none", attrs="data-listing").get('data-listing'))
Then simply create your DataFrame via pandas.json_normalize():
pd.json_normalize(data['strippedPhotos'])
Example
Cause expected result is not clear, this just should point in a direction:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
import json
hdr = {'User-Agent': 'Mozilla/5.0'}
site= "https://ingatlan.com/xiii-ker/elado+lakas/tegla-epitesu-lakas/32844921"
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
data = json.loads(soup.find('div', id="listing", class_="d-none", attrs="data-listing").get('data-listing'))
### all data
pd.json_normalize(data)
### only strippedPhotos
pd.json_normalize(data['strippedPhotos'])

Related

How to get the url from extracted information from a website

So basically I am stuck on the problem where I don't know how to the url from the extracted data from a website.
Here is my code:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1')
soup = BeautifulSoup(req.content, "html.parser")
print(soup.prettify())
I get a lot of information on output, but the only thing I need is the url, I hope someone can help me.
P.S:
It gives me this information:
{"response":{"items":[{"url":"https:\/\/2ch.hk\/b\/src\/262671212\/16440825183970.webm","type":"video\/webm","filesize":"20259","width":1280,"height":720,"name":"1521967932778.webm","board":"b","thread":"262671212"},{"url":"https:\/\/2ch.hk\/b\/src\/261549765\/16424501976450.webm","type":"video\/webm","filesize":"12055","width":1280,"height":720,"name":"1526793203110.webm","board":"b","thread":"261549765"}...
But i only need this part out of all the things
https:\/\/2ch.hk\/b\/src\/261549765\/16424501976450.webm (Not exactly this url, but just as an example)
You can do it this way:
url_array = []
for item in soup['response']['items']:
url_array.append(item['url'])
I guess if the API returns JSON data then it should be better to just parse it directly.
The url produces json data. Beautifulsoup can't grab json data and to grab json data, you can follow the next example.
import requests
import json
data = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1').json()
url= data['response']['items'][0]['url']
if url:
url=url.replace('.webm','.mp4')
print(url)
Output:
https://2ch.hk/b/src/263361969/16451225633240.mp4
The problem is you are telling BeautifulSoup to parse JSON data as HTML. You can get the URL you need more directly with the following code
import json
import requests
from bs4 import BeautifulSoup
req = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1')
data = json.loads(req.content)
my_url = data['response']['items'][0]['url']

Web scraping several a href

I would like to scrape this page with Python: https://statusinvest.com.br/acoes/proventos/ibovespa.
With this code:
import requests
from bs4 import BeautifulSoup as bs
URL = "https://statusinvest.com.br/acoes/proventos/ibovespa"
page = 1
req = requests.get(URL+str(page))
soup = bs(req.text, 'html.parser')
container = soup.find('div', attrs={'class','list'})
dividends = container.find('a')
for dividend in dividends:
links = dividend.find_all('a')
print(links)
But it doesn't return anything.
Can someone help me please?
Edited: you can see the below updated code to access any data you mentioned in the comment, you can modify according to your needs as all the data on that page is inside data variable.
Updated Code:
import json
import requests
from bs4 import BeautifulSoup as bs
url = "https://statusinvest.com.br"
links = []
req = requests.get(f"{url}/acoes/proventos/ibovespa")
soup = bs(req.content, 'html.parser')
data = json.loads(soup.find('input', attrs={'id': 'result'})["value"])
print("Date Com Data")
for datecom in data["dateCom"]:
print(f"{datecom['code']}\t{datecom['companyName']}\t{datecom['companyNameClean']}\t{datecom['companyId']}\t{datecom['companyId']}\t{datecom['resultAbsoluteValue']}\t{datecom['dateCom']}\t{datecom['paymentDividend']}\t{datecom['earningType']}\t{datecom['dy']}\t{datecom['recentEvents']}\t{datecom['recentEvents']}\t{datecom['uRLClear']}")
print("\nDate Payment Data")
for datePayment in data["datePayment"]:
print(f"{datePayment['code']}\t{datePayment['companyName']}\t{datePayment['companyNameClean']}\t{datePayment['companyId']}\t{datePayment['companyId']}\t{datePayment['resultAbsoluteValue']}\t{datePayment['dateCom']}\t{datePayment['paymentDividend']}\t{datePayment['earningType']}\t{datePayment['dy']}\t{datePayment['recentEvents']}\t{datePayment['recentEvents']}\t{datePayment['uRLClear']}")
print("\nProvisioned Data")
for provisioned in data["provisioned"]:
print(f"{provisioned['code']}\t{provisioned['companyName']}\t{provisioned['companyNameClean']}\t{provisioned['companyId']}\t{provisioned['companyId']}\t{provisioned['resultAbsoluteValue']}\t{provisioned['dateCom']}\t{provisioned['paymentDividend']}\t{provisioned['earningType']}\t{provisioned['dy']}\t{provisioned['recentEvents']}\t{provisioned['recentEvents']}\t{provisioned['uRLClear']}")
Seeing to the source code of that website one could fetch the json directly and get your desired links follow the below code.
Code:
import json
import requests
from bs4 import BeautifulSoup as bs
url = "https://statusinvest.com.br"
links=[]
req = requests.get(f"{url}/acoes/proventos/ibovespa")
soup = bs(req.content, 'html.parser')
data = json.loads(soup.find('input', attrs={'id': 'result'})["value"])
for datecom in data["dateCom"]:
links.append(f"{url}{datecom['uRLClear']}")
for datePayment in data["datePayment"]:
links.append(f"{url}{datePayment['uRLClear']}")
for provisioned in data["provisioned"]:
links.append(f"{url}{provisioned['uRLClear']}")
print(links)
Output:
Let me know if you have any questions :)

How to scrape specific IDs from a Webpage

I need to do some real estate market research and for this in need the prices, and other values from new houses.
So my idea was to go on the website where i get the information.
Go to the Main-Search-Site and scrape all the RealEstateIDs that would navigate me directly to the single pages for each house where i can than extract my infos that i need.
My problem is how do i get all the real estate ids from the main page and store them in a list, so i can use them in the next step to build the urls with them to go to the acutal sites.
I tried it with beautifulsoup but failed because i dont understand how to search for a specific word and extract what comes after it.
The html code looks like this:
""realEstateId":110356727,"newHomeBuilder":"false","disabledGrouping":"false","resultlist.realEstate":{"#xsi.type":"search:ApartmentBuy","#id":"110356727","title":"
Since the value "realEstateId" appears around 60 times, i want to scrape evertime the number (here: 110356727) that comes after it and store it in a list, so that i can use them later.
Edit:
import time
import urllib.request
from urllib.request import urlopen
import bs4 as bs
import datetime as dt
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests
from requests import get
url = 'https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
def expose_IDs():
resp = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('resultListModel')
tickers = []
for row in table.findAll('realestateID')[1:]:
ticker = row.findAll(',')[0].text
tickers.append(ticker)
with open("exposeID.pickle", "wb") as f:
pickle.dump(tickers, f)
return tickers
expose_IDs()
Something like this? There are 68 keys in a dictionary that are ids. I use regex to grab the same script as you are after and trim of an unwanted character, then load with json.loads and access the json object as shown in image at bottom.
import requests
import json
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
#resultListModel:
results = json.loads(script)
ids = list(results['searchResponseModel']['entryInformation'].keys())
print(ids)
Ids:
Since website updated:
import requests
import json
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
results = json.loads(script)
ids = [item['#id'] for item in results['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']]
print(ids)

Using requests with bs4 and or json

Here is the source of the of the page I am looking for. Page Source.
If page source is not working here is the link for the source only. "view-source:https://sports.bovada.lv/baseball/mlb"
Here is the Link: Link to page
I am not to familiar with using bs4 but here is the script below which works, but does not return anything I need.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://sports.bovada.lv/baseball/mlb/game-lines-market-group')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.prettify())
I can return the soup just fine. But what see from just inspecting the site and the returned soup are not the same.
Here is a sample of what I can see from inspect.
The goal is to remove the Team, pitcher, odds and total runs. Which I can clearly see in the inspect version. When I print soupthat information does not come with.
Then I dove a little further and on the bottom of the Page source i can see an iFrame and below that it looks like json dictionary with everything I am looking to extract but running a similar script to retrieve json data does not work like I had hoped:
import requests
req = requests.get('view-source:https://sports.bovada.lv//baseball/mlb/game-lines-market-group')
data = req.json()['itemList']
print(data)
I believe i should be using bs4 but I am confused on why the same html is not being returned.
The data in json is dynamic which means it puts it into the HTML.
To access it with BS you need to access the var contained in the source which contains the json data. then load it into json and you can access it from there.
This is from the link you gave from var swc_market_lists =
So in the source it will look like
<script type="text/javascript">var swc_market_lists = {"items":[{"description":"Game Lines","id":"136","link":"/baseball/mlb/game-lines-market-group","baseLink":"/baseball/mlb/game-lines-market-........
now you can use the swc_market_lists in the pattern regular expression to only return that script.
Use soup.find to return just that section.
Because the .text will include the var part I have returned the data from the start of the json string. In this case from 24 which is the first {
This means you now have a string of JSON data which you can then load as json and manipulate as required.
Hopefully you can work with this to find what you want
from bs4 import BeautifulSoup as bs4
import requests
import json
from lxml import html
from pprint import pprint
import re
def get_data():
url = 'https://sports.bovada.lv//baseball/mlb/game-lines-market-group'
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36"})
html_bytes = r.text
soup = bs4(html_bytes, 'lxml')
# res = soup.findAll('script') # find all scripts..
pattern = re.compile(r"swc_market_lists\s+=\s+(\{.*?\})")
script = soup.find("script", text=pattern)
return script.text[23:]
test1 = get_data()
json_data = json.loads(test1)
pprint(json_data['items'])

Need code to get specific text from url

I am trying to figure out what i need to add to this code so after the url source is read I can eliminate everything but text that is in between tags and then have have it print results
import urllib.request
req = urllib.request.Request('http://myurlhere.com')
response = urllib.request.urlopen(req)
the_page = response.read()
print (the_page)
You would need an HTML Parser.
Example using BeautifulSoup (it supports Python-3.x):
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request('http://onlinepermits.co.escambia.fl.us/CitizenAccess/Cap/CapDetail.aspx?Module=Building&capID1=14ACC&capID2=00000&capID3=00386&agencyCode=ESCAMBIA')
response = urllib.request.urlopen(req)
soup = BeautifulSoup(response)
print(soup.find('td', id='ctl00_PlaceHolderMain_PermitDetailList1_owner').div.table.text)
Prints:
SNB HOTEL INC2607 WILDE LAKE BLVD PENSACOLA FL 32526

Categories