Here is the source of the of the page I am looking for. Page Source.
If page source is not working here is the link for the source only. "view-source:https://sports.bovada.lv/baseball/mlb"
Here is the Link: Link to page
I am not to familiar with using bs4 but here is the script below which works, but does not return anything I need.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://sports.bovada.lv/baseball/mlb/game-lines-market-group')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.prettify())
I can return the soup just fine. But what see from just inspecting the site and the returned soup are not the same.
Here is a sample of what I can see from inspect.
The goal is to remove the Team, pitcher, odds and total runs. Which I can clearly see in the inspect version. When I print soupthat information does not come with.
Then I dove a little further and on the bottom of the Page source i can see an iFrame and below that it looks like json dictionary with everything I am looking to extract but running a similar script to retrieve json data does not work like I had hoped:
import requests
req = requests.get('view-source:https://sports.bovada.lv//baseball/mlb/game-lines-market-group')
data = req.json()['itemList']
print(data)
I believe i should be using bs4 but I am confused on why the same html is not being returned.
The data in json is dynamic which means it puts it into the HTML.
To access it with BS you need to access the var contained in the source which contains the json data. then load it into json and you can access it from there.
This is from the link you gave from var swc_market_lists =
So in the source it will look like
<script type="text/javascript">var swc_market_lists = {"items":[{"description":"Game Lines","id":"136","link":"/baseball/mlb/game-lines-market-group","baseLink":"/baseball/mlb/game-lines-market-........
now you can use the swc_market_lists in the pattern regular expression to only return that script.
Use soup.find to return just that section.
Because the .text will include the var part I have returned the data from the start of the json string. In this case from 24 which is the first {
This means you now have a string of JSON data which you can then load as json and manipulate as required.
Hopefully you can work with this to find what you want
from bs4 import BeautifulSoup as bs4
import requests
import json
from lxml import html
from pprint import pprint
import re
def get_data():
url = 'https://sports.bovada.lv//baseball/mlb/game-lines-market-group'
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36"})
html_bytes = r.text
soup = bs4(html_bytes, 'lxml')
# res = soup.findAll('script') # find all scripts..
pattern = re.compile(r"swc_market_lists\s+=\s+(\{.*?\})")
script = soup.find("script", text=pattern)
return script.text[23:]
test1 = get_data()
json_data = json.loads(test1)
pprint(json_data['items'])
Related
I need some help, for a project I should have to parse information from a real estate website.
Somehow I am able to parse almost, everything, but it has a oneliner, which I've never seen before.
The code itself is too large, but some example snippet:
<div class="d-none" data-listing='{"strippedPhotos":[{"caption":"","description":"","urls":{"1920x1080":"https:\/\/ot.ingatlancdn.com\/d6\/07\/32844921_216401477_hd.jpg","800x600":"https:\/\/ot.ingatlancdn.com\/d6\/07\/32844921_216401477_l.jpg","228x171":"https:\/\/ot.ingatlancdn.com\/d6\/07\/32844921_216401477_m.jpg","80x60":"https:\/\/ot.ingatlancdn.com\/d6\/07
Can you please help me to identify this, and maybe a solution to how to parse all the contained info into a pandas DF?
Edit, code added:
other = []
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
hdr = {'User-Agent': 'Mozilla/5.0'}
site= "https://ingatlan.com/xiii-ker/elado+lakas/tegla-epitesu-lakas/32844921"
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
data = soup.find_all('div', id="listing", class_="d-none", attrs="data-listing")
data
You could access the value of the attribute and convert the string via json.loads():
data = json.loads(soup.find('div', id="listing", class_="d-none", attrs="data-listing").get('data-listing'))
Then simply create your DataFrame via pandas.json_normalize():
pd.json_normalize(data['strippedPhotos'])
Example
Cause expected result is not clear, this just should point in a direction:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
import json
hdr = {'User-Agent': 'Mozilla/5.0'}
site= "https://ingatlan.com/xiii-ker/elado+lakas/tegla-epitesu-lakas/32844921"
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
data = json.loads(soup.find('div', id="listing", class_="d-none", attrs="data-listing").get('data-listing'))
### all data
pd.json_normalize(data)
### only strippedPhotos
pd.json_normalize(data['strippedPhotos'])
Editing:
So, I am doing a webscraping using beautiful soup.
I´ve tried a lot of things but can´t reach this part of the code:
I tried this (and other derivations) but it returns an empty list:
iptu = [iptu.get_text() for iptu in soup.find_all("article", {"data-clickstream":"iptuPrices"})]
How can I send the HTML as its very big to copy and paste?!
From your image, it looks like the data you want is in a JSON string in an attribute of the article tag. If so, then perhaps something like this can get you started.
from bs4 import BeautifulSoup
import json
import requests
url = 'https://www.zapimoveis.com.br/aluguel/casas-de-condominio/agr+rj++barra-e-recreio/'
user_agent = {'User-agent': 'Mozilla/5.0'}
resp = requests.get(url, headers=user_agent)
soup = BeautifulSoup(resp.text, features="html.parser")
prices = []
for i, a in enumerate(soup.find_all('article')):
b = a.get('data-clickstream')
if not b: continue
o = json.loads(b)
prices.append(sum(map(float, o['iptuPrices'])))
print(prices)
I'm trying to parse the content within a box like container located at the very bottom of this website but I don't find their existence in page source. I've tried to create a script to reach them anyway.
import requests
from bs4 import BeautifulSoup
url = 'https://www.proxy-list.download/HTTPS'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
item = soup.select_one("a#btn3").text
print(item)
Output I'm having:
Copy to clipboard
I'm after this:
104.248.115.236:80
104.248.53.46:3128
104.236.248.219:3128
104.248.115.236:3128
104.248.115.236:8080
104.248.184.16:8080
This is how that content is visible in that page:
Try this link https://www.proxy-list.download/api/v0/get?l=en&t=https (which you can find using dev tools) to get them all like the way I've shown below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.proxy-list.download/api/v0/get?l=en&t=https'
r = requests.get(url)
for item in r.json()[0]['LISTA']:
proxy = f"{item['IP']}{':'}{item['PORT']}"
print(proxy)
I'm new in web scraping and for learning purpose I want to find all href link in https://retty.me/ website.
But I found that my code only find one link in that website. But I viwed page source it has many link which didn't print. I also print full page where only one link contains.
what did I do wrong?
please correct me.
here is my python code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
data=[]
html = urlopen('https://retty.me')
soup = BeautifulSoup(html,'lxml')
print(soup)
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
file=open('scraped_data.txt','w')
for item in data:
file.write("%s\n"%item)
file.close()
If you enter the message shown in the html you get into google translate it says "We apologize for your trouble".
They don't want people scraping their site so they filter requests based on the user agent. You just need to add a user agent to the request header that looks like a browser.
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
data=[]
url = 'https://retty.me'
req = Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
html = urlopen(req)
soup = BeautifulSoup(html,'lxml')
print(soup)
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
for item in data:
print(item)
In fact, this particular site only requires the presence of the user agent header and will accept any user agent even an empty string. The requests library as mentioned by Rishav provides a user agent by default, that's why it works without adding a custom header there.
I don't know why the website returns different HTML when used with urllib, but you can use the excellent requests library which is much easier to use than urllib anyway.
from bs4 import BeautifulSoup
import re
import requests
data = []
html = requests.get('https://retty.me').text
soup = BeautifulSoup(html, 'lxml')
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
print(data)
You can find the official documentation for requests here and for Beautiful Soup here.
import requests
from bs4 import BeautifulSoup
# your Response object called response
response = requests.get('https://retty.me')
# your html as string
html = response.text
#verify that you get the correct html code
print(html)
#make the html, a soup object
soup = BeautifulSoup(html, 'html.parser')
# initialization of your list
data = []
# append to your list all the URLs found within a page’s <a> tags
for link in soup.find_all('a'):
data.append(link.get('href'))
#print your list items
print(data)
I am scrapping data using beautiful soup. I have a list of urls I want to loop my code through, so I need to include a variable in the urllib2.Request command. When I add a variable to urllib2.Request I get this error (line 1240 of urllib2.py):
raise URLError('unknown url type: %s' % type)
Here is my code:
from bs4 import BeautifulSoup
import urllib2
webstring = "/DIRECTORY/"+"'"
webfull = "urllib2.Request('http://www.caao.org"+webstring+", None, headers)"
print webfull
#webfull prints: urllib2.Request('http://www.caao.org/DIRECTORY/', None, headers)
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(webfull).read()
soup = BeautifulSoup(html)
print soup
The variable webfull prints out the correct code. I can cut and paste it into urlopen and it will work. Just like this:
from bs4 import BeautifulSoup
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request('http://www.caao.org/DIRECTORY/', None, headers)).read()
soup = BeautifulSoup(html)
print soup
I've tried using multiple websites as tests, and I have tried using triple-double quotes on certain strings (like below), but I always get the same unknown url type error.
webstring = "/DIRECTORY/"+"'"
web1 = """'http://www.caao.org"""+webstring+", None, headers)"
As a side note:
I'm new to python and I trying to scrape data from multiple pages within the same website. The code above is to let me run down my list of URLs to run my beautiful soup code on each page. If there is an easier way to loop through a list of URLs and use urllib2.urlopen to open each page so I can run my scraping code, let me know.
Just construct your URL dynamically, then pass it to the functions. Don't pass a string representation of the functions you wish to call - that won't work.
from bs4 import BeautifulSoup
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
webstring = "/DIRECTORY/"
url = "http://www.caao.org"+webstring
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
print soup
The problem you are having here, is you are trying to interpret a string as if it was a piece of code.
What urllib expects is that when you pass in a string that it will be a string. What you probably should do is:
from bs4 import BeautifulSoup
import urllib2
webstring = "/DIRECTORY/"
url = "http://www.caao.org"+webstring
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
print soup
Where the string is just the url you want, and then you pass it into urllib2.Request