I have a request that send specific headers and payload to get a pdf content.
In browser's Network tab the response look like this:
But when I use python request and beautiful soup modules, they all cannot parse this response as well as it can not be written to any file to see it properly.
Here is a part of what I got:
//OK[1,["\x3Chtml\x3E\n\x3Chead\x3E\n\x3CMETA http-equiv\x3D\"Content-Type\" content\x3D\"text/html; charset\x3DUTF-8\"\x3E\n\x3Ctitle\x3EДанные ... \x3C/h2\x3E\n\x3C/div\x3E\n\x3C/div\x3E\n\x3C/body\x3E\n\x3C/html\x3E\n"],0,7]
I tried splitting text to keep only a part that starts and ends with html tag but Beautiful soup couldn't replace hex symbols to a normal view. .encode() and .decode('utf-8') also didn't helped
What would you recommend?
I hope you are doing it this way:
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "lxml")
If not, do it this way, it should work. This is because if you want the text of an HTML file you should use the .text property.
Related
i want to copy the text from this Website (https://www.reclamgymnasium.de/mobil/plankl.html?Klasse=9.2), to use it later py script.
How can i do this? (It doesent realy work with request...)
If you google about python webscraping you will find a lot of information!
basically you start by executing
response = requests.get(url)
Which provides you with the html content of the webpage. Now you can use beautifulsoup to navigate through the content to get what you need.
First we need to create a soup:
soup = beautifulsoup(response.text, "lxml")
in which we can now find the content. If we for example want to find all the url's in the webpage, you can use:
soup.find_all('a')
Here is a complete example code for printing all the url's of a webpage:
import requests
from bs4 import BeautifulSoup
url = "https://google.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for link in soup.find_all('a'):
print(link)
Here is the documentation of beautifulsoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
As the information that Johann was looking isn't static but dynamic information, I'm making a second answer to explain how I got the info.
When visiting the webpage https://www.reclamgymnasium.de/mobil/plankl.html?Klasse=9.2
Open the development tools of your browser (in my case it is firefox and I'm opening by pressing F12).
When the develompent tools are open, click on the "network" tab, which will be empty at this point.
Reload the page by clicking the reload arrow or by pressing F5.
Now we can see requests being loaded in the "network" tab.
As we are looking for data being loaded after page content, we look for "xml" or "json" responses in the "type" column.
Right click the response which has either correct type and click "open page in new tab"
If multiple responses match, test all matching until you find the information you are looking for.
In this case we found https://www.reclamgymnasium.de/mobil/mobdaten/PlanKl20210618.xml?_=1623933794858
I am trying to extract data from a website https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited using Scrapy and Beautiful Soup. However, both scrapers return empty when I use the class 'list-nw'.
I tried different parsers using BS but the same. On closer look, I noticed the view source has the data I need. Thus I get the page content in text which has the data. (rather than the class).
How do I extract the entire array using Regex for the key "LstrationaleDetails" inside variable var Model. (Line number 793)?
I tried several Regex but was unable to. Is Regex the only option or I can use Scrapy or BS? Also confused as after extracting how will I store it? If it was a JSON I could de-serialize it. I was thinking of something in lines of split and eval.
I tried this for BS.
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html5lib.parser')
print(soup)
Thanks for the help.
Attributable to #t.m.adam
You can use the following regex to extract from source html. Use the DOTALL flag to allow for newlines. User-Agent is required in headers.
import requests
import re
import json
url = 'https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited'
headers = {
'User-Agent' : 'Mozilla/5.0'
}
r = requests.get(url, headers = headers)
data = re.search('var Model =(.*?);\s+Ratinoal', r.text, flags=re.DOTALL).group(1)
result = json.loads(data)
for item in result['LstrationaleDetails']:
print(item)
Somebody is handing my function a BeautifulSoup object (BS4) that he has gotten using the typical call:
soup = BeautifulSoup(url)
my code:
def doSomethingUseful(soup):
url = soup.???
How do I get the original URL from the soup object? I tried reading the docs AND the BeautifulSoup source code... I'm still not sure.
If the url variable is a string of an actual URL, then you should just forget the BeautifulSoup here and use the same variable url. You should be using BeautifulSoup to parse HTML code, not a simple URL. In fact, if you try to use it like this, you get a warning:
>>> from bs4 import BeautifulSoup
>>> url = "https://foo"
>>> soup = BeautifulSoup(url)
C:\Python27\lib\site-packages\bs4\__init__.py:336: UserWarning: "https://foo" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
Since the URL is just a string, BeautifulSoup doesn't really know what to do with it when you "soupify" it, except for wrapping it up in basic HTML:
>>> soup
<html><body><p>https://foo</p></body></html>
If you still wanted to extract the URL from this, you could just use .text on the object, since it's the only thing in there:
>>> print(soup.text)
https://foo
If on the other hand url is not really a URL at all but rather a bunch of HTML code (in which case the variable name would be very misleading), then how you'd extract a specific link inside would beg the question of how it's in your code. Doing a find to get the first a tag, then extracting the href value would be one way.
>>> actual_html = '<html><body>My link text</body></html>'
>>> newsoup = BeautifulSoup(actual_html)
>>> newsoup.find('a')['href']
'http://moo'
Relativally new to BeautifulSoup. Attempting to obtain raw html from locally saved html file. I've looked around and have found that I should probably be using Beautiful Soup for this. Though when I do this:
from bs4 import BeautifulSoup
url = r"C:\example.html"
soup = BeautifulSoup(url, "html.parser")
text = soup.get_text()
print (text)
An empty string is printed out. I assume I'm missing some step. Any nudge in the right direction would be greatly appreciated.
The first argument to BeautifulSoup is an actual HTML string, not a URL. Open the file, read its contents, and pass that in.
Touching upon the previous answer, there are two ways to open an HTML file:
1.
with open("example.html") as fp:
soup = BeautifulSoup(fp)
2.
soup = BeautifulSoup(open("example.html"))
I get started with web scraping and I would like to get the URLs from certain page provided below.
import requests
from bs4 import BeautifulSoup as Soup
page = "http://www.zillow.com/homes/for_sale/fore_lt/2-_beds/any_days/globalrelevanceex_sort/57.610107,-65.170899,15.707662,-128.452149_rect/3_zm/"
response = requests.get(page)
soup = Soup(response.text)
Now, I have all the info of the page in the soup content and I would like to get URLs of all the homes provided in the image
When, I INSPECT any of the videos of the home, the chrome opens this DOM element in the image:
How would I get the link inside the <a href=""> tag using the soup? I think the parent is <div id = "lis-results">, but, I need a way to navigate to the element. Actually, I need all the URLs (391,479) of in a text file.
Zillow has an API and also, Python wrapper for the convenience of this kind of data job and I'm looking the code now. All I need to get is the URLs for the FOR SALE -> Foreclosures and POTENTIAL LISTING -> Foreclosed and Pre-foreclosed informations.
The issue is that the request you send doesn't get the URLs. In fact, if I look at the response (using e.g. jupyter) I get:
I would suggest a different strategy: these kind of websites often communicate via json files.
From the Network tab of Web Developer in Firefox you can find the URL to request the json file:
Now, with this file you can get all the information needed.
import json
page = "http://www.zillow.com/search/GetResults.htm?spt=homes&status=110001<=001000&ht=111111&pr=,&mp=,&bd=2%2C&ba=0%2C&sf=,&lot=,&yr=,&pho=0&pets=0&parking=0&laundry=0&income-restricted=0&pnd=0&red=0&zso=0&days=any&ds=all&pmf=1&pf=1&zoom=3&rect=-134340820,16594081,-56469727,54952386&p=1&sort=globalrelevanceex&search=maplist&disp=1&listright=true&isMapSearch=true&zoom=3"
response = requests.get(page) # request the json file
json_response = json.loads(response.text) # parse the json file
soup = Soup(json_response['list']['listHTML'], 'html.parser')
and the soup has what you are looking for. If you explore the json, you will find a lot of useful information.
The list of all the URLs can be find with
links = [i.attrs['href'] for i in soup.findAll("a",{"class":"hdp-link"})]
All the URLs appears twice. If you want that they are unique, you can fix the list, or, otherwise, look for "hdp-link routable" in class above.
But, I always prefer more then less!