Read scripts on a site using python - python

I'm currently trying to write a python script that notifies me through mail when a site updates it's selection of apartments. However, when I use Beautiful Soup, the site doesn't return a list of items, but rather a script that selects all relevant houses instead of the results of said script. Is there any way for me to retrieve the html of text of a site that I would see normally as a user? This is the rather simple code I've written in case that helps.
html = #somesite
response = requests.get(html)
text = BeautifulSoup(response.text)
text.find_all("script")

You need to execute the (java)script the way web browser does, then parse resulting html. I use selenium, there are other tools.

html = #somesite
response = requests.get(html)
text = BeautifulSoup(response.text)
text.text # returns text in the entire html body, excluding script

Related

Python, extract text from webpage

I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.
EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *
I am aware of solutions such as the .get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.
With the .get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to .get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.
As a bonus, is there a way to identify the main body of text on a web page?
Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())
For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.

Crawl a webpage which is generated by Javascript

I want to crawl the data from this website
I only need the text "Pictograph - A spoon 勺 with something 一 in it"
I checked Network -> Doc and I think the information is hidden here.
Because I found there's a line is
i.length > 0 && (r += '<span>» Formation: <\/span>' + i + _Eb)
And I think this page generates part of the page that we can see from the link.
However, I don't know what is the code? It has html, but it also contains so many function().
Update
If the code is Javascript, I would like to know how can I crawl the website not using Selenium?
Thanks!
This page use JavaScript to add this element. Using Selenium I can get HTML after adding this element and then I can search text in HTML. This HTML has strange construction - all text is in tag so this part has no special tag to find it. But it is last text in this tag and it starts with "Formation:" so I use BeautifulSoup to ge all text with all subtags using get_text() and then I can use split('Formation:') to get text after this element.
import selenium.webdriver
from bs4 import BeautifulSoup as BS
driver = selenium.webdriver.Firefox()
driver.get('https://www.archchinese.com/chinese_english_dictionary.html?find=%E4%B8%8E')
soup = BS(driver.page_source)
text = soup.find('div', {'id': "charDef"}).get_text()
text = text.split('Formation:')[-1]
print(text.strip())
Maybe Selenium works slower but it was faster to create solution.
If I could find url used by JavaScript to load data then I would use it without Selenium but I didn't see these information in XHR responses. There was few responses compressed (probably gzip) or encoded and maybe there was this text but I didn't try to uncompress/decode it.

Getting the XPath from an HTML document

https://next.newsimpact.com/NewsWidget/Live
I am trying to code a python script that will grab a value from a HTML table in the link above. The link above is the site that I am trying to grab from, and this is the code I have written. I think that possibly my XPath is incorrect, because its been doing fine on other elements, but the path I'm using is not returning/printing anything.
from lxml import html
import requests
page = requests.get('https://next.newsimpact.com/NewsWidget/Live')
tree = html.fromstring(page.content)
#This will create a list of buyers:
value = tree.xpath('//*[#id="table9521"]/tr[1]/td[4]/text()')
print('Value: ', value)
What is strange is when I open the view source code page, I cant find the table I am trying to pull from.
Thank you for your help!
Required data absent in initial page source - it comes from XHR. You can get it as below:
import requests
response = requests.get('https://next.newsimpact.com/NewsWidget/GetNextEvents?offset=-120').json()
first_previous = response['Items'][0]['Previous'] # Current output - "2.632"
second_previous = response['Items'][1]['Previous'] # Currently - "0.2"
first_forecast = response['Items'][0]['Forecast'] # ""
second_forecast = response['Items'][1]['Forecast'] # "0.3"
You can parse response as simple Python dict and get all required data
Your problem is simple, request don't handle javascript at all. The values are JS generated !
If you really need to run this xpath, you need to use a module capable of understanding JS, like spynner.
You can test when you need JS or not by first using curl or by disabling JS in your browser. With firefox : about:config in navigation bar, then search javascript.enabled, then double click on it to switch between true or false
In chrome, open chrome dev tools, there's the option somewhere.
Check https://github.com/makinacorpus/spynner
Another (possible) problem, use tree = html.fromstring(page.text) not tree = html.fromstring(page.content)

Python: Perform Google Search and extract only the content from the individual top 10 results

I am trying to write a script which performs a Google search for the input keyword and returns only the content from the top 10 URLs.
Note: Content specifically refers to the content that is being requested by the searched term and is found in the body of the returned URLs.
I am done with the search and top 10 url retrieval part. Here is the script:
from google import search
top_10_links = search(keyword, tld='com.in', lang='en',stop=10)
however i am unable to retrieve only the content from the links without knowing their structure. I can scrape content from a particular site by finding the class etc. of the tags using dev tools.But i am unable to figure out how to get content from the top 10 result URLs since for every searched term there are different URLs(different sites have different css selectors) and it would to pretty hard to find the css class of the required content. here is the sample code to extract content from a particular site.
content_dict = {}
i = 1
for page in links:
print(i, ' # link: ', page)
article_html = get_page(page)#get_page() returns page's html
soup = BeautifulSoup(article_html, 'lxml')
content = soup.find('div',{'class': 'entry-content'}).get_text()
content_dict[page] = content
i += 1
However the css class changes for the different sites. Is there someway i can get this script working and get the desired content?
You can't do scraping without knowing the structure of what you're scraping.But there is a package that does something similar. Take a look at newspaper

Is it possible to find this link text with requests?

At the URL https://www.airbnb.com/rooms/3093543, there is a map that loads near the bottom of the page containing a ‘neighborhood’ box that says Presidio. It’s stored in a tag as Presidio
I'm trying to get it with this:
profile = BeautifulSoup(requests.get("https://www.airbnb.com/rooms/3093543").content, "html.parser")
print profile.select('div[id="hover-card"]')[0].find('a').text
# div[id="hover-card"] is not found
I’m not sure if this is a dynamic variable that could only be retrieved with another module, or whether it is possible to get with requests.
You can get that data via another element.
Try this:
profile = BeautifulSoup(requests.get("https://www.airbnb.com/rooms/3093543").content, "html.parser")
print profile.select('meta[id="_bootstrap-neighborhood_card"]')[0]
And if needed request the map via:
https://www.airbnb.pt/locations/api/neighborhood_tiles.json?ids%5B%5D=ID
Where the ID in the above URL is given by the neighborhood_basic_info attribute in the first print.

Categories