how to obtain full HTML content from a google search result page

how to obtain full HTML content from a google search result page - python

I am new to web crawling, thanks for helping out. The task I need to perform is to obtain the full returned HTTP response from google search. When searching on Google with a search keyword in browser, in the returned page, there is a section:
Searches related to XXXX (where XXXX is the searched words)
I need to extract this section of the web page. From my research, most of the current package on google crawling are not able to extract this section of information. I tried to use urllib2, with the following code:
import urllib2
url = "https://www.google.com.sg/search? q=test&ie=&oe=#q=international+business+machine&spf=187"
req = urllib2.Request(url, headers={'User-Agent' : 'Mozilla/5.0'})
con = urllib2.urlopen( req )
strs = con.read()
print strs
I am getting a large chunk of text which looks like legit HTTP response, but within the text, there isn't any content related to my searched key "international business machine". I know Google probably detect this is not request from an actual browser hence hide this info. May I know if there is any way to bypass this and obtained the "related search" section of google result? Thanks.

as pointed out by #anonyXmous. the useful post to refer to is here:
Google Search Web Scraping with Python
with
from requests import get
keyword = "internation business machine"
url = "https://google.com/search?q="+keyword
raw = get(url).text
print raw
I am able to get the needed text in "raw"

Related

Get request returns unfriendly response Python

I am trying to perform a get request on TCG Player via Requests on Python. I checked the sites robots.txt which specifies:
User-agent: *
Crawl-Delay: 10
Allow: /
Sitemap: https://www.tcgplayer.com/sitemap/index.xml
This is my first time seeing a robots.txt file.
My code is as follows:
import requests
url = "http://www.tcgplayer.com"
r = requests.get(url)
print(r.text)
I cannot include r.text in my post because the character limit would be exceeded.
I would have expected to be recieve the HTML content of the webpage, but I got an 'unfriendly' response instead. What is the meaning of the text above? Is there a way to get the HTML so I can scrape the site?
By 'unfriendly' I mean:
The HTML that is returned does not match the HTML that is produced by typing the URL into my web browser.

This is probably due to some server-side rendering of web content, as indicated by the empty <div id="app"></div> block in the scraped result. To properly handle such content, you will need to use a more advanced web scraping tool, like Selenium. I'd recommend this tutorial to get started.

Wait Before Scraping using Beatifulsoup

I'm trying to scrape data from this review site. It first go through first page, check if there's a 2nd page then go to it too. Problem is when getting to 2nd page. Page takes time to update and I still get the first page's data instead of 2nd
For example, if you go here, you will see how it takes time to load page 2 data
I tried to put a timeout or sleep but didn't work. Prefer a solution with minimal package/browser dependency (like webdriver.PhantomJS()) as I need to run this code on my employer's environment and not sure if I can use it. Thank you!!
from urllib.request import Request, urlopen
from time import sleep
from socket import timeout
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read()
webpage = web_byte.decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
true=parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
while(true):
true = parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
if(not True):
true=False
else:
req = Request(softwareadvice+'?review.page=2', headers=hdr)
sleep(10)
webpage = urlopen(req, timeout=10)
sleep(10)
webpage = webpage.read().decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")

The reviews are loaded from external source via Ajax request. You can use this example how to load them:
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.softwareadvice.com/sms-marketing/twilio-profile/reviews/"
api_url = (
"https://pkvwzofxkc.execute-api.us-east-1.amazonaws.com/production/reviews"
)
params = {
"q": "s*|-s*",
"facet.gdm_industry_id": '{"sort":"bucket","size":200}',
"fq": "(and product_id: '{}' listed:1)",
"q.options": '{"fields":["pros^5","cons^5","advice^5","review^5","review_title^5","vendor_response^5"]}',
"size": "50",
"start": "50",
"sort": "completeness_score desc,date_submitted desc",
}
# get product id
soup = BeautifulSoup(requests.get(url).content, "html.parser")
a = soup.select_one('a[href^="https://reviews.softwareadvice.com/new/"]')
id_ = int("".join(re.findall(r"\d+", a["href"])))
params["fq"] = params["fq"].format(id_)
for start in range(0, 3): # <-- increase the number of pages here
params["start"] = 50 * start
data = requests.get(api_url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for h in data["hits"]["hit"]:
if "review" in h["fields"]:
print(h["fields"]["review"])
print("-" * 80)
Prints:
After 2 years using Twilio services, mainly phone and messages, I can say I am so happy I found this solution to handle my communications. It is so flexible, Although it has been a little bit complicated sometimes to self-learn about online phoning systems it saved me from a lot of hassles I wanted to avoid. The best benefit you get is the ultra efficient support service
--------------------------------------------------------------------------------
An amazingly well built product -- we rarely if ever had reliability issues -- the Twilio Functions were an especially useful post-purchase feature discovery -- so much so that we still use that even though we don't do any texting. We also sometimes use FracTEL, since they beat Twilio on pricing 3:1 for 1-800 texts *and* had MMS 1-800 support long before Twilio.
--------------------------------------------------------------------------------
I absolutely love using Twilio, have had zero issues in using the SIP and text messaging on the platform.
--------------------------------------------------------------------------------
Authy by Twilio is a run-of-the-mill 2FA app. There's nothing special about it. It works when you're not switching your hardware.
--------------------------------------------------------------------------------
We've had great experience with Twilio. Our users sign up for text notification and we use Twilio to deliver them information. That experience has been well-received by customers. There's more to Twilio than that but texting is what we use it for. The system barely ever goes down and always shows us accurate information of our usage.
--------------------------------------------------------------------------------
...and so on.

I have been scraping many types of websites and I think in the world of scraping, there are roughly 2 types of websites.
The first one is "URL-based" websites (i.e. you send request with URL, the server responds with HTML tags from which elements can be directly extracted), and the second one is "JavaScript-rendered" websites (i.e. the response you only get is the javascript and you can only see HTML tags after it is run).
In former's cases, you can freely navigate through the website with bs4. But in the latter's cases, you cannot always use URLs as a rule of thumb.
The site you are going to scrape is built with Angular.js, which is based on client-side rendering. So, the response you get is the JavaScript code, not HTML tags with page content in it. You have to run the code to get the content.
About the code you introduced:
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read() # response is javascript, not page content you want...
webpage = web_byte.decode('utf-8')
All you can get is the JavaScript code that must be run to get HTML elements. That is why you get the same pages(response) every time.
So, what to do? Is there any way to run JavaScript within bs4? I guess there aren't any appropriate ways to do this. You can use selenium for this one. You can literally wait until the page fully loads, you can click buttons and anchors, or get page content at any time.
Headless browsers in selenium might work, which means you don't have to see the controlled browser opening on your computer.
Here are some links that might be of help to you.
scrape html generated by javascript with python
https://sadesmith.com/2018/06/15/blog/scraping-client-side-rendered-data-with-python-and-selenium
Thanks for reading.

Twitter scraping using Python

I've been working on a project to reverse-enginner twitter's app to scrape public posts from Twitter using an unofficial API, with Python. (I want to create an "alternative" app, which is simply a localhost that can search for a user, and get its posts)
I've been searching and reading everything related to REST, AJAX, and the python modules requests, requests-html, BeautifulSoup, and more.
I can see when looking at twitter on the devtools (for example on Marvel's profile page) that the only relevant requests being sent (by POST and GET) are the following: client_event.json and UserTweets?variables=... .
I understood that these are the relevant messages being received by cleaning the network tab and recording only when I scroll down and load new tweets - these are the only messages that came up which aren't random videos (I cleaned the search using -video -init -csp_report -config -ondemand -like -pageview -recommendations -prefetch -jot -key_live_kn -svg -jpg -jpeg -png -ico -analytics -loader -sharedCore -Hebrew).
I am new to this field, so I am probably doing something wrong. I can see on UserTweets the response I'm looking for - a beautiful JSON with all the data I need - but I am unable, no matter how much I've been trying to, to access it.
I tried different modules and different headers, and I get nothing. I DON'T want to use Selenium since it's tiresome, and I know where the data I need is stored.
I've been trying to send a GET reuest to:
https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D
by doing:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
response = session.get('https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D')
response.html.render()
s = BeautifulSoup(response.html.html, 'lxml')
but I get back an HTML script that either says Chromium is unsupported, or just a static page without the javascript updating the DOM.
All help appreciated.
Thank you
P.S
I've posted the same question on reverseengineering.stackexchange, just to be safe (overflow has more appropriate tags :-))

Before you deep dive into the actual code, I would first start building the correct request to twitter. I would use a 3rd party tool focused on REST and APIs such as Postman to build and test the required request - and only then would write the actual code.
From your questions it seems that you'll be using an open API of twitter, so it means you'll only need to send x-guest-token and basic Bearer authorization in your request headers.
The Bearer is static - you can just browse to twitter and copy/paste
it from the dev tools network monitor.
To get the x-guest-token you'll need something dynamic because it has expiration, what I would suggest is send a curl request to twitter, parse the token from there and put it in your header before sending the request. You can see something very similar in: Python Downloading twitter video using python (without using twitter api)
.
After you have both of the above, build the required GET request in Postman and test if you get back the correct response. Only after you have everything working in Postman - write the same in Python, or any other language**
**You can use Postman snippets which automatically generates the code needed in many programming languages.

#TripleS, example of how one may extract json data from __INITIAL_STATE__ and write it to text file.
import requests
import re
import json
from contextlib import suppress
# get page
result = requests.get('https://twitter.com/ThePSF')
# Extract json from "window.__INITIAL_STATE__={....};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)
# convert text string to structured json data
twitter_json = json.loads(json_string)
# Save structured json data to a text file that may help
# you to orient yourself and possible pick some parts you
# are interested in (if there are any)
with open('twitter_json_data.txt', 'w') as outfile:
outfile.write(json.dumps(twitter_json, indent=4, sort_keys=True))

I've just tried the same, but with requests, not requests_html module. I could get all site contents, but I would not call it "beautiful".
Also, now I am blocked to access the site without logging in.
Here is my small example.
Use official Twitter API instead.
I also think that I will probably be blocked after some tries of using this script. I've tried it only 2 times.
import requests
import bs4
def example():
result = requests.get("https://twitter.com/childrightscnct")
soup = bs4.BeautifulSoup(result.text, "lxml")
print(soup)
if __name__ == '__main__':
example()
To select any element with bs4, use
some_text = soup.select('locator').getText()
I found one tool for scraping Twitter, that has quite a lot of stars on Github https://github.com/twintproject/twint I did not try it myself and hope it is legal.

What you're missing is the bearer and guest token needed to make your request. If I just hit your endpoint with curl and no headers I get no response. However, if I add headers for the bearer token and guest token then I get that json you're looking for:
curl https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'' -H 'x-guest-token: 1452696114205847552'
You can get the bearer token (which may not expire that often) and the guest token (which does expire, I think) like this:
The html of the twitter link you go to links a file called main.some random numbers.js. Within that javascript file is the bearer token. You can recognize it is because a long string starting with lots of A's.
Take the bearer token and call https://api.twitter.com/1.1/guest/activate.json using the bearer token as an authorization header
curl 'https://api.twitter.com/1.1/guest/activate.json' -X POST -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'
In python this looks like:
import requests
import json
url = "https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D"
headers = {"authorization": "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA", "x-guest-token": "1452696114205847552"}
resp = requests.get(url, headers=headers)
j = json.loads(resp.text)
And now, that variable, j, holds your beautiful json. One warning, sometimes the response back can be so big that it doesn't seem to fit into a single response. If this happens, you'll notice the resp.text isn't valid json, but just some portion of a big blog of json. To fix this, you'll just need to adapt the requests to use "stream=True" and stream out the whole response before you try to parse it as json.

How to detect a red flag, phishing or malicious site with python 2.7?

I want to detect malicious sites using python.
Now, I've tried using requests module to get the contents of a website, then would search for malicious words in it. But, I didn't get it to work.
this my all code : link code
req_check = requests.get(url)
if 'malicious words' in req_check.content:
print ('[Your Site Detect Red Page] ===> '+url)
else:
print ('[Your Site Not Detect Red Page] ===> '+url)

It doesn't work because you're using the requests library wrong.
In your code, you essentially only get the HTML of the virus site (line of code: req_check = requests.get(url, verify=False) and if 'example for detect ' in req_check.content:{source: https://pastebin.com/6x24SN6v})
In Chrome, the browser runs through a database of known virus links (its more complicated than this) and sees if the link is safe. However, the requests library does not do this. Instead, you're better off using their API. If you want to see how the API can be used in conjunction with requests, you can see my answer on another question: Is there a way to extract information from shadow-root on a Website?
Sidenote, the redage() is never called?

Tell the user to enter a website, then use selenium or something to upload the url to virustotal.com

I would comment that your indentation might be messed where there is other code. Otherwise, it should work flawlessly.
Edit 2
It appeared that OP was after a way to detect malicious sites in python. This is a documentation from totalvirus explaining how to leverage their APIs.
Now to give you a working example, this will print a list of engines reporting positive:
import requests
apikey = '<api_key>'
def main():
scan_url('https://friborgerforbundet.no/')
def scan_url(url):
params = {'apikey': apikey, 'url': url}
response = requests.post('https://www.virustotal.com/vtapi/v2/url/scan', data=params)
scan_id = response.json()['scan_id']
report_params = {'apikey': apikey, 'resource': scan_id}
report_response = requests.get('https://www.virustotal.com/vtapi/v2/url/report', params=report_params)
scans = report_response.json()['scans']
positive_sites = []
for key, value in scans.items():
if value['detected'] == True:
positive_sites.append(key)
print(positive_sites)
if __name__ == '__main__':
main()

Parsing webpage with beautifulsoup to get dynamic content

I am trying to parse the following page
http://www.lyricsnmusic.com/roxy-music/while-my-heart-is-still-beating-lyrics/26925936 for the list of similar songs.
The list of similar songs is not present in the page source but is present when I use 'Inspect Element' in the browser.
How do I do it??
Current code:
url = 'http://www.lyricsnmusic.com/roxy-music/while-my-heart-is-still-beating-lyrics/26925936'
request = urllib2.Request(url)
lyricsPage = urllib2.urlopen(request).read()
soup = BeautifulSoup(lyricsPage)
The code to generate the links is:
for p in soup.find_all('p'):
s = p.find('a', { "class" : 'title' }).get('href')
Which methods are available to do this??

This is handled probably by some ajax calls so it will not be in the source,
I think you would need to "monitor network" through developer tools in the browser and look for requests you are interested in.
i.e. a random picked request URL from this page:
http://ws.audioscrobbler.com/2.0/?api_key=73581584905631c5fc15720f03b0b9c8&format=json&callback=jQuery1703329798618797213_1380004055342&method=track.getSimilar&limit=10&artist=roxy%20music&track=while%20my%20heart%20is%20still%20beating&_=1380004055943
to get/see the response enter the above URL in the browser and see the content of the response.
so you need to simulate the requests in python and after you get the response you have to parse the response for interesting details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.