I wrote python code to search for an image in google with some google dork keywords. Here is the code:
def showD(self):
self.text, ok = QInputDialog.getText(self, 'Write A Keyword', 'Example:"twitter.com"')
if ok == True:
self.google()
def google(self):
filePath = self.imagePath
domain = self.text
searchUrl = 'http://www.google.com/searchbyimage/upload'
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': '', 'q': f'site:{domain}'}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']
webbrowser.open(fetchUrl)
App = QApplication(sys.argv)
window = Window()
sys.exit(App.exec())
I just didn't figure how to display the url of the search result in my program. I tried this code:
import requests
from bs4 import BeautifulSoup
import re
query = "twitter"
search = query.replace(' ', '+')
results = 15
url = (f"https://www.google.com/search?q={search}&num={results}")
requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
links = soup_link.find_all("a")
for link in links:
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
title = link.find_all('h3')
if len(title) > 0:
print(link.get('href').split("?q=")[1].split("&sa=U")[0])
# print(title[0].getText())
print("------")
But it only works for normal google search keyword and failed when I try to optimize it for the result of google image search. It didn't display any result.
Currently there is no simple way to scrape Google's "Search by image" using plain HTTPS requests. Before responding to this type of request, they presumably check if user is real using several sophisticated techniques. Even your working example of code does not work for long — it happens to be banned by Google after 20-100 requests.
All public solutions in Python that really scrape Google with images use Selenium and imitate the real user behaviour. So you can go this way yourself. Interfaces of python-selenium binding are not so tough to get used to, except maybe the setup process.
The best of them, for my taste, is hardikvasa/google-images-download (7.8K stars on Github). Unfortunately, this library has no such input interface as image path or image in binary format. It only has the similar_images parameter which expects a URL. Nevertheless, you can try to use it with http://localhost:1234/... URL (you can easily set one up this way).
You can check all these questions and see that all the solutions use Selenium for this task.
Related
I am programming a program that should read out certain data from a website and only output certain data (data from a table). However, I ran into a problem. I wrote a program that logs into the website, but from that website I have to go to the next website and then open the document with the data. Unfortunately, I have no idea how I can change the website and then open the document and read out the data.
Does anyone have any idea how I could get on there?
from bs4 import BeautifulSoup
import requests
User = ''
Pass = ''
LOGIN_URL = ''
LOGIN_API_URL = ''
def main():
session_requests = requests.session()
result = session_requests.get(LOGIN_URL)
cookies = result.cookies
soup = BeautifulSoup(result.content, "html.parser")
auth_token = soup.find("input", {'name': 'logintoken'}).get('value')
payload = {'username': User, 'password': Pass , 'logintoken':auth_token }
result = session_requests.post(
LOGIN_API_URL,
data=payload,
cookies=cookies
)
#Report successful login
print("Login succeeded: ", result.ok)
print("Status code:", result.status_code)
print(result.text)
#Get Data
# Close Session
requests.session().close()
print('Session closed')
# Entry point
if __name__ == '__main__':
main()
You should read into Selenium with Python. Since there is no specific URL or login details (which you shouldn't post here anyway) it would be quite hard for any of us to create a working example since we don't have anything to work with.
Try the using selenium from the link above and if you have any questions or run into any issues from there come back and ask that specific question.
BS4 and requests can be powerful but selenium emulates a web browser and lets you move through websites like a "human" would. Start there.
The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?
Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)
Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.
I want to detect malicious sites using python.
Now, I've tried using requests module to get the contents of a website, then would search for malicious words in it. But, I didn't get it to work.
this my all code : link code
req_check = requests.get(url)
if 'malicious words' in req_check.content:
print ('[Your Site Detect Red Page] ===> '+url)
else:
print ('[Your Site Not Detect Red Page] ===> '+url)
It doesn't work because you're using the requests library wrong.
In your code, you essentially only get the HTML of the virus site (line of code: req_check = requests.get(url, verify=False) and if 'example for detect ' in req_check.content:{source: https://pastebin.com/6x24SN6v})
In Chrome, the browser runs through a database of known virus links (its more complicated than this) and sees if the link is safe. However, the requests library does not do this. Instead, you're better off using their API. If you want to see how the API can be used in conjunction with requests, you can see my answer on another question: Is there a way to extract information from shadow-root on a Website?
Sidenote, the redage() is never called?
Tell the user to enter a website, then use selenium or something to upload the url to virustotal.com
I would comment that your indentation might be messed where there is other code. Otherwise, it should work flawlessly.
Edit 2
It appeared that OP was after a way to detect malicious sites in python. This is a documentation from totalvirus explaining how to leverage their APIs.
Now to give you a working example, this will print a list of engines reporting positive:
import requests
apikey = '<api_key>'
def main():
scan_url('https://friborgerforbundet.no/')
def scan_url(url):
params = {'apikey': apikey, 'url': url}
response = requests.post('https://www.virustotal.com/vtapi/v2/url/scan', data=params)
scan_id = response.json()['scan_id']
report_params = {'apikey': apikey, 'resource': scan_id}
report_response = requests.get('https://www.virustotal.com/vtapi/v2/url/report', params=report_params)
scans = report_response.json()['scans']
positive_sites = []
for key, value in scans.items():
if value['detected'] == True:
positive_sites.append(key)
print(positive_sites)
if __name__ == '__main__':
main()
I am setting up code to check the reputation of any URL E.g. http://go.mobisla.com/ on Website "https://www.virustotal.com/gui/home/url"
First, the very basic thing I am doing is to extract all the Website contents using BeautifulSoup but seems the information I am looking for is in shadow-root(open) -- div.detections and span.individual-detection.
Example Copied Element from Webpage results:
No engines detected this URL
I am new to Python, wondering if you can share the best way to extract the information
Tried requests.get() function but it doesn't give the required information
import requests
import os,sys
from bs4 import BeautifulSoup
import pandas as pd
url_check = "deloplen.com:443"
url = "https://www.virustotal.com/gui/home/url"
req = requests.get(url + url_str)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
Expect to see "2 engines detected this URL" along with Detection Example: Dr. Web Malicious
If you use their website, it'll only return a loading screen for VirusTotal, as this isn't the proper way.
What Shows Up:
Instead, what you're supposed to do is use their public API to make requests. However, you'll have to make an account to obtain a Public API Key.
You can use this code which is able to retrieve JSON info about the link. However, you'll have to fill in the API KEY with yours.
import requests, json
user_api_key = "<api key>"
resource = "deloplen.com:443"
# feel free to remove this, just makes it look nicer
def pp_json(json_thing, sort=True, indents=4):
if type(json_thing) is str:
print(json.dumps(json.loads(json_thing), sort_keys=sort, indent=indents))
else:
print(json.dumps(json_thing, sort_keys=sort, indent=indents))
return None
response = requests.get("https://www.virustotal.com/vtapi/v2/url/report?apikey=" + user_api_key + "&resource=" + resource)
json_response = response.json()
pretty_json = pp_json(json_response)
print(pretty_json)
If you want to learn more about the API, you can use their documentation.
I was trying to develop a python script for my friend, which would take a link of a public album and count the like and comment numbers of every photo with "requests" module. This is the code of my script
import re
import requests
def get_page(url):
r = requests.get(url)
content = r.text.encode('utf-8', 'ignore')
return content
if __name__ == "__main__":
url = 'https://www.facebook.com/media/set/?set=a.460132914032627.102894.316378325074754&type=1'
content = get_page(url)
content = content.replace("\n", '')
chehara = "(\d+) likes and (\d+) comments"
cpattern = re.compile(chehara)
result = re.findall(cpattern, content)
for jinish in result:
print "likes "+ jinish[0] + " comments " + jinish [1]
But the problem here is, it only parses the likes and comments for the first 28 photos, and not more, what is the problem? Can somebody please help?
[Edit: the module "request" just loads the web page, which is, the variable content contains the full html source of the facebook web page of the linked album]
use the facebook graph api:
For Albums its documented here:
https://developers.facebook.com/docs/reference/api/album/
Use the limit attribute for testing since its rather slow:
http://graph.facebook.com/460132914032627/photos/?limit=10
EDIT
i just realized that the like_count is not part of the json, you may have to use fql for that
If you want to see the next page you need to add the after attribute to your request like in this URL:
https://graph.facebook.com/albumID/photos?fields=likes.summary(true),comments.summary(true)&after=XXXXXX&access_token=XXXXXX
You could take a look at this JavaScript project for reference.