I am trying to open scrape all urls in a csv file. Then open the csv file and read each url opening each url to search and grab the Source info, Author, and License info. Then need to follow the respected gitlink to see if there is a license file or not. If there is a license file download and save it to csv file.
I have the below code in place however am receiving the following error upon reading the first url in my file:
No connection adapters were found for "['https://tools.kali.org/information-gathering/ace-voip']"
Actual Error:
File "ommitted", line 742, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
InvalidSchema: No connection adapters were found for "['https://tools.kali.org/information-gathering/ace-voip']"
I think this is happening because there is the added "[' in front of my url however, this doesnt exist in my file of listed urls.
I am new to python and appreciate any and all help on this.
import urllib.request, urllib.parse, urllib.error
import ssl
import zlib
from bs4 import BeautifulSoup
import csv
from urllib.request import urlopen
import urllib
import urllib.parse
import requests
#Testing ssl and reading url
#urllib.request.urlopen('https://google.com').read()
ctx = ssl._create_default_https_context()
# Establish chrome driver and go to report site URL
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'https://tools.kali.org/tools-listing'
html = urllib.request.urlopen(url, context=ctx)#.read().decode('utf-8')
de_data=zlib.decompress(html.read(), 16+zlib.MAX_WBITS)
print(de_data)
soup = BeautifulSoup(de_data, 'lxml')
data = []
for url in soup.find_all('a', href=True, text=True):
print(url['href'])
data.append(url['href'])
print(data)
####New Replacement for above that works removing spaces########
with open('kalitools.csv', 'w') as file:
for url in data:
file.write(str(url) + 'n')
# loading csv file with URLS and parsing each
######TESTING Reading URLS########
with open('E:/KaliScrape/kalitools.txt', 'r') as f_urls, open('ommitted/output.txt', 'w', newline='') as f_output:
csv_urls = csv.reader(f_urls)
csv_output = csv.writer(f_output)
csv_output.writerow(['Source', 'Author', 'License'])
print(csv_urls)
for line in csv_urls:
r = requests.get(line)#.text
soup = BeautifulSoup(r, 'lxml')
#r = requests.get(line[0], verify=False)#.text
#for line in csv_urls:
# line = 'https://' + line if 'https' not in line else line
# source = urlopen(line).read()
src = soup.find('li')
print('Source:', src.text)
auth = soup.find('li')
print('Author:', auth.text)
lic = soup.find('li')
print('License:', lic.text)
csv_output.writerow([src.text, auth.text, lic.text])
So, the problem is you are getting a list, and you just need to pick the list element at the zero index,
for line in csv_urls:
r = requests.get(line[0])#.text
Related
I am using the the following code to read the URL's in a text files and save the results in an another text file
import requests
with open('text.txt', 'r') as f: #text file containing the URLS
for url in f:
f = requests.get(url)
print (url)
print(f.text)
file=open("output.txt", "a") #output file
For some reason I am getting a {"error":"Permission denied"} message for each URL. I can paste the URL in the browser and get the correct response. I also tried with the following code and it worked OK on a singular URL.
import requests
link = "http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=W50%2F4%2F4524"
f = requests.get(link)
print(f.text, file=open("output11.txt", "a"))
The txt file contains the following urls
http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=22_Topografikartta_20k%2F3%2F3742%2F374207
http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=W50%2F4%2F4524
http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=W50%2F4%2F4432
http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=21_Peruskartta_20k%2F3%2F3341%2F334112
I assume I am missing something very simple...Any clues?
Many thanks
Each line has a trailing newline. Simply strip it:
for url in f:
url = url.rstrip('\n')
...
you have to use content from the response-
you can use this code in loop
import requests
download_url="http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=W50%2F4%2F4524"
response = requests.get(download_url, stream = True)
with open("document.txt", 'wb') as file:
file.write(response.content)
file.close()
print("Completed")
I'm trying to automate the download of docs via Selenium.
I'm using requests.get() to download the file after extracting the url from the website:
import requests
url= 'https://www.schroders.com/hkrprewrite/retail/en/attach.aspx?fileid=e47b0366c44e4f33b04c20b8b6878aa7.pdf'
myfile = requests.get(url)
open('/Users/hemanthj/Downloads/AB Test/' + "A-Acc-USD" + '.pdf', 'wb').write(myfile.content)
time.sleep(3)
The file is downloaded but is corrupted when I try to open. The file size is only a few KB at most.
I tried adding the header info from this thread too but no luck:
Corrupted PDF file after requests.get() with Python
What within the headers makes the download work? Any solutions?
The problem was in an incorrect URL.
It loaded HTML instead of PDF.
Looking throw the site I found the URL that you were looking for.
Try this code and then open the document with pdf reader program.
import requests
import pathlib
def load_pdf_from(url:str, filename:pathlib.Path) -> None:
response:requests.Response = requests.get(url, stream=True)
if response.status_code == 200:
with open(filename, 'wb') as pdf_file:
for chunk in response.iter_content(chunk_size=1024):
pdf_file.write(chunk)
else:
print(f"Failed to load pdf: {url}")
url:str = 'https://www.schroders.com/hkrprewrite/retail/en/attachment2.aspx?fileid=e47b0366c44e4f33b04c20b8b6878aa7.pdf'
target_filename:pathlib.Path = pathlib.Path.cwd().joinpath('loaded_pdf.pdf')
load_pdf_from(url, target_filename)
Tried to crawl restaurants address from google front page information panel but getting "urllib.error.HTTPError: HTTP Error 403: Forbidden"
error and program are not run.
I am fresher in python web scraping, please help.
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import re
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
#get google URL.
url = "https://www.google.com/search?q=barbeque%20nation%20-%20noida"
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
page = fromstring(response)
soup = BeautifulSoup(page, 'url.parser')
the_page = soup.prettify("utf-8")
hotel_json = {}
for line in soup.find_all('script',attrs={"type" :
"application/ld+json"}):
details = line.text.strip()
details = json.loads(details)
hotel_json["address"]["LrzXr"]=details["address"]["streetAddress"]
break
with open(hotel_json["name"]+".html", "wb") as file:
file.write(html)
with open(hotel_json["name"]+".json", 'w') as outfile:
json.dump(hotel_json, outfile, indent=4)
Add a user-agent header
request = urllib.request.Request(url, headers = {'User-Agent' : 'Mozilla/5.0'})
I am trying to download a csv file which is on web portal, when doing it manually we login to the url and click on Download CSV button then it prompts for saving. we are using python3
I am trying to do this via python scripting, when we execute this script we get the the html page with the name Download CSV, when we click on that we get a csv file through that.
import urllib.request
import requests
session = requests.session()
playload = {'j_username':'avinash.reddy', 'j_password':'password'}
r = session.post('https://url_of_the_portal/auth/login','data=playload')
r = session.get('URL_of_the_page_where_the_csv_file_exiests')
url='https://url_of_the_portal/review/download/bm_sis'
print ('done')
urllib.request.urlretrieve (url, "Download CSV")
I think it should look like this + your login creds.
import csv
import urllib2
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
print row
Else...
url = 'http://winterolympicsmedals.com/medals.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
Else...
import requests
from contextlib import closing
import csv
url = "http://download-and-process-csv-efficiently/python.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"')
for row in reader:
# Handle each row here...
print row
How to read a CSV file from a URL with Python?
I know how to save page source code using urllib2
import urllib2
page = urllib2.urlopen('http://example.com')
page_source = page.read()
with open('page_source.html', 'w') as fid:
fid.write(page_source)
But how to save a source using urllib3? PoolManager?
Use .data, like this:
import urllib3
http = urllib3.PoolManager()
r = http.request('get', 'http://www.google.com')
with open('page_source.html', 'w') as fid:
fid.write(r.data)