Trouble with requests' get in Python - python

I am trying to automatically download files from a website using a list of URLs that I already have. The relevant part of my code looks like this:
for url in urls:
if len(url) != 0:
print url
Running this prints a list of urls as strings - as expected. However, when I add one new line as below:
for url in urls:
if len(url) != 0:
print url
r = requests.get(url)
an error appears saying "Invalid URL u'Document Detail': No schema supplied." Before this breaks, it is supposed to print a url. Previously, this printed the url as expected. However, now it prints "Document Detail" instead of a URL. I'm not quite sure why this happens and how to resolve it.
Any help would be appreciated!
EDIT
urls = []
with open('filename.csv', 'rb') as f:
reader = csv.reader(f)
count = 0
for row in reader:
urls.append(row[34])

With reference to my comment, "Document Details" is the header of your csv. Skip it. Here's one way to do it.
urls = []
with open('filename.csv', 'rb') as f:
read = f.readlines()
urls = [row.split(",")[34] for row in read[1:]]

It is possible that the layout of your csv file has changed and the url is no longer in column 33 i.e. (34 - 1 since rows is zero based).

The you should convert url to string explicitly:
for url in urls:
if len(url) != 0:
print str(url)
r = requests.get(str(url))
And maybe you can give us some piece of your .csv file please.

Related

Python: Read URL's from text file and save result error

I am using the the following code to read the URL's in a text files and save the results in an another text file
import requests
with open('text.txt', 'r') as f: #text file containing the URLS
for url in f:
f = requests.get(url)
print (url)
print(f.text)
file=open("output.txt", "a") #output file
For some reason I am getting a {"error":"Permission denied"} message for each URL. I can paste the URL in the browser and get the correct response. I also tried with the following code and it worked OK on a singular URL.
import requests
link = "http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=W50%2F4%2F4524"
f = requests.get(link)
print(f.text, file=open("output11.txt", "a"))
The txt file contains the following urls
http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=22_Topografikartta_20k%2F3%2F3742%2F374207
http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=W50%2F4%2F4524
http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=W50%2F4%2F4432
http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=21_Peruskartta_20k%2F3%2F3341%2F334112
I assume I am missing something very simple...Any clues?
Many thanks
Each line has a trailing newline. Simply strip it:
for url in f:
url = url.rstrip('\n')
...
you have to use content from the response-
you can use this code in loop
import requests
download_url="http://vanhatpainetutkartat.maanmittauslaitos.fi/getFiles.php?path=W50%2F4%2F4524"
response = requests.get(download_url, stream = True)
with open("document.txt", 'wb') as file:
file.write(response.content)
file.close()
print("Completed")

How to add "https://www.example.com/" before scraped URLs in Python that don't already have it

I'm a rookie using Python and I'm trying to scrape a list of URLs and from a website and send them to a .CSV file but I keep getting a bunch of URLs that are only partial. They don't have "https://www.example.com" before the rest of the URL. I've found that I need to add something like "['https://www.example.com{0}'.format(link) if link.startswith('/') else link for link in url_list]" into my code but where am I supposed to add it? And is that even what I should add? Thanks for any help! Here is my code:
url_list=soup.find_all('a')
with open('HTMLList.csv','w',newline="") as f:
writer=csv.writer(f,delimiter=' ',lineterminator='\r')
for link in url_list:
url=link.get('href')
if url:
writer.writerow([url])
f.close()
If you notice anything else that should be changed please let me know. Thank you!
A simple if statement will achieve this. Just check for the existence of https://www.example.com in the URL and if it doesnt exist, concatenate it.
url_list=soup.find_all('a')
with open('HTMLList.csv','w',newline="") as f:
writer=csv.writer(f,delimiter=' ',lineterminator='\r')
for link in url_list:
url=link.get('href')
# updated
if url != '#' and url is not None:
# added
if 'https://www.example.com' not in url:
url = 'https://www.example.com' + url
writer.writerow([url])
f.close()

Removing all characters after URL?

Basically, I'm trying to remove all the characters after the URL extension in a URL, but it's proving difficult. The application works off a list of various URLs with various extensions.
Here's my source:
import requests
from bs4 import BeautifulSoup
from time import sleep
#takes userinput for path of panels they want tested
import_file_path = input('Enter the path of the websites to be tested: ')
#takes userinput for path of exported file
export_file_path = input('Enter the path of where we should export the panels to: ')
#reads imported panels
with open(import_file_path, 'r') as panels:
panel_list = []
for line in panels:
panel_list.append(line)
x = 0
for panel in panel_list:
url = requests.get(panel)
soup = BeautifulSoup(url.content, "html.parser")
forms = soup.find_all("form")
action = soup.find('form').get('action')
values = {
soup.find_all("input")[0].get("name") : "user",
soup.find_all("input")[1].get("name") : "pass"
}
print(values)
r = requests.post(action, data=values)
print(r.headers)
print(r.status_code)
print(action)
sleep(10)
x += 1
What I'm trying to achieve is an application that automatically tests your username/password from a list of URLs provided in a text document. However, BeautifulSoup returns an incomplete URL when crawling for action tags, i.e instead of returning the full http://example.com/action.php it will return action.php as it would be in the code. The only way I can think to get past this would be to restate the 'action' variable as 'panel' with all characters after the url extension removed, followed by 'action'.
Thanks!

web scraping and 403 forbidden: My web scraper is blocked by a website, what should I do to make request?

I wrote a script to pull data from a website. But after several times, it shows 403 forbidden when I request.
What should I do for this issue.
My code is below:
import requests, bs4
import csv
links = []
with open('1-432.csv', 'rb') as urls:
reader = csv.reader(urls)
for i in reader:
links.append(i[0])
info = []
nbr = 1
for url in links:
# Problem is here.
sub = []
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
start = soup.find('em')
forname = soup.find_all('b')
name = []
for b in forname:
name.append(b.text)
name = name[7]
sub.append(name.encode('utf-8'))
for b in start.find_next_siblings('b'):
if b.text in ('Category:', 'Website:', 'Email:', 'Phone' ):
sub.append(b.next_sibling.strip().encode('utf-8'))
info.append(sub)
print('Page ' + str(nbr) + ' is saved')
with open('Canada_info_4.csv', 'wb') as myfile:
wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
for u in info:
wr.writerow(u)
nbr += 1
what should I do to make requests to the website.
Example url is http://www.worldhospitaldirectory.com/dr-bhandare-hospital/info/43225
Thanks.
There's a bunch of different things that could be the problem, and depending on what their blacklisting policy it might be too late to fix.
At the very least, scraping like this is generally considered to be dick behavior. You're hammering their server. Try putting a time.sleep(10) inside your main loop.
Secondly, try setting your user agents. See here or here
A better solution though would be to see if they have an API you can use.

LXML unable to retrieve webpage using link from file

Hi this may look like a repost but is not. I have recently posted a similar question but this is another issue that links to that problem. So as seen from the previous question(LXML unable to retrieve webpage with error "failed to load HTTP resource"), I am now able to read and print the article if the link is the first line of the file. However, once I try to do it multiple times, it comes back with the error (http://tinypic.com/r/2rr2mau/8)
import lxml.html
def fetch_article_content_cna (i):
BASE_URL = "http://channelnewsasia.com"
f = open('cnaurl2.txt')
line = f.readlines()
print line [i]
url = urljoin(BASE_URL, line[i])
t = lxml.html.parse(url)
#print t.find(".//title").text
content = '\n'.join(t.xpath('.//div[#class="news_detail"]/div/p/text()'))
return content
cnaurl2.txt
/news/world/tripoli-fire-rages-as/1287826.html
/news/asiapacific/korea-ferry-survivors/1287508.html
Try:
url = urljoin(BASE_URL, line[i].strip())

Categories