I am new to python and just wanted to know if this is possible: I have scraped a url using urllib and want to edit different pages.
Example:
http://test.com/All/0.html
I want the 0.html to become 50.html and then 100.html and so on ...
found_url = 'http://test.com/All/0.html'
base_url = 'http://test.com/All/'
for page_number in range(0,1050,50):
url_to_fetch = "{0}{1}.html".format(base_url,page_number)
That should give you URLs from 0.html to 1000.html
If you want to use urlparse(as suggested in comments to your question):
import urlparse
found_url = 'http://test.com/All/0.html'
parsed_url = urlparse.urlparse(found_url)
path_parts = parsed_url.path.split("/")
for page_number in range(0,1050,50):
new_path = "{0}/{1}.html".format("/".join(path_parts[:-1]), page_number)
parsed_url = parsed_url._replace(path= new_path)
print parsed_url.geturl()
Executing this script would give you the following:
http://test.com/All/0.html
http://test.com/All/50.html
http://test.com/All/100.html
http://test.com/All/150.html
http://test.com/All/200.html
http://test.com/All/250.html
http://test.com/All/300.html
http://test.com/All/350.html
http://test.com/All/400.html
http://test.com/All/450.html
http://test.com/All/500.html
http://test.com/All/550.html
http://test.com/All/600.html
http://test.com/All/650.html
http://test.com/All/700.html
http://test.com/All/750.html
http://test.com/All/800.html
http://test.com/All/850.html
http://test.com/All/900.html
http://test.com/All/950.html
http://test.com/All/1000.html
Instead of printing in the for loop you can use the value of parsed_url.geturl() as per your need. As mentioned, if you want to fetch the content of the page you can use python requests module in the following manner:
import requests
found_url = 'http://test.com/All/0.html'
parsed_url = urlparse.urlparse(found_url)
path_parts = parsed_url.path.split("/")
for page_number in range(0,1050,50):
new_path = "{0}/{1}.html".format("/".join(path_parts[:-1]), page_number)
parsed_url = parsed_url._replace(path= new_path)
# print parsed_url.geturl()
url = parsed_url.geturl()
try:
r = requests.get(url)
if r.status_code == 200:
with open(str(page_number)+'.html', 'w') as f:
f.write(r.content)
except Exception as e:
print "Error scraping - " + url
print e
This fetches the content from http://test.com/All/0.html till http://test.com/All/1000.html and saves the content of each URL into its own file. The file name on disk would be the file name in URL - 0.html to 1000.html
Depending on the performance of the site you are trying to scrape from you might experience considerable time delays in running the script. If performance is of importance, you can consider using grequests
Related
I have an idx file:
https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx
I could open the idx file with following codes one year ago, but the codes don't work now. Why is that? How should I modify the code?
import requests
import urllib
from bs4 import BeautifulSoup
master_data = []
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx"
byte_data = requests.get(file_url).content
data_format = byte_data.decode('utf-8').split('------')
content = data_format[-1]
data_list = content.replace('\n','|').split('|')
for index, item in enumerate(data_list):
if '.txt' in item:
if data_list[index - 2] == '10-K':
entry_list = data_list[index - 4: index + 1]
entry_list[4] = "https://www.sec.gov/Archives/" + entry_list[4]
master_data.append(entry_list)
print(master_data)
If you had inspected the contents of the byte_data variable, you would find that it does not have the actual content of the idx file. It is basically present to prevent scraping bots like yours. You can find more information in this answer: Problem HTTP error 403 in Python 3 Web Scraping
In this case, your answer would be to just use the User-Agent in the header for the request.
import requests
master_data = []
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx"
byte_data = requests.get(file_url, allow_redirects=True, headers={"User-Agent": "XYZ/3.0"}).content
# Your further processing here
On a side note, your processing does not print anything as the if condition is never met for any of the lines, so do not think this solution does not work.
Right now I have a script that can print out basic info on Airbnb listings based on a specified URL:
from bs4 import BeautifulSoup
import requests
import csv
headers = {'User-Agent':'Google Chrome, Windows 10'}
url = 'https://www.airbnb.com/s/Tokyo--Japan/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=april&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&checkin=2021-04-09&checkout=2021-04-23&adults=3&source=structured_search_input_header&search_type=autocomplete_click&query=Tokyo%2C%20Japan&place_id=ChIJ51cu8IcbXWARiRtXIothAS'
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.content,'lxml')
for item in soup.select('[itemprop=itemListElement]'):
try:
print('----------------------------------------')
print(item.select('a')[0]['aria-label']) #Title
print(item.select('a')[0]['href']) #URL
print(item.select('._krjbj')[0].get_text()) #Price
print(item.select('._krjbj')[2].get_text()) #Total price
print(item.select('._kqh46o')[0].get_text()) #Facilities
print(item.select('._kqh46o')[1].get_text()) #Amenities
print(item.select('._18khxk1')[0].get_text()) #Rating with number of reviews in parentheses
# print(name)
print('----------------------------------------')
except Exception as e:
#raise e
print('')
I'd like output for this stored in a csv. Here's an attempt using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import csv
headers = {'User-Agent':'Google Chrome, Windows 10'}
f = csv.writer(open('airbnbscraping.csv', 'w'))
f.writerow(["title", "weburl", "nightprice", "totalprice", "facilities", "amenities", "ratings"])
for item in soup.select('[itemprop=itemListElement]'):
try:
title = item.select(('a')[0]['aria-label'])
weburl = item.select(('a')[0]['href'])
nightprice = str(('._krjbj')[0].get_text())
totalprice = str(('._krjbj')[2].get_text())
facilities = str(('._kqh46o')[0].get_text())
amenities = str(('._kqh46o')[1].get_text())
ratings = str(('._18khxk1')[0].get_text())
except Exception as e:
# raise e
continue
f.writerow([title, weburl, nightprice, totalprice, facilities, amenities, ratings])
Currently only the header row is written to the csv...how could I get the desired values into the table as well? Would I have to use .find and .find_all instead?
I would use a context manager to handle the file open/write and re-work some of your selectors.
Move the try except to where it actually needs to handle the current exception.
Please note that the classes look dynamic so scraping using current classes is not particularly robust. I would instead look for relationships between more stable looking elements/attributes.
I also needed to change your user-agent to one the server was happy with otherwise only the header would be written due to no results from the initial soup.select used for the loop.
from bs4 import BeautifulSoup
import requests
import csv
headers = {'User-Agent':'Mozilla/5.0'}
url = 'https://www.airbnb.com/s/Tokyo--Japan/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=april&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&checkin=2021-04-09&checkout=2021-04-23&adults=3&source=structured_search_input_header&search_type=autocomplete_click&query=Tokyo%2C%20Japan&place_id=ChIJ51cu8IcbXWARiRtXIothAS'
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.content,'lxml')
with open("airbnbscraping.csv", "w", encoding="utf-8-sig", newline='') as f:
w = csv.writer(f, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(["title", "weburl", "nightprice", "totalprice", "facilities", "amenities", "ratings"])
for item in soup.select('[itemprop=itemListElement]'):
title = item.select_one('._8s3ctt a')['aria-label']
weburl = 'https://www.airbnb.co.uk/' + item.select_one('a')['href']
nightprice = item.select_one('._olc9rf0').text
totalprice = item.select_one('button span:contains("total")').text.split(' ')[0]
facilities = item.select_one('._kqh46o').get_text()
amenities = item.select_one('[itemprop=itemListElement] ._kqh46o + ._kqh46o').get_text()
try:
ratings = item.select_one('._10fy1f8').text
except:
ratings = 'None'
w.writerow([title, weburl, nightprice, totalprice, facilities, amenities, ratings])
This question already has answers here:
Download large file in python with requests
(8 answers)
Closed 2 years ago.
The goal is for the program to take user given instagram url and allow to download and save a picture.
I've got the main part in place but cant understand how I can go further and use the filtered and right url to download and save the picture on my computer.
My code so far:
EDIT: I added a download line but can't seem to get the right file type? I mean it saves as whatever I want it to but I can't open it:
import requests
import re
import shutil
def get_response(url):
r = requests.get(url)
while r.status_code != 200:
r.raw.decode_content = True
r = requests.get(url, stream = True)
return r.text
def prepare_urls(matches):
return list({match.replace("\\u0026", "&") for match in matches})
url = input('Enter Instagram URL: ')
response = get_response(url)
vid_matches = re.findall('"video_url":"([^"]+)"', response)
pic_matches = re.findall('"display_url":"([^"]+)"', response)
vid_urls = prepare_urls(vid_matches)
pic_urls = prepare_urls(pic_matches)
if vid_urls:
print('Detected Videos:\n{0}'.format('\n'.join(vid_urls)))
print("Can't download video, the provided URL must be of a picture.")
if pic_urls:
print('Detected Pictures:\n{0}'.format('\n'.join(pic_urls)))
from urllib.request import urlretrieve
dst = 'Instagram picture.jpg'
urlretrieve(url, dst)
#EDIT ^
if not (vid_urls or pic_urls):
print('Could not recognize the media in the provided URL.')
I think this might help...
import requests
from bs4 import BeautifulSoup as bs
import json
import os.path
insta_url = 'https://www.instagram.com'
inta_username = input('enter username of instagram : ')
response = requests.get(f"{insta_url}/{inta_username}/")
if response.ok:
html = response.text
bs_html = bs(html, features="lxml")
bs_html = bs_html.text
index = bs_html.find('profile_pic_url_hd')+21
remaining_text = bs_html[index:]
remaining_text_index = remaining_text.find('requested_by_viewer')-3
string_url = remaining_text[:remaining_text_index].replace("\\u0026", "&")
print(string_url, "\ndownloading...")
while True:
filename = 'pic_ins.jpg'
file_exists = os.path.isfile(filename)
if not file_exists:
with open(filename, 'wb+') as handle:
response = requests.get(string_url, stream=True)
if not response.ok:
print(response)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
else:
continue
break
print("completed")
You can change the name of the image downloaded by changing the filename variable
After running the code,downloaded file is 0bytes. I tried writing response too,also tried using buffer
What am i doing wrong,what else can i try? please help
import urllib2
from bs4 import BeautifulSoup
import os
import pandas as pd
storePath='/home/vinaysawant/BankIFSCCodes/'
def DownloadFiles():
# Remove the trailing / you had, as that gives a 404 page
url='https://rbi.org.in/scripts/Bs_viewcontent.aspx?Id=2009'
conn = urllib2.urlopen(url)
html = conn.read().decode('utf-8')
soup = BeautifulSoup(html, "html.parser")
# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
# Make sure it has one of the correct extensions
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = href.rsplit('/', 1)[-1]
print href
print("Downloading %s to %s..." % (href, filename) )
#urlretrieve(href, filename)
u = urllib2.urlopen(href)
f = open(storePath+filename, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (filename, file_size)
print("Done.")
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8) * (len(status) + 1)
print status,
f.close()
exit(1)
DownloadFiles()
i also tried
import urllib
urllib.retreive(url)
I tried using urllib2 urllib3 as well.
I am not good with pandas and urllib2 but since there is no answer for this question. I think the problem is you are trying to download the first url
url='https://rbi.org.in/scripts/Bs_viewcontent.aspx?Id=2009
you define it here and doesnt change it then
u = urllib2.urlopen(url)
after that you try to download the thing associated with the url
buffer = u.read(block_sz)
Instead of them I guess you should try to download the href
so try to change this
u = urllib2.urlopen(url)
with that
u = urllib2.urlopen(href)
The problem is that redirection to HTTPS is done via js instead of HTTP headers, so urllib doesn't follow. However you could use replace on the links and change the protocol manually.
href = link.get('href').replace('http://', 'https://')
While that resolves the issue, it's not a bad idea to have urlopen in a try-except block.
try:
u = urllib2.urlopen(href)
except Exception as e:
print(e)
continue
Here is my code
driver = webdriver.Chrome()
path = "/home/winpc/test/python/dup/new"
def get_link_urls(url,driver):
driver.get(url)
url = urllib.urlopen(url)
content = url.readlines()
urls = []
for link in driver.find_elements_by_tag_name('a'):
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
test = link.get_attribute('href')
if str(test) != 'None':
file_name=test.rsplit('/')[-1].split('.')[0]
file_name_formated = file_name + "Copy.html"
with open(os.path.join(path, file_name_formated), 'wb') as temp_file:
temp_file.write(source_code.encode('utf-8'))
urls.append(link.get_attribute('href'))
return urls
urls = get_link_urls("http://localhost:8080",driver)
sub_urls = []
for url in urls:
if str(url) != 'None':
sub_urls.extend(get_link_urls(url,driver))
This code properly navigating each and every link but all the time coppiny only the first html page.I need to save the source code of each and every page navigating.saving part is happening using below code:
file_name_formated = file_name + "Copy.html"
with open(os.path.join(path, file_name_formated), 'wb') as temp_file:
temp_file.write(source_code.encode('utf-8'))
First of all you're overwriting URL again and again in the function, so fix that one.
For saving page source through selenium, you can use driver.page_source
Additionally, if you want this code to be faster, consider using requests module.
response = requests.get(url).content