I'm building this Shopify scraper to scraper the shop properties like address, phone, email, etc. and I'm receiving a urllib.error.HTTPError: HTTP Error 404: not found. The CSV is being created with the header but not scraping any of the information. Why isn't the address being scraped?
import csv
import json
from urllib.request import urlopen
import sys
base_url = sys.argv[1]
url = base_url + '/shopprops.json'
def get_page(page):
data = urlopen(url + '?page={}'.format(page)).read()
shopprops = json.loads(data)['shopprops']
return shopprops
with open('shopprops.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Address1'])
page = 1
shop = get_page(page)
while shopprops:
for shop in shopprops:
address1 = shop['address1']
row = [address1]
writer.writerow(row)
page += 1
shopprops = get_page(page)
It looks like the issue's with:
data = urlopen(url + '?page={}'.format(page)).read()
and:
shopprops = get_page(page)
That article is crappy for a few reasons, which might help you to move on. First off, you can't scrape a shop like that guy says just asking for products.json. You get a really small payload of a few products at best, with no really interesting information exposed. Shopify is wise to that.
So before you invest too much effort in your scraper, you might want to re-think what you're doing, and instead, maybe try a different approach than this one.
Related
I made a script for scraping pages of some shop looking for out of stock items. It looks like this:
import requests
from bs4 import BeautifulSoup
urls = ['https://www.someurla','https://www.someurlb']
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
if len(soup.find_all('li',class_='out-of-stock')) > 0:
print(soup.title)
Now, I would like to somehow make this list or URLs available for updating without intervention in this little script. So, I think about some detached file that would serve as a flat database. I think it would be more appropriate than some relational DB, because I don't need it really.
I would like to get some opinion from more experienced Python users is this appropriate approach, and if it is what is the best way to do this with text or with .py file. What libraries are good for this task? On the other hand are there better approaches?
Go with a simple JSON file. Something like this:
import os
import json
url_file = '<path>/urls.json'
urls = []
if os.path.isfile(url_file):
with open(url_file, 'rb') as f:
urls = json.load(f)['urls']
else:
print('No URLs found to load')
print(urls)
# hook in your script here...
JSON structure for this particular example:
{
"urls": [
"http://example.com",
"http://google.com"
]
}
thank you for taking an interest in my question. I'm currently studying Computer Science in university, and I believe that I have a pretty good grasp of Python programming. With that in mind, and now that I'm learning full-stack development, I wanted to develop a web crawler in Python (since I hear that it's good at that) to skim through sites like Manta and Tradesi looking for small businesses without websites so that I can get in touch with their owners and do some pro-bono work to kickstart my career as a web developer. Problem is, I have never made a web crawler before, in any language, so I thought that the helpful folk at Stack Overflow could give me some insight about web crawlers, particularly how I should go about learning how to make them, and ideas on how to implement it for those particular websites.
Any input is appreciated. Thank you, and have a good day/evening!
Here is a way to loop through an array of URLs and import data from each.
import urllib
import re
import json
dateslist = open("C:/Users/rshuell001/Desktop/dates/dates.txt").read() dateslistlist = thedates.split("\n")
for thedate in dateslist:
myfile = open("C:/Users/rshuell001/Desktop/dates/" + thedate +".txt", "w+")
myfile.close()
htmltext = urllib.urlopen("http://www.hockey-reference.com/friv/dailyleaders.cgi?month=" + themonth + "& day=" theday "& year=" theyear "")
data = json.load(htmltext)
datapoints = data["data_values"]
myfile = open("C:/Users/rshuell001/Desktop/dates/" + thedate +".txt", "a")
for point in datapoints:
myfile.write(str(symbol+","+str(point[0])+","+str(point[1])+"\n"))
myfile.close()
#
import requests
from bs4 import BeautifulSoup
base_url = "http://www.privredni-imenik.com/pretraga?abcd=&keyword=&cities_id=0&category_id=0&sub_category_id=0&page=1"
current_page = 1
while current_page < 200:
print(current_page)
url = base_url + str(current_page)
#current_page += 1
r = requests.get(url)
zute_soup = BeautifulSoup(r.text, 'html.parser')
firme = zute_soup.findAll('div', {'class': 'jobs-item'})
for title in firme:
title1 = title.findAll('h6')[0].text
print(title1)
adresa = title.findAll('div', {'class': 'description'})[0].text
print(adresa)
kontakt = title.findAll('div', {'class': 'description'})[1].text
print(kontakt)
print('\n')
page_line = "{title1}\n{adresa}\n{kontakt}".format(
title1=title1,
adresa=adresa,
kontakt=kontakt
)
current_page += 1
Keep in mind, there are many, many, many ways to do this kind of thing, and each site is different from all others, so the final result you come up with will be highly customized and very specific in it's intended use.
I wrote a script to pull data from a website. But after several times, it shows 403 forbidden when I request.
What should I do for this issue.
My code is below:
import requests, bs4
import csv
links = []
with open('1-432.csv', 'rb') as urls:
reader = csv.reader(urls)
for i in reader:
links.append(i[0])
info = []
nbr = 1
for url in links:
# Problem is here.
sub = []
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
start = soup.find('em')
forname = soup.find_all('b')
name = []
for b in forname:
name.append(b.text)
name = name[7]
sub.append(name.encode('utf-8'))
for b in start.find_next_siblings('b'):
if b.text in ('Category:', 'Website:', 'Email:', 'Phone' ):
sub.append(b.next_sibling.strip().encode('utf-8'))
info.append(sub)
print('Page ' + str(nbr) + ' is saved')
with open('Canada_info_4.csv', 'wb') as myfile:
wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
for u in info:
wr.writerow(u)
nbr += 1
what should I do to make requests to the website.
Example url is http://www.worldhospitaldirectory.com/dr-bhandare-hospital/info/43225
Thanks.
There's a bunch of different things that could be the problem, and depending on what their blacklisting policy it might be too late to fix.
At the very least, scraping like this is generally considered to be dick behavior. You're hammering their server. Try putting a time.sleep(10) inside your main loop.
Secondly, try setting your user agents. See here or here
A better solution though would be to see if they have an API you can use.
I am trying to extract a list of golf courses name and addresses from the Garmin Website using the script below.
import csv
import requests
from bs4 import BeautifulSoup
courses_list= []
for i in range(893): #893
url = "http://sites.garmin.com/clsearch/courses?browse=1&country=US&lang=en&per_page={}".format(i*20)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"result"})
for item in g_data2:
try:
name= item.contents[3].find_all("div",{"class":"name"})[0].text
print name
except:
name=''
try:
address= item.contents[3].find_all("div",{"class":"location"})[0].text
except:
address=''
course=[name,address]
courses_list.append(course)
with open ('PGA_Garmin2.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
After running the script, I end up not getting the full data that I need and further when executed it produces random values and not a complete set of data. I need to extract information from 893 pages and get a list of at least 18000 but after running this script I only get 122. How do I fix this script to get the complete data set and produce the needed CSV with the complete data set of golf courses from the Garmin Website. I corrected the page page numbers to reflect the page set up in the Garmin website where the page starts at 20 so on.
Just taking a guess here, but try checking your r.status and confirm that it's 200? Maybe it is possible that you're not accessing the whole website?
Stab in the dark.
I am trying to use urllib to access a website and then strip the page source so I can collect some data from it. I know how to do this for public websites but I don't know how to use urllib to do this for password protected webpages. I know the username and password, I am just very confused about how to get urllib to put in the correct credentials then reroute me to the correct page that I want to strip the data from. Currently, my code looks like this. The problem is that it is bringing up the login page's source.
from tkinter import *
import csv
from re import findall
import urllib.request
def info():
file = filedialog.askopenfilename()
fileR = open(file, 'r')
hold = csv.reader(fileR, delimiter=',', quotechar='|')
aList=[]
for item in hold:
if item[1] and item[2] == "":
print(item[1])
url = "www.example.com/id=" + item[1]
request = urllib.request.urlopen(url)
html = request.read()
data = str(html)
person = findall('''\$MainContent\$txtRecipient\"\stype=\"text\"\svalue=\"([^\"]+)\"''',data)
else:
pass
fileR.close
Remember, I am using python 3.3.3. Any help would be appreciated!