thank you for taking an interest in my question. I'm currently studying Computer Science in university, and I believe that I have a pretty good grasp of Python programming. With that in mind, and now that I'm learning full-stack development, I wanted to develop a web crawler in Python (since I hear that it's good at that) to skim through sites like Manta and Tradesi looking for small businesses without websites so that I can get in touch with their owners and do some pro-bono work to kickstart my career as a web developer. Problem is, I have never made a web crawler before, in any language, so I thought that the helpful folk at Stack Overflow could give me some insight about web crawlers, particularly how I should go about learning how to make them, and ideas on how to implement it for those particular websites.
Any input is appreciated. Thank you, and have a good day/evening!
Here is a way to loop through an array of URLs and import data from each.
import urllib
import re
import json
dateslist = open("C:/Users/rshuell001/Desktop/dates/dates.txt").read() dateslistlist = thedates.split("\n")
for thedate in dateslist:
myfile = open("C:/Users/rshuell001/Desktop/dates/" + thedate +".txt", "w+")
myfile.close()
htmltext = urllib.urlopen("http://www.hockey-reference.com/friv/dailyleaders.cgi?month=" + themonth + "& day=" theday "& year=" theyear "")
data = json.load(htmltext)
datapoints = data["data_values"]
myfile = open("C:/Users/rshuell001/Desktop/dates/" + thedate +".txt", "a")
for point in datapoints:
myfile.write(str(symbol+","+str(point[0])+","+str(point[1])+"\n"))
myfile.close()
#
import requests
from bs4 import BeautifulSoup
base_url = "http://www.privredni-imenik.com/pretraga?abcd=&keyword=&cities_id=0&category_id=0&sub_category_id=0&page=1"
current_page = 1
while current_page < 200:
print(current_page)
url = base_url + str(current_page)
#current_page += 1
r = requests.get(url)
zute_soup = BeautifulSoup(r.text, 'html.parser')
firme = zute_soup.findAll('div', {'class': 'jobs-item'})
for title in firme:
title1 = title.findAll('h6')[0].text
print(title1)
adresa = title.findAll('div', {'class': 'description'})[0].text
print(adresa)
kontakt = title.findAll('div', {'class': 'description'})[1].text
print(kontakt)
print('\n')
page_line = "{title1}\n{adresa}\n{kontakt}".format(
title1=title1,
adresa=adresa,
kontakt=kontakt
)
current_page += 1
Keep in mind, there are many, many, many ways to do this kind of thing, and each site is different from all others, so the final result you come up with will be highly customized and very specific in it's intended use.
Related
I'm building this Shopify scraper to scraper the shop properties like address, phone, email, etc. and I'm receiving a urllib.error.HTTPError: HTTP Error 404: not found. The CSV is being created with the header but not scraping any of the information. Why isn't the address being scraped?
import csv
import json
from urllib.request import urlopen
import sys
base_url = sys.argv[1]
url = base_url + '/shopprops.json'
def get_page(page):
data = urlopen(url + '?page={}'.format(page)).read()
shopprops = json.loads(data)['shopprops']
return shopprops
with open('shopprops.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Address1'])
page = 1
shop = get_page(page)
while shopprops:
for shop in shopprops:
address1 = shop['address1']
row = [address1]
writer.writerow(row)
page += 1
shopprops = get_page(page)
It looks like the issue's with:
data = urlopen(url + '?page={}'.format(page)).read()
and:
shopprops = get_page(page)
That article is crappy for a few reasons, which might help you to move on. First off, you can't scrape a shop like that guy says just asking for products.json. You get a really small payload of a few products at best, with no really interesting information exposed. Shopify is wise to that.
So before you invest too much effort in your scraper, you might want to re-think what you're doing, and instead, maybe try a different approach than this one.
I'm using a webscraping code, without a headless browser, in order to scrape about 500 inputs from Transfer Mrkt for a personal project.
According to best practices, I need to randomize the web scraping pattern I have, along with using a delay and dealing with errors/loading delays, in order to successfully scrape Transfer Markt without getting raising any flags.
I understand how Selenium and Chromedriver can help with all of these in order to scrape more safely, but I've used requests and BeautifulSoup to create a much simpler webscraper:
import requests, re, ast
from bs4 import BeautifulSoup import pandas as pd
i = 1
url_list = []
while True:
page = requests.get('https://www.transfermarkt.us/spieler-statistik/wertvollstespieler/marktwertetop?page=' + str(i), headers = {'User-Agent':'Mozilla/5.0'}).text
parsed_page = BeautifulSoup(page,'lxml')
all_links = []
for link in parsed_page.find_all('a', href=True): link = str(link['href']) all_links.append(link)
r = re.compile('.*profil/spieler.*')
player_links = list(filter(r.match, all_links))
for plink in range(0,25):
url_list.append('https://www.transfermarkt.us' + player_links[plink])
i += 1
if i > 20: break
final_url_list = []
for i in url_list:
int_page = requests.get(i, headers = {'User-Agent':'Mozilla/5.0'}).text
parsed_int_page = BeautifulSoup(int_page,'lxml')
graph_container = parsed_int_page.find('div', class_='large-7 columns small-12 marktwertentwicklung-graph')
graph_a = graph_container.find('a')
graph_link = graph_a.get('href')
final_url_list.append('https://www.transfermarkt.us' + graph_link)
for url in final_url_list:
r = requests.get('https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r"'data':(.*)}\],")
s = p.findall(r.text)[0]
s = s.encode().decode('unicode_escape')
data = ast.literal_eval(s)
#rest of the code to write scraped info below this
I was wondering if this is generally considered a safe enough way to scrape a website like Transfer Mrkt if I add the time.sleep() method from the time library, as detailed here, in order to create a delay - long enough to allow the page to load, like 10 seconds - to scrape the 500 inputs successfully without raising any flags.
I would also forego using randomized clicks (which I think can only be done with selenium/chromedriver) to mimic human behavior, and was wondering if that too would be ok to exclude in order to scrape safely.
I am quite new to Python and am building a web scraper, which will scrape the following page and links in them: https://www.nalpcanada.com/Page.cfm?PageID=33
The problem is the page's default is to display the first 10 search results, however, I want to scrape all 150 search results (when 'All' is selected, there are 150 links).
I have tried messing around with the URL, but the URL remains static no matter what display results option is selected. I have also tried to look at the Network section of the Developer Tools on Chrome, but can't seem to figure out what to use to display all results.
Here is my code so far:
import bs4
import requests
import csv
import re
response = requests.get('https://www.nalpcanada.com/Page.cfm?PageID=33')
soup = bs4.BeautifulSoup(response.content, "html.parser")
urls = []
for a in soup.findAll('a', href=True, class_="employerProfileLink", text="Vancouver, British Columbia"):
urls.append(a['href'])
pagesToCrawl = ['https://www.nalpcanada.com/' + url + '&QuestionTabID=47' for url in urls]
for pages in pagesToCrawl:
html = requests.get(pages)
soupObjs = bs4.BeautifulSoup(html.content, "html.parser")
nameOfFirm = soupObjs.find('div', class_="ip-left").find('h2').next_element
tbody = soupObjs.find('div', {"id":"collapse8"}).find('tbody')
offers = tbody.find('td').next_sibling.next_sibling.next_element
seeking = tbody.find('tr').next_sibling.next_sibling.find('td').next_sibling.next_sibling.next_element
print('Firm name:', nameOfFirm)
print('Offers:', offers)
print('Seeking:', seeking)
print('Hireback Rate:', int(offers) / int(seeking))
Replacing your response call with this code seems to work. The reason is that you weren't passing in the cookie properly.
response = requests.get(
'https://www.nalpcanada.com/Page.cfm',
params={'PageID': 33},
cookies={'DISPLAYNUM': '100000000'}
)
The only other issue I came across was that a ValueError was being raised by this line when certain links (like YLaw Group) don't seem to have "offers" and/or "seeking".
print('Hireback Rate:', int(offers) / int(seeking))
I just commented out the line since you will have to decide what to do in those cases.
I wrote a script to pull data from a website. But after several times, it shows 403 forbidden when I request.
What should I do for this issue.
My code is below:
import requests, bs4
import csv
links = []
with open('1-432.csv', 'rb') as urls:
reader = csv.reader(urls)
for i in reader:
links.append(i[0])
info = []
nbr = 1
for url in links:
# Problem is here.
sub = []
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
start = soup.find('em')
forname = soup.find_all('b')
name = []
for b in forname:
name.append(b.text)
name = name[7]
sub.append(name.encode('utf-8'))
for b in start.find_next_siblings('b'):
if b.text in ('Category:', 'Website:', 'Email:', 'Phone' ):
sub.append(b.next_sibling.strip().encode('utf-8'))
info.append(sub)
print('Page ' + str(nbr) + ' is saved')
with open('Canada_info_4.csv', 'wb') as myfile:
wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
for u in info:
wr.writerow(u)
nbr += 1
what should I do to make requests to the website.
Example url is http://www.worldhospitaldirectory.com/dr-bhandare-hospital/info/43225
Thanks.
There's a bunch of different things that could be the problem, and depending on what their blacklisting policy it might be too late to fix.
At the very least, scraping like this is generally considered to be dick behavior. You're hammering their server. Try putting a time.sleep(10) inside your main loop.
Secondly, try setting your user agents. See here or here
A better solution though would be to see if they have an API you can use.
I am doing a simple crawling task for crawling news comments from yahoo news (http://news.yahoo.com/peter-kassig-mother-isis-twitter-133155662.html).
And this is my code:
import urllib
url2 = 'http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=f8bf9dc7-1692-3283-825e-2d506952f57b&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1'
url1 = 'http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=f8bf9dc7-1692-3283-825e-2d506952f57b&_device=full&count=10&sortBy=highestRated&isNext=true&offset=10&pageNumber=1&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1'
url15 = 'http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=f8bf9dc7-1692-3283-825e-2d506952f57b&_device=full&count=10&sortBy=highestRated&isNext=true&offset=10&pageNumber=15&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1'
u1 = urllib.urlopen(url1)
u2 = urllib.urlopen(url2)
u15 = urllib.urlopen(url15)
data1 = u1.read()
data2 = u2.read()
data15 = u15.read()
# data15 is same with data2!!!
I knew the comments are given by using GET (from Google Web Dev. - Network Tab), which means that I can just use URLs for crawling the comments.
There are only two differences (pageNumber and offset) among url1, url2, and url5.
Though url1 is for pageNumber=1 and url15 is for pageNumber=15, it's same data!
I don't know the reason why.
This is my first naive web crawling task.
Thank you in advance.