I am doing a simple crawling task for crawling news comments from yahoo news (http://news.yahoo.com/peter-kassig-mother-isis-twitter-133155662.html).
And this is my code:
import urllib
url2 = 'http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=f8bf9dc7-1692-3283-825e-2d506952f57b&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1'
url1 = 'http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=f8bf9dc7-1692-3283-825e-2d506952f57b&_device=full&count=10&sortBy=highestRated&isNext=true&offset=10&pageNumber=1&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1'
url15 = 'http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=f8bf9dc7-1692-3283-825e-2d506952f57b&_device=full&count=10&sortBy=highestRated&isNext=true&offset=10&pageNumber=15&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1'
u1 = urllib.urlopen(url1)
u2 = urllib.urlopen(url2)
u15 = urllib.urlopen(url15)
data1 = u1.read()
data2 = u2.read()
data15 = u15.read()
# data15 is same with data2!!!
I knew the comments are given by using GET (from Google Web Dev. - Network Tab), which means that I can just use URLs for crawling the comments.
There are only two differences (pageNumber and offset) among url1, url2, and url5.
Though url1 is for pageNumber=1 and url15 is for pageNumber=15, it's same data!
I don't know the reason why.
This is my first naive web crawling task.
Thank you in advance.
Related
I tried to get the number of followers of a given Twitter account by scraping twitter. I tried scraping with BeautifulSoup and XPath. But none of the code is working.
This is some of my sample testing code for it,
from bs4 import BeautifulSoup
url = "https://twitter.com/BarackObama"
resposnse = re.get(url)
soup = BeautifulSoup(resposnse.content)
div_tag = soup.find_all('main',{"class":"css-1dbjc4n r-1habvwh r-16xksha r-1wbh5a2"})
when i try to see what is the content i scraped by using below code,
import requests
t=requests.get('https://twitter.com/BarackObama')
print(t.content)
It's not included any of the data like count of followers or anything.
Please help me to do this.
whenever your code started to parse information from the twitter URL. it will parse all data suddenly but it won't get all the data. because the URL page is loaded but not the values or other important data etc... (same for the followers). so there is TWITTER PYTHON API where you can able to get followers with api.GetFollowers()
The relevant API endpoint is followers/ids. Using TwitterAPI you can do the following:
from TwitterAPI import TwitterAPI, TwitterPager
api = TwitterAPI(YOUR_CONSUMER_KEY,
YOUR_CONSUMER_SECRET,
YOUR_ACCESS_TOKEN_KEY,
YOUR_ACCESS_TOKEN_SECRET)
count = 0
r = TwitterPager(api, 'followers/ids')
for item in r.get_iterator():
count = count + 1
print(count)
I'm using a webscraping code, without a headless browser, in order to scrape about 500 inputs from Transfer Mrkt for a personal project.
According to best practices, I need to randomize the web scraping pattern I have, along with using a delay and dealing with errors/loading delays, in order to successfully scrape Transfer Markt without getting raising any flags.
I understand how Selenium and Chromedriver can help with all of these in order to scrape more safely, but I've used requests and BeautifulSoup to create a much simpler webscraper:
import requests, re, ast
from bs4 import BeautifulSoup import pandas as pd
i = 1
url_list = []
while True:
page = requests.get('https://www.transfermarkt.us/spieler-statistik/wertvollstespieler/marktwertetop?page=' + str(i), headers = {'User-Agent':'Mozilla/5.0'}).text
parsed_page = BeautifulSoup(page,'lxml')
all_links = []
for link in parsed_page.find_all('a', href=True): link = str(link['href']) all_links.append(link)
r = re.compile('.*profil/spieler.*')
player_links = list(filter(r.match, all_links))
for plink in range(0,25):
url_list.append('https://www.transfermarkt.us' + player_links[plink])
i += 1
if i > 20: break
final_url_list = []
for i in url_list:
int_page = requests.get(i, headers = {'User-Agent':'Mozilla/5.0'}).text
parsed_int_page = BeautifulSoup(int_page,'lxml')
graph_container = parsed_int_page.find('div', class_='large-7 columns small-12 marktwertentwicklung-graph')
graph_a = graph_container.find('a')
graph_link = graph_a.get('href')
final_url_list.append('https://www.transfermarkt.us' + graph_link)
for url in final_url_list:
r = requests.get('https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r"'data':(.*)}\],")
s = p.findall(r.text)[0]
s = s.encode().decode('unicode_escape')
data = ast.literal_eval(s)
#rest of the code to write scraped info below this
I was wondering if this is generally considered a safe enough way to scrape a website like Transfer Mrkt if I add the time.sleep() method from the time library, as detailed here, in order to create a delay - long enough to allow the page to load, like 10 seconds - to scrape the 500 inputs successfully without raising any flags.
I would also forego using randomized clicks (which I think can only be done with selenium/chromedriver) to mimic human behavior, and was wondering if that too would be ok to exclude in order to scrape safely.
I am quite new to Python and am building a web scraper, which will scrape the following page and links in them: https://www.nalpcanada.com/Page.cfm?PageID=33
The problem is the page's default is to display the first 10 search results, however, I want to scrape all 150 search results (when 'All' is selected, there are 150 links).
I have tried messing around with the URL, but the URL remains static no matter what display results option is selected. I have also tried to look at the Network section of the Developer Tools on Chrome, but can't seem to figure out what to use to display all results.
Here is my code so far:
import bs4
import requests
import csv
import re
response = requests.get('https://www.nalpcanada.com/Page.cfm?PageID=33')
soup = bs4.BeautifulSoup(response.content, "html.parser")
urls = []
for a in soup.findAll('a', href=True, class_="employerProfileLink", text="Vancouver, British Columbia"):
urls.append(a['href'])
pagesToCrawl = ['https://www.nalpcanada.com/' + url + '&QuestionTabID=47' for url in urls]
for pages in pagesToCrawl:
html = requests.get(pages)
soupObjs = bs4.BeautifulSoup(html.content, "html.parser")
nameOfFirm = soupObjs.find('div', class_="ip-left").find('h2').next_element
tbody = soupObjs.find('div', {"id":"collapse8"}).find('tbody')
offers = tbody.find('td').next_sibling.next_sibling.next_element
seeking = tbody.find('tr').next_sibling.next_sibling.find('td').next_sibling.next_sibling.next_element
print('Firm name:', nameOfFirm)
print('Offers:', offers)
print('Seeking:', seeking)
print('Hireback Rate:', int(offers) / int(seeking))
Replacing your response call with this code seems to work. The reason is that you weren't passing in the cookie properly.
response = requests.get(
'https://www.nalpcanada.com/Page.cfm',
params={'PageID': 33},
cookies={'DISPLAYNUM': '100000000'}
)
The only other issue I came across was that a ValueError was being raised by this line when certain links (like YLaw Group) don't seem to have "offers" and/or "seeking".
print('Hireback Rate:', int(offers) / int(seeking))
I just commented out the line since you will have to decide what to do in those cases.
thank you for taking an interest in my question. I'm currently studying Computer Science in university, and I believe that I have a pretty good grasp of Python programming. With that in mind, and now that I'm learning full-stack development, I wanted to develop a web crawler in Python (since I hear that it's good at that) to skim through sites like Manta and Tradesi looking for small businesses without websites so that I can get in touch with their owners and do some pro-bono work to kickstart my career as a web developer. Problem is, I have never made a web crawler before, in any language, so I thought that the helpful folk at Stack Overflow could give me some insight about web crawlers, particularly how I should go about learning how to make them, and ideas on how to implement it for those particular websites.
Any input is appreciated. Thank you, and have a good day/evening!
Here is a way to loop through an array of URLs and import data from each.
import urllib
import re
import json
dateslist = open("C:/Users/rshuell001/Desktop/dates/dates.txt").read() dateslistlist = thedates.split("\n")
for thedate in dateslist:
myfile = open("C:/Users/rshuell001/Desktop/dates/" + thedate +".txt", "w+")
myfile.close()
htmltext = urllib.urlopen("http://www.hockey-reference.com/friv/dailyleaders.cgi?month=" + themonth + "& day=" theday "& year=" theyear "")
data = json.load(htmltext)
datapoints = data["data_values"]
myfile = open("C:/Users/rshuell001/Desktop/dates/" + thedate +".txt", "a")
for point in datapoints:
myfile.write(str(symbol+","+str(point[0])+","+str(point[1])+"\n"))
myfile.close()
#
import requests
from bs4 import BeautifulSoup
base_url = "http://www.privredni-imenik.com/pretraga?abcd=&keyword=&cities_id=0&category_id=0&sub_category_id=0&page=1"
current_page = 1
while current_page < 200:
print(current_page)
url = base_url + str(current_page)
#current_page += 1
r = requests.get(url)
zute_soup = BeautifulSoup(r.text, 'html.parser')
firme = zute_soup.findAll('div', {'class': 'jobs-item'})
for title in firme:
title1 = title.findAll('h6')[0].text
print(title1)
adresa = title.findAll('div', {'class': 'description'})[0].text
print(adresa)
kontakt = title.findAll('div', {'class': 'description'})[1].text
print(kontakt)
print('\n')
page_line = "{title1}\n{adresa}\n{kontakt}".format(
title1=title1,
adresa=adresa,
kontakt=kontakt
)
current_page += 1
Keep in mind, there are many, many, many ways to do this kind of thing, and each site is different from all others, so the final result you come up with will be highly customized and very specific in it's intended use.
I am scraping an entire article management system storing thousands of articles. My script works, but the problem is that beautifulsoup and requests both take a long in determining whether the the page is an actual article or an article not found page. I have approximately 4000 articles and by calculating, the amount time the script will run will complete is in days.
for article_url in edit_article_list:
article_edit_page = s.get(article_url, data=payload).text
article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')
# Section
if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
continue
else:
for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
f.write(thing.get_text(strip=True) + "\t")
The first if determines whether the url is good or bad.
edit_article_list is made by:
for count in range(87418,307725):
edit_article_list.append(login_url+"AddEditArticle.aspxArticleID="+str(count))
My script right now checks for the bad and good urls and then scrapes the content. Is there any way I can get the valid urls of similar pattern using requests while making the url list?
To skip articles which don't exist, need to not allow redirects and check the status code:
for article_url in edit_article_list:
r = requests.get(article_url, data=payload, allow_redirects=False)
if r.status_code != 200:
continue
article_edit_page = r.text
article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')
# Section
if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
continue
else:
for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
f.write(thing.get_text(strip=True) + "\t")
I do though recommend parsing the article list page for the actual urls - you are currently firing off over 200,000 requests and only expecting 4,000 articles, that is a lot of overhead and traffic, and not very efficient!