I'm trying to scrape all news headlines from Google News (note: not via news.google.com) with the following conditions:
i. keyword(s),
ii. specific date range,
iii. sorted by date, and
iv. able to loop through the pages
This is the link of a regular google search with specified keywords:
https://www.google.com/search?q=migrant%2Bcaravans&rlz=1C1GCEA_enUS827US827&sxsrf=ACYBGNT3ExxxPO5PSo9Cgp91M37sVBHLMA:1576086735805&source=lnms&tbm=nws&sa=X&ved=2ahUKEwji9pbQlK7mAhWIxFkKHWDQCCcQ_AUoAXoECBAQAw&biw=1680&bih=939
And this is the link of my google with the same keywords with sorted by date and date range:
https://www.google.com/search?q=migrant%2Bcaravans&rlz=1C1GCEA_enUS827US827&tbs=cdr:1,cd_min:1/1/2017,cd_max:12/31/2017,sbd:1&tbm=nws&sxsrf=ACYBGNRZjtVzEEfuEKcHjuOYUmubi5pT3g:1576086970386&source=lnt&sa=X&ved=0ahUKEwjc1oTAla7mAhWExVkKHQlVB_YQpwUIIA&biw=1680&bih=939&dpr=1
This is a sample of my code that is able to scrape the headlines from a regular search without any of the conditions imposed:
def scrape_news_summaries(topic, pagenum=1):
#time.sleep(randint(0, 2))
url = "http://www.google.com/search?q="+topic+"&tbm=nws&dpr=" + str(pagenum)
r = requests.get(url)
if r.status_code != 200:
print('status code for ' + url + ' was ' + str(r.status_code))
sys.exit(-1)
soup = BeautifulSoup(r.text, "html.parser")
return soup
scrape_news_summaries("migrant+caravans")
This is the code with the altered the URL to include a date range and have the search sorted by date:
def scrape_news_date_range(query, min_date, max_date, pagenum=1):
url = "https://www.google.com/search?q="+query+"&rlz=1C1GCEA_enUS827US827&tbs=cdr:1,cd_min:"+min_date+",cd_max:"+max_date+",sbd:1&tbm=nws/*,ned=es_sv*/&dpr="+str(pagenum)
r = requests.get(url)
if r.status_code != 200:
print('status code for' + url + 'was' + str(r.status_code))
sys.exit(-1)
soup = BeautifulSoup(r.text, "html.parser")
#return soup
print(soup)
scrape_news_date_range("migrant+caravans", "1/1/2017", "12/1/2017")
And it doesn't seem to return the same content as I would like to get from the second link which I shared above instead it returns the content of a regular search.
I greatly appreciate any help with this! Thank you so much!
Related
I am trying create a function that scrapes college baseball team roster pages for a project. And I have created a function that crawls the roster page, gets a list of the links I want to scrape. But when I try to scrape the individual links for each player, it works but cannot find the data that is on their page.
This is the link to the page I am crawling from at the start:
https://gvsulakers.com/sports/baseball/roster
These are just functions that I call within the function that I am having a problem with:
def parse_row(rows):
return [str(x.string)for x in rows.find_all('td')]
def scrape(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
row = soop.find_all('tr')
lopr = [parse_row(rows) for rows in row]
return(lopr)
Here is what I am having an issue with. when I assign type1_roster with a variable and print it, i only get an empty list. Ideally it should contain data about a player or players from a players roster page.
# Roster page crawler
def type1_roster(team_id):
url = "https://" + team_id + ".com/sports/baseball/roster"
soop = scrape(url)
href_tags = soop.find_all(href = True)
hrefs = [tag.get('href') for tag in href_tags]
# get all player links
player_hrefs = []
for href in hrefs:
if 'sports/baseball/roster' in href:
if 'sports/baseball/roster/coaches' not in href:
if 'https:' not in href:
player_hrefs.append(href)
# get rid of duplicates
player_links = list(set(player_hrefs))
# scrape the roster links
for link in player_links:
player_ = url + link[24:]
return(find_data(player_))
A number of things:
I would pass the headers as a global
You are slicing 1 character too late the link I think for player_
You need to re-work the logic of find_data(), as data is present in a mixture of element types and not in table/tr/td elements e.g. found in spans. The html attributes are nice and descriptive and will support targeting content easily
You can target the player links from the landing page more tightly with the css selector list shown below. This removes the need for multiple loops as well as the use of list(set())
import requests
from bs4 import BeautifulSoup
HEADERS = {'User-Agent': 'Mozilla/5.0'}
def scrape(url):
page = requests.get(url, headers=HEADERS)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers=HEADERS)
#print(page)
html = page.text
soop = BeautifulSoup(html, 'lxml')
# re-think logic here to return desired data e.g.
# soop.select_one('.sidearm-roster-player-jersey-number').text
first_name = soop.select_one('.sidearm-roster-player-first-name').text
# soop.select_one('.sidearm-roster-player-last-name').text
# need targeted string cleaning possibly
bio = soop.select_one('#sidearm-roster-player-bio').get_text('')
return (first_name, bio)
def type1_roster(team_id):
url = "https://" + team_id + ".com/sports/baseball/roster"
soop = scrape(url)
player_links = [i['href'] for i in soop.select(
'.sidearm-roster-players-container .sidearm-roster-player h3 > a')]
# scrape the roster links
for link in player_links:
player_ = url + link[23:]
# print(player_)
return(find_data(player_))
print(type1_roster('gvsulakers'))
I'm trying to append all links in the RSS feed of this Google News page using Beautiful Soup. I'm probably doing too much, but I can't seem to do it with this loop that iterates through a list of search terms for which I want to scrape Google News.
for t in terms:
raw_url = "https://news.google.com/rss/search?q=" + t + "&hl=en-US&gl=US&ceid=US%3Aen"
url = raw_url.replace(" ","-")
req = Request(url)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
links.append(re.findall("href=[\"\'](.*?)[\"\']", str(html_page), flags=0))
print(links)
The list comes up empty every time. My regex is probably off...
Any ideas?
Let BeautifulSoup help you by extracting all of the <item> tags, but because the link is not part of an embedded tag, you need to do the rest by hand. This does what you want, I think.
from bs4 import BeautifulSoup
import requests
terms = ['abercrombie']
for t in terms:
url = f"https://news.google.com/rss/search?q={t}&hl=en-US&gl=US&ceid=US%3Aen"
html_page = requests.get(url)
soup = BeautifulSoup(html_page.text, "lxml")
for item in soup.find_all("item"):
link= str(item)
i = link.find("<link/>")
j = link.find("<guid")
print( link[i+7:j] )
I am trying to search the SEC website to find the first occurence of "10-Q" or "10-K", and retreive the link found under the "Interactive Data Button" on the website.
The url that I am trying to retreive the link from is:
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=&dateb=20200506&owner=exclude&count=40
The result link should be:
https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v
the code I am currently using:
import requests
from bs4 import BeautifulSoup
date1 = "20200506"
ticker = "AAPL"
URL = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=' + ticker + '&type=&dateb=' +
date1 + '&owner=exclude&count=40'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='seriesDiv')
rows = results.find_all('tr')
for row in rows:
document = row.find('td', string='10-Q')
link = row.find('a', id="interactiveDataBtn")
if None in (document, link):
continue
print(document.text)
print(link['href'])
this code returns all the links of the 10-Q's, but it should be for both 10-Q and 10-K.
Can someone help me to mold this code so that it only returns the link of the first occurence of 10-Q or 10-K?
Thanks
The quickest solution is to use lambda in .find() method.
For example:
import requests
from bs4 import BeautifulSoup
date1 = "20200506"
ticker = "AAPL"
URL = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=' + ticker + '&type=&dateb=' + date1 + '&owner=exclude&count=40'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='seriesDiv')
rows = results.find_all('tr')
for row in rows:
document = row.find(lambda t: t.name=='td' and ('10-Q' in t.text or '10-K' in t.text))
link = row.find('a', id="interactiveDataBtn")
if None in (document, link):
continue
print(document.text)
print('https://www.sec.gov' + link['href'])
Prints both 10-Q and 10-K links:
10-Q
https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v
10-Q
https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000010&xbrl_type=v
10-K
https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-19-000119&xbrl_type=v
10-Q
https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-19-000076&xbrl_type=v
10-Q
https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-19-000066&xbrl_type=v
EDIT: To get only first occurrence, you can use dictionary. Each iteration check if there's key (a string 10-Q or 10-K) inside the dictionary and if not, add it:
links = dict()
for row in rows:
document = row.find(lambda t: t.name=='td' and ('10-Q' in t.text or '10-K' in t.text))
link = row.find('a', id="interactiveDataBtn")
if None in (document, link):
continue
if document.text not in links:
links[document.text] = 'https://www.sec.gov' + link['href']
print(links)
Prints:
{'10-Q': 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v',
'10-K': 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-19-000119&xbrl_type=v'}
This code gets the page. My problem is I need to scrape the content of users comments not the number of comments. It is nested inside the number of comments section but I am not sure how I can access the link and parse through and scrape the user comments.
request_list = []
id_list = [0]
for i in range(0,200,25):
response = requests.get("https://www.reddit.com/r/CryptoCurrency/?count="+str(i)+"&after="+str(id_list[-1]), headers = {'User-agent':'No Bot'})
soup = BeautifulSoup(response.content, 'lxml')
request_list.append(soup)
id_list.append(soup.find_all('div', attrs={'data-type': 'link'})[-1]['data-fullname'])
print(i, id_list)
if i%100 == 0:
time.sleep(1)
The code below I tried writing a function that is supposed to access the nested comments but I have no clue.
def extract_comment_contents(request_list):
comment_contents_list = []
for i in request_list:
if response.status_code == 200:
for each in i.find_all('a', attrs={'data-inbound-url': '/r/CryptoCurrency/comments/'}):
comment_contents_list.append(each.text)
else:
print("Call failed at request ", i)
return comment_contents_list
fetch_comment_contents_list = extract_comment_contents(request_list)
print(fetch_comment_contents_list)
For each thread, you need to send another request to get the comments page. The url for the comments page can be found using soup.find_all('a', class_='bylink comments may-blank'). This will give all the a tags that have to url for the comments page. I'll show you one example to get to the comments page.
r = requests.get('https://www.reddit.com/r/CryptoCurrency/?count=0&after=0')
soup = BeautifulSoup(r.text, 'lxml')
for comments_tag in soup.find_all('a', class_='bylink comments may-blank', href=True):
url = comments_tag['href']
r2 = requests.get(url)
soup = BeautifulSoup(r2.text, 'lxml')
# Your job is to parse this soup object and get all the comments.
I'm trying to scrape a list of URL's from the European Parliament's Legislative Observatory. I do not type in any search keyword in order to get all links to documents (currently 13172). I can easily scrape a list of the first 10 results which are displayed on the website using the code below. However, I want to have all links so that I would not need to somehow press the next page button. Please let me know if you know of a way to achieve this.
import requests, bs4, re
# main url of the Legislative Observatory's search site
url_main = 'http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y'
# function gets a list of links to the procedures
def links_to_procedures (url_main):
# requesting html code from the main search site of the Legislative Observatory
response = requests.get(url_main)
soup = bs4.BeautifulSoup(response.text) # loading text into Beautiful Soup
links = [a.attrs.get('href') for a in soup.select('div.procedure_title a')] # getting a list of links of the procedure title
return links
print(links_to_procedures(url_main))
You can follow the pagination by specifying the page GET parameter.
First, get the results count, then calculate the number of pages to process by dividing the count on the results count per page. Then, iterate over pages one by one and collect the links:
import re
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y')
soup = BeautifulSoup(response.content)
# get the results count
num_results = soup.find('span', class_=re.compile('resultNum')).text
num_results = int(re.search('(\d+)', num_results).group(1))
print "Results found: " + str(num_results)
results_per_page = 50
base_url = "http://www.europarl.europa.eu/oeil/search/result.do?page={page}&rows=%s&sort=d&searchTab=y&sortTab=y&x=1411566719001" % results_per_page
links = []
for page in xrange(1, num_results/results_per_page + 1):
print "Current page: " + str(page)
url = base_url.format(page=page)
response = requests.get(url)
soup = BeautifulSoup(response.content)
links += [a.attrs.get('href') for a in soup.select('div.procedure_title a')]
print links