I wrote a for-loop that I thought was extracting the text from the html elements that I had indicated using the Beautifulsoup library. It looks like this:
url = "https://www.researchgate.net/profile/David_Severson"
r = requests.get(url)
data = r.text
soup = bsoup(data, "lxml")
item = soup.find("div", {"class": "section section-research"})
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
for p in papers:
title = p.find("div", {"class": "nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__title nova-v-publication-item__title--clamp-3"})
abstract = p.find("div", {"class": "nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__description nova-v-publication-item__description--clamp-3"})
views = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__metrics"})
date = p.find("li", {"class": "nova-e-list__item publication-item-meta-items__meta-data-item"})
authors = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__person-list"})
link = p.find("a", {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"}, href=True)
if link:
full_link = urllib.parse.urljoin("https://www.researchgate.net/", link["href"])
print(full_link)
print(p.text)
I noticed that it was printing out more than what I had indicated in the contents of the loop. After trying to debug each of the individual items (title, abstract etc...), I realized the loop was not even accessing the items therein at all.
For example, if I commented them all out, or totally remove them, it still gave the exact same output:
for p in papers:
print(p.text)
print("")
(This ^ gives me the exact same output as the code with the contents in the body.)
Somehow the loop is not even reading the elements it's supposed to be using to iterate through p...How can I get it to recognize the script contained therein, and extract the desired elements as defined by the elements I have (or thought I had) written in the body of the loop?
The problem is that you have space in your class that you specified
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
I removed the space and your code worked. so retry it by using this code.
Related
I am trying to run a loop in a web scraping script that uses Beautiful Soup to extract data from this Page. The loop will loop through each div tag and extract 4 different pieces of information. It searches a h3, a div, and 2 span tags. But when I add the ".text" option I get errors from the 'date,' 'soldprice,' and 'shippingprice.' The error says:
AttributeError: 'NoneType' object has no attribute 'text'
I can get the text value from the 'title,' but nothing else when i put ".text" at the end of the line or in the print function. The script overall will extract the correct information when it is run, however I don't want the html tags.
results = soup.find_all("div", {"class": "s-item__info clearfix"}) #to separate the section of text for each item on the page
for item in results:
product = {
'title': item.find("h3", attrs={"class": "s-item__title s-item__title--has-tags"}).text,
'date': item.find("div", attrs={"class": "s-item__title--tag"}), #.find("span", attrs={"class": "POSITIVE"}),
'soldprice': item.find("span", attrs={"class": "s-item__price"}),
'shippingprice': item.find("span", attrs={"class": "s-item__shipping s-item__logisticsCost"}),
}
print(product)
Problem is because before offers there is other div with class="s-item__info clearfix" but without date, soldprice,shippingprice.
You have to add find to search only in offers
results = soup.find('div', class_='srp-river-results clearfix').find_all("div", {"class": "s-item__info clearfix"})
When I run this code:
url = "https://www.researchgate.net/profile/David_Severson"
r = requests.get(url)
data = r.text
soup = bsoup(data, "lxml")
for a in soup.find_all('a', {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"}, href=True):
print (a['href'])
It returns all of the links no problem.
When I put it as part of a bit more complicated of a loop with other elements:
item = soup.find("div", {"class": "section section-research"})
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
for p in papers:
title = p.find("div", {"class": "nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__title nova-v-publication-item__title--clamp-3"})
abstract = p.find("div", {"class": "nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__description nova-v-publication-item__description--clamp-3"})
views = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__metrics"})
date = p.find("li", {"class": "nova-e-list__item publication-item-meta-items__meta-data-item"})
authors = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__person-list"})
link = p.find("a", {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"}, href=True)
print(p['href'])
print(p.text)
It no longer returns the href (link) that I want, and instead gives me KeyError: Href
Why is it no longer returning the links?
Fetch the in from the link
Ex:
from bs4 import BeautifulSoup
import requests
url = "https://www.researchgate.net/profile/David_Severson"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
item = soup.find("div", {"class": "section section-research"})
papers = [paper for paper in item.find_all("div", {"class": "nova-o-stack__item"})]
for p in papers:
title = p.find("div", {"class": "nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__title nova-v-publication-item__title--clamp-3"})
abstract = p.find("div", {"class": "nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-publication-item__description nova-v-publication-item__description--clamp-3"})
views = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__metrics"})
date = p.find("li", {"class": "nova-e-list__item publication-item-meta-items__meta-data-item"})
authors = p.find("ul", {"class": "nova-e-list nova-e-list--size-m nova-e-list--type-inline nova-e-list--spacing-none nova-v-publication-item__person-list"})
link = p.find("a", {"class": "nova-e-badge nova-e-badge--color-green nova-e-badge--display-block nova-e-badge--luminosity-high nova-e-badge--size-l nova-e-badge--theme-solid nova-e-badge--radius-m nova-v-publication-item__type"},href=True)
if link:
print(link["href"])
your p in papers is div element and not a element as in the previous code snippet, that is why the href key error. maybe you want link['href'] assuming link is not None.
I started learning Python today and so it is not a surprise that I am struggling with some basics. I am trying to parse data from a school website for a project and I managed to parse the first page. However, there are multiple pages (results are paginated).
I have an idea about how to go about it, ie, run through the urls in a loop since I know the url format but I have no idea how to proceed. I figured it would be better to somehow search for the "next" button and run the function if it is there, if not, then stop function.
I would appreciate any help I can get.
import requests
from bs4 import BeautifulSoup
url = "http://www.myschoolwebsite.com/1"
#url2 = "http://www.myschoolwebsite.com/2"
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('ul', {"class": "searchResults"})
for item in g_data:
for li in item.findAll('li'):
for resultnameh2 in li.findAll('h2'):
for resultname in resultnameh2.findAll('a'):
print(resultname).text
for resultAddress in li.findAll('p', {"class": "resultAddress"}):
print(resultAddress).text.replace('Get directions','').strip()
for resultContact in li.findAll('ul', {"class": "resultContact"}):
for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
print(resultContact).text
First, you can assume the maximum no. of pages of the directory (if you know pattern of the url). I am assuming the url is of the form http://base_url/page Next you can write this:
base_url = 'http://www.myschoolwebsite.com'
total_pages = 100
def parse_content(r):
soup = BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('ul', {"class": "searchResults"})
for item in g_data:
for li in item.findAll('li'):
for resultnameh2 in li.findAll('h2'):
for resultname in resultnameh2.findAll('a'):
print(resultname).text
for resultAddress in li.findAll('p', {"class": "resultAddress"}):
print(resultAddress).text.replace('Get directions','').strip()
for resultContact in li.findAll('ul', {"class": "resultContact"}):
for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
print(resultContact).text
for page in range(1, total_pages):
response = requests.get(base_url + '/' + str(page))
if response.status_code != 200:
break
parse_content(response)
I would make an array with all the URLs and loop through it, or if there is a clear pattern, write a regex to search for that pattern.
I am trying to scrape some review data with beautiful soup, and it will only let me grab a single element:
BASE_URL = "http://consequenceofsound.net/'category/reviews/album-reviews/"
html = urlopen(BASE_URL + section_url).read()
soup = BeautifulSoup(html, "lxml")
meta = soup.find("div", {"class": "content"}).h1
wordage = [s.contents for s in meta]
this will let me grab a single reviews title from this page. When I change find to find_all though, I can't identify h1 on this line, so I get some code like this:
meta = soup.find("div", {"class": "content"})
wordage = [s.h1 for s in meta]
and I'm unable to find a way to isolate the contents.
meta = soup.find_all("div", {"class": "content"})
wordage = [s.h1 for s in meta if s.h1 not in ([], None)]
link = [s.a['href'] for s in wordage]
Note the addition of the 'not in' statement. It seems on occassion empty and nonetype lists get added in to the 'soup' so this is an important measure.
I am scraping an article using BeautifulSoup. I want to scrape all of the p tags within the article body aside from a certain section. I was wondering if someone could give me a hint as to what I am doing wrong? I didn't get an error, it just didn't present anything different. At the moment it is grabbing the word "Print" from the undesirable section and printing it with the other p tags.
Section I want to ignore: soup.find("div", {'class': 'add-this'})
url: http://www.un.org/apps/news/story.asp?NewsID=47549&Cr=burundi&Cr1=#.U0vmB8fTYig
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Retrieve all of the paragraphs
tags = soup.find("div", {'id': 'fullstory'}).find_all('p')
for tag in tags:
ptags = soup.find("div", {'class': 'add-this'})
for tag in ptags:
txt.write(tag.nextSibling.text.encode('utf-8') + '\n' + '\n')
else:
txt.write(tag.text.encode('utf-8') + '\n' + '\n')
One option is to just pass recursive=False in order not to search p tags inside any other elements of a fullstory div:
tags = soup.find("div", {'id': 'fullstory'}).find_all('p', recursive=False)
for tag in tags:
print tag.text
This will grab only top-level paragraphs from the div, prints the complete article:
10 April 2014 The United Nations today called on the Government...
...
...follow up with the Government on these concerns.