I'm trying to scrape links with contextual information from the following page: https://www.reddit.com/r/anime/wiki/discussion_archive/2018. I'm able to get the links just fine using BS4 using Python, but having year, season, titles, and episodes associated to the links is ideal. The desired output would look like this:
I've started with the code below, but don't know how to loop through the code to capture things in sections for each season/title:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
link = 'https://www.reddit.com/r/anime/wiki/discussion_archive/2018'
request_2018 = session.get(link, headers={'User-agent': 'Chrome'})
soup = BeautifulSoup(request_2018.content, 'lxml')
data_table = soup.find('div', class_='md wiki')
Is this something that's doable with BS4? Thanks for your help!
EDIT
criteria = {'class':'md wiki'} # so it can reuse later
data_soup = soup.find('div', criteria)
titles = data_soup.find_all('strong')
tables = data_soup.find_all('table')
Try following:
titles = soup.find('div', {'class':'md wiki'}).find_all('strong')
data_tables = soup.find('div', {'class':'md wiki'}).find_all('table')
Better put the second argument of find into a dict and find_all will return all elements which match your search.
Related
Here is the website I am to scrape the number of reviews
So here i want to extract number 272 but it returns None everytime .
I have to use BeautifulSoup.
I tried-
sources = requests.get('https://www.thebodyshop.com/en-us/body/body-butter/olive-body-butter/p/p000016')
soup = BeautifulSoup(sources.content, 'lxml')
x = soup.find('div', {'class': 'columns five product-info'}).find('div')
print(x)
output - empty tag
I want to go inside that tag further.
The number of reviews is dynamically retrieved from an url you can find in network tab. You can simply extract from response.text with regex. The endpoint is part of a defined ajax handler.
You can find a lot of the API instructions in one of the js files: https://thebodyshop-usa.ugc.bazaarvoice.com/static/6097redes-en_us/bvapi.js
For example:
You can trace back through a whole lot of jquery if you really want.
tl;dr; I think you need only add the product_id to a constant string.
import requests, re
from bs4 import BeautifulSoup as bs
p = re.compile(r'"numReviews":(\d+),')
ids = ['p000627']
with requests.Session() as s:
for product_id in ids:
r = s.get(f'https://thebodyshop-usa.ugc.bazaarvoice.com/6097redes-en_us/{product_id}/reviews.djs?format=embeddedhtml')
p = re.compile(r'"numReviews":(\d+),')
print(int(p.findall(r.text)[0]))
i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)
I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:
soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')
Printing body will also include other stuff under the main div tag.
Try this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
print(item.text.strip())
Use this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
print(item.text)
I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python
However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.
What I find most obvious to do right now is something like this:
Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
Scrape the text as described by Shahin in a previous answer
For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.
How would one go about scraping all occurrences of the 'elqFormRow' on the whole website? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
print(div.text.strip())
You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:
import bs4 as bs
import requests
domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'
# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)
urls = []
for tag in tags:
if domain in tag:
urls.append(tag['href'])
urls.append(initial_url)
print(urls)
# function to grab your info
def scrape_desired_info(url):
out = []
text = requests.get(url).text
soup = bs.BeautifulSoup(text, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
out.append(div.text.strip())
return out
info = [scrape_desired_info(url) for url in urls if domain in url]
URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.
Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.
Im using beautifulsoup4 to parse a webpage and collect all the href values using this code
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
allProductInfo = soup.find_all("a", class_="name-link")
print allProductInfo
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
linksList1 prints two of each link. I believe this is happening as its taking the link from the title as well as the item colour. I have tried a few things but cannot get BS to only parse the title link, and have a list of one of each link instead of two. I imagine its something real simple but im missing it. Thanks in advance
This code will give you the result without getting duplicate results
(also using set() may be a good idea as #Tarum Gupta)
But I changed the way you crawl
import requests
from bs4 import BeautifulSoup
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
# Gets all divs with class of inner-article then search for a with name-link class
that is inside an h1 tag
allProductInfo = soup.select("div.inner-article h1 a.name-link")
# print (allProductInfo)
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
alldiv = soup.findAll("div", {"class":"inner-article"})
for div in alldiv:
linkList1.append(div.h1.a['href'])
set(linksList1) # use set() to remove duplicate link
list(set(linksList1)) # use list() convert set to list if you need