Follow link in forum to scrape thread (comments) using BS4 - python

I have a forum with 3 threads. I am trying to scrape the data in all three posts. so I need to follow the href link to each post and scrape the data. this is giving me an error and I'm not sure what I am dong wrong...
import csv
import time
from bs4 import BeautifulSoup
import requests
source = requests.get('https://mainforum.com').text
soup = BeautifulSoup(source, 'lxml')
#get the thread href (thread_link)
for threads in soup.find_all('p', class_= 'small'):
thread_name = threads.text
thread_link = threads.a.get('href')# there are three threads and this gets all 3 links
print (thread_link)
Rest of the code is where I am having an issue with?
# request the individual thread links
for follow_link in thread_link:
response = requests.get(follow_link)
#parse thread link
soup= BeautifulSoup(response, 'lxml')
#print Data
for p in soup.find_all('p'):
print(p)

As to your schema error...
You're getting the schema error because you are overwriting one link over and over. Then you attempt to call that link as if it were a list of links. At this point it is a string and you just iterate through the characters (starting with h) hence the error.
See here: requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied
As to the general query and how to solve something like this...
If I was to do this the flow would go as follows:
Get the three hrefs (similar to what you've already done)
Use a function that scrapes the thread hrefs individually and returns whatever you want them to return
Save/append that returned information wherever you want.
Repeat
Something like this perhaps
import csv
import time
from bs4 import BeautifulSoup
import requests
source = requests.get('https://mainforum.com')
soup = BeautifulSoup(source.content, 'lxml')
all_thread_info = []
def scrape_thread_link(href):
response = requests.get(href)
#parse thread link
soup= BeautifulSoup(response.content, 'lxml')
#return data
return [p.text for p in soup.find_all('p')]
#get the thread href (thread_link)
for threads in soup.find_all('p', class_= 'small'):
this_thread_info = {}
this_thread_info["thread_name"] = threads.text
this_thread_info["thread_link"] = threads.a.get('href')
this_thread_info["thread_data"] = scrape_thread_link(this_thread_info["thread_link"])
all_thread_info.append(this_thread_info)
print(all_thread_info)
There's quite a lot unspecified in the original question so I made some assumptions. Ideally though you can see the gist.
Also note I prefer to use the .content of the response instead of .text.

#Darien Schettler I made some changes/adjustments to the code would love to hear if I messed up somewhere?
all_thread_info = []
def scrape_thread_link(href):
response = requests.get(href)
soup= BeautifulSoup(response.content, 'lxml')
for Thread in soup.find_all(id= 'discussionReplies'):
Thread_Name = Thread.find_all('div', class_='xg_user_generated')
for Posts in Thread_Name:
print(Posts.text)
for threads in soup.find_all('p', class_= 'small'):
thread_name = threads.text
thread_link = threads.a.get('href')
thread_data = scrape_thread_link(thread_link)
all_thread_info.append(thread_data)

Related

How to select all links of apps from app store and extract its href?

from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
url = f'https://www.apple.com/kr/search/youtube?src=globalnav'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.select(".rf-serp-productname-list")
print(links)
I want to crawl through all links of shown apps. When I searched for a keyword, I thought links = soup.select(".rf-serp-productname-list") would work, but links list is empty.
What should I do?
Just check this code, I think is what you want:
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
Source:
https://gist.github.com/AO8/f721b6736c8a4805e99e377e72d3edbf
You can change the part:
for link in soup.find_all("a", href=pattern):
#do something
To check for a keyword I think
You are cooking a soup so first at all taste it and check if everything you expect contains in it.
ResultSet of your selection is empty cause structure in response differs a bit from your expected one from the developer tools.
To get the list of links select more specific:
links = [a.get('href') for a in soup.select('a.icon')]
Output:
['https://apps.apple.com/kr/app/youtube/id544007664', 'https://apps.apple.com/kr/app/%EC%BF%A0%ED%8C%A1%ED%94%8C%EB%A0%88%EC%9D%B4/id1536885649', 'https://apps.apple.com/kr/app/youtube-music/id1017492454', 'https://apps.apple.com/kr/app/instagram/id389801252', 'https://apps.apple.com/kr/app/youtube-kids/id936971630', 'https://apps.apple.com/kr/app/youtube-studio/id888530356', 'https://apps.apple.com/kr/app/google-chrome/id535886823', 'https://apps.apple.com/kr/app/tiktok-%ED%8B%B1%ED%86%A1/id1235601864', 'https://apps.apple.com/kr/app/google/id284815942']

beautifulsoup returns none when using find() for an element

I'm trying to scrape this site to retrieve the years of each paper thats been published. I've managed to get titles to work but when it comes to scraping the years it returns none.
I've broken it down and the results of 'none' occur when its going into the for loop but I can't figure out why this happens when its worked with titles.
import requests
from bs4 import BeautifulSoup
URL = "https://dblp.org/search?q=web%20scraping"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
paperyear = singlepaper.find(class_="datePublished")
print(paperyear)
When it goes to paperResults it gives the breakdown of the section I've selected within the results on the line above that.
Any suggestions on how to retrieve the years would be greatly appreciated
Change this
for singlepaper in paperResults:
paperyear = singlepaper.find(class_="datePublished")
print(paperyear)
To this
for singlepaper in paperResults:
paperyear = singlepaper.find('span', itemprop="datePublished")
print(paperyear.string)
You were looking for a class when you needed to be parsing span... if you print paperResults you will see that your datePublished is an itemprop in a span element.
Try this:
import requests
from bs4 import BeautifulSoup
URL = "https://dblp.org/search?q=web%20scraping"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
paperyear = singlepaper.find(attrs={"itemprop": "datePublished"})
print(paperyear)
It worked for me.

Exporting data from HTML to Excel

i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)

Python BeautifulSoup Paragraph Text only

I am very new to anything webscraping related and as I understand Requests and BeautifulSoup are the way to go in that.
I want to write a program which emails me only one paragraph of a given link every couple of hours (trying a new way to read blogs through the day)
Say this particular link 'https://fs.blog/mental-models/' has a a paragraph each on different models.
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
now soup has a wall of bits before the paragraph text begins: <p> this is what I want to read </p>
soup.title.string working perfectly fine, but I don't know how to move ahead from here pls.. any directions?
thanks
Loop over the soup.findAll('p') to find all the p tags and then use .text to get their text:
Furthermore, do all that under a div with the class rte since you don't want the footer paragraphs.
from bs4 import BeautifulSoup
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
divTag = soup.find_all("div", {"class": "rte"})
for tag in divTag:
pTags = tag.find_all('p')
for tag in pTags[:-2]: # to trim the last two irrelevant looking lines
print(tag.text)
OUTPUT:
Mental models are how we understand the world. Not only do they shape what we think and how we understand but they shape the connections and opportunities that we see.
.
.
.
5. Mutually Assured Destruction
Somewhat paradoxically, the stronger two opponents become, the less likely they may be to destroy one another. This process of mutually assured destruction occurs not just in warfare, as with the development of global nuclear warheads, but also in business, as with the avoidance of destructive price wars between competitors. However, in a fat-tailed world, it is also possible that mutually assured destruction scenarios simply make destruction more severe in the event of a mistake (pushing destruction into the “tails” of the distribution).
 
If you want the text of all the p tag, you can just loop on them using the find_all method:
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)
data = soup.find_all('p')
for p in data:
text = p.get_text()
print(text)
EDIT:
Here is the code in order to have them separatly in a list. You can them apply a loop on the result list to remove empty string, unused characters like\n etc...
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')
result = []
for p in data:
result.append(p.get_text())
print(result)
Here is the solution:
from bs4 import BeautifulSoup
import requests
import Clock
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')
result = []
for p in data:
result.append(p.get_text())
Clock.schedule_interval(print(result), 60)

BeautifulSoup webscraper issue: can't find certain divs/tables

I'm having issues with scraping pro-football-reference.com. I'm trying to access the "Team Offense" table but can't seem to target the div/table.
The best I can do is:
soup.find('div', {'id':'all_team_stats})
which doesn't return the table nor it's immediate div wrapper. The following attempts return "None":
soup.find('div', {'id':'div_team_stats'})
soup.find('table', {'id':'team_stats'})
I've already scraped different pages simply by:
soup.find('table', {'id':'table_id})
but I can't figure out why it's not working on this page. Below is the code I've been working with. Any help is much appreciated!
from bs4 import BeautifulSoup
import urllib2
def make_soup(url):
page = urllib2.urlopen(url)
soupdata = BeautifulSoup(page, 'lxml')
return soupdata
def get_player_totals():
soup = make_soup("http://www.pro-football-reference.com/years/2015/")
tableStats = soup.find('table', {'id':'team_stats'})
return tableStats
print get_player_totals()
EDIT:
Thanks for all the help everyone. Both of the provided solutions below have been successful. Much appreciated!
Just remove the comments with re.sub before you pass to bs4:
from bs4 import BeautifulSoup
import urllib2
import re
comm = re.compile("<!--|-->")
def make_soup(url):
page = urllib2.urlopen(url)
soupdata = BeautifulSoup(comm.sub("", page.read()), 'lxml')
return soupdata
def get_player_totals():
soup = make_soup("http://www.pro-football-reference.com/years/2015/")
tableStats = soup.find('table', {'id':'team_stats'})
return tableStats
print get_player_totals()
You will see the table when you run the code.
A couple of thoughts here: first, as mentioned in the comments, the acutal table is commented out and is not per se part of the DOM (not accessible via a parser adhoc).
In this situation, you can loop over the comments found and try to get the information via regular expressions (though this heavily discussed and mostly disliked on Stackoverflow, see here for more information). Last, but not least, I would recommend requests rather than urllib2.
That being said, here is a working code example:
from bs4 import BeautifulSoup, Comment
import requests, re
def make_soup(url):
r = requests.get(url)
soupdata = BeautifulSoup(r.text, 'lxml')
return soupdata
soup = make_soup("http://www.pro-football-reference.com/years/2015/")
# get the comments
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
# look for table with the id "team_stats"
rx = re.compile(r'<table.+?id="team_stats".+?>[\s\S]+?</table>')
for comment in comments:
try:
table = rx.search(comment.string).group(0)
print(table)
# break the loop if found
break
except:
pass
Always favour a parser solution over this one, especially the regex part is very error-prone.

Categories