Scraping First post from phpbb3 forum by Python - python

I have alink like that
http://www.arabcomics.net/phpbb3/viewtopic.php?f=98&t=71718
the link has LINKS in first post in phpbb3 forum
How I get LINKS in first post
I tried this but not working
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.arabcomics.net/phpbb3/viewtopic.php?f=98&t=71718'
response= requests.get(url)
soup = bs(response.text, 'html5lib')
itemstr= soup.findAll('div',{'class':'postbody'})
for link in itemstr.findAll('a'):
links = link.get('href')
print(links)

Big oof my man, just use regex for this ? No need to use bs, also regex will work even if they remake site.
import re
myurlregex=re.compile(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))\" class=\"postlink\"''')
url = re.findall(myurlregex,response.text)[0]
Also as a coder regex is one of skills u will need always.

Related

Getting a string in a specific tag with Beautiful Soup

I try to get all the titles from a website https://webscraper.io/test-sites. For that I use Beautiful Soup. The title (in this case E-commerce site) is always included in the following part of a code:
<h2 class="site-heading">
<a href="/test-sites/e-commerce/allinone">
E-commerce site
</a>
</h2>
I don't get that part. I already tried different things but for example the most intuitive code for me is not working:
import re
from bs4 import BeautifulSoup
import requests
url = 'https://webscraper.io/test-sites'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html)
string = soup.find_all("h2", string=re.compile("E-commerce")
How can I get just the title, in this case 'E-commerce site' for a list?
You are close. A few issues.
You are not using any parser to parse r_html. I have used html.parser here.
I don't see any need to use Regex re in your problem.
The titles are present inside h2 tags with class name - site-heading. You can select them.
This code selects all the titles and prints them.
from bs4 import BeautifulSoup
import requests
url = 'https://webscraper.io/test-sites'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html,"html.parser")
string = soup.find_all("h2", class_='site-heading')
for i in string:
print(i.text.strip())
E-commerce site
E-commerce site with pagination links
E-commerce site with popup links
E-commerce site with AJAX pagination links
E-commerce site with "Load more" buttons
E-commerce site that loads items while scrolling
Table playground
import re
import requests
from bs4 import BeautifulSoup
url = 'https://webscraper.io/test-sites'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, features="html.parser")
h2s = soup.find_all("h2")
for h2 in h2s:
print(h2.text.strip())
this will give you all the texts in your H2s.
Let me know if this helps you.
If I understand you correctly, you want to get a list of all the titles available. You could do something like this:
titles = [x.getText() for x in soup.find_all("h2", {class_="site-heading"})]

Webscraping using BeautifulSoup (Jupyter Notebook)

Good afternoon,
I am fairly new to Webscraping. I am trying to scrape a dataset from an open source portal. Just to try to figure out how I can scrape website.
I am trying to scape a dataset from data.toerismevlaanderen.be
This is the dataset i want: https://data.toerismevlaanderen.be/tourist/reca/beer_bars
I always end up with a http error: HTTP Error 404: Not Found
This is my code:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://data.toerismevlaanderen.be/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
soup.findAll('a')
one_a_tag = soup.findAll('a')[35]
link = one_a_tag['href']
download_url = 'https://data.toerismevlaanderen.be/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/tourist/reca/beer_bars_')+1:])
time.sleep
What am I doing wrong?
The issue is the following:
link = one_a_tag['href']
print(link)
This returns a link: https://data.toerismevlaanderen.be/
Then you are adding this link to download_url by doing:
download_url = 'https://data.toerismevlaanderen.be/'+ link
Therefore, if you print(download_url), you get:
https://data.toerismevlaanderen.be/https://data.toerismevlaanderen.be/
Which it is not a valid url.
UPDATE BASED ON COMMENTS
The issue is that there is not tourist/activities/breweries anywhere in the text you scrape.
If you write:
for link in soup.findAll('a'):
print(link.get('href'))
you see all the a href tag. None contains tourist/activities/breweries
But
If you want just the link data.toerismevlaanderen.be/tourist/activities/breweries you can do:
download_url = link + "tourist/activities/breweries"
There is an API for this so I would use that
e.g.
import requests
r = requests.get('https://opendata.visitflanders.org/tourist/reca/beer_bars.json?page=1&page_size=500&limit=1').json()
you get many absolute links in return. Adding it to the original url for new requests therefor won't work. Simply requesting the 'link' you grabbed will work instead

Exporting data from HTML to Excel

i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)

How to check whether the url is of format "/askwiki/questions/<any number>" in Python

I am trying to learn web scraping using BeautifulSoup and Python.
I scraped a list of urls from a website and I want to display the text of all th links that are in format "/askwiki/questions/ like
"/askwiki/questions/4" or "/askwiki/questions/123".
import requests
from bs4 import BeautifulSoup
url = 'http://unistd.herokuapp.com/askrec';
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml");
links = soup.find_all("a")
for link in links:
if #url is of my desired format
print link.text
What should I write in the if statement.
I am new to python as well as web scraping. It may be a really stupid question but I am not getting what to write there.
I tried like
if "/askwiki/questions/[0-9]+ " in link.get("href"):
if "/askwiki/questions/[0-9]?" in link.get("href"):
but it's not working.
P.S - There are other links too like 'askwiki/questions/tags' and /askwiki/questions/users'.
Edit: Using regex to identify only those with numbers at the end.
import re
for link in links:
url = str(link.get('href'))
if re.findall('/askwiki/questions/[\d]+', url):
print(link)
You're on the right track! The missing component is the re module.
I think what you want is something like this:
import re
matcher = re.compile(r"/askwiki/questions/[0-9]+")
if matcher.search(link.get("href")):
print(link.text)
Alternatively, you can just drop the number component, if you're only really looking for links with "/askwiki/questions" in:
if "/askwiki/questions" in link.get("href")
print(link.text)
try something like :
for link in links:
link = link.get("href")
if link.startswith("/askwiki/questions/"):
print(link.test)
If you want to use regex (ie what you have, [0-9]+), you have to import the re library. Check out this link to the documentation on using re to find patterns!

How do I extract just the blog content and exclude other elements using Beautiful Soup

I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:
soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')
Printing body will also include other stuff under the main div tag.
Try this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
print(item.text.strip())
Use this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
print(item.text)
I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python
However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.
What I find most obvious to do right now is something like this:
Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
Scrape the text as described by Shahin in a previous answer

Categories