Retrieve all strings in a webpage in Python - python

I am trying to retrieve all strings from a webpage using BeautifulSoup and return a list of all the retrieved strings.
I have 2 approaches in mind:
Find all elements who have a text that is not null, append the text to result list and return it. I am having a hard time implementing this as I couldn't find any way to do it in BeautifulSoup.
Use BeautifulSoup's "find_all" method to find all attributes that I am looking for such as "p" for paragraphs, "a" for links etc. The problem I am facing with this approach is that for some reason, find_all is returning a duplicated output. For example, if a website has a link with a text "Get Hired", I am receiving "Get Hired" more than once in the output.
I am honestly not sure how to proceed from here and I have been stuck for several hours trying to figure out how to get all strings form a webpage.
Would really appreciate your help.

Use .stripped_strings to get all the strings with whitespaces stripped off.
.stripped_strings - Read the Docs.
Here is the code that returns a list of strings present inside the <body> tag.
import requests
from bs4 import BeautifulSoup
url = 'YOUR URL GOES HERE...'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
b = soup.find('body')
list_of_strings = [s for s in b.stripped_strings]
list_of_strings will have a list of all the strings present in the URL.

Post the code that you've used.
If I remember correctly, something like this should get the complete page in one variable "page" and all the text of the page would be available as page.text
import requests
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
print(page.text)

Related

Find a string in a string which starts and ends with different string in Python

I have complete html of a page and from that I need to find GA (google Analytics) id of it. For example:
<script>ga('create', 'UA-4444444444-1', 'auto');</script>
From above string I need to get UA-4444444444-1, which starts from "UA-" and ends with "-1". I have tried this:
re.findall(r"\"trackingId\"\s?:\s?\"(UA-\d+-\d+)\"", raw_html)
but didn't get any success. Please let me know what mistake I am making.
Thanks
It seems that you are overthinking it, you could just seek for the UA token directly:
re.findall(r"UA-\d+-\d+")
Never use regex in parsing through the html. BeautifulSoup should be find in extracting text from tags. Here we extract script tags from html, then we apply regex to text located in script tags.
import re
from bs4 import BeautifulSoup as bs4
html = "<script>ga('create', 'UA-4444444444-1', 'auto');</script>"
soup = bs4(html, 'lxml')
pattern = re.compile("UA-[0-9]+-[0-9]+")
ids = []
for i in soup.findAll("script"):
ids.append(pattern.findall(i.text)[0])
print(ids)

How to get all links containing a phrase from a changing website

I want to retrieve all links from a website that contain a specific phrase.
An example on a public website would be to retrieve all videos from a large youtube channel (for example Linus Tech Tips):
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
current_link = ''
for link in soup.find_all('a'):
current_link = link.get('href')
print(current_link)
Now I have 3 problems here:
How do I get only hyperlinks containing a phrase like "watch?v="
Most hyperlinks aren't shown. In the browser: They appear when you scroll down. BeautifulSoup does only find the links which can be found without scrolling. How can I retrieve all hyperlinks?
All hyperlinks appear two times. How can I only choose each hyperlink once?
Any suggestions?
How do I get only hyperlinks containing a phrase like "watch?v="
Add a single if statement above your print statement
if 'watch?v=' in current_link:
print(current_link)
All hyperlinks appear two times. How can I only choose each hyperlink once?
Store all hyperlinks in a dictionary as the key and set the value to any arbitrary number (dictionaries only allow a single key entry so you wont be able to add duplicates)
Something like this:
myLinks = {} //declare a dictionary variable to hold your data
if 'watch?v=' in current_link:
print(current_link)
myLinks[currentLink] = 1
You can iterate over the keys (links) in the dictionary like this:
for link,val in myLinks:
print(link)
This will print all the links in your dictionary
Most hyperlinks aren't shown. In the browser: They appear when you scroll down. BeautifulSoup does only find the links which can be found without scrolling. How can I retrieve all hyperlinks?
I'm unsure as to how you directly get around the scripting on the page you have directed us to but you could always crawl the links you get from the initial scrape and rip new links off the side panels/traverse them, this should give you most, if not all, of the links you want.
To do so you would want another dictionary to store the already traversed links/check if you already traversed them. You can check for a key in a dictionary like so:
if key in myDict:
print('myDict has this key already!')
I would use the request library,
for python3
import urllib.request
import requests
SearchString="SampleURL.com"
response = requests.get(SearchString, stream=True)
zeta= str(response.content)
with open ("File.txt" , "w") as l:
l.write(zeta)
l.close()
#And now open up the file with the information written to it
x = open("File.txt", "r")
jello = []
for line in x:
jello.append(line)
t = (jello[0].split(""""salePrice":""",1)[1].split(",",1)[0] )
#you'll notice above that I have the keyword "salePrice", this should be a unique identifier in the pages xpath. typically f12 in chrome and then navigating til the item is highlighted gives you the xpath if you right click and copy
#Now this will only return a single result, youll want to use a for loop to iterate over the File.txt until you find all the separate results
I hope this helps Ill keep an eye on this thread if you need more help.
Part One and Three:
Create a list and append links to the list:
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
links = [] # see here
for link in soup.find_all('a'):
links.append(link.get('href')) # and here
Then create a set and convert it back to list to remove duplicates:
links = list(set(links))
Now return the items of interest:
clean_links = [i for i in links if 'watch?v=' in i]
Part Two:
In order to navigate through the site you may need more than just Beautiful Soup. Scrapy has a great API that allows you to pull down a page and explore how you want to parse parent and child elements with xpath. I highly encourage you to try Scrapy and use the interactive shell to tweak your extraction method.
HELPFUL LINK

Scrape any string with Python + Beautiful Soup that contains 5 numbers

Im living in Germany, where ZIP Codes are in most of the cases a 5 digit number f.e. 53525. I would really like to extract that information from a website using beautiful Soup.
I am new to Python/Beautiful Soup and I am not sure how to translate "Find every 5 Numbers in a row + "SPACE"" into Python language.
import requests
import urllib.request,re
from bs4 import BeautifulSoup
source = requests.get('DOMAIN').text
soup = BeautifulSoup(source, 'lxml')
soup.find_all(NOTSUREHERE)
In the simplest scenario:
NOTSUREHEREshould be replaced by name = 'tag_name', being tag_name a possible tag in which you are certain to find ZIP codes (and no other numerical field that could be mistaken by a ZIP Code)
Then, each element of that object should be passed to re.findall(regex, string) being: regex = '([0-9]{5})' (from what I understand the pattern was) and string the element from which you're extracting ZIP Codes.
import requests
import urllib.request,re
from bs4 import BeautifulSoup
source = requests.get('DOMAIN').text
soup = BeautifulSoup(source, 'lxml')
tag_list = soup.find_all(name = 'tag_name')
match_list = []
for tag in tag_list:
match_list.append(re.findall('([0-9]{5})', str(tag)))
You should watch out for possible matches that aren't ZIP codes. It could be the case of refining the soup.find_all() call by adding more arguments. The documentation might give you even more options, but the attrs argument could be set to {'target_attribute':'target_att_value'} those being an attribute and a value that definitely mark a tag with a ZIP code.
EDIT: Regarding possible empty elements, this link has a very straightforward solution: Removing empty elements from an array in Python

How to get a specific word from html page using beautiful soup in python

I have to extract specific words from a HTML page and count the number of times the word has been repeated. How do I do this using beautiful soup in python? How do I pass the url in the soup and then count the words ?
This is my code till now. I have no idea what to do next.
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()
soup = bs.BeautifulSoup(source,'lxml')
for paragraph in soup.find_all('p'):
print(paragraph.string)
print(str(paragraph.text))
You could get all the text in the page using
soup.get_text()
After setting that to a variable you could then use the .count() method to find the amount that a certain string appears in the HTML page. e.g.
text = soup.get_text()
print (text.count('word'))
To make sure you aren't getting words inside words you could split everything with a space and then look for them in each index of the list. For example 'house' is inside 'houses' would be fixed by this.

Stripping HTML Tags from Forum using Python/bs4

I am a (very) new Python user, and decided some of my first work would be to grab some lyrics from a forum and sort according to word frequency. I obviously haven't gotten to the frequency part yet, but the following is the code that does not work for obtaining the string values I want, resulting in an "AttributeError: 'ResultSet' object has no attribute 'getText' ":
from bs4 import BeautifulSoup
import urllib.request
url = 'http://www.thefewgoodmen.com/thefgmforum/threads/gdr-marching-songs-section-b.14998'
wp = urllib.request.urlopen(url)
soup = BeautifulSoup(wp.read())
message = soup.findAll("div", {"class": "messageContent"})
words = message.getText()
print(words)
If I alter the code to have getText() operate on the soup object:
words = soup.getText()
I, of course, get all of the string values throughout the webpage, rather than those limited to only the class messageContent.
My question, therefore, is two-fold:
1) Is there a simple way to limit the tag-stripping to only the intended sections?
2) What simple thing do I not understand in that I cannot have getText() operate on the message object?
Thanks.
The message in this case is a BeautifulSoup ResultSet, which is a list of BeautifulSoup Tag(s). What you need to do is call getText on each element of message like so,
words = [item.getText() for item in message]
Similarly, if you are just interested in a single Tag (let's say the first one for the sake of argument), you could get its content with,
words = message[0].getText()

Categories