Web crawling extracts same sentence twice

Web crawling extracts same sentence twice - python

When I write the below code, as you can see in the picture, the result shows me the same sentence twice.
How can I solve this problem?
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.consumeraffairs.com/education/online-courses/coursera.html?page=2#scroll_to_reviews=true")
soup = BeautifulSoup(res.content, 'html.parser')
data = soup.find_all('p')
for item in data:
print(item.string)

Changing your parser to lxml will fix this.
By inspecting the website, you will see there are empty p tags around the p tag of the reviews, and they are messing up the html parser.
This snippet change on the first line will fix by changing your parser, and then on the data list, you can filter the empty p tags by seeing if they have content in their string attribute.
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.consumeraffairs.com/education/online-courses/coursera.html?page=2#scroll_to_reviews=true")
soup = BeautifulSoup(res.content, "lxml")
data = soup.find_all('p')
for item in data:
if item.string:
print(item.string)
This will print for you the reviews only once!
If you get an error about not having a builder, run
pip install lxml

Related

Webscraping output []

Hey i just wanted to test Python Webscraping and i have no Idea why this doesn't work.
As output i become [] and nothing else.
Has anyone an Idea? BEcause if i go to the Website and search for the element i find it.
from bs4 import BeautifulSoup
import requests
html_text = requests.get("https://osu.ppy.sh/users/20488254").text
soup = BeautifulSoup(html_text, "lxml")
job = soup.find("div", class_ = "profile-detail__col profile-detail__col--bottom-right")
print(job)

Player info is loaded dynamically with JS. So, you can't scrape dynamic content using plain bs4. Luckily, they provide user info in json format inside script tag. If you open page source and look for json-user you will see there is a tag:
<script id="json-user" type="application/json">
{"avatar_url":"https:\/\/a.ppy.sh\/20488254?1622470835.jpeg","country_code":"AT","default_group":"default","id":20488254,...
</script>
You can grab json inside that tag and get any information about player. Here is how it would look like:
import json
import requests
from bs4 import BeautifulSoup
html_text = requests.get("https://osu.ppy.sh/users/20488254").text
soup = BeautifulSoup(html_text, "lxml")
json_data = json.loads(soup.find('script', {'id':'json-user'}).string)
Now let's say you are looking for player's global rank. All you need to do is to find the correct keys to navigate you there:
player_rank = json_data['statistics']['global_rank']
# -> 199303

Python web scraping - Why do I get this error?

I want to get text from a website using bs4 but I keep getting this error and I don't know why. This is the error: TypeError: slice indices must be integers or None or have an index method.
This is my code:
from urllib.request import urlopen
import bs4
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
text = html.find("div", {"class":"gc-score__title"})#the error is in this line
print(text)

On this line:
text = html.find("div", {"class":"gc-score__title"})
you just use str.find method, not bs4.BeautifulSoup.find method
So if you do
soup = bs4.BeautifulSoup(html, 'html.parser')
text = soup.find("div", {"class":"gc-score__title"})
print(text)
you will get rid of the error.
That said, the site is using JavaScript, so this will not yield what you expect. You will need to use tools like Selenium to scrape this site.

First, if you want BeautifulSoup to parse the data, you need to ask it to do that.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.oddsshark.com/nfl/dallas-pittsburgh-odds-august-5-2021-1410371"
page = urlopen(url)
html_bytes = page.read()
soup = BeautifulSoup(html_bytes)
Then you can used soup.find to find <div> tags:
text = soup.find("div", {"class":"gc-score__title"})
That will eliminate the error. You were calling str.find because html is a string, and to pick tags out you need to call the find method of a bs4.BeautifulSoup object.
But besides eliminating the error, that line won't do what you want. It won't return anything, because the data at that url does not contain the tag <div class="gc-score__title">.
Copy the contents of html_bytes to a text editor to confirm this.

How do I extract just the blog content and exclude other elements using Beautiful Soup

I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:
soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')
Printing body will also include other stuff under the main div tag.

Try this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
print(item.text.strip())
Use this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
print(item.text)

I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python
However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.
What I find most obvious to do right now is something like this:
Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
Scrape the text as described by Shahin in a previous answer

cut html in half using python beautifulsoup

I'm trying to scrap a website and I need to cut the HTML code in half. The problem is that the HTML code is not really well organized and I can't just use findAll.
Here is my code to parse the HTML code :
resultats = requests.get(URL)
bs = BeautifulSoup(resultats.text, 'html.parser')
What I want to do is to divide bs for each <h2> I find :
The solution might be really simple but I can't find it...
edit : the website, here

this scrapes the whole text without html in it:
import urllib2, json, re
from bs4 import BeautifulSoup
url = "https://fr.wikipedia.org/wiki/Liste_de_sondages_sur_l'%C3%A9lection_pr%C3%A9sidentielle_fran%C3%A7aise_de_2017#Avril"
resultats = urllib2.urlopen(url)
html = resultats.read()
soup = BeautifulSoup(html, 'html5lib')
soup = soup.get_text() # Extracts Text from HTML
print soup
If you want to leave certain information out, you could add this:
soup = re.sub(re.compile('yourRegex', re.DOTALL), '', soup)\
.strip()

Beautiful Soup not returning everything in HTML file?

HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.

You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;

It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.

changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web crawling extracts same sentence twice - python

Related

Webscraping output []

Python web scraping - Why do I get this error?

How do I extract just the blog content and exclude other elements using Beautiful Soup

cut html in half using python beautifulsoup

Beautiful Soup not returning everything in HTML file?

Categories

Resources