I'm trying to scrap a website and I need to cut the HTML code in half. The problem is that the HTML code is not really well organized and I can't just use findAll.
Here is my code to parse the HTML code :
resultats = requests.get(URL)
bs = BeautifulSoup(resultats.text, 'html.parser')
What I want to do is to divide bs for each <h2> I find :
The solution might be really simple but I can't find it...
edit : the website, here
this scrapes the whole text without html in it:
import urllib2, json, re
from bs4 import BeautifulSoup
url = "https://fr.wikipedia.org/wiki/Liste_de_sondages_sur_l'%C3%A9lection_pr%C3%A9sidentielle_fran%C3%A7aise_de_2017#Avril"
resultats = urllib2.urlopen(url)
html = resultats.read()
soup = BeautifulSoup(html, 'html5lib')
soup = soup.get_text() # Extracts Text from HTML
print soup
If you want to leave certain information out, you could add this:
soup = re.sub(re.compile('yourRegex', re.DOTALL), '', soup)\
.strip()
Related
Beautiful Soup returns part of html tags like "\u041f\u0440\u0438\u043b\u043e\u0436\".
How it could be solved?
Problem with page https://life.com.by/
import requests
from bs4 import BeautifulSoup
result = requests.get('https://life.com.by/company/news')
soup = BeautifulSoup(result.content, 'html.parser')
print(soup)
it returns
var engTranslations = {"test":"test","or":"or","main_page":"Main Page","consultant":"Consultant","android_description":"\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435 \u0434\u043b\u044f\u00a0\u0441\u043c\u0430\u0440\u0442\u0444\u043e\u043d\u043e\u0432 \u0438\u00a0\u043f\u043b\u0430\u043d\u0448\u0435\u0442\u043e\u0432 \u043d\u0430\u00a0Android","apple_description":"\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435 \u0434\u043b\u044f\u00a0\u0441\u043c\u0430\u0440\u0442\u0444\u043e\u043d\u043e\u0432 \u0438\u00a0\u043f\u043b\u0430\u043d\u0448\u0435\u0442\u043e\u0432 \u043d\u0430\u00a0iOS","tv_description":"\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435 \u0434\u043b\u044f\u00a0Android\u00a0TV","notebook_description":"\u0421\u043c\u043e\u0442\u0440\u0435\u0442\u044c \u0432\u00a0\u0431\u0440\u0430\u0443\u0437\u0435\u0440\u0435","new_sim":"New SIM-card","add_traffic":"Add traffic","change_tariff":"\u0421\u043c\u0435\u043d\u0438\u0442\u044c \u0442\u0430\u0440\u0438\u0444\u043d\u044b\u0439\u00a0\u043f\u043b\u0430\u043d","tariff_changing":"\u0421\u043c\u0435\u043d\u0430 \u0442\u0430\u0440\u0438\u0444\u0430","for_a_day":"For a day","for_a_week":"For a week","for_a_month":"For a month","connect":"Activate","call_back_me":"\u041f\u0435\u0440\u0435\
If I just print part of it, it returns human - readed string.
print('\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435\u043f\u043e\u043c\u043e\u0436\u0435\u043c')
Приложение поможем
I don't exactly understand what you want...
But from what I understood, try this:
soup.encode('utf-8').decode('unicode-escape')
I am trying to scrape this page https://ntrs.nasa.gov/search .
I am using the code below and Beautiful soup is finding only 3 tags when there are many more. I have tried using html5lib, lxml and HTML parsers but none of them have worked.
Can you advise what might be the problem please?
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set the URL
url = 'https://ntrs.nasa.gov/search'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to a BeautifulSoup object¶
soup = BeautifulSoup(response.content, "html5lib")
# soup = BeautifulSoup(response.text, "html5lib")
# soup = BeautifulSoup(response.content, "html.parser")
# soup = BeautifulSoup(response.content, "lxml")
# loop through all a-tags
for a_tag in soup.findAll('a'):
if 'title' in a_tag:
if a_tag['title'] == 'Download Document':
link = a_tag['href']
download_url = 'https://ntrs.nasa.gov' + link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/citations/')+1:11])
It is dynamically pulled from a script tag. You can regex out the JavaScript object which contains the download url, handle some string replacements for html entities, parse as json then extract the desired url:
import requests, re, json
r = requests.get('https://ntrs.nasa.gov/search')
data = json.loads(re.search(r'(\{.*/api.*\})', r.text).group(1).replace('&q;','"'))
print('https://ntrs.nasa.gov' + data['http://ntrs-proxy-auto-deploy:3001/citations/search']['results'][0]['downloads'][0]['links']['pdf'])
You could append the ?attachment=true but I don't think that is required.
Your problem stems from the fact that the page is rendered using Javascipt, and the actual page source is only a few script and style tags.
I iam testing a programm for printing the content of wikipedia into the prompt. I alrready got some output but its a bit messy. So i thought to only get the content of tags <p> and <b> that are the two which wikipedia uses to show the content. Here is my code:
import urllib.request
from bs4 import BeautifulSoup
URL = input("Enter the url (only wikipedia supported, default url https://?.wikipedia.org/wiki) : ")
page = urllib.request.urlopen(URL)
html_doc = page.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for x in soup.find_all('p').find_all('b'):
print(x.string)
The interrogation mark is because wikipedia shows the lenguage there so it depends. As you see i added one more .find_all with because i didn´t know how to add it. Sorry for my bad english and my bad code because i am not very related to this request field. Thanks
BeautifulSoup.find_all returns a ResultSet which is essentially a list of elements. You need to iterate through that list for the second operation yourself.
import urllib.request
from bs4 import BeautifulSoup
URL = input("Enter the url (only wikipedia supported, default url https://?.wikipedia.org/wiki) : ")
page = urllib.request.urlopen(URL)
html_doc = page.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for elem in soup.find_all('p'):
for x in elem.find_all('b'):
print(x.string)
i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)
HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.
You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;
It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.
changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text