python how to parse html [duplicate]

python how to parse html [duplicate] - python

This question already has answers here:
Parsing HTML page using beautifulsoup
(2 answers)
Closed 9 years ago.
I want to parse a html div in python with BeautifulSoup.
my code:
url = "http://german-bash.org/action/random"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for div in soup.findAll('div','zitat'):
output = div.find('span', 'quote_zeile').string
print(output)
I want to get all span's in the div "div.zitat" but it doesn't work

You are only fetching the first such a <span>. Use .findAll() again to find all such spans:
html = response.read()
soup = BeautifulSoup(html, from_encoding=response.info().getparam('charset'))
for div in soup.findAll('div','zitat'):
for span in div.findAll('span', 'quote_zeile'):
print span.string
This produces:
>>> for div in soup.findAll('div','zitat'):
... for span in div.findAll('span', 'quote_zeile'):
... print span.string
...
<bigg-el> war grad schuhe kaufen
<bigg-el> konnte mich nicht entscheiden...
<bigg-el> dann hab ich halt beide genommen :D
<Sputnik> links UND rechts??
<Busstard> weisste was scheisse is?
<Parteibruder> Nöa!
<Busstard> wenn man seinen eigenen vadda um halb zehn im puff trifft....>(
<Parteibruder> LOOOOOOL
<Psyman2> Saw? gib mal ne Zusammenfassung
<Alf777> Gewalt, Blut, Folter, psychische Schäden und abgetrennte Körperteile
<Psyman2> Aha, normale Hauptschule also
<Alf777> rofl
<complex> ich verhüte nicht. frauen hüten sich vor mir -.-
<MeX> Da steht im neuen Plus-Prospekt doch tatsächlich:
<MeX> Achtung: Silvesterknaller nur aus 2 Meter Entfernung entzünden!
<MeX> Wie stellen die sich das vor? Mit nem Flammenwerfer?
Note that I also tell BeautifulSoup what encoding to use, taken from the response Content-Type header; this is important for this page because without it BeautifulSoup guesses wrong.

Related

Using python BeautifulSoup to retrieve values from a website using jinja placeholders

I am trying to access translation fields from a dictionary-like website using python and BeautifulSoup as following:
word = "death"
url="https://tatoeba.org/eng/sentences/search?query={}&from=eng&to=deu".format(word)
vstr=requests.get(url).content
soup = BeautifulSoup(vstr,features="html.parser",from_encoding='utf8')
rows = soup.findAll('div', {"class":"sentence-and-translations"})
for row in rows:
if row:
a=row.find('div',{'class':'translation'})
a=a.find('div',{'layout':'row'})
print(a)
however, the pieces that I would like to access only show as their jinja tag representation, like this:
<div class="text" dir="{{translation.dir}}" flex="" lang="{{translation.lang_tag}}">
Is there a good way to retrieve the values without making use of virtual browsers like selenium?

The data is embedded in the page in the JavaScript form. With little tweaking, you can extract the information with re/json modules:
import html
import json
import requests
from bs4 import BeautifulSoup
kw = 'death'
lang = 'deu'
url = 'https://tatoeba.org/eng/sentences/search?query={kw}&from=eng&to={lang}'
soup = BeautifulSoup(requests.get(url.format(kw=kw, lang=lang)).content, 'html.parser')
for div in soup.select('[sentence-and-translations][ng-init]'):
i = re.search(r'vm\.init\((.*)\)', div['ng-init']).group(1)
s = '[' + html.unescape(i).replace(f"'{lang}'", f'"{lang}"') + ']'
data = json.loads(s)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print(data[1]['text'])
for trans in data[2] + data[3]:
print(trans['text'])
print('-' * 80)
Prints:
I escaped death.
Ich bin dem Tod entkommen.
Ich bin dem Tode von der Schippe gesprungen.
Ich bin dem Tod entronnen.
--------------------------------------------------------------------------------
Death before dishonor!
Lieber tot als ehrlos.
--------------------------------------------------------------------------------
Think about death.
Gedenke des Todes.
--------------------------------------------------------------------------------
Death is imminent.
Der Tod ist unvermeidlich.
Der Tod ist unausweichlich.
--------------------------------------------------------------------------------
Victory, or death!
Sieg oder Tod!
--------------------------------------------------------------------------------
I am Death.
Ich bin der Tod.
--------------------------------------------------------------------------------
Death is inevitable.
Der Tod ist unvermeidlich.
Der Tod ist unausweichlich.
--------------------------------------------------------------------------------
Death is certain.
Der Tod ist unvermeidlich.
Der Tod ist unausweichlich.
--------------------------------------------------------------------------------
Death take me.
Der Tod soll mich holen.
--------------------------------------------------------------------------------
Death is scary.
Der Tod macht Angst.
--------------------------------------------------------------------------------

Python regex: re.search() does not find string

I have trouble using the re.search() method. I am trying to extract an image link from following string explicit:
div class="beitragstext">\n\t\t\t\t<p>Es gibt derzeit keine GrÃ¼nde mehr NICHT auf 1.1.3 zu springen!</p>\n<p><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg" alt="372948722_6ec4028a80.jpg" border="0" width="430" height="466" /></p>\n<p>Photo: factoryjoe
I want to substract the URL of the first image, and the URL only.
This is my code:
imageURLObject = re.search(r'http(?!.*http).*?\.(jpg|png|JPG|PNG)', match)
The result should be https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg
Instead, the method return is None.
But if use this regex re.search(r'http.*?\.(jpg|png|JPG|PNG)', match), without the `*(?!.http), the first http hit will match until .(jpg|png|JPG|PNG) and this would be the return:
http://www.flickr.com/photos/factoryjoe/372948722/"><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg
Can someone help me please ? :-)

Use Beautiful soup for HTML parsing..
https://beautiful-soup-4.readthedocs.io/en/latest/
from bs4 import BeautifulSoup
html = """
<div class="beitragstext">\n\t\t\t\t<p>Es gibt derzeit keine GrÃ¼nde mehr NICHT auf 1.1.3 zu springen!</p>\n<p><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg" alt="372948722_6ec4028a80.jpg" border="0" width="430" height="466" /></p>\n<p>Photo: factoryjoe
"""
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'beitragstext'})
for i in links:
print(i.find('img')['src'])
>>> https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg

How to parse data using BeautifulSoup4?

Following is the Sample from .xml file:
<title>Kaufsignal für Marriott International</title>
<link>https://insideparadeplatz.ch/2015/03/06/kaufsignal-fuer-marriott-international/</link>
<pubDate>Fri, 06 Mar 2015 </pubDate>
<content:encoded>
<![CDATA[
<p class="p1">
<span class="s1">Mit Marken wie Bulgari, Ritz-Carlton, Marriott und weiteren ist Marriott International nach sämtlichen Kriterien, die vom <a href="http://www.obermatt.com/de/home.html">
<span class="s2">Obermatt-System</span></a></span> bewertet werden, ein interessantes Investment. Der Titel ist relativ gesehen günstig, das Unternehmen sollte weiter überproportional wachsen, und es ist solide finanziert, mit einem guten Verhältnis von Eigenkapital und Schulden. Über alle Kategorien gesehen landet die
<span class="s3">Marriott-Aktie</span></a>, die derzeit an der Technologiebörse Nasdaq bei rund 84 Dollar gehandelt wird, in der Wochenauswertung im Total-Ranking auf dem ersten Platz.
<img class="aligncenter wp-image-17092 size-full" src="https://insideparadeplatz.ch/wp-content/uploads/2015/03/Total-Ranking-6-Mar-2015.png" alt="Total-Ranking 6 Mar 2015" width="873" height="627" /></a></p>]]>
</content:encoded>
What I'm trying to do is , using beautifulsoup4, I'm able to extract 'title', 'link', 'pubDate'. But problem is 'content:encoded'. Here I want to extract 'img' from 'content:encoded' for my 'img_list'. I've tried many solutions but all I get is None.
title = []
link = []
date = []
img_list = []
for item in soup.find_all('item'):
for t in item.find_all('title'):
title.append(t.text)
for item in soup.find_all('item'):
for l in item.find_all('link'):
link.append(t.text)
for item in soup.find_all('item'):
for date in item.find_all('pubDate'):
pubDate.append(date.text)
for item in soup.find_all('item'):
for data in item.find_all('content:encoded'):
data.text
I tried:
for item in soup.find_all('item'):
for data in item.find_all('content:encoded'):
for img in data.find_all('img'):
img_list.append(img.text)
but got nothing. What I'm missing here?

I think your going to have trouble getting that img data out.
for item in soup.find("content:encoded"):
print(item)
print(type(item))
Then see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring
So bs4 thinks it is a string and you will need to parse it manually or maybe refeed the new string into a new bs4 object

Scrape to get only english text from translated google patent page

<section itemprop="claims" itemscope>
<h2>Claims (<span itemprop="count">20</span>)</h2>
<aside>Translated from <span itemprop="translatedLanguage">German</span></aside>
<div itemprop="content" html><div mxw-id="PCLM47023824" lang="DE" load-source="patent-office" class="claims"><div class="claim"><div num="1" class="claim"><div class="claim-text"> <span class="notranslate" onmouseover="_tipon(this)" onmouseout="_tipoff()"><span class="google-src-text" style="direction: ltr; text-align: left">1. Einrichtung zum Installieren von Versorgungsleitungen und/oder Datenleitungen für mehrere Arbeitsplätze, insbesonde re miteinander und/oder mit einer zentralen Einrichtung ver bundenen Computer-Arbeitsplätze oder dergleichen in einem Räum, <b>dadurch gekennzeichnet</b> , dass ein aus vorbereiteten Ele menten gerüstartig aufbaubares System vorgesehen ist, das un terhalb einer Decke ( <b>12</b> ) des Raumes und oberhalb einer norma len Greifhöhe anbringbare Kanäle ( <b>18</b> , <b>18</b> ') zur Aufnahme von Versorgungsleitungen und/oder Datenleitungen enthält, wobei an die Kanäle ( <b>18</b> , <b>18</b> ') nach unten gerichtete, Arbeitsplätzen zu geordnete Säulen ( <b>21</b> , <b>27</b> ) anschließbar sind, die mit in Greif höhe anzuordnenden Versorgungsanschlüssen ( <b>23</b> ) versehen sind.</span> 1. means for installing supply lines and / or data lines for several jobs, insbesonde re each other and / or with a central facility ver affiliated computer workstations or the like in a cavities, <b>characterized in</b> that a Ele from prepared elements is like a framework to-assemble system provided which 'contains for receiving supply lines and / or data lines, wherein the channels <b>(18,</b> <b>18</b> un terhalb a ceiling <b>(12)</b> of the room and above a norma len picking height attachable channels <b>(18,</b> <b>18)')</b> downward, jobs to subordinate columns <b>(21,</b> <b>27)</b> are connected, which are provided with height in Griffin to be arranged supply terminals <b>(23).</b></span> </div></div></div><div class="claim-dependent"><div num="2" class="claim"> <div class="claim-text"> <span class="notranslate" onmouseover="_tipon(this)" onmouseout="_tipoff()"><span class="google-src-text" style="direction: ltr; text-align: left">2. Einrichtung nach Anspruch 1, dadurch gekennzeichnet, dass für die Kanäle ( <b>18</b> ) Hängehalter ( <b>19</b> ) zum Aufhängen an der De cke ( <b>12</b> ) des Raumes vorgesehen sind.</span> 2. Device according to claim 1, characterized in that the channels <b>(18)</b> hanging holder <b>(19)</b> for hanging on the De blocks <b>(12)</b> of the space are provided.</span> </div></div></div></div>
</section>
For more reference I you can check source code of this page
The above code is just sample from the above page.
I tried BeautifulSoup(page).findAll('Section')[5].getText(), but it is giving both german and english text.
Can anyone tell me a way to extract only English words either by using functions like BeautifulSoup or by using regular expression?

BeautifulSoup(page).findAll('Section')[5] gets just the content of 6th section tag. You need further parsing. Try doing something like:
soup = BeautifulSoup(page, 'html.parser')
section = soup.findAll('section')[5]
content = section.find('span', {'class':'notranslate'}) #finds the span with text
eng_text = content.contents[1:] #strips-off German text
print(eng_text[0].strip())
Output:
1. means for installing supply lines and / or data lines for several jobs, insbesonde re each other and / or with a central facility ver affiliated computer workstations or the like in a cavities,

Extract the text from `p` within `div` with BeautifulSoup

I am very new to web-scraping with Python, and I am really having a hard time with extracting nested text from within HTML (p within div, to be exact). Here is what I got so far:
from bs4 import BeautifulSoup
import urllib
url = urllib.urlopen('http://meinparlament.diepresse.com/')
content = url.read()
soup = BeautifulSoup(content, 'lxml')
This works fine:
links=soup.findAll('a',{'title':'zur Antwort'})
for link in links:
print(link['href'])
This extraction works fine:
table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
print(x)
This is the output:
<div class="content-question">
<p>[...] Die Verhandlungen über die mögliche Visabefreiung für
türkische Staatsbürger per Ende Ju...
<a href="http://meinparlament.diepresse.com/frage/10144/" title="zur
Antwort">mehr »</a>
</p>
</div>
Now, I want to extract the text within p and /p. This is the code I use:
table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
print(x['p'])
However, Python raises a KeyError.

The following code finds and prints the text of each p element in the div's with the class "content-question"
from bs4 import BeautifulSoup
import urllib
url = urllib.urlopen('http://meinparlament.diepresse.com/')
content = url.read()
soup = BeautifulSoup(content, 'lxml')
table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
print x.find('p').text
# Another way to retrieve tables:
# table = soup.select('div[class="content-question"]')
The following is the printed text of the first p element in table:
[...] Die Verhandlungen über die mögliche Visabefreiung für türkische Staatsbürger per Ende Juni sind noch nicht abgeschlossen, sodass nicht mit Sicherheit gesagt werden kann, ob es zu diesem Zeitpunkt bereits zu einer Visabefreiung kommt. Auch die genauen Modalitäten einer solchen Visaliberalisierung sind noch nicht ausverhandelt. Prinzipiell ist es jedoch so, dass Visaerleichterungen bzw. -liberalisierungen eine Frage von Reziprozität sind, d.h. dass diese für beide Staaten gelten müssten. [...]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python how to parse html [duplicate] - python

Related

Using python BeautifulSoup to retrieve values from a website using jinja placeholders

Python regex: re.search() does not find string

How to parse data using BeautifulSoup4?

Scrape to get only english text from translated google patent page

Extract the text from `p` within `div` with BeautifulSoup

Categories

Resources