How to parse data using BeautifulSoup4? - python

Following is the Sample from .xml file:
<title>Kaufsignal für Marriott International</title>
<link>https://insideparadeplatz.ch/2015/03/06/kaufsignal-fuer-marriott-international/</link>
<pubDate>Fri, 06 Mar 2015 </pubDate>
<content:encoded>
<![CDATA[
<p class="p1">
<span class="s1">Mit Marken wie Bulgari, Ritz-Carlton, Marriott und weiteren ist Marriott International nach sämtlichen Kriterien, die vom <a href="http://www.obermatt.com/de/home.html">
<span class="s2">Obermatt-System</span></a></span> bewertet werden, ein interessantes Investment. Der Titel ist relativ gesehen günstig, das Unternehmen sollte weiter überproportional wachsen, und es ist solide finanziert, mit einem guten Verhältnis von Eigenkapital und Schulden. Über alle Kategorien gesehen landet die
<span class="s3">Marriott-Aktie</span></a>, die derzeit an der Technologiebörse Nasdaq bei rund 84 Dollar gehandelt wird, in der Wochenauswertung im Total-Ranking auf dem ersten Platz.
<img class="aligncenter wp-image-17092 size-full" src="https://insideparadeplatz.ch/wp-content/uploads/2015/03/Total-Ranking-6-Mar-2015.png" alt="Total-Ranking 6 Mar 2015" width="873" height="627" /></a></p>]]>
</content:encoded>
What I'm trying to do is , using beautifulsoup4, I'm able to extract 'title', 'link', 'pubDate'. But problem is 'content:encoded'. Here I want to extract 'img' from 'content:encoded' for my 'img_list'. I've tried many solutions but all I get is None.
title = []
link = []
date = []
img_list = []
for item in soup.find_all('item'):
for t in item.find_all('title'):
title.append(t.text)
for item in soup.find_all('item'):
for l in item.find_all('link'):
link.append(t.text)
for item in soup.find_all('item'):
for date in item.find_all('pubDate'):
pubDate.append(date.text)
for item in soup.find_all('item'):
for data in item.find_all('content:encoded'):
data.text
I tried:
for item in soup.find_all('item'):
for data in item.find_all('content:encoded'):
for img in data.find_all('img'):
img_list.append(img.text)
but got nothing. What I'm missing here?

I think your going to have trouble getting that img data out.
for item in soup.find("content:encoded"):
print(item)
print(type(item))
Then see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring
So bs4 thinks it is a string and you will need to parse it manually or maybe refeed the new string into a new bs4 object

Related

BeautifulSoup: Get all elements that contain a certain attribute

I have a file that I want to make soup with- this is only an excerpt:
<p facs="#facs_14_TextRegion_1624023825755_566">
<lb facs="#facs_14_line_1624023773342_555" n="N001"/>Und wiewol eyn solcher Mensch behelt eyn <w>Frei<pc>-</pc>
<lb/>heyt</w> seines willens, doch schwach und verletzet, daraus
<lb facs="#facs_14_r1l25" n="N003"/>als eynem brunnen herfliessen alle ehrliche tugent und <w>tha<pc>-</pc>
<lb/>ten</w> der heiden sampt iren wirckungen, So kan er doch vor
<lb facs="#facs_14_r1l27" n="N005"/>der gnaden und ehe dann er wider zu recht gebracht ist,
<lb facs="#facs_14_r1l28" n="N006"/>die gerechtigkeyt, die vor Gott gilt, nit erreychen, sonder
<lb facs="#facs_14_r1l29" n="N007"/>ist vil mehr eyn knecht der Sünden, des Teüfels eygen
<lb facs="#facs_14_r1l30" n="N008"/>und eyn feind Gottes und allen straffen diser welt <w>under<pc>-</pc>
<lb/>worffen</w>, dann er wirdt beschwärt mit hunger, kummer,
<lb facs="#facs_14_r1l32" n="N010"/>durst, frost, hitz, schmertzen, kranckheyt und wirdt entlich
<lb facs="#facs_14_r1l33" n="N011"/>durch den tod zerstöret. Dann <q>die sünde ist durch eynen
<lb facs="#facs_14_line_1623337148499_448" n="N012"/><note place="margin-right" facs="#facs_14_TextRegion_1624023908878_596">
<lb facs="#facs_14_line_1623337734932_504" n="N001"/>Roma. 5.</note>menschen inn die welt kommen unnd durch die sünde der
<lb facs="#facs_14_r1l35" n="N013"/>tod</q><note type="annotation"><ref type="biblical" cRef="Rm_5,12">Röm 5,12</ref>.</note>. Es seind aber die straffen diser ersten übertrettung
<lb facs="#facs_14_r1l36" n="N014"/>denen, die von neuem durch die genad geboren werden,
<lb facs="#facs_14_r1l37" n="N015"/>mit den sündern gemeyn, wiewol sie den neugebornen auch
<lb facs="#facs_14_r1l38" n="N016"/>zur übung auffgelegt werden, Aber den ungerechten und
<lb facs="#facs_14_r1l39" n="N017"/>Gottlosen werden sie zur straff zugeschickt.
</p>
Now, what I want to do is get all and their content, regardles of what the cRef's text actually is. (It's a standardised set of shorthands depicting bible passages). I think am not sure whether I can use cRefs = soup.findAll(attrs={"type":"biblical"}), as we might have elements that are type="biblical" but don't have a cRef.All elements that have a cRef should be, and I am not the one tagging the document, , though.
Any hints on how to do that?
Edit: my Code so far:
import bs4 as bs
with open('interim.xml', 'r') as f:
file = f.read()
soup = bs.BeautifulSoup(file, 'lxml')
Refs = soup.find_all(attrs={'type':'biblical'})
print(Refs)
expected output:
[<ref cref="Gn_1,26-27" type="biblical">Gen 1,26f.</ref>, <ref cref="Sir_15,14" type="biblical">Sir 15,14</ref>, <ref cref="Rm_5,12" type="biblical">Röm 5,12</ref> ]
(only an excerpt) - it might happen that not all of the elements share type="biblical", though.
for an individual element in Refs I can use Refs[i].get('cref') for retrieving the value, which I also need at a later time.
Is there a way in BS to tetrieve only elements that have a certain attribute?
Edit2: bcs I found out that I can do
Refs = soup.find_all(attrs={'cref':True})
All the best and thanks in advance,
K
After some more coffee, I searched again and found that
Refs = soup.find_all(attrs={'cref':True})
works flawlessly for what I want.

Using python BeautifulSoup to retrieve values from a website using jinja placeholders

I am trying to access translation fields from a dictionary-like website using python and BeautifulSoup as following:
word = "death"
url="https://tatoeba.org/eng/sentences/search?query={}&from=eng&to=deu".format(word)
vstr=requests.get(url).content
soup = BeautifulSoup(vstr,features="html.parser",from_encoding='utf8')
rows = soup.findAll('div', {"class":"sentence-and-translations"})
for row in rows:
if row:
a=row.find('div',{'class':'translation'})
a=a.find('div',{'layout':'row'})
print(a)
however, the pieces that I would like to access only show as their jinja tag representation, like this:
<div class="text" dir="{{translation.dir}}" flex="" lang="{{translation.lang_tag}}">
Is there a good way to retrieve the values without making use of virtual browsers like selenium?
The data is embedded in the page in the JavaScript form. With little tweaking, you can extract the information with re/json modules:
import html
import json
import requests
from bs4 import BeautifulSoup
kw = 'death'
lang = 'deu'
url = 'https://tatoeba.org/eng/sentences/search?query={kw}&from=eng&to={lang}'
soup = BeautifulSoup(requests.get(url.format(kw=kw, lang=lang)).content, 'html.parser')
for div in soup.select('[sentence-and-translations][ng-init]'):
i = re.search(r'vm\.init\((.*)\)', div['ng-init']).group(1)
s = '[' + html.unescape(i).replace(f"'{lang}'", f'"{lang}"') + ']'
data = json.loads(s)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print(data[1]['text'])
for trans in data[2] + data[3]:
print(trans['text'])
print('-' * 80)
Prints:
I escaped death.
Ich bin dem Tod entkommen.
Ich bin dem Tode von der Schippe gesprungen.
Ich bin dem Tod entronnen.
--------------------------------------------------------------------------------
Death before dishonor!
Lieber tot als ehrlos.
--------------------------------------------------------------------------------
Think about death.
Gedenke des Todes.
--------------------------------------------------------------------------------
Death is imminent.
Der Tod ist unvermeidlich.
Der Tod ist unausweichlich.
--------------------------------------------------------------------------------
Victory, or death!
Sieg oder Tod!
--------------------------------------------------------------------------------
I am Death.
Ich bin der Tod.
--------------------------------------------------------------------------------
Death is inevitable.
Der Tod ist unvermeidlich.
Der Tod ist unausweichlich.
--------------------------------------------------------------------------------
Death is certain.
Der Tod ist unvermeidlich.
Der Tod ist unausweichlich.
--------------------------------------------------------------------------------
Death take me.
Der Tod soll mich holen.
--------------------------------------------------------------------------------
Death is scary.
Der Tod macht Angst.
--------------------------------------------------------------------------------

Scrape to get only english text from translated google patent page

<section itemprop="claims" itemscope>
<h2>Claims (<span itemprop="count">20</span>)</h2>
<aside>Translated from <span itemprop="translatedLanguage">German</span></aside>
<div itemprop="content" html><div mxw-id="PCLM47023824" lang="DE" load-source="patent-office" class="claims"><div class="claim"><div num="1" class="claim"><div class="claim-text"> <span class="notranslate" onmouseover="_tipon(this)" onmouseout="_tipoff()"><span class="google-src-text" style="direction: ltr; text-align: left">1. Einrichtung zum Installieren von Versorgungsleitungen und/oder Datenleitungen für mehrere Arbeitsplätze, insbesonde re miteinander und/oder mit einer zentralen Einrichtung ver bundenen Computer-Arbeitsplätze oder dergleichen in einem Räum, <b>dadurch gekennzeichnet</b> , dass ein aus vorbereiteten Ele menten gerüstartig aufbaubares System vorgesehen ist, das un terhalb einer Decke ( <b>12</b> ) des Raumes und oberhalb einer norma len Greifhöhe anbringbare Kanäle ( <b>18</b> , <b>18</b> ') zur Aufnahme von Versorgungsleitungen und/oder Datenleitungen enthält, wobei an die Kanäle ( <b>18</b> , <b>18</b> ') nach unten gerichtete, Arbeitsplätzen zu geordnete Säulen ( <b>21</b> , <b>27</b> ) anschließbar sind, die mit in Greif höhe anzuordnenden Versorgungsanschlüssen ( <b>23</b> ) versehen sind.</span> 1. means for installing supply lines and / or data lines for several jobs, insbesonde re each other and / or with a central facility ver affiliated computer workstations or the like in a cavities, <b>characterized in</b> that a Ele from prepared elements is like a framework to-assemble system provided which 'contains for receiving supply lines and / or data lines, wherein the channels <b>(18,</b> <b>18</b> un terhalb a ceiling <b>(12)</b> of the room and above a norma len picking height attachable channels <b>(18,</b> <b>18)')</b> downward, jobs to subordinate columns <b>(21,</b> <b>27)</b> are connected, which are provided with height in Griffin to be arranged supply terminals <b>(23).</b></span> </div></div></div><div class="claim-dependent"><div num="2" class="claim"> <div class="claim-text"> <span class="notranslate" onmouseover="_tipon(this)" onmouseout="_tipoff()"><span class="google-src-text" style="direction: ltr; text-align: left">2. Einrichtung nach Anspruch 1, dadurch gekennzeichnet, dass für die Kanäle ( <b>18</b> ) Hängehalter ( <b>19</b> ) zum Aufhängen an der De cke ( <b>12</b> ) des Raumes vorgesehen sind.</span> 2. Device according to claim 1, characterized in that the channels <b>(18)</b> hanging holder <b>(19)</b> for hanging on the De blocks <b>(12)</b> of the space are provided.</span> </div></div></div></div>
</section>
For more reference I you can check source code of this page
The above code is just sample from the above page.
I tried BeautifulSoup(page).findAll('Section')[5].getText(), but it is giving both german and english text.
Can anyone tell me a way to extract only English words either by using functions like BeautifulSoup or by using regular expression?
BeautifulSoup(page).findAll('Section')[5] gets just the content of 6th section tag. You need further parsing. Try doing something like:
soup = BeautifulSoup(page, 'html.parser')
section = soup.findAll('section')[5]
content = section.find('span', {'class':'notranslate'}) #finds the span with text
eng_text = content.contents[1:] #strips-off German text
print(eng_text[0].strip())
Output:
1. means for installing supply lines and / or data lines for several jobs, insbesonde re each other and / or with a central facility ver affiliated computer workstations or the like in a cavities,

Extract the text from `p` within `div` with BeautifulSoup

I am very new to web-scraping with Python, and I am really having a hard time with extracting nested text from within HTML (p within div, to be exact). Here is what I got so far:
from bs4 import BeautifulSoup
import urllib
url = urllib.urlopen('http://meinparlament.diepresse.com/')
content = url.read()
soup = BeautifulSoup(content, 'lxml')
This works fine:
links=soup.findAll('a',{'title':'zur Antwort'})
for link in links:
print(link['href'])
This extraction works fine:
table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
print(x)
This is the output:
<div class="content-question">
<p>[...] Die Verhandlungen über die mögliche Visabefreiung für
türkische Staatsbürger per Ende Ju...
<a href="http://meinparlament.diepresse.com/frage/10144/" title="zur
Antwort">mehr »</a>
</p>
</div>
Now, I want to extract the text within p and /p. This is the code I use:
table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
print(x['p'])
However, Python raises a KeyError.
The following code finds and prints the text of each p element in the div's with the class "content-question"
from bs4 import BeautifulSoup
import urllib
url = urllib.urlopen('http://meinparlament.diepresse.com/')
content = url.read()
soup = BeautifulSoup(content, 'lxml')
table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
print x.find('p').text
# Another way to retrieve tables:
# table = soup.select('div[class="content-question"]')
The following is the printed text of the first p element in table:
[...] Die Verhandlungen über die mögliche Visabefreiung für türkische Staatsbürger per Ende Juni sind noch nicht abgeschlossen, sodass nicht mit Sicherheit gesagt werden kann, ob es zu diesem Zeitpunkt bereits zu einer Visabefreiung kommt. Auch die genauen Modalitäten einer solchen Visaliberalisierung sind noch nicht ausverhandelt. Prinzipiell ist es jedoch so, dass Visaerleichterungen bzw. -liberalisierungen eine Frage von Reziprozität sind, d.h. dass diese für beide Staaten gelten müssten. [...]

python how to parse html [duplicate]

This question already has answers here:
Parsing HTML page using beautifulsoup
(2 answers)
Closed 9 years ago.
I want to parse a html div in python with BeautifulSoup.
my code:
url = "http://german-bash.org/action/random"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for div in soup.findAll('div','zitat'):
output = div.find('span', 'quote_zeile').string
print(output)
I want to get all span's in the div "div.zitat" but it doesn't work
You are only fetching the first such a <span>. Use .findAll() again to find all such spans:
html = response.read()
soup = BeautifulSoup(html, from_encoding=response.info().getparam('charset'))
for div in soup.findAll('div','zitat'):
for span in div.findAll('span', 'quote_zeile'):
print span.string
This produces:
>>> for div in soup.findAll('div','zitat'):
... for span in div.findAll('span', 'quote_zeile'):
... print span.string
...
<bigg-el> war grad schuhe kaufen
<bigg-el> konnte mich nicht entscheiden...
<bigg-el> dann hab ich halt beide genommen :D
<Sputnik> links UND rechts??
<Busstard> weisste was scheisse is?
<Parteibruder> Nöa!
<Busstard> wenn man seinen eigenen vadda um halb zehn im puff trifft....>(
<Parteibruder> LOOOOOOL
<Psyman2> Saw? gib mal ne Zusammenfassung
<Alf777> Gewalt, Blut, Folter, psychische Schäden und abgetrennte Körperteile
<Psyman2> Aha, normale Hauptschule also
<Alf777> rofl
<complex> ich verhüte nicht. frauen hüten sich vor mir -.-
<MeX> Da steht im neuen Plus-Prospekt doch tatsächlich:
<MeX> Achtung: Silvesterknaller nur aus 2 Meter Entfernung entzünden!
<MeX> Wie stellen die sich das vor? Mit nem Flammenwerfer?
Note that I also tell BeautifulSoup what encoding to use, taken from the response Content-Type header; this is important for this page because without it BeautifulSoup guesses wrong.

Categories