Python regex: re.search() does not find string - python

I have trouble using the re.search() method. I am trying to extract an image link from following string explicit:
div class="beitragstext">\n\t\t\t\t<p>Es gibt derzeit keine Gründe mehr NICHT auf 1.1.3 zu springen!</p>\n<p><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg" alt="372948722_6ec4028a80.jpg" border="0" width="430" height="466" /></p>\n<p>Photo: factoryjoe
I want to substract the URL of the first image, and the URL only.
This is my code:
imageURLObject = re.search(r'http(?!.*http).*?\.(jpg|png|JPG|PNG)', match)
The result should be https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg
Instead, the method return is None.
But if use this regex re.search(r'http.*?\.(jpg|png|JPG|PNG)', match), without the `*(?!.http), the first http hit will match until .(jpg|png|JPG|PNG) and this would be the return:
http://www.flickr.com/photos/factoryjoe/372948722/"><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg
Can someone help me please ? :-)

Use Beautiful soup for HTML parsing..
https://beautiful-soup-4.readthedocs.io/en/latest/
from bs4 import BeautifulSoup
html = """
<div class="beitragstext">\n\t\t\t\t<p>Es gibt derzeit keine Gründe mehr NICHT auf 1.1.3 zu springen!</p>\n<p><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg" alt="372948722_6ec4028a80.jpg" border="0" width="430" height="466" /></p>\n<p>Photo: factoryjoe
"""
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'beitragstext'})
for i in links:
print(i.find('img')['src'])
>>> https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg

Related

Selenium does not load <li> inside <ul> inside <div>

I am new to Selenium, Python, and programming in general but I am trying to write a small web scraper. I have encountered a website that has multiple links but their HTML code is not available for me using
soup = bs4.BeautifulSoup(html, "lxml")
The HTML-Code is:
<div class="content">
<div class="vertical_page_list is-detailed">
<div infinite-nodes="true" up-data="{"next":1,"url":"/de/pressemitteilungen?container_contenxt=lg%2C1.0"}">[event]
<ul class="has-no-bottom-margin list-unstyled infinite-nodes--list">
<li class="vertical-page-list--item is-detailed infite-nodes--list-item" style="display: list-item;">
<li class="...>
...
</ul>
</div>
</div>
</div>
But soup only contains this part, missing the li classes:
<div class="content">
<div class="vertical_page_list is-detailed">
<div infinite-nodes="true" up-data="{"next":1,"url":"/de/pressemitteilungen?container_contenxt=lg%2C1.0"}">
<ul class="has-no-bottom-margin list-unstyled infinite-nodes--list">
</ul>
</div>
</div>
</div>
It has somthing to do with the [event] after the div but I can't figure out what to do. My guess was that it is some lazy-loaded code but using
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
or directly moving to the element
actions = ActionChains(driver)
actions.move_to_element(driver.find_element_by_xpath("//div['infinite-nodes=']")).perform()
did not yield any results. This is the code I am using:
# Enable headless firefox for Serenium
options = Options()
#options.headless = True
options.add_argument("--headless")
options.page_load_strategy = 'normal'
driver = webdriver.Firefox(options=options, executable_path=r'C:\bin\geckodriver.exe')
print ("Headless Firefox Initialized")
# Load html source code from webpage
driver = webdriver.PhantomJS(executable_path=r'C:\phantomjs\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get("https://www.volkswagen-newsroom.com/de/pressemitteilungen?container_context=lg%2C1.0")
SCROLL_PAUSE_TIME = 2
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
print("Scrolled down to bottom")
# Extract html code
driver.find_element_by_xpath("//div['infinite-nodes=']").click() #just testing
time.sleep(SCROLL_PAUSE_TIME)
html = driver.page_source.encode('utf-8')
soup = bs4.BeautifulSoup(html, "lxml")
Could anyone help me please?
When you visit the page in a browser, and log your network traffic, every time the page loads (or you press the Mehr Pressemitteilungen anzeigen button) an XHR (XmlHttpRequest) request is made to some kind of API(?) - the response of which is JSON, which also contains HTML. It's this HTML that contains the list-item elements you're looking for. You don't need selenium for this:
def get_article_titles():
import requests
from bs4 import BeautifulSoup as Soup
url = "https://www.volkswagen-newsroom.com/de/pressemitteilungen"
params = {
"container_context": "lg,1.0",
"next": "1"
}
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"user-agent": "Mozilla/5.0",
"x-requested-with": "XMLHttpRequest"
}
while True:
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
data = response.json()
params["next"] = data["next"]
soup = Soup(data["html"], "html.parser")
for tag in soup.select("h3.page-preview--title > a"):
yield tag.get_text().strip()
def main():
from itertools import islice
for num, title in enumerate(islice(get_article_titles(), 10), start=1):
print("{}.) {}".format(num, title))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
1.) Volkswagen Konzern, BASF, Daimler AG und Fairphone starten Partnerschaft für nachhaltigen Lithiumabbau in Chile
2.) Verkehrsausschuss-Vorsitzender Cem Özdemir informiert sich über Transformation im Elektro-Werk in Zwickau
3.) Astypalea: Start der Transformation zur smarten, nachhaltigen Insel
4.) Vor 60 Jahren: Fußball-Legende Pelé zu Besuch im Volkswagen Werk Wolfsburg
5.) Novum unter den Kompakten: Neuer Polo ist mit „IQ.DRIVE Travel Assist“ teilautomatisiert unterwegs
6.) Der neue Tiguan Allspace – ab sofort bestellbar
7.) Volkswagen startet Vertriebsoffensive im deutschen Markt
8.) Vor 70 Jahren: Volkswagen erhält ersten Beirat
9.) „Experience our Volkswagen Way to Zero“ – neue Ausstellung im DRIVE. Volkswagen Group Forum für Gäste geöffnet
10.) Jetzt bestellbar: Der neue ID.4 GTX
>>>

How to parse data using BeautifulSoup4?

Following is the Sample from .xml file:
<title>Kaufsignal für Marriott International</title>
<link>https://insideparadeplatz.ch/2015/03/06/kaufsignal-fuer-marriott-international/</link>
<pubDate>Fri, 06 Mar 2015 </pubDate>
<content:encoded>
<![CDATA[
<p class="p1">
<span class="s1">Mit Marken wie Bulgari, Ritz-Carlton, Marriott und weiteren ist Marriott International nach sämtlichen Kriterien, die vom <a href="http://www.obermatt.com/de/home.html">
<span class="s2">Obermatt-System</span></a></span> bewertet werden, ein interessantes Investment. Der Titel ist relativ gesehen günstig, das Unternehmen sollte weiter überproportional wachsen, und es ist solide finanziert, mit einem guten Verhältnis von Eigenkapital und Schulden. Über alle Kategorien gesehen landet die
<span class="s3">Marriott-Aktie</span></a>, die derzeit an der Technologiebörse Nasdaq bei rund 84 Dollar gehandelt wird, in der Wochenauswertung im Total-Ranking auf dem ersten Platz.
<img class="aligncenter wp-image-17092 size-full" src="https://insideparadeplatz.ch/wp-content/uploads/2015/03/Total-Ranking-6-Mar-2015.png" alt="Total-Ranking 6 Mar 2015" width="873" height="627" /></a></p>]]>
</content:encoded>
What I'm trying to do is , using beautifulsoup4, I'm able to extract 'title', 'link', 'pubDate'. But problem is 'content:encoded'. Here I want to extract 'img' from 'content:encoded' for my 'img_list'. I've tried many solutions but all I get is None.
title = []
link = []
date = []
img_list = []
for item in soup.find_all('item'):
for t in item.find_all('title'):
title.append(t.text)
for item in soup.find_all('item'):
for l in item.find_all('link'):
link.append(t.text)
for item in soup.find_all('item'):
for date in item.find_all('pubDate'):
pubDate.append(date.text)
for item in soup.find_all('item'):
for data in item.find_all('content:encoded'):
data.text
I tried:
for item in soup.find_all('item'):
for data in item.find_all('content:encoded'):
for img in data.find_all('img'):
img_list.append(img.text)
but got nothing. What I'm missing here?
I think your going to have trouble getting that img data out.
for item in soup.find("content:encoded"):
print(item)
print(type(item))
Then see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring
So bs4 thinks it is a string and you will need to parse it manually or maybe refeed the new string into a new bs4 object

unable to fetch full data inside<div>

HTML:
<div>
Está en: <b>
Inicio /
Valle Del Cauca /
Cali /
Zona Sur /
Zona Sur /
<a>Los Naranjos Conjunto Campestre</a></b>
</div>
Unable to fetch all <a> tags inside <div> tag
My code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.fincaraiz.com.co/oceana-52/barranquilla/proyecto-nuevo-det-1041165.aspx')
soup = BeautifulSoup(page.content, 'html.parser')
first = soup.find('div' , 'breadcrumb left')
link = first.find('div')
a_link = link.findAll('a')
print (a_link)
The above coding only printing the first <a> tag
[Inicio]
Following are the output required from the above HTML
Valle Del Cauca
Cali
Zona Sur
Zona Sur
I'm not sure why it was not printing after '/' inside <b> tag
You can use lxml parser, html.parser normalizes/prettify the actual source before BS4 parse it.
soup = BeautifulSoup(page.content, 'lxml')

Extract the text from `p` within `div` with BeautifulSoup

I am very new to web-scraping with Python, and I am really having a hard time with extracting nested text from within HTML (p within div, to be exact). Here is what I got so far:
from bs4 import BeautifulSoup
import urllib
url = urllib.urlopen('http://meinparlament.diepresse.com/')
content = url.read()
soup = BeautifulSoup(content, 'lxml')
This works fine:
links=soup.findAll('a',{'title':'zur Antwort'})
for link in links:
print(link['href'])
This extraction works fine:
table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
print(x)
This is the output:
<div class="content-question">
<p>[...] Die Verhandlungen über die mögliche Visabefreiung für
türkische Staatsbürger per Ende Ju...
<a href="http://meinparlament.diepresse.com/frage/10144/" title="zur
Antwort">mehr »</a>
</p>
</div>
Now, I want to extract the text within p and /p. This is the code I use:
table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
print(x['p'])
However, Python raises a KeyError.
The following code finds and prints the text of each p element in the div's with the class "content-question"
from bs4 import BeautifulSoup
import urllib
url = urllib.urlopen('http://meinparlament.diepresse.com/')
content = url.read()
soup = BeautifulSoup(content, 'lxml')
table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
print x.find('p').text
# Another way to retrieve tables:
# table = soup.select('div[class="content-question"]')
The following is the printed text of the first p element in table:
[...] Die Verhandlungen über die mögliche Visabefreiung für türkische Staatsbürger per Ende Juni sind noch nicht abgeschlossen, sodass nicht mit Sicherheit gesagt werden kann, ob es zu diesem Zeitpunkt bereits zu einer Visabefreiung kommt. Auch die genauen Modalitäten einer solchen Visaliberalisierung sind noch nicht ausverhandelt. Prinzipiell ist es jedoch so, dass Visaerleichterungen bzw. -liberalisierungen eine Frage von Reziprozität sind, d.h. dass diese für beide Staaten gelten müssten. [...]

python how to parse html [duplicate]

This question already has answers here:
Parsing HTML page using beautifulsoup
(2 answers)
Closed 9 years ago.
I want to parse a html div in python with BeautifulSoup.
my code:
url = "http://german-bash.org/action/random"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for div in soup.findAll('div','zitat'):
output = div.find('span', 'quote_zeile').string
print(output)
I want to get all span's in the div "div.zitat" but it doesn't work
You are only fetching the first such a <span>. Use .findAll() again to find all such spans:
html = response.read()
soup = BeautifulSoup(html, from_encoding=response.info().getparam('charset'))
for div in soup.findAll('div','zitat'):
for span in div.findAll('span', 'quote_zeile'):
print span.string
This produces:
>>> for div in soup.findAll('div','zitat'):
... for span in div.findAll('span', 'quote_zeile'):
... print span.string
...
<bigg-el> war grad schuhe kaufen
<bigg-el> konnte mich nicht entscheiden...
<bigg-el> dann hab ich halt beide genommen :D
<Sputnik> links UND rechts??
<Busstard> weisste was scheisse is?
<Parteibruder> Nöa!
<Busstard> wenn man seinen eigenen vadda um halb zehn im puff trifft....>(
<Parteibruder> LOOOOOOL
<Psyman2> Saw? gib mal ne Zusammenfassung
<Alf777> Gewalt, Blut, Folter, psychische Schäden und abgetrennte Körperteile
<Psyman2> Aha, normale Hauptschule also
<Alf777> rofl
<complex> ich verhüte nicht. frauen hüten sich vor mir -.-
<MeX> Da steht im neuen Plus-Prospekt doch tatsächlich:
<MeX> Achtung: Silvesterknaller nur aus 2 Meter Entfernung entzünden!
<MeX> Wie stellen die sich das vor? Mit nem Flammenwerfer?
Note that I also tell BeautifulSoup what encoding to use, taken from the response Content-Type header; this is important for this page because without it BeautifulSoup guesses wrong.

Categories