Python - Retrieve specific "object" from url with Beatifulsoup

Python - Retrieve specific "object" from url with Beatifulsoup - python

Im trying to parse a specific "item" on a site, but I dont know if its a class, object, id or something else
my code:
soup = BeautifulSoup(urllib2.urlopen(myURL))
divdata = soup.find('div')
print(divdata)
And it returns:
<div data-store='{"Auth":{"cookie":null,"user":null,"timestamp":1485297666762},"Blocked":{},"Broadcast":
{"forceUpdate":false,"failed":[],"pending":[],"error":
{"isNotFound":false,"isServerError":false,"isUnavailable":false}},"BroadcastCache":{"broadcasts":{"ID1":{"broadcast":
{"data":{"class_name":"Broadcast","id":"ID1","state":"running,
....(more)....
So I want to retrieve the "running" or what ever is in "state"
I tried
statedata = soup.find('div', {"class":"state"})
But it returns nothing, what is the correct way to retrieve it?

import json
div_tag = soup.find('div', {'data-store':True})
data_string = div_tag['data-store'] # get data string
json.loads(data_string)['BroadcastCache']['broadcasts']['ID1']['broadcast']['data']['state'] # convert data string to python dict and get state
out:
'running'

The correct syntax is soup.find_all('div', class_='state').
Note the underscore after class_.
It's unlikely to work in your case without modification, since it looks like the actual class of the div is 'data-store', and the rest is just a string and not actually content of a tag. You could just use string.find('\"state\"') on that one.

Related

Webcrawler: extracting string out of array using Python3 on mac

i have a problem with writing a webcrawler to extract currency rates:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re
url = "https://wechselkurse-euro.de/"
r = requests.get(url)
rates = []
status = r.status_code
if status != 200:
print("Something went wrong while parsing the website " + url)
temp = BeautifulSoup(r.text, "html.parser")
current_date = temp.select(".ecb")[0].text.strip().split(" ")[5]
#rates_array = temp.select(".kurz_kurz2.center", limit= 20).string
rates_array = temp.select(".kurz_kurz2.center", limit= 20)
#for i in rates_array:
# rate = rates_array[i].string
# rates.append(rate)
rates = list( map( lambda x: re.search(">\d{1}\.\d{4}",x), rates_array))
print(rates)
#rate_1EUR_to_USD =
#rate_1EUR_to_GBP =
I tried several ways which are commented out - all of them don't work and I don't know why. Especially the .string not working is suprising to me since the rates_array seems to inherit all the different information of the bs4 object, including the information that there is a td tag <td class="kurz_kurz2 center" title="Aktueller Wechselkurs am 3.4.2020">0.5554</td> where I just want the string within the tag (so the value 0.5554 in the example above). This should be easy but nothing works, what am I doing wrong?
The regular expression should not be the problem, I tested it on regExR.
I tried using the map function as currently active but I can't convert the map object to a list as I am supposed to.
The select().string returns an empty list and the same with using relgular expressions to search through the strings I saved in rates_array when I try to do the oldschool way of iterating over every item of my function with a for loop.
String as attribute of bs4-object

Your rates_array contains Beautiful Soup tag objects, not strings. So you'll have to access their text property in order to get the values. For example:
rates = [o.text for o in rates_array]
Now rates contains:
['0.5554', '0.1758']

I would recommend you to check the locator first.
Are you sure that rates_array is not empty?
Also, try:
rates_array[i].text

String argument behaves differently within my script

I've tried to parse a text out of some html elements using string argument the way it is described here but failed miserably. I've tried two different ways but every time I encountered the same AttributeError.
How can I use string argument in this very case to fetch the text?
I've tried with:
import re
from bs4 import BeautifulSoup
htmlelement = """
<caption>
<span class="toggle open"></span>
ASIC registration
</caption>
"""
soup = BeautifulSoup(htmlelement,"lxml")
try:
item = soup.find("caption",string="ASIC registration").text
#item = soup.find("caption",string=re.compile("ASIC registration",re.I)).text
except AttributeError:
item = ""
print(item)
Expected output (only using string argument):
ASIC registration

How can I use string argument in this very case to fetch the text?
You can't
Note:
I am assuming that you mean by some change string parameter in
item = soup.find("caption",string="ASIC registration").text
As given in the documentation
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
import re
from bs4 import BeautifulSoup
htmlelement = """
<caption>
<span class="toggle open"></span>
ASIC registration
</caption>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.find("caption")
print(item.string)
Output
None
Here the .string is None as caption has more than one child.
If you are trying to get the parent (caption tag in this case) with the text, you could do
item = soup.find(string=re.compile('ASIC registration')).parent
which will give
<caption><a></a>ASIC registration</caption>
Of course, calling a .text on this parent tag will give the full text within that tag, if it is not the full text within it.
item = soup.find(string=re.compile('ASIC')).parent.text
will give an output
ASIC registration

The issue you're running into is that the string argument searches for strings instead of for tags as it states in the documentation you linked.
The syntax you are using:
soup.find("caption",string="ASIC registration")
is for finding tags.
For finding strings:
soup.find(string=re.compile('ASIC'))
With the first one you are saying find a caption tag with the "string" attribute of your string. The caption tag has no string attribute so nothing is returned.
The second one is saying find the string that contains 'ASIC', so it returns the string.

Turns out the string parameter doesn't work if a tag has a child tag. The following code is stupid, but it works:
real_item = ""
try:
items = soup.find_all("caption")
r = re.compile(u"ASIC registration", re.I)
for item in items:
for s in item.strings:
if r.search(unicode(s)):
real_item = item
break
except AttributeError:
real_item = ""
print(real_item)

Python Get Links Script - Needs Wildcard search

I have the below code that when you put in a URL with a a bunch of links it will return the list to you. This works well, except that I only want links that start with ... and this will return EVERY link, including ones like the home/back/etc. is there a way to use a wild card or a "starts with" function?
from bs4 import BeautifulSoup
import requests
url = ""
# Getting the webpage, creating a Response object.
response = requests.get(url)
# Extracting the source code of the page.
data = response.text
# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')
# Extracting all the <a> tags into a list.
tags = soup.find_all('a')
# Extracting URLs from the attribute href in the <a> tags.
for tags in tags:
print(tags.get('href'))
Also, is there a way to export to excel? I am not great with python and I am not sure how I got this far to be honest.
Thanks,

Here is an updated version of your code that will get all https hrefs from the page:
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com"
# Getting the webpage, creating a Response object.
response = requests.get(url)
# Extracting the source code of the page.
data = response.text
# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data)
# Extracting all the <a> tags into a list.
tags = soup.find_all('a')
# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
if str.startswith(tag.get('href'), 'https'):
print(tag.get('href'))
If you want to get hrefs that start with something other than https, change the 2nd to last line :)
References:
https://www.tutorialspoint.com/python/string_startswith.htm

You could use startswith() :
for tag in tags:
if tag.get('href').startswith('pre'):
print(tag.get('href'))

For your second question: Is there a way to export to Excel - I've been using a python module XlsxWriter.
import xlsxwriter
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('Expenses01.xlsx')
worksheet = workbook.add_worksheet()
# Some data we want to write to the worksheet.
expenses = (
['Rent', 1000],
['Gas', 100],
['Food', 300],
['Gym', 50],
)
# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0
# Iterate over the data and write it out row by row.
for item, cost in (expenses):
worksheet.write(row, col, item)
worksheet.write(row, col + 1, cost)
row += 1
# Write a total using a formula.
worksheet.write(row, 0, 'Total')
worksheet.write(row, 1, '=SUM(B1:B4)')
workbook.close()
XlsxWriter allows the coding to follow basic excel conventions - I, being new to python, getting this up, running and working was easy with the first attempt.

If tags.get returns a string, you should be able to filter on whatever start string you want like so:
URLs = [URL for URL in [tag.get('href') for tag in tags]
if URL.startswith('/some/path/')]
Edit:
It turns out that in your case, tags.get doesn't always return a string. For tags that don't contain links, the return type is NoneType, and we can't use string methods on NoneType. It's easy to check if the return value of tags.get is None before using the string method startswith on it.
URLs = [URL for URL in [tag.get('href') for tag in tags]
if URL is not None and URL.startswith('/some/path/')]
Notice the addition of URL is not None and. That has to become before URL.startswith, otherwise Python will try to use a string method on None and complain. You can read this just like an English sentence, which highlights one of the great things about Python; the code is easier to read than just about any other programming language, which makes it really good for communicating ideas to other people.

Extract URL using beautifulsoup 4

I'm trying to extract a URL using BS4, I can get to the correct location but I'm not sure how to remove the '' tags from URL. I tried adding .text however this just returned nothing.
vid_screenshot = (soup('a', {'class':'mp4Thumb'}))[0].contents[0]
>> <img src="www.fgfg.com/dsfasdf.jpg"/>
desired result
>> www.fgfg.com/dsfasdf.jpg
did not work, returned nothing.
(soup('a', {'class':'mp4Thumb'}))[0].contents[0].text
Would anyone know how to strip these tags..?

You have the HTML tag, you need to take the src attribute:
vid_screenshot = soup('a', {'class':'mp4Thumb'})[0].contents[0]
vid_screenshot_src = wid_screenshot['src']
This assumes there is always going to be a src attribute on the tag. You can also use the .get() method to return None if the attribute is not present:
vid_screenshot = soup('a', {'class':'mp4Thumb'})[0].contents[0]
vid_screenshot_src = wid_screenshot.get('src')
or you can give .get() a second argument to return if the attribute is missing:
vid_screenshot = soup('a', {'class':'mp4Thumb'})[0].contents[0]
vid_screenshot_src = wid_screenshot.get('src', 'default value')
See the Attributes documentation section.

Python Regular Expression string exclusion

Using beautfiulsoup to parse sourcecode for scraping:
tempSite = preSite+'/contact_us/'
print tempSite
theTempSite = urlopen(tempSite).read()
currentTempSite = BeautifulSoup(theTempSite)
lightwaveEmail = currentTempSite('input')[7]
#<input type="Hidden" name="bb_recipient" value="comm2342#gmail.com" />
How can I re.compile lightwaveEmail so that only comm2342#gmail.com is printed?

Kinda going about it the wrong way. The reason its the wrong way is that you're using numbered indexes to find the tag you want - BeautifulSoup will find tags for you based on their tag, or attributes which makes it a lot simpler.
You want something like
tempSite = preSite+'/contact_us/'
print tempSite
theTempSite = urlopen(tempSite).read()
soup = BeautifulSoup(theTempSite)
tag = soup.find("input", { "name" : "bb_recipient" })
print tag['value']

If the question is how to get the value attribute from the tag object, then you can use it as a dictionary:
lightwaveEmail['value']
You can find more information about this in the BeautifulSoup documentation.
If the question is how to find in the soup all input tags with such a value, then you can look for them as follows:
soup.findAll('input', value=re.compile(r'comm2342#gmail.com'))
You can find a similar example also in the BeautifulSoup documentation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Retrieve specific "object" from url with Beatifulsoup - python

import json div_tag = soup.find('div', {'data-store':True}) data_string = div_tag['data-store'] # get data string json.loads(data_string)['BroadcastCache']['broadcasts']['ID1']['broadcast']['data']['state'] # convert data string to python dict and get state out: 'running'

Related

Webcrawler: extracting string out of array using Python3 on mac

String argument behaves differently within my script

Python Get Links Script - Needs Wildcard search

Extract URL using beautifulsoup 4

Python Regular Expression string exclusion

Categories

Resources