Using BeautifulSoup to extract items from a dictionary - python

How to i extract "loginError" and its value from the untis dict
<script type="text/javascript">
window.untisUIVersion = 2;
window.untisMomentLocale= "de";
window.untis__webpack_public_path__ = "https://content.webuntis.com/WebUntis/static/2022.14.2/js/untis/";
untis = {
config: {"mode":"STANDARD","locale":"de-at","contextPath":"/WebUntis","licence":{"name":"HTBLA Weiz","name2":"A-8160, Dr.Karl-Widdmannstr. 40"},"mandantName":"htbla-weiz","mandant":16270,"customerNumber":70284,"imageServiceConfig":{"customLogo":true},"loginServiceConfig":{"ssoType":"none","samlProviderLabel":"","idpName":"","loginError":"Invalid user name and/or password","lastUserName":"","lastMandantName":"htbla-weiz"},
};
</script>

There are multiple ways to work on this, but you will have to work on the transformation directly. How you would want the transformation process to work is solely depends on you.
This is how I will work on this.
I will first get the string data from the web or where-ever you get your data from.
soup = BeautifulSoup(text, "html.parser")
results = soup.find('script')
stringtext = results.get_text()
Then I will start to transform the text data into Json format (because is easy to work on). In my example, I split the string into array using ; semicolon, then remove any leading and trailing whitespaces, replace more than 1 consecutive whitespaces into 1 whitespace, and load the rest as a Json object.
array = stringtext.split(';')
untis = array[3].strip()
untis = re.sub("\s\s+"," ", untis)
structjson = json.loads(untis[18:-6]+"}")
With the JSON object you can now look for what you need.
In[] : print(structjson["loginServiceConfig"]["loginError"])
Out[]: Invalid user name and/or password
This is just the logic behind working with BeautifulSoup. I do not think that you can extract the "loginError" as a key with BeautifulSoup. The only way to work with this is to extract it as a text and work on it directly in Python. You can better automate the logic by using regex search to identify the open curly and close curly braces indexes and extract those value. If you are working with large datasets, then you should not store the value in the list but just transforming the selected text on-fly or with Big Data framework like Apache Spark.

Related

How to substring with specific start and end positions where a set of characters appear?

I am trying to clean the data I scraped from their links. I have over 100 links in a CSV I'm trying to clean.
This is what a link looks like in the CSV:
"https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
I've observed that scraping this for HTML data doesn't go well and I have to get the URL present inside this.
I want to get the substring which starts with &url= and ends at &ct as that's where the real URL resides.
I've read posts like this but couldn't find one for ending str too. I've tried an approach from this using the substring package but it doesn't work for more than one character.
How do I do this? Preferably without using third party packages?
I don't understand problem
If you have string then you can use string- functions like .find() and slice [start:end]
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
start = text.find('url=') + len('url=')
end = text.find('&ct=')
text[start:end]
But it may have url= and ct= in different order so better search first & after url=
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
start = text.find('url=') + len('url=')
end = text.find('&', start)
text[start:end]
EDIT:
There is also standard module urllib.parse to work with url - to split or join it.
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
import urllib.parse
url, query = urllib.parse.splitquery(text)
data = urllib.parse.parse_qs(query)
data['url'][0]
In data you have dictionary
{'cd': ['SldisGkopisopiasenjA6Y28Ug'],
'ct': ['ga'],
'rct': ['j'],
'sa': ['t'],
'url': ['https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428'],
'usg': ['AFQjaskdfYJkasKugowe896fsdgfsweF']}
EDIT:
Python shows warning that splitquery() is deprecated as of 3.8 and code should use urlparse()
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
import urllib.parse
parts = urllib.parse.urlparse(text)
data = urllib.parse.parse_qs(parts.query)
data['url'][0]

Find_all function returning multiple strings that are separated but not delimited by a character

Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range
Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...

How do I remove double quotes from whithin retreived JSON data

I'm currently using BeautifulSoup to web-scrape listings from a jobs website, and outputting the data into JSON via the site's HTML code.
I fix bugs with regex as they come along, but this particular issue has me stuck. When webscraping the job listing, instead of extracting info from each container of interest, I've chosen to instead extract JSON data within the HTML source code (< script type = "application/ld+json" >). From there I convert the BeautifulSoup results into strings, clean out the HTML leftovers, then convert the string into a JSON. However, I've hit a snag due to text within the job listing using quotes. Since the actual data is large, I'll just use a substitute.
example_string = '{"Category_A" : "Words typed describing stuff",
"Category_B" : "Other words speaking more irrelevant stuff",
"Category_X" : "Here is where the "PROBLEM" lies"}'
Now the above won't run in Python, but the string I have that has been extracted from the job listing's HTML is pretty much in the above format. When it's passed into json.loads(), it returns the error: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5035
I'm not at all sure how to address this issue.
EDIT Here's the actual code leading to the error:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import json, re
uClient = urlopen("http://www.ethiojobs.net/display-job/227974/Program-Manager---Mental-Health%2C-Child-Care-Gender-%26-Protection.html")
page_html = uClient.read()
uClient.close()
listing_soup = BeautifulSoup(page_html, "lxml")
json_script = listing_soup.find("script", "type":"application/ld+json"}).strings
extracted_json_str = ''.join(json_script)
## Clean up the string with regex
extracted_json_str_CLEAN1 = re.sub(pattern = r"\r+|\n+|\t+|\\l+| | |amp;|\u2013|</?.{,6}>", # last is to get rid of </p> and </strong>
repl='',
string = extracted_json_str)
extracted_json_str_CLEAN2 = re.sub(pattern = r"\\u2019",
repl = r"'",
string = extracted_json_str_CLEAN1)
extracted_json_str_CLEAN3 = re.sub(pattern=r'\u25cf',
repl=r" -",
string = extracted_json_str_CLEAN2)
extracted_json_str_CLEAN4 = re.sub(pattern=r'\\',
repl="",
string = extracted_json_str_CLEAN3)
## Convert to JSON (HERE'S WHERE THE ERROR ARISES)
json_listing = json.loads(extracted_json_str_CLEAN4)
I do know what's leading to the error: within the last bullet point of Objective 4 in the job description, the author used quotes when referring to a required task of the job (i.e. "quality control" ). The way I've been going about extracting information from these job listings, a simple instance of someone using quotes causes my whole approach to blow up. Surely there's got to be a better way to build this script without such liabilities like this (as well as having to use regex to fix each breakdown as they arise).
Thanks!
you need to apply the escape sequence(\) if you want double Quote(") in your value. So, your String input to json.loads() should look like below.
example_string = '{"Category_A": "Words typed describing stuff", "Category_B": "Other words speaking more irrelevant stuff", "Category_X": "Here is where the \\"PROBLEM\\" lies"}'
json.loads can parse this.
# WHen you extracting this I think you shood make a chekc for this.
# example:
if "\"" in extraction:
extraction = extraction.replace("\"", "\'")
print(extraction)
In this case you will convert " from extraction in ' I mean something you will need to convert because python give uyou a way to use both if uyou want to use " inside of a string you will need to inverse that simbols:
example:
"this is a 'test'"
'this was a "test"'
"this is not a \"test\""
#in case the condition is meat
if "\"" in item:
#use this
item = item.replace("\"", "\'")
#or use this
item = item.replace("\"", "\\\"")

extracting a string from html using HTMLParser & Python2.7

I am developing an Alexa skill and therefore, I have the standard Python (2.7) libraries available for us. Therefore, I don't have BeautifulSoup4 available to use.
I'm trying to identify the below string within a full HTML page and then extract the string '104 spaces' from within the below. "104 spaces" is a variable however the other code remains constant:-
<p class="jp1"><strong>Car Parking</strong>104 spaces</p>
Is this possible to do with HTMLParser or alternatively, could this be done using a regex search? I appreciate regex is not the best for parsing HTML code however given I am using urllib2 which processes the HTML code as a string and the fact I am looking to extract a specific string that follows a specific string within this, it may suffice in this case.
One option I am thinking of is:-
s1 = "<strong>Car Parking</strong>104 spaces</p>"
s2 = "<strong>Car Parking</strong>"
print s1[s1.index(s2) + len(s2):]
This will return all of the text commencing with "104 spaces" and onwards. Therefore, how can I isolate this specific segment of text?
Thanks
I dont use python but here a regex. After this you need to replace the strong and p tag manually with python should be easy.
var test = '<p class="jp1"><strong>Car Parking</strong>104 spaces</p>'
test = test.match(/[<][/]strong[>](.*)[<][/]p[>]/gmi)
//console.log would be here test: [ '</strong>104 spaces</p>' ]
test = test.join('')
test = test.replace('</strong>','')
test = test.replace('</p>','')
console.log('test: ', test)

Exclude certain keyword from URL

I am successfully able to get the url using my technique but point is that i need to change the url slightly like this: "http://www.example.com/static/p/no-name-0330-227404-1.jpg". Where as in img tag i get this link: "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
HTML CODE:
<div class="swiper-wrapper"><img data-error-placeholder="PlaceholderPDP.jpg" class="swiper-lazy swiper-lazy-loaded" src="http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"></div>
Python Code:
imagesList = []
imagesList.append([re.findall(re.compile(u'http.*?\.jpg'), etree.tostring(imagesList).decode("utf-8")) for imagesList in productTree.xpath('//*[#class="swiper-wrapper"]/img')])
print (imagesList)
output:
[['http://www.example.com/static/p/no-name-8143-225244-1-product.jpg']]
NOTE: I need to remove "-product" from url and I have no idea why this url is inside two square brackets.
If you are intending to remove just the product keyword then you can simply use the .replace() API. Otherwise you can construct regular expressions to manipulate the string. Below is an example code for the replace API.
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myURL = myURL.replace("-product", "") # gives u "http://www.example.com/static/p/no-name-0330-227404-1.jpg"
print(myURL)
Regular expression version: (Probably not a clean solution, as in it is difficult to understand). However it is better than the first approach because it dynamically discard the last set of -words (e.g. -product)
What I have done is capture 3 parts of the URL but omit the middle part because that is the -product bit, and combine part 1 and 3 together to form your URL.
import re
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myPattern = "(.*)(-.*)(\.jpg)$"
pattern = re.compile(myPattern)
match = re.search(pattern, myURL)
print (match.group(1) + match.group(3))
Same output as above:
http://www.example.com/static/p/no-name-0330-227404-1.jpg
If all the images have the word "product" could you just do a simple string replace and remove just that word? Whatever you are trying to do (including renaming files) I see that as the simplest solution.

Categories