How do I remove double quotes from whithin retreived JSON data - python

I'm currently using BeautifulSoup to web-scrape listings from a jobs website, and outputting the data into JSON via the site's HTML code.
I fix bugs with regex as they come along, but this particular issue has me stuck. When webscraping the job listing, instead of extracting info from each container of interest, I've chosen to instead extract JSON data within the HTML source code (< script type = "application/ld+json" >). From there I convert the BeautifulSoup results into strings, clean out the HTML leftovers, then convert the string into a JSON. However, I've hit a snag due to text within the job listing using quotes. Since the actual data is large, I'll just use a substitute.
example_string = '{"Category_A" : "Words typed describing stuff",
"Category_B" : "Other words speaking more irrelevant stuff",
"Category_X" : "Here is where the "PROBLEM" lies"}'
Now the above won't run in Python, but the string I have that has been extracted from the job listing's HTML is pretty much in the above format. When it's passed into json.loads(), it returns the error: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5035
I'm not at all sure how to address this issue.
EDIT Here's the actual code leading to the error:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import json, re
uClient = urlopen("http://www.ethiojobs.net/display-job/227974/Program-Manager---Mental-Health%2C-Child-Care-Gender-%26-Protection.html")
page_html = uClient.read()
uClient.close()
listing_soup = BeautifulSoup(page_html, "lxml")
json_script = listing_soup.find("script", "type":"application/ld+json"}).strings
extracted_json_str = ''.join(json_script)
## Clean up the string with regex
extracted_json_str_CLEAN1 = re.sub(pattern = r"\r+|\n+|\t+|\\l+| | |amp;|\u2013|</?.{,6}>", # last is to get rid of </p> and </strong>
repl='',
string = extracted_json_str)
extracted_json_str_CLEAN2 = re.sub(pattern = r"\\u2019",
repl = r"'",
string = extracted_json_str_CLEAN1)
extracted_json_str_CLEAN3 = re.sub(pattern=r'\u25cf',
repl=r" -",
string = extracted_json_str_CLEAN2)
extracted_json_str_CLEAN4 = re.sub(pattern=r'\\',
repl="",
string = extracted_json_str_CLEAN3)
## Convert to JSON (HERE'S WHERE THE ERROR ARISES)
json_listing = json.loads(extracted_json_str_CLEAN4)
I do know what's leading to the error: within the last bullet point of Objective 4 in the job description, the author used quotes when referring to a required task of the job (i.e. "quality control" ). The way I've been going about extracting information from these job listings, a simple instance of someone using quotes causes my whole approach to blow up. Surely there's got to be a better way to build this script without such liabilities like this (as well as having to use regex to fix each breakdown as they arise).
Thanks!

you need to apply the escape sequence(\) if you want double Quote(") in your value. So, your String input to json.loads() should look like below.
example_string = '{"Category_A": "Words typed describing stuff", "Category_B": "Other words speaking more irrelevant stuff", "Category_X": "Here is where the \\"PROBLEM\\" lies"}'
json.loads can parse this.

# WHen you extracting this I think you shood make a chekc for this.
# example:
if "\"" in extraction:
extraction = extraction.replace("\"", "\'")
print(extraction)
In this case you will convert " from extraction in ' I mean something you will need to convert because python give uyou a way to use both if uyou want to use " inside of a string you will need to inverse that simbols:
example:
"this is a 'test'"
'this was a "test"'
"this is not a \"test\""
#in case the condition is meat
if "\"" in item:
#use this
item = item.replace("\"", "\'")
#or use this
item = item.replace("\"", "\\\"")

Related

Using BeautifulSoup to extract items from a dictionary

How to i extract "loginError" and its value from the untis dict
<script type="text/javascript">
window.untisUIVersion = 2;
window.untisMomentLocale= "de";
window.untis__webpack_public_path__ = "https://content.webuntis.com/WebUntis/static/2022.14.2/js/untis/";
untis = {
config: {"mode":"STANDARD","locale":"de-at","contextPath":"/WebUntis","licence":{"name":"HTBLA Weiz","name2":"A-8160, Dr.Karl-Widdmannstr. 40"},"mandantName":"htbla-weiz","mandant":16270,"customerNumber":70284,"imageServiceConfig":{"customLogo":true},"loginServiceConfig":{"ssoType":"none","samlProviderLabel":"","idpName":"","loginError":"Invalid user name and/or password","lastUserName":"","lastMandantName":"htbla-weiz"},
};
</script>
There are multiple ways to work on this, but you will have to work on the transformation directly. How you would want the transformation process to work is solely depends on you.
This is how I will work on this.
I will first get the string data from the web or where-ever you get your data from.
soup = BeautifulSoup(text, "html.parser")
results = soup.find('script')
stringtext = results.get_text()
Then I will start to transform the text data into Json format (because is easy to work on). In my example, I split the string into array using ; semicolon, then remove any leading and trailing whitespaces, replace more than 1 consecutive whitespaces into 1 whitespace, and load the rest as a Json object.
array = stringtext.split(';')
untis = array[3].strip()
untis = re.sub("\s\s+"," ", untis)
structjson = json.loads(untis[18:-6]+"}")
With the JSON object you can now look for what you need.
In[] : print(structjson["loginServiceConfig"]["loginError"])
Out[]: Invalid user name and/or password
This is just the logic behind working with BeautifulSoup. I do not think that you can extract the "loginError" as a key with BeautifulSoup. The only way to work with this is to extract it as a text and work on it directly in Python. You can better automate the logic by using regex search to identify the open curly and close curly braces indexes and extract those value. If you are working with large datasets, then you should not store the value in the list but just transforming the selected text on-fly or with Big Data framework like Apache Spark.

How to substring with specific start and end positions where a set of characters appear?

I am trying to clean the data I scraped from their links. I have over 100 links in a CSV I'm trying to clean.
This is what a link looks like in the CSV:
"https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
I've observed that scraping this for HTML data doesn't go well and I have to get the URL present inside this.
I want to get the substring which starts with &url= and ends at &ct as that's where the real URL resides.
I've read posts like this but couldn't find one for ending str too. I've tried an approach from this using the substring package but it doesn't work for more than one character.
How do I do this? Preferably without using third party packages?
I don't understand problem
If you have string then you can use string- functions like .find() and slice [start:end]
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
start = text.find('url=') + len('url=')
end = text.find('&ct=')
text[start:end]
But it may have url= and ct= in different order so better search first & after url=
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
start = text.find('url=') + len('url=')
end = text.find('&', start)
text[start:end]
EDIT:
There is also standard module urllib.parse to work with url - to split or join it.
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
import urllib.parse
url, query = urllib.parse.splitquery(text)
data = urllib.parse.parse_qs(query)
data['url'][0]
In data you have dictionary
{'cd': ['SldisGkopisopiasenjA6Y28Ug'],
'ct': ['ga'],
'rct': ['j'],
'sa': ['t'],
'url': ['https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428'],
'usg': ['AFQjaskdfYJkasKugowe896fsdgfsweF']}
EDIT:
Python shows warning that splitquery() is deprecated as of 3.8 and code should use urlparse()
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
import urllib.parse
parts = urllib.parse.urlparse(text)
data = urllib.parse.parse_qs(parts.query)
data['url'][0]

How to convert a BeautifulSoup tag to JSON?

I have a type element, bs4.element.Tag, product of a web scraping, I usually do: json.loads (soup.find ('script', type = 'application / ld + json'). Text) , but on this page it only appears in: <script> </script> so I had to do: scripts = soup.find_all ('script') until I get to the one that interests me: script = scripts [18].
The variable in question is script. My problem is that I want to access its attributes, for example script ['goodsInfo'], obviously being an element type bs4.element.Tag, try to do: script.attrs and return me {}. Then I tried to convert it to the type json: json.loads (str (script)) and it throws me the exception: 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'
This is my code:
import json
from bs4 import BeautifulSoup
import requests
url_aux = 'https://www.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0'
response = requests.get(url_aux)
soup = BeautifulSoup(response.content, "html.parser")
scripts = soup.find_all('script')
script = scripts[18]
print(json.loads(str(script)))
#output: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
print(type(script))
#output: bs4.element.Tag
print(str(json.loads(str(script))))
You can use json module to extract the data, but first it's necessary to locate the right info - you can use re module for that.
For example:
import re
import json
import requests
url = 'https://eur.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0&ref=www&rep=dir&ret=eur'
txt = re.findall(r'goodsInfo\s*:\s*({.*})', requests.get(url).text)[0]
data = json.loads(txt)
# print(json.dumps(data, indent=4)) # <-- uncomment to see all data
print(data['detail']['goods_name'])
print(data['detail']['brand'])
print('Num of comments:', data['detail']['comment']['comment_num'])
Prints:
Mock-neck Brush Stroke Print Bodycon Dress
SHEIN
Num of comments: 17
BS4 does not parse javascript, from BS4's Tag object's POV the text in a <script> tag is, well, just text. I don't have any idea what this script looks like (since you didn't post it and I'm not going to bother try and find it), but if your expectations were that script ['goodsInfo'] would return the value of a JS variables named 'goodInfo' then, bad news, it's not going to work that way.
Also, Javascript is not JSON, so the chances a JS snippet will be valid json are rather small to say the least. The proper syntax to test it would be quite simply the same as the one you used for you first use case, ie json.loads(script.text), but I assume that's the first thing you tried ;-)
So, well, I'm afraid you'll have to manually parse this script to extract the relevant part. Depending on what the js code looks like, it may be a matter of a few lines of basic string parsing / regexp stuff, or it may require a proper Javascript parser etc.

Unable to parse some information from a webpage using search keywords

I've created a script to parse some information related to some songs from a website. When I try with this link or this one, I get my scrpt working flawlessly. What I could understand is that when I append my search keyword after this portion https://www.billboard.com/music/, I get the desired page having information.
However, things go wrong when I try with these keywords 1 Of The Girls or Al B. Sure! or Ashford & Simpson and so on.
I can't figure out how to append the above keywords after the base link https://www.billboard.com/music/ to locate the pages with information.
Script I've tried with:
import requests
from bs4 import BeautifulSoup
LINK = "https://www.billboard.com/music/Adele"
res = requests.get(LINK)
soup = BeautifulSoup(res.text,"lxml")
scores = [item.text for item in soup.select("[class$='-history__stats'] > p > span")]
print(scores)
Result I'm getting (as expected):
['4 No. 1 Hits', '6 Top 10 Hits', '13 Songs']
Result located in that page is just after the chart history like the following:
How can I fetch some information from a webpage using critical search keywords?
I don't know all use cases but the obvious pattern I have seen for cases mentioned is that special characters are stripped (without leaving whitespace in their place) out, words are lower-case and then spaces replaced with "-". The tricky bit may be the definition and handling of special characters.
e.g.
https://www.billboard.com/music/ashford-simpson
https://www.billboard.com/music/al-b-sure
https://www.billboard.com/music/1-of-the-girls
You could start with writing something to perform those string manipulations and then test the response code. Perhaps see if there is any form of validation in js files.
EDIT:
Multiple blanks between words becomes single blank before being replaced with "-" ?
Answer developed with #Mithu for preparing search terms:
import re
keywords = ["Y?N-Vee","Ashford & Simpson","Al B. Sure!","1 Of The Girls"]
spec_char = ["!","#","$","%","&","'","(",")","*","+",",",".","/",":",";","<","=",">","?","#","[","]","^","_","`","{","|","}","~",'"',"\\"]
for elem in keywords:
refined_keywords = re.sub('-+','-' , ''.join(i.replace(" ","-") for i in elem.lower() if i not in spec_char))
print(refined_keywords)

Python 3 - Getting some strings from a HTTPrequest response

I'm having a hard time extracting data from a httprequest response.
Can somebody help me? Here's a part of my code:
import requests
r = requests.get('https://www.example.com', verify=True)
keyword = r.text.find('loginfield')
print (keyword)
>>> 42136
42136 value basically means that string 'loginfield' exists on the response.text. But how do I extract specific strings from it?
Like for example I want to extract these exact strings:
<title>Some title here</title>
or this one:
<div id='bla...' #continues extracting of strings until it stops where I want it to stop extracting.
Anybody got an idea on how should I approach this problem?
You can use BeautifulSoup to parse HTML and get tags. Here's an example piece of code:
import requests
from bs4 import BeautifulSoup as BS
r = requests.get('https://www.example.com', verify=True)
soup = BS(r.text)
print(soup.find('title').text)
Should print:
Some title here
But depends on if it's the first title or not
Please note that for HTML-page data extraction, you should take a look at a specialized library like Beautiful soup. Your program will be less fragile and more maintainable that way.
string.find will return -1 if the string does not exists.
There is no string "loginfield" in the page you retrieved.
Once you have the correct index for your string, the returned value is the position of the first char of that string.
since you edited your question:
>>> r.text.find('loginfield')
42136
That means, the string "loginfield" starts at offset 42136 in the text. You could display say 200 chars starting at that position that way:
>>> print(r.text[42136:42136+200])
To find the various values you looking for, you have to figure out where there are relative to that position.

Categories