I have below string, I am able to grab the 'text' what I wanted to (text is warped between pattern). code is give below,
val1 = '[{"vmdId":"Text1","vmdVersion":"text2","vmId":"text3"},{"vmId":"text4","vmVersion":"text5","vmId":"text6"}]'
temp = val1.split(',')
list_len = len(temp)
for i in range(0, list_len):
var = temp[i]
found = re.findall(r':"([^(]*)\"\;', var)
print ''.join(found)
I would like to replace values (Text1, text2, tex3, etc) with new values provided by user / or by reading from another XML. (Text1, tex2 .. are is totally random and alphanumeric data. below some details
Text1 = somename
text2 = alphanumatic value
text3 = somename
Text4 = somename
text5 = alphanumatic value
text6 = somename
anstring =
[{"vmdId":"newText1","vmdVersion":"newtext2","vmId":"newtext3"},{"vmId":"newtext4","vmVersion":"newtext5","vmId":"newtext6"}]
I decided to go with replace() but later realize data is not constant. hence seeking for help again. Appreciate your response.
Any help would be appreciated. Also, if let me know if I can improve the way i am grabing the value right now, as i new with regex.
You can do this by using backreferences in combination with re.sub:
import re
val1 = '[{"vmdId":"Text1","vmdVersion":"text2","vmId":"text3"},{"vmId":"text4","vmVersion":"text5","vmId":"text6"}]'
ansstring = re.sub(r'(?<=:")([^(]*)', r'new\g<1>' , val1)
print ansstring
\g<1> is the text which is in the first ().
EDIT
Maybe a better approach would be to decode the string, change the data and encode it again. This should allow you to easier access the values.
import sys
# python2 version
if sys.version_info[0] < 3:
import HTMLParser
html = HTMLParser.HTMLParser()
html_escape_table = {
"&": "&",
'"': """,
"'": "'",
">": ">",
"<": "<",
}
def html_escape(text):
"""Produce entities within text."""
return "".join(html_escape_table.get(c,c) for c in text)
html.escape = html_escape
else:
import html
import json
val1 = '[{"vmdId":"Text1","vmdVersion":"text2","vmId":"text3"},{"vmId":"text4","vmVersion":"text5","vmId":"text6"}]'
print(val1)
unescaped = html.unescape(val1)
json_data = json.loads(unescaped)
for d in json_data:
d['vmId'] = 'new value'
new_unescaped = json.dumps(json_data)
new_val = html.escape(new_unescaped)
print(new_val)
I hope this helps.
Related
I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]
I have a list of dictionaries, and within the dictionary is a list.
{
"Credentials": [
{
"realName": "Mark Toga",
"toolsOut": [
"TL-482940",
"TL-482940"
],
"username": "291F"
},
{
"realName": "Burt Mader",
"toolsOut": [],
"username": "R114"
},
{
"realName": "Tim Johnson",
"toolsOut": [
"TL-482940"
],
"username": "E188"
}
]
}
I am attempting to parse this file so that it shows something like this:
Mark Toga: TL-482940, TL482940
Tim Johnson: TL-482940
Ommitting Burt Mader as he has no tools out.
I have it to a point where it displays the above with Burt Mader still (GUI output)
Edit: Here is a printout of newstr6 rather than the GUI image. I do want the GUI for my application, but for ease of reading:
Mark Toga: 'TL-482940', 'TL-482940',
Burt Mader: ,
Tim Johnson: 'TL-482940'
Here is my current code (I'm sure there are many efficiency improvements, but I mostly care about ommitting the dictionary with the empty list.)
## importing libraries
import json
from tkinter import *
from tkinter import ttk
from functools import partial
import pprint
mainWin = Tk()
mainWin.geometry('400x480')
mainWin.title('Select Tooling')
with open('Inventory.json','r+') as json_file:
data=json.load(json_file)
credData = data['Credentials']
noSID = [{k: v for k, v in d.items() if k != 'username'} for d in credData]
print(noSID)
pp = pprint.pformat(noSID)
ps = str(pp)
newstr1 = ps.replace('[','')
newstr2 = newstr1.replace(']','')
newstr3 = newstr2.replace('{','')
newstr4 = newstr3.replace('}','')
newstr5 = newstr4.replace("'realName': '","")
newstr6 = newstr5.replace("', 'toolsOut'","")
text = Label(mainWin,text=newstr6)
text.pack()
quitButton = Button(mainWin,text="Log Out",command=lambda:mainWin.destroy())
quitButton.pack()
mainWin.mainloop()
This smells like an X-Y Problem. You don't want to display the person that has no tools checked out, but you don't really need to remove them from the list to do that. You're relying on pprint to convert a dictionary to a string, and then messing with that string. Instead, just build the string from scratch, and don't include the people with no tools checked out.
data=json.load(json_file)
credData = data['Credentials']
# Since you're the one creating the string, you can choose what you want to put in it
# No need to create a NEW dictionary without the username keys
outstr = ""
for person in credData:
outstr += person['realName'] + ": " + ", ".join(person['toolsOut']) + "\n"
print(outstr)
This prints:
Mark Toga: TL-482940, TL-482940
Burt Mader:
Tim Johnson: TL-482940
Now, since you want to ignore the persons that don't have any tools, add that condition.
outstr = ""
for person in credData:
if person['toolsOut']:
outstr += person['realName'] + ": " + ", ".join(person['toolsOut']) + "\n"
print(outstr)
And you get:
Mark Toga: TL-482940,TL-482940
Tim Johnson: TL-482940
if person['toolsOut'] is identical to if len(person['toolsOut']) == 0 because empty lists are Falsy
If you really want to remove the elements of credData that have empty toolsOut keys, you can use this same condition in a list comprehension.
credData2 = [person for person in credData if person['toolsOut'])
You could filter out the unwanted items when you load the "credential" list:
credData = [d for d in data['Credentials'] if d.get("toolsOut")]
or you could have a separate variable for the filtered credentials
credWithTools = [d for d in credData if d.get("toolsOut")]
Just filter your list of dictionaries by aplying certain condition. In this case, dict key toolsOut associated content should be asserted as True:
def process_data(list_of_dicts, field):
res = []
for item in list_of_dicts:
if item[field]:
res.append(item)
return res
credData = process_data(data["Credentials"], "toolsOut")
I wrote some code that grabs the numbers I need from this website, but I don't know what to do next.
It grabs the numbers from the table at the bottom. The ones under calving ease, birth weight, weaning weight, yearling weight, milk and total maternal.
#!/usr/bin/python
import urllib2
from bs4 import BeautifulSoup
import pyperclip
def getPageData(url):
if not ('abri.une.edu.au' in url):
return -1
webpage = urllib2.urlopen(url).read()
soup = BeautifulSoup(webpage, "html.parser")
# This finds the epd tree and saves it as a searchable list
pedTreeTable = soup.find('table', {'class':'TablesEBVBox'})
# This puts all of the epds into a list.
# it looks for anything in pedTreeTable with an td tag.
pageData = pedTreeTable.findAll('td')
pageData.pop(7)
return pageData
def createPedigree(animalPageData):
''' make animalPageData much more useful. Strip the text out and put it in a dict.'''
animals = []
for animal in animalPageData:
animals.append(animal.text)
prettyPedigree = {
'calving_ease' : animals[18],
'birth_weight' : animals[19],
'wean_weight' : animals[20],
'year_weight' : animals[21],
'milk' : animals[22],
'total_mat' : animals[23]
}
for animalKey in prettyPedigree:
if animalKey != 'year_weight' and animalKey != 'dam':
prettyPedigree[animalKey] = stripRegNumber(prettyPedigree[animalKey])
return prettyPedigree
def stripRegNumber(animal):
'''returns the animal with its registration number stripped'''
lAnimal = animal.split()
strippedAnimal = ""
for word in lAnimal:
if not word.isdigit():
strippedAnimal += word + " "
return strippedAnimal
def prettify(pedigree):
''' Takes the pedigree and prints it out in a usable format '''
s = ''
pedString = ""
# this is also ugly, but it was the only way I found to format with a variable
cFormat = '{{:^{}}}'
rFormat = '{{:>{}}}'
#row 1 of string
s += rFormat.format(len(pedigree['calving_ease'])).format(
pedigree['calving_ease']) + '\n'
#row 2 of string
s += rFormat.format(len(pedigree['birth_weight'])).format(
pedigree['birth_weight']) + '\n'
#row 3 of string
s += rFormat.format(len(pedigree['wean_weight'])).format(
pedigree['wean_weight']) + '\n'
#row 4 of string
s += rFormat.format(len(pedigree['year_weight'])).format(
pedigree['year_weight']) + '\n'
#row 4 of string
s += rFormat.format(len(pedigree['milk'])).format(
pedigree['milk']) + '\n'
#row 5 of string
s += rFormat.format(len(pedigree['total_mat'])).format(
pedigree['total_mat']) + '\n'
return s
if __name__ == '__main__':
while True:
url = raw_input('Input a url you want to use to make life easier: \n')
pageData = getPageData(url)
s = prettify(createPedigree(pageData))
pyperclip.copy(s)
if len(s) > 0:
print 'the easy string has been copied to your clipboard'
I've just been using this code for easy copying and pasting. All I have to do is insert the URL, and it saves the numbers to my clipboard.
Now I want to use this code on my website; I want to be able to insert a URL in my HTML code, and it displays these numbers on my page in a table.
My questions are as follows:
How do I use the python code on the website?
How do I insert collected data into a table with HTML?
It sounds like you would want to use something like Django. Although the learning curve is a bit steep, it is worth it and it (of course) supports python.
I`m trying to work with Chinese text and big data in Python.
Part of work is clean text from some unneeded data. For this goal I am using regexes. However I met some problems as in Python regex as in PyCharm application:
1) The data is stored in postgresql and viewed well in the columns, however, after select and pull it to the var it is displayed as a square:
When the value printed to the console is looks like:
Mentholatum 曼秀雷敦 男士 深层活炭洁面乳100g(新包装)
So I presume there is no problem with application encoding but with debug part of encoding, however, I did not find any solutions for such behaviour.
2) The example of regex that I need to care is to remove the values between Chinese brackets include them. The code I used is:
#!/usr/bin/env python
# -*- coding: utf-8 -*
import re
from pprint import pprint
import sys, locale, os
columnString = row[columnName]
startFrom = valuestoremove["startsTo"]
endWith = valuestoremove["endsAt"]
isInclude = valuestoremove["include"]
escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
nonASCIIregex = re.compile('([^\x00-\x7F])')
if escapeCharsRegex.match(startFrom):
startFrom = re.escape(startFrom)
if escapeCharsRegex.match(endWith):
endWith = re.escape(endWith)
if isInclude:
regex = startFrom + '(.*)' + endWith
else:
regex = '(?<=' + startFrom + ').*?(?=' + endWith + ')'
if nonASCIIregex.match(regex):
p = re.compile(ur'' + regex)
else:
p = re.compile(regex)
row[columnName] = p.sub("", columnString).strip()
But the regex does not influence on the given string.
I`ve made a test with next code:
#!/usr/bin/env python
# -*- coding: utf-8 -*
import re
reg = re.compile(ur'((.*))')
string = u"巴黎欧莱雅 男士 劲能冰爽洁面啫哩(原男士劲能净爽洁面啫哩)100ml"
print string
string = reg.sub("", string)
print string
And it is work fine for me.
The only difference between those two code examples is that n the first the regex values are come from the txt file with json, encoded as utf-8:
{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "1"
}
}, {
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
},{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
},{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
}
The Chinese brackets from the file are also viewed like the squares:
I cant find explanation or any solution for such behavior, thus the community help need
Thanks for help.
The problem is that the text you're reading in isn't getting understood as Unicode correctly (this is one of the big gotchas that prompted sweeping changes for Python 3k). Instead of:
data_file = myfile.read()
You need to tell it to decode the file:
data_file = myfile.read().decode("utf8")
Then continue with json.loads, etc, and it should work out fine. Alternatively,
data = json.load(myfile, "utf8")
After many searches and consultations here is a solution for Chinese text (also mixed and non-mixed language)
import codecs
def betweencase(valuestoremove, row, columnName):
columnString = row[columnName]
startFrom = valuestoremove["startsTo"]
endWith = valuestoremove["endsAt"]
isInclude = valuestoremove["include"]
escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
if escapeCharsRegex.match(startFrom):
startFrom = re.escape(startFrom)
if escapeCharsRegex.match(endWith):
endWith = re.escape(endWith)
if isInclude:
regex = ur'' + startFrom + '(.*)' + endWith
else:
regex = ur'(?<=' + startFrom + ').*?(?=' + endWith + ')'
***p = re.compile(codecs.encode(unicode(regex), "utf-8"))***
delimiter = ' '
if localization == 'CN':
delimiter = ''
row[columnName] = p.sub(delimiter, columnString).strip()
As you can see we encode any regex to utf-8 thus the postgresql db value is match to regex.
Using Pythons (2.7) 'json' module I'm looking to process various JSON feeds. Unfortunately some of these feeds do not conform with JSON standards - in specific some keys are not wrapped in double speech-marks ("). This is causing Python to bug out.
Before writing an ugly-as-hell piece of code to parse and repair the incoming data, I thought I'd ask - is there any way to allow Python to either parse this malformed JSON or 'repair' the data so that it would be valid JSON?
Working example
import json
>>> json.loads('{"key1":1,"key2":2,"key3":3}')
{'key3': 3, 'key2': 2, 'key1': 1}
Broken example
import json
>>> json.loads('{key1:1,key2:2,key3:3}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\json\__init__.py", line 310, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 346, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 362, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 1 (char 1)
I've written a small REGEX to fix the JSON coming from this particular provider, but I forsee this being an issue in the future. Below is what I came up with.
>>> import re
>>> s = '{key1:1,key2:2,key3:3}'
>>> s = re.sub('([{,])([^{:\s"]*):', lambda m: '%s"%s":'%(m.group(1),m.group(2)),s)
>>> s
'{"key1":1,"key2":2,"key3":3}'
You're trying to use a JSON parser to parse something that isn't JSON. Your best bet is to get the creator of the feeds to fix them.
I understand that isn't always possible. You might be able to fix the data using regexes, depending on how broken it is:
j = re.sub(r"{\s*(\w)", r'{"\1', j)
j = re.sub(r",\s*(\w)", r',"\1', j)
j = re.sub(r"(\w):", r'\1":', j)
Another option is to use the demjson module which can parse json in non-strict mode.
The regular expressions pointed out by Ned and cheeseinvert don't take into account when the match is inside a string.
See the following example (using cheeseinvert's solution):
>>> fixLazyJsonWithRegex ('{ key : "a { a : b }", }')
'{ "key" : "a { "a": b }" }'
The problem is that the expected output is:
'{ "key" : "a { a : b }" }'
Since JSON tokens are a subset of python tokens, we can use python's tokenize module.
Please correct me if I'm wrong, but the following code will fix a lazy json string in all the cases:
import tokenize
import token
from StringIO import StringIO
def fixLazyJson (in_text):
tokengen = tokenize.generate_tokens(StringIO(in_text).readline)
result = []
for tokid, tokval, _, _, _ in tokengen:
# fix unquoted strings
if (tokid == token.NAME):
if tokval not in ['true', 'false', 'null', '-Infinity', 'Infinity', 'NaN']:
tokid = token.STRING
tokval = u'"%s"' % tokval
# fix single-quoted strings
elif (tokid == token.STRING):
if tokval.startswith ("'"):
tokval = u'"%s"' % tokval[1:-1].replace ('"', '\\"')
# remove invalid commas
elif (tokid == token.OP) and ((tokval == '}') or (tokval == ']')):
if (len(result) > 0) and (result[-1][1] == ','):
result.pop()
# fix single-quoted strings
elif (tokid == token.STRING):
if tokval.startswith ("'"):
tokval = u'"%s"' % tokval[1:-1].replace ('"', '\\"')
result.append((tokid, tokval))
return tokenize.untokenize(result)
So in order to parse a json string, you might want to encapsulate a call to fixLazyJson once json.loads fails (to avoid performance penalties for well-formed json):
import json
def json_decode (json_string, *args, **kwargs):
try:
json.loads (json_string, *args, **kwargs)
except:
json_string = fixLazyJson (json_string)
json.loads (json_string, *args, **kwargs)
The only problem I see when fixing lazy json, is that if the json is malformed, the error raised by the second json.loads won't be referencing the line and column from the original string, but the modified one.
As a final note I just want to point out that it would be straightforward to update any of the methods to accept a file object instead of a string.
BONUS: Apart from this, people usually likes to include C/C++ comments when json is used for
configuration files, in this case, you can either remove comments using a regular expression, or use the extended version and fix the json string in one pass:
import tokenize
import token
from StringIO import StringIO
def fixLazyJsonWithComments (in_text):
""" Same as fixLazyJson but removing comments as well
"""
result = []
tokengen = tokenize.generate_tokens(StringIO(in_text).readline)
sline_comment = False
mline_comment = False
last_token = ''
for tokid, tokval, _, _, _ in tokengen:
# ignore single line and multi line comments
if sline_comment:
if (tokid == token.NEWLINE) or (tokid == tokenize.NL):
sline_comment = False
continue
# ignore multi line comments
if mline_comment:
if (last_token == '*') and (tokval == '/'):
mline_comment = False
last_token = tokval
continue
# fix unquoted strings
if (tokid == token.NAME):
if tokval not in ['true', 'false', 'null', '-Infinity', 'Infinity', 'NaN']:
tokid = token.STRING
tokval = u'"%s"' % tokval
# fix single-quoted strings
elif (tokid == token.STRING):
if tokval.startswith ("'"):
tokval = u'"%s"' % tokval[1:-1].replace ('"', '\\"')
# remove invalid commas
elif (tokid == token.OP) and ((tokval == '}') or (tokval == ']')):
if (len(result) > 0) and (result[-1][1] == ','):
result.pop()
# detect single-line comments
elif tokval == "//":
sline_comment = True
continue
# detect multiline comments
elif (last_token == '/') and (tokval == '*'):
result.pop() # remove previous token
mline_comment = True
continue
result.append((tokid, tokval))
last_token = tokval
return tokenize.untokenize(result)
Expanding on Ned's suggestion, the following has been helpful for me:
j = re.sub(r"{\s*'?(\w)", r'{"\1', j)
j = re.sub(r",\s*'?(\w)", r',"\1', j)
j = re.sub(r"(\w)'?\s*:", r'\1":', j)
j = re.sub(r":\s*'(\w+)'\s*([,}])", r':"\1"\2', j)
In a similar case, I have used ast.literal_eval. AFAIK, this won't work only when the constant null (corresponding to Python None) appears in the JSON.
Given that you know about the null/None predicament, you can:
import ast
decoded_object= ast.literal_eval(json_encoded_text)
In addition to Neds and cheeseinvert suggestion, adding (?!/) should avoid the mentioned problem with urls
j = re.sub(r"{\s*'?(\w)", r'{"\1', j)
j = re.sub(r",\s*'?(\w)", r',"\1', j)
j = re.sub(r"(\w)'?\s*:(?!/)", r'\1":', j)
j = re.sub(r":\s*'(\w+)'\s*([,}])", r':"\1"\2', j)
j = re.sub(r",\s*]", "]", j)