Python print regex in json string - python
So, I'm using a small python script to try to simply print out every occurrence of a website 'northwest.hall.' where the wildcard() is a number, in a very large json string pulled from a url.
I have this so far:
import urllib, json, re
url = 'http://graphite.website.com/render/?target=stats.web.northwest.hall.*&format=json'
response = urllib.urlopen(url)
data = json.loads(response.read())
code = re.findall('northwest', data)
print code
This should return a list of the 30 regexpressions of northwest.hall.number in the json string being parsed, but I get the following error instead:
Traceback (most recent call last):
File "/Users/arin/Desktop/scripts/code_parser2.py", line 7, in <module>
code = re.findall('community', data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
New to Python (sure you can tell).
Thanks in advance.
Use
data = response.read()
to get json string from server.
Using
data = json.loads(response.read())
you change this string into python dictionary.
EDIT:
import re
data = """
stats.web.northwest.hall.01
stats.web.northwest.hall.223
stats.web.northwest.hall.31
stats.web.northwest.hall.4
"""
print re.findall(r'stats.web.northwest.hall.(\d+)', data)
['01', '223', '31', '4']
Related
Convert json query to insert a variable and re-convert it to json query
I am kinda frustrated. I copied the following Metabase query string from the network tab in the browser: query = "{\"database\":17,\"query\":{\"source-table\":963,\"filter\":[\"and\",[\"=\",[\"field\",17580,null],\"XXXXXX_XXXXXX\"],[\"=\",[\"field\",17599,null],\"**chl-43d813dd-05a7-45b8-a5b0-8eb960289aa5**\"]],\"fields\":[[\"field\",17579,null],[\"field\",17569,null],[\"field\",17572,null],[\"field\",17586,null],[\"field\",17592,{\"temporal-unit\":\"default\"}],[\"field\",17611,null],[\"field\",17582,null],[\"field\",17597,null],[\"field\",17603,null],[\"field\",17607,null],[\"field\",17576,null],[\"field\",17588,null],[\"field\",17596,null],[\"field\",17608,null],[\"field\",17587,{\"temporal-unit\":\"default\"}],[\"field\",17578,{\"temporal-unit\":\"default\"}],[\"field\",17602,null],[\"field\",17606,null],[\"field\",17605,{\"temporal-unit\":\"default\"}],[\"field\",17601,null],[\"field\",17590,null],[\"field\",17580,null],[\"field\",17598,{\"temporal-unit\":\"default\"}],[\"field\",17577,null],[\"field\",164910,null],[\"field\",46951,null],[\"field\",46952,{\"temporal-unit\":\"default\"}]]},\"type\":\"query\",\"middleware\":{\"js-int-to-string?\":true,\"add-default-userland-constraints?\":true}}" As the next step I wanted to convert it to a String to replace the bold reference with a variable. The String looks like this: query = '{"database\":17,\"query\":{\"source-table\":963,\"filter\":[\"and\",[\"=\",[\"field\",17580,null],\"XXXXXXXX-XXXXXXXX\"],[\"=\",[\"field\",17599,null],\"'+channelRef+'\"]],\"fields\":[[\"field\",17579,null],[\"field\",17569,null],[\"field\",17572,null],[\"field\",17586,null],[\"field\",17592,{\"temporal-unit\":\"default\"}],[\"field\",17611,null],[\"field\",17582,null],[\"field\",17597,null],[\"field\",17603,null],[\"field\",17607,null],[\"field\",17576,null],[\"field\",17588,null],[\"field\",17596,null],[\"field\",17608,null],[\"field\",17587,{\"temporal-unit\":\"default\"}],[\"field\",17578,{\"temporal-unit\":\"default\"}],[\"field\",17602,null],[\"field\",17606,null],[\"field\",17605,{\"temporal-unit\":\"default\"}],[\"field\",17601,null],[\"field\",17590,null],[\"field\",17580,null],[\"field\",17598,{\"temporal-unit\":\"default\"}],[\"field\",17577,null],[\"field\",164910,null],[\"field\",46951,null],[\"field\",46952,{\"temporal-unit\":\"default\"}]]},\"type\":\"query\",\"middleware\":{\"js-int-to-string?\":true,\"add-default-userland-constraints?\":true}}' With q = json.dumps(query) the result looks exactly as I want to: q = "{\"database\":17,\"query\":{\"source-table\":963,\"filter\":[\"and\",[\"=\",[\"field\",17580,null],\"XXXXXXXX-XXXXXXXX\"],[\"=\",[\"field\",17599,null],\"**chl-caabef81-f081-4532-9b6e-ac20b3d4c6cf**\"]],\"fields\":[[\"field\",17579,null],[\"field\",17569,null],[\"field\",17572,null],[\"field\",17586,null],[\"field\",17592,{\"temporal-unit\":\"default\"}],[\"field\",17611,null],[\"field\",17582,null],[\"field\",17597,null],[\"field\",17603,null],[\"field\",17607,null],[\"field\",17576,null],[\"field\",17588,null],[\"field\",17596,null],[\"field\",17608,null],[\"field\",17587,{\"temporal-unit\":\"default\"}],[\"field\",17578,{\"temporal-unit\":\"default\"}],[\"field\",17602,null],[\"field\",17606,null],[\"field\",17605,{\"temporal-unit\":\"default\"}],[\"field\",17601,null],[\"field\",17590,null],[\"field\",17580,null],[\"field\",17598,{\"temporal-unit\":\"default\"}],[\"field\",17577,null],[\"field\",164910,null],[\"field\",46951,null],[\"field\",46952,{\"temporal-unit\":\"default\"}]]},\"type\":\"query\",\"middleware\":{\"js-int-to-string?\":true,\"add-default-userland-constraints?\":true}}" But when I use this query string to send an API request, I get the following error message(s): {"via":[{"type":"java.lang.ClassCastException"}],"trace":[],"message":null} Traceback (most recent call last): File "c:\Users\XXXX\Documents\XXXXXXXX\Test.py", line 308, in main() File "c:\Users\XXXX\Documents\XXXXXXXX\Test.py", line 114, in main some_function(XXXX, window, selected_path) File "c:\Users\XXXX\Documents\XXXXXXXX\Test.py", line 290, in some_function dataframe = DataFrame(result) File "C:\Users\XXXX\AppData\Roaming\Python\Python310\site-packages\pandas\core\frame.py", line 756, in init raise ValueError("DataFrame constructor not properly called!") ValueError: DataFrame constructor not properly called! Does have anyone have an idea? Thank you very much in advance!
You can use the built-in json module: import json query = "{\"database\":17,\"query\":{\"source-table\":963,\"filter\":[\"and\",[\"=\",[\"field\",17580,null],\"XXXXXX_XXXXXX\"],[\"=\",[\"field\",17599,null],\"**chl-43d813dd-05a7-45b8-a5b0-8eb960289aa5**\"]],\"fields\":[[\"field\",17579,null],[\"field\",17569,null],[\"field\",17572,null],[\"field\",17586,null],[\"field\",17592,{\"temporal-unit\":\"default\"}],[\"field\",17611,null],[\"field\",17582,null],[\"field\",17597,null],[\"field\",17603,null],[\"field\",17607,null],[\"field\",17576,null],[\"field\",17588,null],[\"field\",17596,null],[\"field\",17608,null],[\"field\",17587,{\"temporal-unit\":\"default\"}],[\"field\",17578,{\"temporal-unit\":\"default\"}],[\"field\",17602,null],[\"field\",17606,null],[\"field\",17605,{\"temporal-unit\":\"default\"}],[\"field\",17601,null],[\"field\",17590,null],[\"field\",17580,null],[\"field\",17598,{\"temporal-unit\":\"default\"}],[\"field\",17577,null],[\"field\",164910,null],[\"field\",46951,null],[\"field\",46952,{\"temporal-unit\":\"default\"}]]},\"type\":\"query\",\"middleware\":{\"js-int-to-string?\":true,\"add-default-userland-constraints?\":true}}" my_json = json.loads(query) # make edit's (works like a dict) query = json.dumps(my_json)
I don't see a bold reference in your JSON string, but this is all handled with the json library: import json query = "YOUR QUERY STRING" object = json.dumps(query) # Make your changes to your dict object here new_query = json.loads(object)
NameError in function to retrieve JSON data
I'm using python 3.6.1 and have the following code which successfully retrieves data in JSON format: import urllib.request,json,pprint url = "https://someurl" response = urllib.request.urlopen(url) data = json.loads(response.read()) pprint.pprint(data) I want to wrap this in a function, so i can reuse it. This is what i have tried in a file called getdata.py: from urllib.request import urlopen import json def get_json_data(url): response = urlopen(url) return json.loads(response.read()) and this is the error i get after importing the file and attempting to print out the response: >>> import getdata >>> print(getdata.get_json_data("https://someurl")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\Nick\getdata.py", line 6, in get_json_data from urllib.request import urlopen NameError: name 'urllib' is not defined i also tried this and got the same error: import urllib.request,json def get_json_data(url): response = urllib.request.urlopen(url) return json.loads(response.read()) What do i need to do to get this to work please? cheers
Its working now ! I think the problem was the hydrogen addon i have for the Atom editor. I uninstalled it, tried again and it worked. Thanks for looking.
lxml.etree.XPathEvalError: Invalid expression
I am getting an error with Python that I am not able to understand. I have simplified my code to the very bare minimum: response = requests.get('http://pycoders.com/archive') tree = html.fromstring(response.text) r = tree.xpath('//divass="campaign"]/a/#href') print(r) and still get the error Traceback (most recent call last): File "ultimate-1.py", line 17, in <module> r = tree.xpath('//divass="campaign"]/a/#href') File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702) File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954) File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962) File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817) lxml.etree.XPathEvalError: Invalid expression Would anyone have an idea of where the issue is coming from? Might it be a dependencies problem? Thanks.
The expression '//divass="campaign"]/a/#href' is not syntactically correct and does not make much sense. Instead you meant to check the class attribute: //div[#class="campaign"]/a/#href Now, that would help to avoid the Invalid Expression error, but you would get nothing found by the expression. This is because the data is not there in the response that requests receives. You would need to mimic what the browser does to get the desired data and make an additional request to get the javascript file containing the campaigns. Here is what works for me: import ast import re import requests from lxml import html with requests.Session() as session: # extract script url response = session.get('http://pycoders.com/archive') tree = html.fromstring(response.text) script_url = tree.xpath("//script[contains(#src, 'generate-js')]/#src")[0] # get the script response = session.get(script_url) data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1)) # extract the desired data tree = html.fromstring(data) campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[#class="campaign"]/a')] print(campaigns) Prints: ['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140', ... 'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481' ]
U was wrong in making xpath. If u want to take all hrefs your xpath should be like hrefs = tree.xpath('//div[#class="campaign"]/a') for href in hrefs: print(href.get('href')) or in one line: hrefs = [item.get('href') for item in tree.xpath('//div[#class="campaign"]/a')]
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode
I know many people encountered this error before but I couldn't find the solution to my problem. I have a URL that I want to normalize: url = u"http://www.dgzfp.de/Dienste/Fachbeitr%C3%A4ge.aspx?EntryId=267&Page=5" scheme, host_port, path, query, fragment = urlsplit(url) path = urllib.unquote(path) path = urllib.quote(path,safe="%/") This gives an error message: /usr/lib64/python2.6/urllib.py:1236: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal res = map(safe_map.__getitem__, s) Traceback (most recent call last): File "url_normalization.py", line 246, in <module> logging.info(get_canonical_url(url)) File "url_normalization.py", line 102, in get_canonical_url path = urllib.quote(path,safe="%/") File "/usr/lib64/python2.6/urllib.py", line 1236, in quote res = map(safe_map.__getitem__, s) KeyError: u'\xc3' I tried to remove the unicode indicator "u" from the URL string and I do not get the error message. But How can I get rid of the unicode automatically because I read it directly from a database.
urllib.quote() does not properly parse Unicode. To get around this, you can call the .encode() method on the url when reading it (or on the variable you read from the database). So run url = url.encode('utf-8'). With this you get: import urllib import urlparse from urlparse import urlsplit url = u"http://www.dgzfp.de/Dienste/Fachbeitr%C3%A4ge.aspx?EntryId=267&Page=5" url = url.encode('utf-8') scheme, host_port, path, query, fragment = urlsplit(url) path = urllib.unquote(path) path = urllib.quote(path,safe="%/") and then your output for the path variable will be: >>> path '/Dienste/Fachbeitr%C3%A4ge.aspx' Does this work?
ElementTree XML parsing and urllib2.urlopen
I'm opening a URL using: response = urllib2.urlopen(url, data, timeout=_TIMEOUT) and using response.read(), it gives the following output: <XMLlookup licenseid="X4X6X42" reason="OK" status="1" /> but when I want to parse it using ElementTree, like so: print response.read() t = ET.parse(response) r = t.getroot() print r.attrib.get('status') gives me the following error message: File "<string>", line 62, in parse File "<string>", line 38, in parse cElementTree.ParseError: no element found: line 1, column 0 but when I remove the line response.read(), the code works fine. What am I doing wrong?
You can only read once a response, because it is a file object like (in fact a addinfourl). Subsequent calls to read will return an empty string since you have always read the whole text. So either you do not call read before using ET.parse(response), or you store the result in a string and use it for ET : txt = response.read() # do what you want with txt (without changing it) t = ET.fromstring(txt)
You need to use: t = ET.fromstring(response.read())
instead of response.read() t = ET.parse(response) r = t.getroot() try resp = response.read() t = ET.fromstring(resp) r = t.getroot() or t = ET.fromstring(response.read()) r = t.getroot() Also, you should note that not all HTML is parsable as XML. If your request returns XHTML then you will be fine, but otherwise you will get a very similar error to what you are seeing.