Python regex - multiple search - python

Here is what I'm trying to accomplish:
Using python mechanize I open a site
If content does not match my regex I open another site
I perform searching using another regex
And the extracted code:
m = re.search('<td>(?P<alt>\d+)', response.read())
...
m = re.search('<td>(?P<alt>\w+)', response.read())
print m.group('alt')
I'm getting:
AttributeError: 'NoneType' object has no attribute 'group'
If I uncomment the second search everything is fine. I don't understand this behaviour.
Such an error redirected me to this stackoverflow issue and to this - but to no avail - neither of these solved my problem.
I don't care about efficiency here so I don't use compile.

Assuming response is a file-like object, calling read a second time might return a empty string as you consumed the file before.
data = response.read()
m = re.search('<td>(?P<alt>\d\d*)', data)
m = re.search('<td>(?P<alt>\d\d*)', data)
print m.group('alt')
Why would you call search multiple times?

Related

Splitting website url to keywords, multible splits

I am currently creating a tool which scans the URL of a website and returns the keywords as a list. For example google.com/images then the tool should give out:
{"google", "images"}
I knew how to filter the .com part out, but I have the problem that I can't split the split parts again. So I end up with the results of the first split. How do I split these parts again?
First run split(".") -> {"google", "com/images"}
Second run split("/") -> {"google", "com", "images"}
because then I can filter things like the .com part out. I'm writing this in Python and this is my code atm.
First the error:
" AttributeError: 'list' object has no attribute 'split' "
so the problem is that this is an list object and I can't split this again.
Now the code
url_content = input('Enter url: ')
url_split1 = url_content.split('.')
url_split2 = url_split1.split('/')
url_split3 = url_split2.split('-')
url_split4 = url_split3.split('&')
filtered = {'com', 'net'}
print(url_split4)
for key in url_split4:
if key not in filtered:
print(key)
You can use replace:
url_content = input('Enter url: ').replace('/','.').replace('-','.').replace('&','.')
and then split it once:
url_split1 = url_content.split('.')
You can use either use python's builtin regular expressions library as follows.
import re
re.split('\.|\&|\-|/', url_content)
or you may use the string replace method.
url_content.replace(".", "/").replace("&", "/").replace("-", "/").split("/")

Python Selenium: Alternative to string formatting for lists?

I am trying to take the value from the input and put it into the browser.find_elements_by_xpath("//div[#class='v1Nh3 kIKUG _bz0w']") function. However, the string formatting surely doesn't work, since it's the list, hence it throws the AttributeError.
Does anyone know any alternatives to use with lists (possibly without iterating over each file)?
xpath_to_links = input('Enter the xpath to links: ')
posts = browser.find_elements_by_xpath("//div[#class='{}']").format(devops)
AttributeError: 'list' object has no attribute 'format'
Looks like the reason of error is that you are placing the format function in the wrong place, so instead of operating on string "//div[#class='{}']" you call it for the list returned by find_elements_by_xpath. Could you please try to replace your code with one of the following lines ?
posts = browser.find_elements_by_xpath("//div[#class='{}']".format(devops))
posts = browser.find_elements_by_xpath(f"//div[#class='{devops}']")

Python 2.7 replace all instances of NULL / NONE in complex JSON object

I have the following code..
.... rest api call >> response
rsp = response.json()
print json2html.convert(rsp)
which results in the following
error: Can't convert NULL!
I therefore started looking into schemes to replace all None / Null's in my JSON response, but I'm having an issue since the JSON returned from the api is complex and nested many levels and I don't know where the NULL will actually appear.
From what I can tell I need to iterate over the dictionary objects recursively and check for any values that are NONE and actually rebuild the object with the values replaced, but I don't really know where to start since dictionary objects are immutable..
If you look at json2html's source it seems like you have a different problem - and the error message is not helping.
Try to use it like this:
print json2html.convert(json=rsp)
btw. because I've already contributed to that project a bit I've opened up the following PR due to this question: https://github.com/softvar/json2html/pull/20

Find Hyperlinks in Text using Python (Follow-up to another post)

In regards to (Extracting a URL in Python) I have a follow-up question. Note: I'm new to SO and Python, so feel free to correct me on etiquette.
I pulled the regex from the above post and this works fine for me:
myString = """ <iframe width="640" height="390" src="http://www.youtube.com/embed/24WIANESD7k?rel=0" frameborder="0" allowfullscreen></iframe> """
print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
However what I really need to do is loop through a data set that I have previously retrieved from a database. So I did the below, which gives me a strange error, also below.
# Note: "data" here is actually a list of strings, not a data set
for pseudo_url in data:
print re.search("(?P<url>https?://[^\s]+)", str(pseudo_url)).group("url")
Error:
Traceback (most recent call last):
File "find_and_email_bad_press_urls.py", line 136, in <module>
main()
File "find_and_email_bad_press_urls.py", line 14, in main
scrubbed_urls = extract_urls_from_raw_data(raw_url_data)
File "find_and_email_bad_press_urls.py", line 47, in extract_urls_from_raw_data
print re.search("(?P<url>https?://[^\s]+)", str(pseudo_url)).group("url")
AttributeError: 'NoneType' object has no attribute 'group'
When I Google this I find tons of irrelevant posts, so I was hoping SO could shed some light. My hunch is that the regex is blowing up on some null data, special character, etc., but I don't know enough about Python to figure it out. Casting to a string didn't help either.
Any ideas or workarounds to power through this would be much appreciated!
Your regex is not finding a url in every string in data. You should check to make sure you have a match before making the call to group:
for pseudo_url in data:
m = re.search("(?P<url>https?://[^\s]+)", pseudo_url)
if m:
print m.group("url")
You don't need the call to str() either if pseudo_url is already a string.
And as #Blender suggested in his comment, if data is really lines read from an HTML file, you may want to consider using Beautiful Soup instead of regex for this.

Convert google search results into json in python 3.1

I am writing a Python program that feeds a search term to google using the google search API and downloads the first 10 results. I was able to do this in Python 2.6 as follows:
query = urllib.parse.urlencode({'q' : 'searchterm','start' : k},doseq=false)
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' \
% (query)
results = urllib.urlopen(url)
resultsjson = json.loads(results.read())
betterResults += resultsjson["responseData"]["results"]
Google's search API returns the results as a json, so I used the above code to download the results into a json of my and parse them into a list (betterResults).
When I switched over to Python 3, my program began throwing exceptions. Apparently, in Python 2.6 the object returned by urlopen() is a file-like object that can be loaded into a json. In Python 3.1, the object returned is an HTTPResponse object, which does contain a read() method, as required by the json specifications, but is a byte object. I was therefore unable to access the information as I had in 2.6.
Is there any way to access the json returned by google? How can I get the results in Python 3 and be able to select which fields I want, as I was able to do with the json?
Thank you very much,
bsg
You'll need to decode the byte object if you want to use it with json.loads
resultjson = json.loads(results.read().decode())
docs also suggest to pass encoding parameter to the loads function:
json.loads(results.read(), encoding=<encoding-type>)
I think Lennart has an explanation how to get the encoding-type.
The object returned by urlopen is file like, you are wrong there. But you use json.loads(), which expects a string. json.load() expects a file like object.
However, json.load() expects the result of the read() method to be a string, while of course the read you get will be bytes, so you need to decode it from bytes to a string first.
So, something like this:
query = urllib.parse.urlencode({'q' : 'searchterm','start' : k},doseq=false)
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' \
% (query)
results = urllib.urlopen(url)
encoding = input.getheader('content-type').split('=')[-1]
resultsjson = json.loads(results.read().decode(encoding))
betterResults += resultsjson["responseData"]["results"]
Might work. (I didn't test it).

Categories