Find Hyperlinks in Text using Python (Follow-up to another post) - python

In regards to (Extracting a URL in Python) I have a follow-up question. Note: I'm new to SO and Python, so feel free to correct me on etiquette.
I pulled the regex from the above post and this works fine for me:
myString = """ <iframe width="640" height="390" src="http://www.youtube.com/embed/24WIANESD7k?rel=0" frameborder="0" allowfullscreen></iframe> """
print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
However what I really need to do is loop through a data set that I have previously retrieved from a database. So I did the below, which gives me a strange error, also below.
# Note: "data" here is actually a list of strings, not a data set
for pseudo_url in data:
print re.search("(?P<url>https?://[^\s]+)", str(pseudo_url)).group("url")
Error:
Traceback (most recent call last):
File "find_and_email_bad_press_urls.py", line 136, in <module>
main()
File "find_and_email_bad_press_urls.py", line 14, in main
scrubbed_urls = extract_urls_from_raw_data(raw_url_data)
File "find_and_email_bad_press_urls.py", line 47, in extract_urls_from_raw_data
print re.search("(?P<url>https?://[^\s]+)", str(pseudo_url)).group("url")
AttributeError: 'NoneType' object has no attribute 'group'
When I Google this I find tons of irrelevant posts, so I was hoping SO could shed some light. My hunch is that the regex is blowing up on some null data, special character, etc., but I don't know enough about Python to figure it out. Casting to a string didn't help either.
Any ideas or workarounds to power through this would be much appreciated!

Your regex is not finding a url in every string in data. You should check to make sure you have a match before making the call to group:
for pseudo_url in data:
m = re.search("(?P<url>https?://[^\s]+)", pseudo_url)
if m:
print m.group("url")
You don't need the call to str() either if pseudo_url is already a string.
And as #Blender suggested in his comment, if data is really lines read from an HTML file, you may want to consider using Beautiful Soup instead of regex for this.

Related

Conversion of string to dictionary

I have captured a string from a REST get request and have placed it in a variable. The string is:
{"name":"na1mailboxarchive","objectCount":49564710,"dataBytes":36253526882451},{"name":"na1mailboxarchive2","objectCount":17616567,"dataBytes":13409204616615}
I am trying to convert it to a dictionary so I can increment through it and capture the bucket name, size and object count. I have tried eval()
bucket_dict = eval(bucket_info)
but the program errors out with a:
Traceback (most recent call last):
File "./test.py", line 83, in <module>
for k,b in bucket_dict.items():
AttributeError: 'tuple' object has no attribute 'items'
When I print the value of bucket_dict I get:
({'name': 'na1mailboxarchive', 'objectCount': 49564710, 'dataBytes': 36253526882451}, {'name': 'na1mailboxarchive2', 'objectCount': 17616567, 'dataBytes': 13409204616615})
I think the foul up is the () at the beginning and the end of the dictionary. Nothing else I have tried works either.
Try this instead
import ast
string = '{"name":"na1mailboxarchive","objectCount":49564710,"dataBytes":36253526882451},{"name":"na1mailboxarchive2","objectCount":17616567,"dataBytes":13409204616615}'
result = ast.literal_eval(string)
print(result)
result is returned as a dictionary
I got it figured out.
Firstly the json return from the REST API get is badly formatted. I will take that up with the vendor. Secondly I used some iof the infrmation form #PrashantKumar and #MisterMiyagi to sus out the issue I was having. In my original code I had loaded the list with:
bucket_info = [acct_string[acct_string_start+11:acct_len-4]]
The variable was capturing the leading "[" and trailing "]" as a part of the string. Once I removed them then the list behaved correctly and I now can work with it. Thank you for the information and the trail markers.

Python: Issue with 'unexpected end of pattern'

I'm doing small project on sentiment Analysis using twitter data. I have the sample csv file containing the data. but before doing the sentiment analysis part. I have to clean up the data. There is one part that I am stuck. Here's the code.
tweets['source'][2] ## Source is an attribute in csv file containing values
Out[51]: u'Twitter for Android'
I want to clean the source(data). I don't want the the values to be shown with web links and the tags.
Here's the code for cleaning the source:
tweets['source_new'] = ''
for i in range(len(tweets['source'])):
m = re.search('(?)(.*)', tweets['source'][i])
try:
tweets['source_new'][i]=m.group(0)
except AttributeError:
tweets['source_new'][i]=tweets['source'][i]
tweets['source_new'] = tweets['source_new'].str.replace('', ' ', case=False)
But when I executed the code. I got this error:
Traceback (most recent call last):
File "<ipython-input-50-f92a7f05ad1d>", line 2, in <module>
m = re.search('(?)(.*)', tweets['source'][i])
File "C:\Users\aneeq\Anaconda2\lib\re.py", line 146, in search
return _compile(pattern, flags).search(string)
File "C:\Users\aneeq\Anaconda2\lib\re.py", line 251, in _compile
raise error, v # invalid expression
error: unexpected end of pattern
I got an error saying 'error: unexpected end of pattern". Can any help me with this? I can't find the issue of the code that I am working on.
I should start by stating that using a regular expression for this task is not a good idea12
Being that said, I see two ways to accomplish this depending on your context:
If you don't really know what tags you are going to encounter
We can get the HTML text value doing the following:
# Replace any HTML tag with empty string
value = re.sub('<[^>]*>', '', tweets['source'][i])
tweets['source_new'] = value
If you know what tags you are going to encounter (recommended)
This would be my recommended approach (if you really need to use regular expressions), as it is more explicit and less prone to any surprises.
# Replace any HTML "a" tag with empty string
value = re.sub('(?i)<\/?a[^>]*>', '', tweets['source'][i])
tweets['source_new'] = value
Alternatively, you can take a look at How to remove HTML tags from a String on Python for other options and approaches.
1 Using a Regex to remove HTML tags from a string
2 Using Regex to parse HTML

Type error that shows on console, but not on pythontutor.com

Ok, so I have a block of code that I am trying to debug, and I usually use Pythontutor.com to step through the code to see where it is going wrong. Problem is, the exact code works on the website, but not in my console.
row = []
row.append("Acid Arrow")
testList = ['Detect', 'Discern', 'Summon', 'Call', 'Binding']
nameList = row[0].split(' ')
print testList, nameList
a = list(set(testList) & set(nameList))
The error I am getting is this:
C:\Users\User\Dropbox\D&D\SpellBag>livingSpell.py
['Detect', 'Discern', 'Summon', 'Call', 'Binding'] ['Acid', 'Arrow']
Traceback (most recent call last):
File "C:\Users\User\Dropbox\D&D\SpellBag\livingSpell.py", line 121, in <module>
sb = spellBook(r'allSpells.csv')
File "C:\Users\User\Dropbox\D&D\SpellBag\livingSpell.py", line 27, in __init__
a = list(set(testList) & set(nameList))
TypeError: 'str' object is not callable
The above code works flawlessly on PythonTutor, but fails when I run it in the console. What it is intended to do is check if a word from the list is in the spell name, which if any of them are, the spell is passed over and it moves on. It should be returning an empty list, but instead I get the error.
The line that has the error is a = list(set(testList) & set(nameList)), and the error says "'str' object is not callable." This means the Python interpreter tried to call a function and found out it wasn't actually a function. This is the same error you would get if you typed "bad_code"(), since the string "bad_code" is not a function.
It's impossible to say exactly which of the two is having an issue, but either list or set has been overwritten and is now a string rather than the default functions provided in Python. That snippet of code works fine by itself in pythontutor.com because the offending line of code happens somewhere before it in your file (the error says you have 22 lines of code beforehand). In fact, if you started a blank file and only had the snippet you posted here on StackOverflow, it would run perfectly. Check for anything like list = ... or set = ... in your original source code.
It is a somewhat common convention in Python to avoid naming conflicts with reserved words (list, set, or, if, with, while, etc...) by appending an underscore to the name. In this case, that would mean writing either list_ = ... or set_ = .... A good coding practice in general though would be to come up with a specific name for your variable that describes it exactly. For example, you might use used_spell_list instead of list (just guessing...I have no idea how this was overwritten).

Trying to edit private dicom tag

I'm currently trying to edit a private dicom tag which is causing problems with a radiotherapy treatment, using pydicom in python. Bit of a python newbie here so bear with me.
The dicom file imports correctly into python; I've attached some of the output in the first image from the commands
ds = dicomio.read_file("xy.dcm")
print(ds)
This returns the following data:
pydicom output
The highlighted tag is the one I need to edit.
When trying something like
ds[0x10,0x10].value
This gives the correct output:
'SABR Spine'
However, trying something along the lines of
ds[3249,1000]
or
ds[3249,1000].value
returns the following output:
> Traceback (most recent call last):
File "<pyshell#64>", line 1, in <module>
ds[3249,1000].value
File "C:\Users\...\dataset.py", line 317, in __getitem__
data_elem = dict.__getitem__(self, tag)
KeyError: (0cb1, 03e8)
If I try accessing [3249,1010] via the same method, it returns a KeyError of (0cb1, 03f2).
I have tried adding the tag to the _dicom_dict.py file, as highlighted in the second image:
end of _dicom_dict.py
Have I done this right? I'm not even sure if I'm accessing the tags correctly - using
ds[300a,0070]
gives me 'SyntaxError: invalid syntax' as the output, for example, even though this is present in the file as fraction group sequence. I have also been made aware that [3249,1000] is connected to [3249,1010] somehow, and apparently since they are proprietary tags, they cannot be edited in Matlab, however it was suggested they could be edited in python for some reason.
Thanks a lot
It looks like your dicomio lookup is converting all inputs to hexadecimal.
You could try:
ds[0x3249,0x1000]
This should prevent any forced conversion to hexadecimal.
You can apparently access them directly as strings:
ds['3249', '1000']
However, your issue is that you are trying to access a data element that is nested several layers deep. Based on your output at the top, I would suggest trying:
first_list_item = ds['300a', '0070'][0]
for item in first_list_item['300c', '0004']:
print(item['3249','1000'])
Essentially, a data element from the top level Dataset object can be either a list or another Dataset object. Makes parsing the data a little harder, but probably unavoidable.
Have a look at this for more info.
As Andrew Guy notes in his last comment, you need to get the first sequence item for 300a,0070. Then get the second sequence item from the 300c,0004 sequence in that item. In that sequence item, you should be able to get the 3249,1000 attribute.

In interpreting Jinja2 templates, why does code object have a null filename?

I'm trying to pick apart Jinja2's TemplateSyntaxError to see why it doesn't tell me the exact file name in which a syntax error is found.
I'm actually introducing this error in a sub-template on purpose to try to better understand this templating system. Upon getting the syntax error, I see File "<unknown>", line 4, in template in my Flask preview server. The line number is correct, but the debugger appears confused about the file from which the problem originated, which is very annoying. I'm uncertain as of yet what the name of the code object, template represents.
As someone has pointed out, the <unknown> is used here as a throwaway when the filename value of the code object is null. After reading through a few references for code objects, I've not yet had luck wrapping my head around this weirdness.
Someone appears to get a similar error in this github issue.
May it be that it's just an arbitrary value provided by Jinja2 to some dynamically generated code?
>>> code = compile('print("test")', '<unknown>', 'exec')
>>> code
<code object <module> at 0x1064b6e30, file "<unknown>", line 1>
>>> exec code
test
>>> code.co_filename
'<unknown>'
And there it seems to be indeed in jinja2/debug.py - translate_syntax_error.
The explanation of why that is is rather straightforward. The whole machinery, starting with flask.render_template_string through jinja2.Environment.from_string down to Jinja2 exception handlers does not take any concern about the origin of the template string passed in.
While it would be possible to tunnel some more information top down, what would be the benefit of it anyway? In case of an exception you get a complete stack trace with appropriate local information available on each level of it, including from where you passed in the string and the line number in the template string that erred, e.g.:
File "jinja2-uknown-filename.py", line 7, in index
return flask.render_template_string("this is a \n \n {% test %}")
...
File "<unknown>", line 3, in template
TemplateSyntaxError: Encountered unknown tag 'test'.

Categories