Python: Issue with 'unexpected end of pattern' - python

I'm doing small project on sentiment Analysis using twitter data. I have the sample csv file containing the data. but before doing the sentiment analysis part. I have to clean up the data. There is one part that I am stuck. Here's the code.
tweets['source'][2] ## Source is an attribute in csv file containing values
Out[51]: u'Twitter for Android'
I want to clean the source(data). I don't want the the values to be shown with web links and the tags.
Here's the code for cleaning the source:
tweets['source_new'] = ''
for i in range(len(tweets['source'])):
m = re.search('(?)(.*)', tweets['source'][i])
try:
tweets['source_new'][i]=m.group(0)
except AttributeError:
tweets['source_new'][i]=tweets['source'][i]
tweets['source_new'] = tweets['source_new'].str.replace('', ' ', case=False)
But when I executed the code. I got this error:
Traceback (most recent call last):
File "<ipython-input-50-f92a7f05ad1d>", line 2, in <module>
m = re.search('(?)(.*)', tweets['source'][i])
File "C:\Users\aneeq\Anaconda2\lib\re.py", line 146, in search
return _compile(pattern, flags).search(string)
File "C:\Users\aneeq\Anaconda2\lib\re.py", line 251, in _compile
raise error, v # invalid expression
error: unexpected end of pattern
I got an error saying 'error: unexpected end of pattern". Can any help me with this? I can't find the issue of the code that I am working on.

I should start by stating that using a regular expression for this task is not a good idea12
Being that said, I see two ways to accomplish this depending on your context:
If you don't really know what tags you are going to encounter
We can get the HTML text value doing the following:
# Replace any HTML tag with empty string
value = re.sub('<[^>]*>', '', tweets['source'][i])
tweets['source_new'] = value
If you know what tags you are going to encounter (recommended)
This would be my recommended approach (if you really need to use regular expressions), as it is more explicit and less prone to any surprises.
# Replace any HTML "a" tag with empty string
value = re.sub('(?i)<\/?a[^>]*>', '', tweets['source'][i])
tweets['source_new'] = value
Alternatively, you can take a look at How to remove HTML tags from a String on Python for other options and approaches.
1 Using a Regex to remove HTML tags from a string
2 Using Regex to parse HTML

Related

Conversion of string to dictionary

I have captured a string from a REST get request and have placed it in a variable. The string is:
{"name":"na1mailboxarchive","objectCount":49564710,"dataBytes":36253526882451},{"name":"na1mailboxarchive2","objectCount":17616567,"dataBytes":13409204616615}
I am trying to convert it to a dictionary so I can increment through it and capture the bucket name, size and object count. I have tried eval()
bucket_dict = eval(bucket_info)
but the program errors out with a:
Traceback (most recent call last):
File "./test.py", line 83, in <module>
for k,b in bucket_dict.items():
AttributeError: 'tuple' object has no attribute 'items'
When I print the value of bucket_dict I get:
({'name': 'na1mailboxarchive', 'objectCount': 49564710, 'dataBytes': 36253526882451}, {'name': 'na1mailboxarchive2', 'objectCount': 17616567, 'dataBytes': 13409204616615})
I think the foul up is the () at the beginning and the end of the dictionary. Nothing else I have tried works either.
Try this instead
import ast
string = '{"name":"na1mailboxarchive","objectCount":49564710,"dataBytes":36253526882451},{"name":"na1mailboxarchive2","objectCount":17616567,"dataBytes":13409204616615}'
result = ast.literal_eval(string)
print(result)
result is returned as a dictionary
I got it figured out.
Firstly the json return from the REST API get is badly formatted. I will take that up with the vendor. Secondly I used some iof the infrmation form #PrashantKumar and #MisterMiyagi to sus out the issue I was having. In my original code I had loaded the list with:
bucket_info = [acct_string[acct_string_start+11:acct_len-4]]
The variable was capturing the leading "[" and trailing "]" as a part of the string. Once I removed them then the list behaved correctly and I now can work with it. Thank you for the information and the trail markers.

How can I find this pattern in a text document?

So im practicing some RegEx in python and essentially I want to look through a log of transaction numbers and see if any of them are returning an error such as Error in phone Activation.
I was successful in searching in a dictionary for something that starts with Error and then ends with Activation, so that if it was tablet, watch, etc , it would still find the error. However, as a bulk text file, it will not successfully find the pattern.
So the code I used to find it in a dictionary was such that the dictionary key was a transaction number and the error (or lack thereof) was the value:
for i in Transaction_Log:
if bool(re.search("^Error.* Activation$", Transaction_Log[i])):
print("Found requested error in transaction number " + i)
error_count += 1
This works, however using the same search function cant find anything when in a text file setup like this:
Transnum: 20190510001 error: Error in phone Activation,
Transnum: 20190510002 error: none,
Transnum: 20190510003 error: Error in tablet Activation,
Ideally, it can find the type of errors, and when successful I can make a counter to see how many there are, however my boolean statement is not True when searching this way through a text file.
Searching for just the word Error does work though.
With the help of #CAustin, I figured out that I was searching for the wrong pattern due to the line not starting with error and the ending of the line also having a comma at the end. By removing both anchors, I was able to find what I needed to find in this example, so for anyone else looking for something similar it was this...
for line in testingDoc:
if bool(re.search("Error.* Activation", line)):
print("found error in transaction")

Trying to edit private dicom tag

I'm currently trying to edit a private dicom tag which is causing problems with a radiotherapy treatment, using pydicom in python. Bit of a python newbie here so bear with me.
The dicom file imports correctly into python; I've attached some of the output in the first image from the commands
ds = dicomio.read_file("xy.dcm")
print(ds)
This returns the following data:
pydicom output
The highlighted tag is the one I need to edit.
When trying something like
ds[0x10,0x10].value
This gives the correct output:
'SABR Spine'
However, trying something along the lines of
ds[3249,1000]
or
ds[3249,1000].value
returns the following output:
> Traceback (most recent call last):
File "<pyshell#64>", line 1, in <module>
ds[3249,1000].value
File "C:\Users\...\dataset.py", line 317, in __getitem__
data_elem = dict.__getitem__(self, tag)
KeyError: (0cb1, 03e8)
If I try accessing [3249,1010] via the same method, it returns a KeyError of (0cb1, 03f2).
I have tried adding the tag to the _dicom_dict.py file, as highlighted in the second image:
end of _dicom_dict.py
Have I done this right? I'm not even sure if I'm accessing the tags correctly - using
ds[300a,0070]
gives me 'SyntaxError: invalid syntax' as the output, for example, even though this is present in the file as fraction group sequence. I have also been made aware that [3249,1000] is connected to [3249,1010] somehow, and apparently since they are proprietary tags, they cannot be edited in Matlab, however it was suggested they could be edited in python for some reason.
Thanks a lot
It looks like your dicomio lookup is converting all inputs to hexadecimal.
You could try:
ds[0x3249,0x1000]
This should prevent any forced conversion to hexadecimal.
You can apparently access them directly as strings:
ds['3249', '1000']
However, your issue is that you are trying to access a data element that is nested several layers deep. Based on your output at the top, I would suggest trying:
first_list_item = ds['300a', '0070'][0]
for item in first_list_item['300c', '0004']:
print(item['3249','1000'])
Essentially, a data element from the top level Dataset object can be either a list or another Dataset object. Makes parsing the data a little harder, but probably unavoidable.
Have a look at this for more info.
As Andrew Guy notes in his last comment, you need to get the first sequence item for 300a,0070. Then get the second sequence item from the 300c,0004 sequence in that item. In that sequence item, you should be able to get the 3249,1000 attribute.

Find Hyperlinks in Text using Python (Follow-up to another post)

In regards to (Extracting a URL in Python) I have a follow-up question. Note: I'm new to SO and Python, so feel free to correct me on etiquette.
I pulled the regex from the above post and this works fine for me:
myString = """ <iframe width="640" height="390" src="http://www.youtube.com/embed/24WIANESD7k?rel=0" frameborder="0" allowfullscreen></iframe> """
print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
However what I really need to do is loop through a data set that I have previously retrieved from a database. So I did the below, which gives me a strange error, also below.
# Note: "data" here is actually a list of strings, not a data set
for pseudo_url in data:
print re.search("(?P<url>https?://[^\s]+)", str(pseudo_url)).group("url")
Error:
Traceback (most recent call last):
File "find_and_email_bad_press_urls.py", line 136, in <module>
main()
File "find_and_email_bad_press_urls.py", line 14, in main
scrubbed_urls = extract_urls_from_raw_data(raw_url_data)
File "find_and_email_bad_press_urls.py", line 47, in extract_urls_from_raw_data
print re.search("(?P<url>https?://[^\s]+)", str(pseudo_url)).group("url")
AttributeError: 'NoneType' object has no attribute 'group'
When I Google this I find tons of irrelevant posts, so I was hoping SO could shed some light. My hunch is that the regex is blowing up on some null data, special character, etc., but I don't know enough about Python to figure it out. Casting to a string didn't help either.
Any ideas or workarounds to power through this would be much appreciated!
Your regex is not finding a url in every string in data. You should check to make sure you have a match before making the call to group:
for pseudo_url in data:
m = re.search("(?P<url>https?://[^\s]+)", pseudo_url)
if m:
print m.group("url")
You don't need the call to str() either if pseudo_url is already a string.
And as #Blender suggested in his comment, if data is really lines read from an HTML file, you may want to consider using Beautiful Soup instead of regex for this.

Python regex - multiple search

Here is what I'm trying to accomplish:
Using python mechanize I open a site
If content does not match my regex I open another site
I perform searching using another regex
And the extracted code:
m = re.search('<td>(?P<alt>\d+)', response.read())
...
m = re.search('<td>(?P<alt>\w+)', response.read())
print m.group('alt')
I'm getting:
AttributeError: 'NoneType' object has no attribute 'group'
If I uncomment the second search everything is fine. I don't understand this behaviour.
Such an error redirected me to this stackoverflow issue and to this - but to no avail - neither of these solved my problem.
I don't care about efficiency here so I don't use compile.
Assuming response is a file-like object, calling read a second time might return a empty string as you consumed the file before.
data = response.read()
m = re.search('<td>(?P<alt>\d\d*)', data)
m = re.search('<td>(?P<alt>\d\d*)', data)
print m.group('alt')
Why would you call search multiple times?

Categories