Searching two things - python

I am using re and would like to search a string between two strings. My problem is the string that I would like to search may end with either newline(\n) or another string. So what I want to do is if it is newline or another string it should give me back the string. The reason why I want to do that is some of my documents are created wrong in a way that it does not have new line, so I have to get the text until newline and then check if it has the corresponding string.
I have tried this:
recipients = re.search('Recipients:(.*)\n', body)
reciBody = re.search('(.*)Notes', recipients.group(1).encode("utf-8"))
Later on I am trying to split this by using:
recipientsList = reciBody.group(1).encode("utf-8").split(',')
The problem is I am getting this error if there is no corresponding string:
recipientsList = reciBody.group(1).encode("utf-8").split(',')
AttributeError: 'NoneType' object has no attribute 'group'
What other ways can I use? Or how can I handle this errror?

I'm assuming nothing needs to be done if the group isn't found. Simplest is to just skip the error.
try:
recipientsList = reciBody.group(1).encode("utf-8").split(',')
except AttributeError:
pass # nothing needs to be done
Instead of pass you may need to set recipientsList to something else

Related

JSON - How to return the location of an error?

When I try to read a JSON file into Python using Python's built in package json, I get back a JSONDecodeError that looks something like this:
JSONDecodeError: Expecting value: line 1 column 233451 (char 233450)
Is there any way to return the location of the error (in this case, 233450)? What I want is something like:
try:
json.loads(my_json)
except:
error_loc = json.get_error(my_json)
where error_loc = 233450 - or even just the entire error message as a string, I can extract the number myself.
Context: I'm trying to load some very poorly formatted (webscraped) JSONs into Python. Many of the errors are related to the fact that the text contained in the JSONs contains quotes, curly brackets, and other characters that the json reader interprets as formatting - e.g.
{"category": "this text contains "quotes", which messes with the json reader",
"category2": "here's some more text containing "quotes" - and also {brackets}"},
{"category3": "just for fun, here's some text with {"brackets and quotes"} in conjunction"}
I managed to eliminate the majority of these situations using regex, but now there's a small handful of cases where I accidentally replaced necessary quotes. Looking through the JSONs manually, I don't actually think it's possible to catch all the bad formatting situations without replacing at least one necessary character. And in almost every situation, the issue is just one missing character, normally towards the very end...
If I could return the location of the error, I could just revert the replaced character and try again.
I feel like there has to be a way to do this, but I don't think I'm using the correct terms to search for it.
You can catch the error as the variable error by except json.decoder.JSONDecodeError as error. Then, the JSONDecodeError object has an attribute pos, that gives the index in the string which the JSON decoding error. lineno and colno can be used to get line number and column number like when opening a file graphically in an editor.
try:
json.loads(string_with_json)
except json.decoder.JSONDecodeError as error:
error_pos = error.pos
error_lineno = error.lineno
error_colno = error.colno

Is there a way to strip the end of a string until a certain character is reached?

I'm working on a side project for myself and have stumbled on an issue that I'm not sure how to solve for. I have a url, for arguments sake let's say https://stackoverflow.com/xyz/abc. I'm attempting to strip the the end of the url so that I am only left with https://stackoverflow.com/xyz/.
Initially I tried to use the strip function and specify a length/position to remove up to, but realized for other url's I'm working with, it is not the same length. (i.e. URL 1 = /xyz/abc, URL 2 = /xyz/abcd))
Is there any advice for achieving this, I looked into using the regular expression operations in Python, but was unsure how to apply it to this use case. Ideally I would like to write a function that would start from the end of the string and strip away all characters till the first '/' is reached. Any advice would be appreciated.
Thanks
Why not just use rfind, which starts from the end?
>>> string = 'https://stackoverflow.com/xyz/abc'
>>> string = string[:string.rfind('/')+1]
>>> print(string)
'https://stackoverflow.com/xyz/'
And if you don't want the character either (the / in this case), simply remove the +1.
Keep in mind however that this only works if the string actually contains the character you are looking for.
If you want to protect against this, you will have to use the following:
string = 'https://stackoverflow.com/xyz/abc'
idx = string.rfind('/')
if(idx != -1):
string = string[:idx+1]
Unless, obviously, you do want to end up with an empty string in case the character is not found.
Then the first example works just fine.
if yo dont want to use regex, you can combine both the split and join().
lol = 'https://stackoverflow.com/xyz/abc'
splt= lol.split('/')[:-1]
'/'.join(splt)
output
'https://stackoverflow.com/xyz'

In python, is there a way to remove all text following the last instance of a delimiter?

I'm trying to create a random text generator in python. I'm using Markovify to produce the required text, a filter to not let it start generating text unless the first word is capitalized and, to prevent it from ending "mid sentence", want the program to search from the back of the output to the front and remove all text after the last (for instance) period. I want it to ignore all other instances of the selected delimiter(s). I have no idea how many instances of the delimiter will occur in the generated text, nor have anyway to know in advance.
While looking into this I found rsplit(), and tried using that, but ran into a problem.
'''tweet = buff.rsplit('.')[-1] '''
The above is what I tried first, and I thought it was working until I noticed that all of the lines printed with that had only a single sentence in them. Never more than that. The problem seems to be that the text is being dumped into an array of strings, and the [-1] bit is calling just one entry from that array.
'''tweet = buff.rsplit('.') - buff.rsplit('.')[-1] '''
Next I tried the above. The thinking, was that it would remove the last entry in the array, and then I could just print what remained. It... didn't go to plan. I get an "unsupported operand type" error, specifically tied to the attempt to subtract. Not sure what I'm missing at this point.
.rsplit has second optional argument - maxsplit i.e. maximum number of split to do. You could use it following way:
txt = 'some.text.with.dots'
all_but_last = txt.rsplit('.', 1)[0]
print(all_but_last)
Output:
some.text.with

Replace Strings in Python 3

I'm trying to replace a string in python like this:
private_ips.replace("{",'')
The error I get back is this:
Traceback (most recent call last):
File ".\aws_ec2_list_instances.py", line 39, in <module>
private_ips.replace("{",'')
AttributeError: 'set' object has no attribute 'replace'
What am I doing wrong?
private_ips is set object. You can use replace only on strings.
To represent set as string take this code snippet:
private_ips_as_string = '{' + ', '.join(str(elem) for elem in private_ips) + '}'
Let's back up a little ...
tree = objectpath.Tree(instance)
private_ips = set(tree.execute('$..PrivateIpAddress'))
Your initial problem is that you specifically converted the return value into a set. If you don't want a set, then don't convert it to one, or convert it back to something more useful to you. Since you've failed to provide a Minimal, complete, verifiable example, we can't fix everything, but I'll use an intuitive leap here ...
tree.execute returns a list of IP addresses.
You're using set to remove duplicate addresses in a list.
If so, you're fine up to this point. To get the address as a string, I think you want to iterate through the items in the set:
for ip_addr in private_ips:
# Handle ip_addr, a single IP address seen as a str.
If you need the addresses lined up, you can always convert back to a list with
private_ips = list(private_ips)
... and if you know there is exactly one addr that you want as a string, you can grab it in one step with
single_ip = list(private_ips)[0]
... or just grab it directly from your function's return value:
single_ip = tree.execute('$..PrivateIpAddress')[0]
To explain what did happen to you:
You called a function that return a sequence of some sort.
You converted that sequence to a set, a common technique for removing duplicates.
You tried to remove braces from the set, as if it were a string.
The problem is that a set does not have braces. Those braces are a notational convenience; they exist only in the __repr__ (output string representation) of the data type, not in the set itself. You cannot manipulate that representation. This would be something like trying to remove the up-vote and down-vote arrows from this question by editing the question text: you can't do it, because those are part of the delivery framework.
Similarly, you cannot remove the quotation marks from the ends of a string, because they're not part of the string.
To get rid of the braces, you quit using a set: reach inside and pull out the contents as an individual element.

Medium.com Invalid Json?

I am trying to fetch the latest posts from Medium.com so for example, I go here
https://medium.com/circle-blog/latest?format=json
But when I copy and paste that entire JSON into JSONEditorOnline.org, I get error saying
Error: Parse error on line 1:
])}while(1);</x>{"su
^
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[', got ']'
I realize error is because of the random stuff in the front
])}while(1);</x>
So how would I remove that using Python?
After I remove, I want to dump it into a JSON file
with open('medium.json', 'w') as json1:
json1.write(json.dumps(JSONWITHWHILE(1)REMOVED))
How would I go about doing this?
I wouldn't bother with it since it's obviously not a valid JSON but if you need it, you can try to locate the first opening curly-bracket and simply remove everything before it:
valid_json = broken_json[broken_json.find('{'):]
Explanation:
broken_json.find('{') returns the position (index) of the first occurrence of the character { in the string broken_json
broken_json[X:] - is a string slice, it returns the substring of broken_json starting on the position X
An advantage over the LeKhan's solution is that when that JSON becomes valid, your code will be still working even with this fix in place. Also, his solution will return a broken JSON if it contains the substring </x> inside its fields (which may be valid).
Note: it's probably not a bug but it's there intentionally for some reason. There's for example the module Medium JSON feed which handles it very similarly - it's also stripping everything before the first opening curly-bracket.
According to this article, it's there to prevent "JSON hacking", whatever it means.
You can try splitting that string by </x> and then get the second index:
clean_json = raw_json.split('</x>')[1]
Medium didn't provide JSON objects But they are providing RSS feeds. Therefore you could convert the RSS feeds to JSON objects. Use the link below and replace your user name instead of userName.
https://api.rss2json.com/v1/api.json?rss_url=https://medium.com/feed/<userName>
for this question, you can use the below link
https://api.rss2json.com/v1/api.json?rss_url=https://medium.com/feed/circle-blog

Categories