Python strip only single specific characters from text/json

Python strip only single specific characters from text/json - python

I'm currently trying to scrape data from a website and want to automatically save them. I want to format the data before so I can use them as cvs or similar. The json is:
{"counts":{"default":"27","quick_mode1":"48","quick_mode2":"13","custom":"281","quick_mode3":"0","total":369}}
My code is:
x = '{"counts":{"default":"27","quick_mode1":"48","quick_mode2":"13","custom":"281","quick_mode3":"0","total":369}}'
y = json.loads(x)
print(y["total"])
But due to the {"counts": on the beginning and the corresponding } on the end I can't just use it as a normal json file because the formatting breaks and it just puts an error message out (json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)), when I remove the characters manually it then works again.
How can I get rid of only those 2 parts?

I think that json library should go along with this, but if it really is an issue you can't get rid of, you could remove all occurrences of { and } characters in the string you are are receiving with .replace() function.
This requires chaining for every character you want to remove and is rather not an optimal solution, but as far as the strings you are processing are not long and/or you are not concerned about efficiency this should do just fine.
Example:
my_var.replace('{', '').replace('}', '')

Related

JSON - How to return the location of an error?

When I try to read a JSON file into Python using Python's built in package json, I get back a JSONDecodeError that looks something like this:
JSONDecodeError: Expecting value: line 1 column 233451 (char 233450)
Is there any way to return the location of the error (in this case, 233450)? What I want is something like:
try:
json.loads(my_json)
except:
error_loc = json.get_error(my_json)
where error_loc = 233450 - or even just the entire error message as a string, I can extract the number myself.
Context: I'm trying to load some very poorly formatted (webscraped) JSONs into Python. Many of the errors are related to the fact that the text contained in the JSONs contains quotes, curly brackets, and other characters that the json reader interprets as formatting - e.g.
{"category": "this text contains "quotes", which messes with the json reader",
"category2": "here's some more text containing "quotes" - and also {brackets}"},
{"category3": "just for fun, here's some text with {"brackets and quotes"} in conjunction"}
I managed to eliminate the majority of these situations using regex, but now there's a small handful of cases where I accidentally replaced necessary quotes. Looking through the JSONs manually, I don't actually think it's possible to catch all the bad formatting situations without replacing at least one necessary character. And in almost every situation, the issue is just one missing character, normally towards the very end...
If I could return the location of the error, I could just revert the replaced character and try again.
I feel like there has to be a way to do this, but I don't think I'm using the correct terms to search for it.

You can catch the error as the variable error by except json.decoder.JSONDecodeError as error. Then, the JSONDecodeError object has an attribute pos, that gives the index in the string which the JSON decoding error. lineno and colno can be used to get line number and column number like when opening a file graphically in an editor.
try:
json.loads(string_with_json)
except json.decoder.JSONDecodeError as error:
error_pos = error.pos
error_lineno = error.lineno
error_colno = error.colno

When converting binary data to JSON in python how to get rid of extra lines and whitespaces?

I currently have binary data that looks like this:
test = b'Got [01]:\n{\'test\': [{\'message\': \'foo bar baz \'\n "\'secured\', current \'proposal\'.",\n \'name\': \'this is a very great name \'\n \'approves something of great order \'\n \'has no one else associated\',\n \'status\': \'very good\'}],\n \'log-url\': \'https://localhost/we/are/the/champions\',\n \'status\': \'rockingandrolling\'}\n
As you can see this is basically JSON.
So what I did was the following:
test.decode('utf8').replace("Got [01]:\n{", '{').replace("\n", "").replace("'", '"')
This basically turned it into a string, and get it as close to valid JSON as possible. Unfortunately, it doesn't fully get there, because when I convert it to a string, it keeps all these stupid spaces and line breaks. That is hard to parse out, with all the .replace()s I keep using.
Is there any way to make the binary data that is being outputted and decoded to produce all 1 line allowing to parse the string, and so I can turn it into JSON format
I have also used a regex to this specific case, and it works, but because this binary data is generated dynamically every time, it may be a tad different where the line breaks and spaces are. So a regex is too brittle to catch every case.
Any thoughts?

In python, is there a way to remove all text following the last instance of a delimiter?

I'm trying to create a random text generator in python. I'm using Markovify to produce the required text, a filter to not let it start generating text unless the first word is capitalized and, to prevent it from ending "mid sentence", want the program to search from the back of the output to the front and remove all text after the last (for instance) period. I want it to ignore all other instances of the selected delimiter(s). I have no idea how many instances of the delimiter will occur in the generated text, nor have anyway to know in advance.
While looking into this I found rsplit(), and tried using that, but ran into a problem.
'''tweet = buff.rsplit('.')[-1] '''
The above is what I tried first, and I thought it was working until I noticed that all of the lines printed with that had only a single sentence in them. Never more than that. The problem seems to be that the text is being dumped into an array of strings, and the [-1] bit is calling just one entry from that array.
'''tweet = buff.rsplit('.') - buff.rsplit('.')[-1] '''
Next I tried the above. The thinking, was that it would remove the last entry in the array, and then I could just print what remained. It... didn't go to plan. I get an "unsupported operand type" error, specifically tied to the attempt to subtract. Not sure what I'm missing at this point.

.rsplit has second optional argument - maxsplit i.e. maximum number of split to do. You could use it following way:
txt = 'some.text.with.dots'
all_but_last = txt.rsplit('.', 1)[0]
print(all_but_last)
Output:
some.text.with

Medium.com Invalid Json?

I am trying to fetch the latest posts from Medium.com so for example, I go here
https://medium.com/circle-blog/latest?format=json
But when I copy and paste that entire JSON into JSONEditorOnline.org, I get error saying
Error: Parse error on line 1:
])}while(1);</x>{"su
^
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[', got ']'
I realize error is because of the random stuff in the front
])}while(1);</x>
So how would I remove that using Python?
After I remove, I want to dump it into a JSON file
with open('medium.json', 'w') as json1:
json1.write(json.dumps(JSONWITHWHILE(1)REMOVED))
How would I go about doing this?

I wouldn't bother with it since it's obviously not a valid JSON but if you need it, you can try to locate the first opening curly-bracket and simply remove everything before it:
valid_json = broken_json[broken_json.find('{'):]
Explanation:
broken_json.find('{') returns the position (index) of the first occurrence of the character { in the string broken_json
broken_json[X:] - is a string slice, it returns the substring of broken_json starting on the position X
An advantage over the LeKhan's solution is that when that JSON becomes valid, your code will be still working even with this fix in place. Also, his solution will return a broken JSON if it contains the substring </x> inside its fields (which may be valid).
Note: it's probably not a bug but it's there intentionally for some reason. There's for example the module Medium JSON feed which handles it very similarly - it's also stripping everything before the first opening curly-bracket.
According to this article, it's there to prevent "JSON hacking", whatever it means.

You can try splitting that string by </x> and then get the second index:
clean_json = raw_json.split('</x>')[1]

Medium didn't provide JSON objects But they are providing RSS feeds. Therefore you could convert the RSS feeds to JSON objects. Use the link below and replace your user name instead of userName.
https://api.rss2json.com/v1/api.json?rss_url=https://medium.com/feed/<userName>
for this question, you can use the below link
https://api.rss2json.com/v1/api.json?rss_url=https://medium.com/feed/circle-blog

How to save data to a file on separate items instead of one long string?

I am having trouble simply saving items into a file for later reading. When I save the file, instead of listing the items as single items, it appends the data together as one long string. According to my Google searches, this should not be appending the items.
What am I doing wrong?
Code:
with open('Ped.dta','w+') as p:
p.write(str(recnum)) # Add record number to top of file
for x in range(recnum):
p.write(dte[x]) # Write date
p.write(str(stp[x])) # Write Steps number

Since you do not show your data or your output I cannot be sure. But it seems you are trying to use the write method like the print function, but there are important differences.
Most important, write does not follow its written characters with any separator (like space by default for print) or end (like \n by default for print).
Therefore there is no space between your data and steps number or between the lines because you did not write them and Python did not add them.
So add those. Try the lines
p.write(dte[x]) # Write date
p.write(' ') # space separator
p.write(str(stp[x])) # Write Steps number
p.write('\n') # line terminator
Note that I do not know the format of your "date" that is written, so you may need to convert that to text before writing it.
Now that I have the time, I'll implement #abarnert's suggestion (in a comment) and show you how to get the advantages of the print function and still write to a file. Just use the file= parameter in Python 3, or in Python 2 after executing the statement
from __future__ import print_function
Using print you can do my four lines above in one line, since print automatically adds the space separator and newline end:
print(dte[x], str(stp[x]), file=p)
This does assume that your date datum dte[x] is to be printed as text.

Try adding a newline ('\n') character at the end of your lines as you see in docs. This should solve the problem of 'listing the items as single items', but the file you create may not be greatly structured nonetheless.
For further of your google searches you may want to check serialization, as well as json and csv formats, covered in python standard library.
You question would have befited if you gave very small example of recnum variable + original f.close() is not necessary as you have a with statement, see here at SO.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python strip only single specific characters from text/json - python

Related

JSON - How to return the location of an error?

When converting binary data to JSON in python how to get rid of extra lines and whitespaces?

In python, is there a way to remove all text following the last instance of a delimiter?

Medium.com Invalid Json?

How to save data to a file on separate items instead of one long string?

Categories

Resources