Medium.com Invalid Json? - python

I am trying to fetch the latest posts from Medium.com so for example, I go here
https://medium.com/circle-blog/latest?format=json
But when I copy and paste that entire JSON into JSONEditorOnline.org, I get error saying
Error: Parse error on line 1:
])}while(1);</x>{"su
^
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[', got ']'
I realize error is because of the random stuff in the front
])}while(1);</x>
So how would I remove that using Python?
After I remove, I want to dump it into a JSON file
with open('medium.json', 'w') as json1:
json1.write(json.dumps(JSONWITHWHILE(1)REMOVED))
How would I go about doing this?

I wouldn't bother with it since it's obviously not a valid JSON but if you need it, you can try to locate the first opening curly-bracket and simply remove everything before it:
valid_json = broken_json[broken_json.find('{'):]
Explanation:
broken_json.find('{') returns the position (index) of the first occurrence of the character { in the string broken_json
broken_json[X:] - is a string slice, it returns the substring of broken_json starting on the position X
An advantage over the LeKhan's solution is that when that JSON becomes valid, your code will be still working even with this fix in place. Also, his solution will return a broken JSON if it contains the substring </x> inside its fields (which may be valid).
Note: it's probably not a bug but it's there intentionally for some reason. There's for example the module Medium JSON feed which handles it very similarly - it's also stripping everything before the first opening curly-bracket.
According to this article, it's there to prevent "JSON hacking", whatever it means.

You can try splitting that string by </x> and then get the second index:
clean_json = raw_json.split('</x>')[1]

Medium didn't provide JSON objects But they are providing RSS feeds. Therefore you could convert the RSS feeds to JSON objects. Use the link below and replace your user name instead of userName.
https://api.rss2json.com/v1/api.json?rss_url=https://medium.com/feed/<userName>
for this question, you can use the below link
https://api.rss2json.com/v1/api.json?rss_url=https://medium.com/feed/circle-blog

Related

JSON - How to return the location of an error?

When I try to read a JSON file into Python using Python's built in package json, I get back a JSONDecodeError that looks something like this:
JSONDecodeError: Expecting value: line 1 column 233451 (char 233450)
Is there any way to return the location of the error (in this case, 233450)? What I want is something like:
try:
json.loads(my_json)
except:
error_loc = json.get_error(my_json)
where error_loc = 233450 - or even just the entire error message as a string, I can extract the number myself.
Context: I'm trying to load some very poorly formatted (webscraped) JSONs into Python. Many of the errors are related to the fact that the text contained in the JSONs contains quotes, curly brackets, and other characters that the json reader interprets as formatting - e.g.
{"category": "this text contains "quotes", which messes with the json reader",
"category2": "here's some more text containing "quotes" - and also {brackets}"},
{"category3": "just for fun, here's some text with {"brackets and quotes"} in conjunction"}
I managed to eliminate the majority of these situations using regex, but now there's a small handful of cases where I accidentally replaced necessary quotes. Looking through the JSONs manually, I don't actually think it's possible to catch all the bad formatting situations without replacing at least one necessary character. And in almost every situation, the issue is just one missing character, normally towards the very end...
If I could return the location of the error, I could just revert the replaced character and try again.
I feel like there has to be a way to do this, but I don't think I'm using the correct terms to search for it.
You can catch the error as the variable error by except json.decoder.JSONDecodeError as error. Then, the JSONDecodeError object has an attribute pos, that gives the index in the string which the JSON decoding error. lineno and colno can be used to get line number and column number like when opening a file graphically in an editor.
try:
json.loads(string_with_json)
except json.decoder.JSONDecodeError as error:
error_pos = error.pos
error_lineno = error.lineno
error_colno = error.colno

Python strip only single specific characters from text/json

I'm currently trying to scrape data from a website and want to automatically save them. I want to format the data before so I can use them as cvs or similar. The json is:
{"counts":{"default":"27","quick_mode1":"48","quick_mode2":"13","custom":"281","quick_mode3":"0","total":369}}
My code is:
x = '{"counts":{"default":"27","quick_mode1":"48","quick_mode2":"13","custom":"281","quick_mode3":"0","total":369}}'
y = json.loads(x)
print(y["total"])
But due to the {"counts": on the beginning and the corresponding } on the end I can't just use it as a normal json file because the formatting breaks and it just puts an error message out (json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)), when I remove the characters manually it then works again.
How can I get rid of only those 2 parts?
I think that json library should go along with this, but if it really is an issue you can't get rid of, you could remove all occurrences of { and } characters in the string you are are receiving with .replace() function.
This requires chaining for every character you want to remove and is rather not an optimal solution, but as far as the strings you are processing are not long and/or you are not concerned about efficiency this should do just fine.
Example:
my_var.replace('{', '').replace('}', '')

Remove quotes inside XML text tag

I asked this question a couple of days ago; and am now facing an issue in some XMLs after iterating through all of them.
I have found that some values have quotes inside, like <restaurant> L'amour <\restaurant>, and when I try to parse it into a dictionary it generates an error because of the single quote character inside the value. Is there a way to add double quotes to, preferably, all the values inside the XML so that the error can be avoided and then remove the double quotes after the list of dictionaries is generated?
Or, perhaps there is another approach to this issue? Thank you very much.
Edit:
This is an example of the string I am having trouble with:
s1 = "{'uno': 'l'ebe'}"
ast.literal_eval(mydict(s1))
Throws the Invalid syntax error.
Have you tried replacing the value
value.replace("'", "\"")
Then you can revert that when displaying, or even, you can try replacing the single quote when saving into the dictionary, so its saved scaped.

Python CSV module - quotes go missing

I have a CSV file that has data like this
15,"I",2,41301888,"BYRNESS RAW","","BYRNESS VILLAGE","NORTHUMBERLAND","ENG"
11,"I",3,41350101,2,2935,2,2008-01-09,1,8,0,2003-02-01,,2009-12-22,2003-02-11,377016.00,601912.00,377105.00,602354.00,10
I am reading this and then writing different rows to different CSV files.
However, in the original data there are quotes around the non-numeric fields, as some of them contain commas within the field.
I am not able to keep the quotes.
I have researched lots and discovered the quoting=csv.QUOTE_NONNUMERIC however this now results in a quote mark around every field and I dont know why??
If i try one of the other quoting options like MINIMAL I end up with an error message regarding the date value, 2008-01-09, not being a float.
I have tried to create a dialect, add the quoting on the csv reader and writer but nothing I have tried results in the getting an exact match to the original data.
Anyone had this same problem and found a solution.
When writing, quoting=csv.QUOTE_NONNUMERIC keeps values unquoted as long as they're numbers, ie. if their type is int or float (for example), which means it will write what you expect.
Your problem could be that, when reading, a csv.reader will turn every row it reads into a list of strings (if you read the documentation carefully enough, you'll see a reader does not perform automatic data type conversion!
If you don't perform any kind of conversion after reading, then when you write you'll end up with everything on quotes... because everything you write is a string.
Edit: of course, date fields will be quoted, because they are not numbers, meaning you cannot get the exact expected behaviour using the standard csv.writer.
Are you sure you have a problem? The behavior you're describing is correct: The csv module will enclose strings in quotes only if it's necessary for parsing them correctly. So you should expect to see quotes only around strings containing a comma, newlines, etc. Unless you're getting errors reading your output back in, there is no problem.
Trying to get an "exact match" of the original data is a difficult and potentially fruitless endeavor. quoting=csv.QUOTE_NONNUMERIC put quotes around everything because every field was a string when you read it in.
Your concern that some of the "quoted" input fields could have commas is usually not that big a deal. If you added a comma to one of your quoted fields and used the default writer, the field with the comma would be automatically quoted in the output.

How to compare unicode strings with entity ref to non-unicode string

I am evaluating hundreds of thousands of html files. I am looking for particular parts of the files. There can be small variations in the way the files were created
For example, in one file I can have a section heading (after I converted it to upper and split then joined the text to get rid of possibly inconsistent white space:
u'KEY1A\x97RISKFACTORS'
In another file I could have:
'KEY1ARISKFACTORS'
I am trying to create a dictionary of possible responses and I want to compare these two and conclude that they are equal. But every substitution I try to run the first string to remove the '\97 does not seem to work
There are a fair number of variations of keys with various representations of entities so I would really like to create a dictionary more or less automatically so I have something like:
key_dict={'u'KEY1A\x97RISKFACTORS':''KEY1ARISKFACTORS',''KEY1ARISKFACTORS':'KEY1ARISKFACTORS',. . .}
I am assuming that since when I run
S1='A'
S2=u'A'
S1==S2
I get
True
I should be able to compare these once the html entities are handled
What I specifically tried to do is
new_string=u'KEY1A\x97RISKFACTORS'.replace('|','')
I got an error
Sorry, I have been at this since last night. SLott pointed out something and I see I used the wrong label I hope this makes more sense
You are correct that if S1='A' and S2 = u'A', then S1 == S2. Instead of assuming this though, you can do a simple test:
key_dict= {u'A':'Value1',
'A':'Value2'}
print key_dict
print u'A' == 'A'
This outputs:
{u'A': 'Value2'}
True
That resolved, let's look at:
new_string=u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace('|','')
There's a problem here, \x97 is the value you're trying to replace in the target string. However, your search string is '|', which is hex value 0x7C (ascii and unicode) and clearly not the value you need to replace. Even if the target and search string were both ascii or unicode, you'd still not find the '\x97'. Second problem is that you are trying to search for a non-unicode string in a unicode string. The easiest solution, and one that makes the most sense is to simply search for u'\x97':
print u'KEY1A\x97DEMOGRAPHICRESPONSES'
print u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace(u'\x97', u'')
Outputs:
KEY1A\x97DEMOGRAPHICRESPONSES
KEY1ADEMOGRAPHICRESPONSES
Why not the obvious .replace(u'\x97','')? Where does the idea of that '|' come from?
>>> s = u'KEY1A\x97DEMOGRAPHICRESPONSES'
>>> s.replace(u'\x97', '')
u'KEY1ADEMOGRAPHICRESPONSES'

Categories