Map JSON string to struct in PySpark

Map JSON string to struct in PySpark - python

I have a JSON in string as follows:
'''{"col1":"value1", "col2":"[{'col3':'val3'},{'col3':'val4'}]"}'''
I want to convert it as:
{"col1":"value1",
"col2":[ {'col3':'val3'}, {'col3':'val4'}]}
And I want to read this in the PySpark dataframe. how to convert the list inside string to json struct?

The (whole) data is not a JSON-string. Namely because ' characters are not allowed in JSON structures. The best option would be to go back to wherever this is generated and correct the malformed data before going onwards.
Once you have corrected the bad data, you can do:
import json
result = json.loads('''{"col1":"value1", "col2":[{"col3":"val3"},{"col3":"val4"}]}''')
If you can't change how the data is given to you. One solution would be to string-replace the bad characters (but this might cause all sorts of trouble along the way):
import json
result = json.loads('''{"col1":"value1", "col2":"[{'col3':'val3'},{'col3':'val4'}]"}''')
result['col2'] = json.loads(result['col2'].replace("'", '"'))
Either way, I would go back and re-work the way you get the data for the most reliable results. But that is not JSON-data as it stands now. At least not in the sense you think it is.

Related

How to get PyMongo's bson to decode everything

I'm trying to get some data stored in a .bson file into a jupyter notebook.
Per this answer and this answer, the accepted answer is basically to use the bson module from PyMongo, and then the following code
FILE = "file.bson"
with open(FILE, 'rb') as f:
data = bson.decode_all(f.read())
Now, data is a list of length 1.
data[0] is a dictionary.
The first key in this dictionary is a
data[0]["a"] is a dictionary with keys tag and data, and
data[0]["a"]["data"] is exactly what is should be, a list of integers that I can work with in python.
On the other hand, the second key in this dictionary is b
but now data[0]["b"] is a dictionary with keys tag, type, size, and data
and
data[0]["b"]["data"] is type bytes, and I'm not sure how to work with it.
I have never worked with bson before, so any input is appreciated. However, some of my questions are
Does anyone have a good ref on how to work with bson in python?
Does anyone know why a gets read in a readable way (not bytes), but b gets read in with more keys, but not readable (bytes as opposed to integers)
I was really hoping read_all would take care of everything; does anyone know why it doesn't / what I should do differently? I've tried applying read_all again to the stuff still in bytes, but I get the error message InvalidBSON: invalid message size
Does anyone have a solution for my goal, of getting the information from data[0]["b"]["data"] in a usable format (i.e. a list of integers)?

Reading a csv persisted list of floats back into a list of floats

I have persisted a list of floats in a csv file and it appears thus (a single row).
"[6.61501123e-04 1.23390303e-04 1.59454121e-03 2.17852772e-02
:
3.02987776e-04 3.83064064e-03 6.90607396e-04 3.30468375e-03
2.78064613e-02]"
Now when converting reading back to a list, I am using the ast literal_eval approach:
probs = [float(p) for p in ast.literal_eval(row['prob_array'])]
And I get this error:
probs = [float(p) for p in ast.literal_eval(row['prob_array'])]
File "/Users/santino/anaconda/lib/python2.7/ast.py", line 49, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/Users/santino/anaconda/lib/python2.7/ast.py", line 37, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
[6.61501123e-04 1.23390303e-04 1.59454121e-03 2.17852772e-02
^
SyntaxError: invalid syntax
Not sure how I can instruct ast to read the exponent syntax, or am I wrong in assuming it's the exponent syntax that is causing the exception.
Edit: I used csv.DictWriter to persist into the csv file. Is there a different way I should be persisting?
Edit2:
with open("./input_file.csv","w") as r:
writer = csv.DictWriter(r,fieldnames=["item_id","item_name","prob_array"])
writer.writeheader()
res_list = ...
for i,res in enumerate(res_list):
row_dict = {}
row_dict['item_id'] = id_list[i]
row_dict['prob_array'] = res
row_dict['item_name'] = item_list[i]
writer.writerow(row_dict)

CSV only stores string columns. Using it to store strings, ints, floats, and a few other basic types is fine, as long as you manually convert the objects: whenever you do str(i) to an int, you can get the int back with int(s).
But that isn't true for a list of floats. There's no function you can use to get back the result of str(lst) on an arbitrary list.1 And it isn't true for… whatever you have, which seems to be most likely a numpy array or Pandas Series… either.2
If you can store each float as a separate column, instead of storing a list of them in a single column, that's the easiest answer. But it may not be appropriate.3
So, you just need to pick some other function to use in place of the implicit str, which can be reversed with a simple function call. There are formats designed for persisting data to strings—JSON, XML, even a nested CSV—so that's the first place to look.
Usually JSON should be the first one you look at. As long as it can handle all of your data (and it definitely can here), it's dead simple to use, someone's already thought throw all the annoying edge cases, and there's code to parse it for every platform in the universe.
So, you write the value like this:
row_dict['prob_array'] = json.dumps(res)
And then you can read it back like this:
prob_array = json.loads(row['prob_array'])
If prob_array is actually a numpy arrays or Pandas series or something rather than a list, you'll want to either convert through list, or use numpy or Pandas JSON methods instead of the stdlib module.
The only real problem here is that if you want the CSV to be human-readable/editable, the escaped commas and quotes could be pretty ugly.
In this case, you can define a simpler format that's still easy to write and parse for your specific data, and also more human-readable, like just space-separated floats:
row_dict['prob_array'] = ' '.join(map(str, res))
prob_array = [float(val) for val in row['prob_array'].split()]
1. Sometimes you can use ast.literal_eval, but relying on that is never a good idea, and it isn't working here.
2. The human-readable format used by numpy and Pandas is even less parser-friendly than the one used by Python lists. You could switch to their repr instead of their str, but it still isn't going to ast.literal_eval.
3. For an obvious example, imagine a table with two different arbitrary-length lists…

Reading/Writing/Appending using json module

I was recommended some code from someone here but I still need some help. Here's the code that was recommended:
import json
def write_data(data,filename):
with open(filename,'w') as outfh:
json.dump(data,outfh)
def read_data(filename):
with open(filename, 'r') as infh:
json.load(infh)
The writing code works fine, but the reading code doesn't return any strings after inputting the data:
read_data('player.txt')
Another thing that I'd like to be able to do is to be able to specify a line to be printed. Something that is also pretty important for this project / exorcise is to be able to append strings into my file. Thanks to anyone that can help me.
Edit: I need to be storing strings in the file that I can convert to values. IE;
name="Karatepig"
Is something I would store so I can recall the data if I ever need to load a previous set of data or something of that sort.
I'm very much a noob at python so I don't know what would be best, whether a list or dictionary; I haven't really used a dictionary before, and also I have no idea yet as to how I'm going to convert strings into values.

Suppose you have a dictionary like this:
data = {'name': 'Karatepig', 'score': 10}
Your write_data function is fine, but your read_data function needs to actually return the data after reading it from the file:
def read_data(filename):
with open(filename, 'r') as infh:
data = json.load(infh)
return data
data = read_data('player.json')
Now suppose you want to print the name and update the score:
print data['name']
data['score'] += 5 # add 5 points to the score
You can then write the data back to disk using the write_data function. You can not use the json module to append data to a file. You need to actually read the file, modify the data, and write the file again. This is not so bad for small amounts of data, but consider using a database for larger projects.
Whether you use a list or dictionary depends on what you're trying to do. You need to explain what you're trying to accomplish before anyone can recommend an approach. Dictionaries store values assigned to keys (see above). Lists are collections of items without keys and are accessed using an index.
It isn't clear what you mean by "convert strings into values." Notice that the above code shows the string "Karatepig" using the dictionary key "name".

Can Pandas read a nested JSON blob without parsing sub JSON structures?

I'm trying to parse a JSON blob with Pandas without parsing the nested JSON structures. Here's an example of what I mean.
import json
import pandas as pd
x = json.loads('{"test":"something", "yes":{"nest":10}}')
df = pd.DataFrame(x)
When I do df.head() I get the following:
test yes
nest something 10
What I really want is ...
test yes
1 something {"nest": 10}
Any ideas on how to do this with Pandas? I have workaround ideas, but I'm parsing GBs of JSON files and do not want to be dependent on a slow for loop to convert and prep the information for Pandas. It would be great to do this efficiently while still utilizing the speed of Pandas.
Note: There's been a correction to this question to fix and an error about my reference to json objects.

I'm trying to parse a JSON blob with Pandas
No you're not. You're just constructing a DataFrame out of a plain old Python dict. That dict might have been parsed from JSON somewhere else in your code, or it may never have been JSON in the first place. It doesn't matter; either way, you're not using Pandas's JSON parsing. In fact, if you did try to construct a DataFrame directly out of a JSON string, you would get a PandasError.
If you do use Pandas parsing, you can use its options, as documented in pandas.read_json. For example:
>>> j = '{"test": "something", "yes": {"nest": 10}}'
>>> pd.read_json(j, typ='series')
test something
yes {u'nest': 10}
dtype: object
(Of course that's obviously a Series, not a DataFrame. But I'm not sure exactly what you want your DataFrame to be here…)
But if you've already parsed the JSON elsewhere, you obviously can't use Pandas's data parsing on it.
Also:
… and do not want to be dependent on a slow for loop to convert and prep the information for Pandas …
Then use, e.g., a dict comprehension, generator expression, itertools function, or something else that can do the looping in C instead of in Python.
However, I doubt that the speed of looping over the JSON strings is actually a real performance issue here, compared to the cost of parsing the JSON, building the Pandas structures, etc. Figure out what's actually taking the time by profiling, then optimize that, instead of just picking some random part of your code and hoping it makes a difference.

Converting unicode object to list after retrieval from database

In Django, I'm getting some values from a select field using request.POST.getlist('tags'), so when I store this information in MySQL I end up with something like this: u"['literature']". I think this is pretty reasonable and even desirable since I don't want to use another table to store this information. Obviously, the problem comes when I try to retrieve that information because, as expected, I get this:
u'['
u'u'
u"'"
u'l'
u'i'
u't'
u'e'
.
.
.
(assuming this tag is literature, for example).
How can I transform this unicode object into a Python list?. Is there a better approach?.
Thanks in advance

Short answer: Create another table.
Databases are designed to be used in a particular way, why try to force them to store information in a way they are not meant to?
There are other solutions to this, but the best answer is use the database as it was intended, it will be easier in the long run.

Use json to convert to JSON before writing, and from JSON after reading. Or use one of the several implementations of JSONField in the wild.

does this help?
>>> import ast
>>> lst = ast.literal_eval(u"['literature']")
>>> lst
['literature']
>>> isinstance(lst, list)
True
but the better approach would be to proper serialize the list before storing as a string. you could use one of the existing pickle implementations, json, or roll your own (since it does not have to be generic, it could be a simple oneliner like "SENTINAL".join(list).... not that I'd recommend the latter, though)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Map JSON string to struct in PySpark - python

I have a JSON in string as follows: '''{"col1":"value1", "col2":"[{'col3':'val3'},{'col3':'val4'}]"}''' I want to convert it as: {"col1":"value1", "col2":[ {'col3':'val3'}, {'col3':'val4'}]} And I want to read this in the PySpark dataframe. how to convert the list inside string to json struct?

Related

How to get PyMongo's bson to decode everything

Reading a csv persisted list of floats back into a list of floats

Reading/Writing/Appending using json module

Can Pandas read a nested JSON blob without parsing sub JSON structures?

Converting unicode object to list after retrieval from database

Categories

Resources