Can Pandas read a nested JSON blob without parsing sub JSON structures? - python

I'm trying to parse a JSON blob with Pandas without parsing the nested JSON structures. Here's an example of what I mean.
import json
import pandas as pd
x = json.loads('{"test":"something", "yes":{"nest":10}}')
df = pd.DataFrame(x)
When I do df.head() I get the following:
test yes
nest something 10
What I really want is ...
test yes
1 something {"nest": 10}
Any ideas on how to do this with Pandas? I have workaround ideas, but I'm parsing GBs of JSON files and do not want to be dependent on a slow for loop to convert and prep the information for Pandas. It would be great to do this efficiently while still utilizing the speed of Pandas.
Note: There's been a correction to this question to fix and an error about my reference to json objects.

I'm trying to parse a JSON blob with Pandas
No you're not. You're just constructing a DataFrame out of a plain old Python dict. That dict might have been parsed from JSON somewhere else in your code, or it may never have been JSON in the first place. It doesn't matter; either way, you're not using Pandas's JSON parsing. In fact, if you did try to construct a DataFrame directly out of a JSON string, you would get a PandasError.
If you do use Pandas parsing, you can use its options, as documented in pandas.read_json. For example:
>>> j = '{"test": "something", "yes": {"nest": 10}}'
>>> pd.read_json(j, typ='series')
test something
yes {u'nest': 10}
dtype: object
(Of course that's obviously a Series, not a DataFrame. But I'm not sure exactly what you want your DataFrame to be here…)
But if you've already parsed the JSON elsewhere, you obviously can't use Pandas's data parsing on it.
Also:
… and do not want to be dependent on a slow for loop to convert and prep the information for Pandas …
Then use, e.g., a dict comprehension, generator expression, itertools function, or something else that can do the looping in C instead of in Python.
However, I doubt that the speed of looping over the JSON strings is actually a real performance issue here, compared to the cost of parsing the JSON, building the Pandas structures, etc. Figure out what's actually taking the time by profiling, then optimize that, instead of just picking some random part of your code and hoping it makes a difference.

Related

How to get the HTML from dataframe without writing to a file?

Lets say I have a large pandas Dataframe
df = pd.read_csv("temp.csv")
Now I want get the html representation of this dataframe but in memory, and not by writing to a file.
Is there a thing like following
html_object=df.to_html()
Writing html is incredibly slow.
Yes there is just use the code you wrote.
html_object=df.to_html()
Yes, there literally is df.to_html(). By default, this will return a string.

python/pandas : Pandas changing the value adding extra digits in values [duplicate]

I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')
You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.
Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

Map JSON string to struct in PySpark

I have a JSON in string as follows:
'''{"col1":"value1", "col2":"[{'col3':'val3'},{'col3':'val4'}]"}'''
I want to convert it as:
{"col1":"value1",
"col2":[ {'col3':'val3'}, {'col3':'val4'}]}
And I want to read this in the PySpark dataframe. how to convert the list inside string to json struct?
The (whole) data is not a JSON-string. Namely because ' characters are not allowed in JSON structures. The best option would be to go back to wherever this is generated and correct the malformed data before going onwards.
Once you have corrected the bad data, you can do:
import json
result = json.loads('''{"col1":"value1", "col2":[{"col3":"val3"},{"col3":"val4"}]}''')
If you can't change how the data is given to you. One solution would be to string-replace the bad characters (but this might cause all sorts of trouble along the way):
import json
result = json.loads('''{"col1":"value1", "col2":"[{'col3':'val3'},{'col3':'val4'}]"}''')
result['col2'] = json.loads(result['col2'].replace("'", '"'))
Either way, I would go back and re-work the way you get the data for the most reliable results. But that is not JSON-data as it stands now. At least not in the sense you think it is.

Converting python Dataframe to Matlab file

I am trying to convert a python Dataframe to a Matlab (.mat) file.
I initially have a txt (EEG signal) that I import using panda.read_csv:
MyDataFrame = pd.read_csv("data.txt",sep=';',decimal='.'), data.txt being a 2D array with labels. This creates a dataframe which looks like this.
In order to convert it to .mat, I tried this solution where the idea is to convert the dataframe into a dictionary of lists but after trying every aspect of this solution it's still unsuccessful.
scipy.io.savemat('EEG_data.mat', {'struct':MyDataFrame.to_dict("list")})
It did create a .mat file but it did not save my dataframe properly. The file I obtain after looks like this, so all the values are basically gone, and the remaining labels you see are empty when you look into them.
I also tried using mat4py which is designed to export python structures into Matlab files, but it did not work either. I don't understand why, because converting my dataframe to a dictionary of lists is exactly what should be done according to the mat4py documentation.
I believe that the reason the previous solutions haven't worked for you is that your DataFrame column names are not valid MATLAB struct field names, because they contain spaces and/or start with digit characters.
When I do:
import pandas as pd
import scipy.io
MyDataFrame = pd.read_csv('eeg.txt',sep=';',decimal='.')
truncDataFrame = MyDataFrame[0:1000] # reduce data size for test purposes
scipy.io.savemat('EEGdata1.mat', {'struct1':truncDataFrame.to_dict("list")})
the result in MATLAB is a struct with the 4 fields reltime, datetime, iSensor and quality. Each of these has 1000 elements, so the data from these columns has been converted, but the rest of your data is missing.
However if I first rename the DataFrame columns:
truncDataFrame.rename(columns=lambda x:'col_' + x.replace(' ', '_'), inplace=True)
scipy.io.savemat('EEGdata2.mat', {'struct2':truncDataFrame.to_dict("list")})
the result in MATLAB is a struct with 36 fields. This is not the same format as your mat4py solution but it does contain (as far as I can see) all the data from the source DataFrame.
(Note that in your question, you are creating a .mat file that contains a variable called struct and when this is loaded into MATLAB it masks the builtin struct datatype - that might also cause issues with subsequent MATLAB code.)
I finally found a solution thanks to this post. There, the poster did not create a dictionary of lists but a dictionary of integers, which worked on my side. It is a small example, easily reproductible. Then I tried to manually add lists by entering values like [1, 2], an it did not work. But what worked was when I manually added tuples !
MyDataFrame needs to be converted to a dictionary and if a dictionary of lists doesn't work, try with tuples.
For beginners : lists are contained by [] and tuples by (). Here is an image showing both.
This worked for me:
import mat4py as mp
EEGdata = MyDataFrame.apply(tuple).to_dict()
mp.savemat('EEGdata.mat',{'structs': EEGdata})
EEGdata.mat should now be readable by Matlab, as it is on my side.

Categories