What happens exactly in the i/o of json files? - python

I struggled with the following for a couple of hours yesterday. I figured out a workaround, but I'd like to understand a little more of what's going on in the background and, ideally, I'd like to remove the intermediate file from my code just for the sake of elegance. I'm using python, by the way and files_df starts off as a pandas dataframe.
Can you help me understand why the following code gives me an error.
files_json = files_df.to_json(orient='records')
for file_json in files_json:
print(file_json) #do stuff
But this code works?
files_json = files_df.to_json(orient='records')
with open('export_json.json', 'w') as f:
f.write(files_json)
with open('export_json.json') as data:
files_json = json.load(data)
for file_json in files_json:
print(file_json) #do stuff
Obviously, the export/import is converting the data somehow into a usable format. I would like to understand that a little better and know if there is some option within the pandas files_df.to_json command to perform the same conversion.

json.load is the opposite of json.dump, but you export from pandas data frames into file and than import again with standard library into some sort of python structure.
Try files_df.to_dict

Related

Converting JSON file to SQLITE or CSV

I'm attempting to convert a JSON file to an SQLite or CSV file so that I can manipulate the data with python. Here is where the data is housed: JSON File.
I found a few converters online, but those couldn't handle the quite large JSON file I was working with. I tried using a python module called sqlbiter but again, like the others, was never really able to output or convert the file.
I'm not. sure where to go now, if anyone has any recommendations or insights on how to get this data into a database, I'd really appreciate it.
Thanks in advance!
EDIT: I'm not looking for anyone to do it for me, I just need to be pointed in the right direction. Are there other methods I haven't tried that I could learn?
You can utilize pandas module for this data processing task as follows:
First, you need to read the JSON file using with, open and json.load.
Second, you need to change the format of your file a bit by changing the large dictionary that has a main key for every airport into a list of dictionaries instead.
Third, you can now utilize some pandas magic to convert your list of dictionaries into a DataFrame using pd.DataFrame(data=list_of_dicts).
Finally, you can utilize pandas's to_csv function to write your DataFrame as a CSV file into disk.
It would look something like this:
import pandas as pd
import json
with open('./airports.json.txt','r') as f:
j = json.load(f)
l = list(j.values())
df = pd.DataFrame(data=l)
df.to_csv('./airports.csv', index=False)
You need to load your json file and parse it to have all the fields available, or load the contents to a dictionary, then you could using pyodbc to write to the database these fields, or write them to the csv if you use import csv first.
But this is just a general idea. You need to study python and how to do every step.
For instance for writting to the database you could do something like:
for i in range(0,max_len):
sql_order = "UPDATE MYTABLE SET MYTABLE.MYFIELD ...."
cursor1.execute(sql_order)
cursor1.commit()

Producing pandas DataFrame from table in text file

I have some data in a text file which looks like this:
(v14).K TaskList[Parameter Estimation].(Problem)Parameter Estimation.Best Value
5.00885e-007 3.0914e+007
5.75366e-007 2.99467e+007
6.60922e-007 2.99199e+007
I'm trying to get this data into a pandas dataframe. The code I've written below partially works but has formatting issues:
def parse_PE_results(results_file):
with open(results_file) as f:
data=f.readlines()
parameter_value=[]
best_value=[]
for i in data:
split= i.split('\t')
parameter_value.append(split[0])
best_value.append(split[1].rstrip())
pv=pandas.Series(parameter_value,name=parameter_value[0])
bv=pandas.Series(best_value,name=best_value[0])
df=pandas.DataFrame({parameter_value[0]:pv,best_value[0]:bv})
return df
I get the feeling that there must be an easier, more 'pythonic' way of building a data frame from text files. Would anybody happen to know what that is?
Use pandas.read_csv. The entire parse_PE_results function can be replaced with
df = pd.read_csv(results_file, delimiter='\t')
You'll also enjoy better performance by using read_csv instead of calling
data=f.readlines() and looping through it line by line.

how to read a data file including "pandas.core.frame, numpy.core.multiarray"

I met a DF file which is encoded in binary format. But when I open it using Vim, still I can see characters like "pandas.core.frame", "numpy.core.multiarray". So I guess it is related with Python. However I know little about the Python language. Though I have tried using pandas and numpy modules, I failed to read the file. Could you guys give any suggestion on this issue? Thank you in advance. Here is the Dropbox link to the DF file: https://www.dropbox.com/s/b22lez3xysvzj7q/flux.df
Looks like DataFrame stored with pickle, use read_pickle() to read it:
import pandas as pd
df = pd.read_pickle('flux.df')

Pandas read_stata() with large .dta files

I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:
%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')
and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().
My questions are:
Is there something I am doing wrong that is resulting in Pandas having issues?
Is there a workaround to get the data into a Pandas dataframe?
Here is a little function that has been handy for me, using some pandas features that might not have been available when the question was originally posed:
def load_large_dta(fname):
import sys
reader = pd.read_stata(fname, iterator=True)
df = pd.DataFrame()
try:
chunk = reader.get_chunk(100*1000)
while len(chunk) > 0:
df = df.append(chunk, ignore_index=True)
chunk = reader.get_chunk(100*1000)
print '.',
sys.stdout.flush()
except (StopIteration, KeyboardInterrupt):
pass
print '\nloaded {} rows'.format(len(df))
return df
I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c.
This notebook shows it in action.
For all the people who end on this page, please upgrade Pandas to the latest version. I had this exact problem with a stalled computer during load (300 MB Stata file but only 8 GB system ram), and upgrading from v0.14 to v0.16.2 solved the issue in a snap.
Currently, it's v 0.16.2. There have been significant improvements to speed though I don't know the specifics. See: most efficient I/O setup between Stata and Python (Pandas)
There is a simpler way to solve it using Pandas' built-in function read_stata.
Assume your large file is named as large.dta.
import pandas as pd
reader=pd.read_stata("large.dta",chunksize=100000)
df = pd.DataFrame()
for itm in reader:
df=df.append(itm)
df.to_csv("large.csv")
Question 1.
There's not much I can say about this.
Question 2.
Consider exporting your .dta file to .csv using Stata command outsheet or export delimited and then using read_csv() in pandas. In fact, you could take the newly created .csv file, use it as input for R and compare with pandas (if that's of interest). read_csv is likely to have had more testing than read_stata.
Run help outsheet for details of the exporting.
You should not be reading a 3GB+ file into an in-memory data object, that's a recipe for disaster (and has nothing to do with pandas).
The right way to do this is to mem-map the file and access the data as needed.
You should consider converting your file to a more appropriate format (csv or hdf) and then you can use the Dask wrapper around pandas DataFrame for chunk-loading the data as needed:
from dask import dataframe as dd
# If you don't want to use all the columns, make a selection
columns = ['column1', 'column2']
data = dd.read_csv('your_file.csv', use_columns=columns)
This will transparently take care of chunk-loading, multicore data handling and all that stuff.

Python Excel parsing data with xlrd

Fairly simple; I've got the data I want out of the excel file, but can't seem to find anything inside the XLRD readme that explains how to go from this:
xldate:40397.007905092592
number:10000.0
text:u'No'
number:0.1203
number:0.096000000000000002
number:0.126
to their respective python datatypes. Any ideas?
did you tried the documentation help --> date_function
I had the same issue and used the following as a last resort:
def numobj2fl(p):
return float(str(p).split(":")[1])
for converting the 'number object' to float.

Categories