How to load certain aspect of json file to rdd easily

How to load certain aspect of json file to rdd easily - python

I have a json file with over a million rows, so I am trying to minimize the number of times I have to run through it all to get one aspect of it into an rdd.
Right now, I load each row into a list:
with open('in/json-files/sites.json') as f:
for line in f:
data.append(json.loads(line))
Then, I make another list and import the aspect into that:
for line in range(1,len(data)):
data_companies.append(data[line]['company'])
Then, I parallelize this into an rdd so that I can analyze it. I am worried about how much memory this will take up, so is there an easier and faster way to do this? I have tried loading the json file like this, but it wont work:
data.append(json.loads(line['company'))

As your data is structured(JSON), you can look into Spark-SQL
https://spark.apache.org/docs/2.4.0/sql-programming-guide.html
https://spark.apache.org/docs/2.4.0/sql-data-sources-json.html
You can directly load your JSON into a dataframe and look for the particular column to do your analysis

Related

How to load large json(multiple object) into pandas dataframes in chuncks to avoid high memory usage?

I have a very large json file that is in the form of multiple objects, for small dataset, this works
data=pd.read_json(file,lines=True)
but on the same but larger dataset it would crash on 8gb ram computer, so i tried to convert it to list first with below code
data[]
with open(file) as file:
for i in file:
d = json.loads(i)
data.append(d)'
then convert the list into dataframe with
df = pd.DataFrame(data)
this does convert it into a list fine even with the large dataset file, but it crashes when i try to convert it into a dataframe due to it using to much memory i presume
i have tried doing
data[]
with open(file) as file:
for i in file:
d = json.loads(i)
df=pd.DataFrame([d])'
I thought it would append it one by one but i think it still create one large copy in memory at once insteads, so it still crashes
how would i convert the large json file into dataframe by chuncks so it limit the memory useage?

There are several possible solution, depending on your specific case. Given we don't have a data example or information on the data structure, I could offer the following:
If the data in the json file is numeric, consider breaking it into chunks, reading each one and converting to the smallest type (float32/int), as pandas will use float64 which is more memory intensive
use Dask for bigger-than-memory datasets, like you have.

To avoid the intermediate data structures you can use a generator.
def load_jsonl(filename):
with open(filename) as fd:
for line in fd:
yield json.loads(line)
df = pd.DataFrame(load_jsonl(filename))

How to read a few lines in a large CSV file with pandas?

I have a CSV file that doesn't fit into my system's memory. Using Pandas, I want to read a small number of rows scattered all over the file.
I think that I can accomplish this without pandas following the steps here: How to read specific lines of a large csv file
In pandas, I am trying to use skiprows to select only the rows that I need.
# FILESIZE is the number of lines in the CSV file (~600M)
# rows2keep is an np.array with the line numbers that I want to read (~20)
rows2skip = (row for row in range(0,FILESIZE) if row not in rows2keep)
signal = pd.read_csv('train.csv', skiprows=rows2skip)
I would expect this code to return a small dataframe pretty fast. However, what is does is start consuming memory over several minutes until the system becomes irresponsive. I'm guessing that it is reading the whole dataframe first and will get rid of rows2skip later.
Why is this implementation so inefficient? How can I efficiently create a dataframe with only the lines specified in rows2keep?

Try this
train = pd.read_csv('file.csv', iterator=True, chunksize=150000)
If you only want to read the first n rows:
train = pd.read_csv(..., nrows=n)
If you only want to read rows from n to n+100
train = pd.read_csv(..., skiprows=n, nrows=n+100)

chunksize should help in limiting the memory usage. Alternatively, if you only need a few number of lines, a possible way is to first read the required lines ouside of pandas and then only feed read_csv with that subset. Code could be:
lines = [line for i, line in enumerate(open('train.csv')) if i in lines_to_keep]
signal = pd.read_csv(io.StringIO(''.join(lines)))

Converting JSON file to SQLITE or CSV

I'm attempting to convert a JSON file to an SQLite or CSV file so that I can manipulate the data with python. Here is where the data is housed: JSON File.
I found a few converters online, but those couldn't handle the quite large JSON file I was working with. I tried using a python module called sqlbiter but again, like the others, was never really able to output or convert the file.
I'm not. sure where to go now, if anyone has any recommendations or insights on how to get this data into a database, I'd really appreciate it.
Thanks in advance!
EDIT: I'm not looking for anyone to do it for me, I just need to be pointed in the right direction. Are there other methods I haven't tried that I could learn?

You can utilize pandas module for this data processing task as follows:
First, you need to read the JSON file using with, open and json.load.
Second, you need to change the format of your file a bit by changing the large dictionary that has a main key for every airport into a list of dictionaries instead.
Third, you can now utilize some pandas magic to convert your list of dictionaries into a DataFrame using pd.DataFrame(data=list_of_dicts).
Finally, you can utilize pandas's to_csv function to write your DataFrame as a CSV file into disk.
It would look something like this:
import pandas as pd
import json
with open('./airports.json.txt','r') as f:
j = json.load(f)
l = list(j.values())
df = pd.DataFrame(data=l)
df.to_csv('./airports.csv', index=False)

You need to load your json file and parse it to have all the fields available, or load the contents to a dictionary, then you could using pyodbc to write to the database these fields, or write them to the csv if you use import csv first.
But this is just a general idea. You need to study python and how to do every step.
For instance for writting to the database you could do something like:
for i in range(0,max_len):
sql_order = "UPDATE MYTABLE SET MYTABLE.MYFIELD ...."
cursor1.execute(sql_order)
cursor1.commit()

Efficiently rewriting lines in a large text file with Python

I'm trying to generate a large data file (in the GBs) by iterating over thousands of database records. At the top of the file are a line for each "feature" that appears latter in the file. They look like:
#attribute 'Diameter' numeric
#attribute 'Length' real
#attribute 'Qty' integer
lines containing data using these attributes look like:
{0 0.86, 1 0.98, 2 7}
However, since my data is sparse data, each record from my database may not have each attribute, and I don't know what the complete feature set is in advance. I could, in theory, iterate over my database records twice, the first time accumulating the feature set, and then the second time to output my records, but I'm trying to find a more efficient method.
I'd like to try a method like the following pseudo-code:
fout = open('output.dat', 'w')
known_features = set()
for records in records:
if record has unknown features:
jump to top of file
delete existing "#attribute" lines and write new lines
jump to bottom of file
fout.write(record)
It's the jump-to/write/jump-back part I'm not sure how to pull off. How would you do this in Python?
I tried something like:
fout.seek(0)
for new_attribute in new_attributes:
fout.write(attribute)
fout.seek(0, 2)
but this overwrites both the attribute lines and data lines at the top of the file, not simply insert new lines starting at the seek position I specify.
How do you obtain a word-processor's "insert" functionality in Python without loading the entire document into memory? The final file is larger than all my available memory.

Why don't you get a list of all the features and their data types; list them first. If a feature is missing, replace it with a known value - NULL seems appropriate.
This way your records will be complete (in length), and you don't have to hop around the file.
The other approach is, write two files. One contains all your features, the others all your rows. Once both files are generated, append the feature file to the top of the data file.
FWIW, word processors load files in memory for editing; and then they write the entire file out. This is why you can't load a file larger than the addressable/available memory in a word processor; or any other program that is not implemented as a stream reader.

Why don't you build the output in memory first (e.g. as a dict) and write it to a file after all data is known?

Exporting a list to a CSV/space separated and each sublist in its own column

I'm sure there is an easy way to do this, so here goes. I'm trying to export my lists into CSV in columns. (Basically, it's how another program will be able to use the data I've generated.) I have the group called [frames] which contains [frame001], [frame002], [frame003], etc. I would like the CSV file that's generated to have all the values for [frame001] in the first column, [frame002] in the second column, and so on. I thought if I could save the file as CSV I could manipulate it in Excel, however, I figure there is a solution that I can program to skip that step.
This is the code that I have tried using so far:
import csv
data = [frames]
out = csv.writer(open(filename,"w"), delimiter=',',quoting=csv.QUOTE_ALL)
out.writerow(data)
I have also tried:
import csv
myfile = open(..., 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(mylist)
If there's a way to do this so that all the values are space separated, that would be ideal, but at this point I've been trying this for hours and can't get my head around the right solution.

What you're describing is that you want to translate a 2 dimensional array of data. In Python you can achieve this easily with the zip function as long as the inner lists are all the same length.
out.writerows(zip(*data))
If they are not all the same length, you can use itertools.izip_longest to fill the remaining fields with some default value (even '').

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.