sparkRDD and large file comparison - python

Scenario: I've 85789142 JSON documents, and a textfile with 32227957 items.
The textfile would look like:
url1
url2
url3
And a sample JSON document:
{"key1":"value1","key2":"value2","url":"some_url"}
I want to find the JSON documents corresponding to the items in the textfile.
What I've done:
import json
textfile_rdd = sc.textFile("path/to/textfile.txt")
urls = set(textfile_rdd.collect())
json_files_rdd = sc.textFile("path/to/the/directory/of/json/files")
json_rdd = json_files_rdd.filter(lambda x: (json.loads(x)).get("url") in urls )
This code works for a textfile of small size (I've tried with 500000 documents).
Currently I'm splitting my textfile of 32227957 into smaller files, are there any better approaches?

I would suggest you to parse your json file using sparkSQL and load the textfile as a DataFrame of one column too. Then you could simply join the two datasets, without any need to collect the first file to the driver, which is your scalability issue now...

Related

How to extract very nested json without pattern

I've been trying to normalize a JSON file and wanted a python(pandas) or pyspark script the more generic as possible that can extract data from a very nested mongodb JSON - it comes from a third party API and saved in MongoDB - and return it in a relational dataset so we can consume it from the datalake.
There are a lot of records and fields, so we can't do it in only one dataframe. Also, the layout does not have a pattern.
Could you please help us?
What is the best way to do this in best practices and, if possible, recursively?
Below is a chunk of the json file
https://raw.githubusercontent.com/migueelcruz/sample_json/main/sample.json
We expect multiple dataframes that link each other so we can consume data like a relational database. Also, the files must be like a database table.
Thanks a lot for your help!
A way to approach this problem would be using the json module to deserialize the data into a python dictionary.
# Get the data
import urllib.request as urllib
link = "https://raw.githubusercontent.com/migueelcruz/sample_json/main/sample.json"
f = urllib.urlopen(link)
myfile = f.read()
# Deserialize
import json
data = json.loads(myfile)
data
Now the way you would get the data is using python dictionaries syntax.
i.e if you want to get eventos which is under dados which is under eventos would be:
data["dados"]["nfe"]["eventos"]

How to extract information from multiple json files using python

I have multiple json files in a folder. I had implemented a way to catch only .json files in this folder. My problem is: I need to extract some information contained in each of those files but it didn't work the way I expected. I need to find a way to get this information and convert all into a pandas dataframe.
The variable jsons_data contains all .json files
jsons_data = json_import(path_to_json)
for index, js in enumerate(jsons_data):
with open(os.path.join(path_to_json, js)) as json_file:
data = json.loads(json_file)
print(data)
The problem in your code is that on every iteration you override the content of data.
I assume you want to create on big dataframe from all the files in that case you can do -
dataframes = []
for js in jsons_data:
dataframes.append(pd.read_json(os.path.join(path_to_json, js)))
df = pd.concat(dataframes)
See documentation about read_json -
https://pandas.pydata.org/pandas-docs/version/1.3.0rc1/reference/api/pandas.read_json.html

How do I export JSON data to CSV using python?

I'm building a site that, based on a user's input, sorts through JSON data and prints a schedule for them into an html table. I want to give it the functionality that once the their table is created they can export the data to a CSV/Excel file so we don't have to store their credentials (logins & schedules in a database). Is this possible? If so, how can I do it using python preferably?
This is not the exact answer but rather steps for you to follow in order to get a solution:
1 Read data from json. some_dict = json.loads(json_string)
2 Appropriate code to get the data from dictionary (sort/ conditions etc) and get that data in a 2D array (list)
3 Save that list as csv: https://realpython.com/python-csv/
I'm pretty lazy and like to utilize pandas for things like this. It would be something along the lines of
import pandas as pd
file = 'data.json'
with open(file) as j:
json_data = json.load(j)
df = pd.DataFrame.from_dict(j, orient='index')
df.to_csv("data.csv")

What is an efficient way to dump a json response from polygon api?

I'm downloading data from polygon api and after checking the documentation, I realized that there is some kind of a rate limit in terms of response size which will consist of 5000 records per request. Let's say I need to download several months worth of data, it looks like there is no one-liner solution that fetches all the data for the specified period at once.
Here's what the response looks like for 4 day data points that I get using requests.get('query').json():
{
"ticker":"AAPL",
"status":"OK",
"queryCount":4,
"resultsCount":4,
"adjusted":True,
"results":[
{
"v":152050116.0,
"vw":132.8458,
"o":132.76,
"c":134.18,
"h":134.8,
"l":130.53,
"t":1598932800000,
"n":1
},
{
"v":200117202.0,
"vw":131.6134,
"o":137.59,
"c":131.4,
"h":137.98,
"l":127,
"t":1599019200000,
"n":1
},
{
"v":257589206.0,
"vw":123.526,
"o":126.91,
"c":120.88,
"h":128.84,
"l":120.5,
"t":1599105600000,
"n":1
},
{
"v":336546289.0,
"vw":117.9427,
"o":120.07,
"c":120.96,
"h":123.7,
"l":110.89,
"t":1599192000000,
"n":1
}
],
"request_id":"bf5f3d5baa930697621b97269f9ccaeb"
}
I thought the fastest way is to write the content as is and process it later
with open(out_file, 'a') as out:
out.write(f'{response.json()["results"][0]}\n')
And later after I download what I needed, will read the file and convert the data to a json file using pandas:
pd.DataFrame([eval(item) for item in open('out_file.txt')]).to_json('out_file.json')
Is there a better way of achieving the same thing? If anyone is familiar with scrapy feed exports, is there a way of dumping the data to json file during the run without saving anything to memory which i think is the same fashion as scrapy operates.
Instead of writing out the content as text, write it directly as a JSON instead with a unique filename (e.g. your request_id).
import json
# code for fetching data omitted.
data = response.json()
with open(out_file, 'w') as f:
json.dump(data, f)
Then you can load all of them into Dataframes, e.g. similar to here: How to read multiple json files into pandas dataframe?:
from pathlib import Path # Python 3.5+
import pandas as pd
dfs = []
for path in Path('dumped').rglob('*.json'):
tmp = pd.read_json(path)
dfs.append(tmp)
df = pd.concat(dfs, ignore_index=True)

I have a few JSON files that are empty and are giving an exception when I try to loop through them. How do I make this work?

I'm doing some research on Cambridge Analytica and wanted to have as much news articles as I can from some news outlets.
I was able to scrape them and now have a bunch of JSON files in a folder.
Some of them have only this [] written in them while others have the data I need.
Using pandas I used the following and got every webTitle in the file.
df = pd.read_json(json_file)
df['webTitle']
The thing is that whenever there's an empty file it won't even let me assign df['webTitle'] to a variable.
Is there a way for me to check if it is empty and if it is just go to the next file?
I want to make this into a spreadsheet with a few of the keys and columns and the values as rows for each news article.
My files are organized by day and I've used TheGuardian API to get the data.
I did not write much yet but just in case here's the code as it is:
import pandas as pd
import os
def makePathToFile(path):
pathtoJson = []
for root,sub,filename in os.walk(path):
for i in filename:
pathToJson.append(os.path.join(path, i))
return pathToJson
def readJsonAndWriteCSV (pathToJson):
for json_file in pathToJson:
df = pd.read_json(json_file)
Thanks!
You can set up a google Alert for the news keywords you want, then scrape the results in python using https://pypi.org/project/galerts/

Categories