THE ERROR
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().
THE CODE
from pyspark.sql.functions import explode, col
# Read the JSON file from Databricks storage
df_json = spark.read.json("/mnt/BigData_JSONFiles/new_test.json")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
# Convert the dataframe to a dictionary
data = df_json.toPandas().to_dict()
# Split the data into two parts
d1 = dict(itertools.islice(data.items(), 8))
d2 = dict(itertools.islice(data.items(), 8, len(data.items())))
# Convert the first part of the data back to a dataframe
df1 = spark.createDataFrame([d1])
# Write the first part of the data to a JSON file in Databricks storage
df1.write.format("json").save("/mnt/BigData_JSONFiles/new_test_header.json")
# Convert the second part of the data back to a dataframe
df2 = spark.createDataFrame([d2])
# Write the second part of the data to a JSON file in Databricks storage
df2.write.format("json").save("/mnt/BigData_JSONFiles/new_test_detail.json")
THE SAMPLE JSON FILE OF LARGE JSON FILE
{
"reporting_entity_name": "launcher",
"reporting_entity_type": "launcher",
"plan_name": "launched",
"plan_id_type": "hios",
"plan_id": "1111111111",
"plan_market_type": "individual",
"last_updated_on": "2020-08-27",
"version": "1.0.0",
"in_network": [
{
"negotiation_arrangement": "ffs",
"name": "Boosters",
"billing_code_type": "CPT",
"billing_code_type_version": "2020",
"billing_code": "27447",
"description": "Boosters On Demand",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
0
],
"tin": {
"type": "ein",
"value": "11-1111111"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 123.45,
"expiration_date": "2022-01-01",
"billing_class": "organizational"
}
]
}
]
}
]
}
Hi, I am trying to divide a big json file into two format which is done by the above code. But it is failing saying to cache i used
.cache() at the last of the loading file but still getting this error. Kindly please let me know how can i solve this error.
I am able to resolve this error buy changing the this
df_json = spark.read.json("/mnt/BigData_JSONFiles/new_test.json")
to this
df_json = spark.read.option("multiline","true").json("/mnt/BigData_JSONFiles/new_test.json")
Related
Hello this is the python code i developed in my local machine, but now i am trying to work this code on DATABRICKS. But i am new to DATABRICKS so dont know how can i do it.
What i am trying to do is that i have a sample of huge JSON file and i am splitting it in two parts one contain headers and second file contains all the details.
Here is my local machine python code.
import json
import itertools
with open('new_test.json', 'r') as fp:
data = json.loads(fp.read())
d1 = dict(itertools.islice(data.items(), 8))
print(d1)
d2 = dict(itertools.islice(data.items(), 8, len(data.items())))
print(d2)
with open("new_test_header.json", "w") as header_file:
json.dump(d1, header_file)
with open("new_test_detail.json", "w") as detail_file:
json.dump(d2, detail_file)
Here is the JSON file.
{
"reporting_entity_name": "launcher",
"reporting_entity_type": "launcher",
"plan_name": "launched",
"plan_id_type": "hios",
"plan_id": "1111111111",
"plan_market_type": "individual",
"last_updated_on": "2020-08-27",
"version": "1.0.0",
"in_network": [
{
"negotiation_arrangement": "ffs",
"name": "Boosters",
"billing_code_type": "CPT",
"billing_code_type_version": "2020",
"billing_code": "27447",
"description": "Boosters On Demand",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
0
],
"tin": {
"type": "ein",
"value": "11-1111111"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 123.45,
"expiration_date": "2022-01-01",
"billing_class": "organizational"
}
]
}
]
}
]
}
Here is what i am trying to write in DATABRICKS
import json
import itertools
from pyspark.sql.functions import explode, col
df_json = spark.read.option("multiline","true").json("/mnt/BigData_JSONFiles/SampleDatafilefrombigfile.json")
display(df_json)
d1 = dict(itertools.islice(df_json.items(), 4))
d2 = dict(itertools.islice(df_json.items(), 4, len(df_json.items())))
# I am unable to write the WRITE function.
A help or guidance will be very helpful.
Here is a snippet example:
from pyspark.sql.functions import explode, col
# Read the JSON file from Databricks storage
df_json = spark.read.json("/mnt/BigData_JSONFiles/new_test.json")
# Convert the dataframe to a dictionary
data = df_json.toPandas().to_dict()
# Split the data into two parts
d1 = dict(itertools.islice(data.items(), 8))
d2 = dict(itertools.islice(data.items(), 8, len(data.items())))
# Convert the first part of the data back to a dataframe
df1 = spark.createDataFrame([d1])
# Write the first part of the data to a JSON file in Databricks storage
df1.write.format("json").save("/mnt/BigData_JSONFiles/new_test_header.json")
# Convert the second part of the data back to a dataframe
df2 = spark.createDataFrame([d2])
# Write the second part of the data to a JSON file in Databricks storage
df2.write.format("json").save("/mnt/BigData_JSONFiles/new_test_detail.json")
I have this panda Data Frame (DF1).
DF1= DF1.groupby(['Name', 'Type', 'Metric'])
DF1= DF1.first()
If I output to df1.to_excel("output.xlsx"). The format is correct see bellow :
But when I upload to my google sheets using python and GSpread
from gspread_formatting import *
worksheet5.clear()
set_with_dataframe(worksheet=worksheet1, dataframe=DF1, row=1, include_index=True,
include_column_header=True, resize=True)
That's the output
How can I keep the same format in my google sheets using gspread_formatting like in screenshot 1?
Issue and workaround:
In the current stage, it seems that the data frame including the merged cells cannot be directly put into the Spreadsheet with gspread. So, in this answer, I would like to propose a workaround. The flow of this workaround is as follows.
Prepare a data frame including the merged cells.
Convert the data frame to an HTML table.
Put the HTML table with the batchUpdate method of Sheets API.
By this flow, the values can be put into the Spreadsheet with the merged cells. When this is reflected in a sample script, how about the following sample script?
Sample script:
# This is from your script.
DF1 = DF1.groupby(["Name", "Type", "Metric"])
DF1 = DF1.first()
# I added the below script.
spreadsheetId = "###" # Please set your spreadsheet ID.
sheetName = "Sheet1" # Please set your sheet name you want to put the values.
spreadsheet = client.open_by_key(spreadsheetId)
sheet = spreadsheet.worksheet(sheetName)
body = {
"requests": [
{
"pasteData": {
"coordinate": {"sheetId": sheet.id},
"data": DF1.to_html(),
"html": True,
"type": "PASTE_NORMAL",
}
}
]
}
spreadsheet.batch_update(body)
When this script is run with your sample value including the merged cells, the values are put to the Spreadsheet by reflecting the merged cells.
If you want to clear the cell format, please modify body as follows.
body = {
"requests": [
{
"pasteData": {
"coordinate": {"sheetId": sheet.id},
"data": DF1.to_html(),
"html": True,
"type": "PASTE_NORMAL",
}
},
{
"repeatCell": {
"range": {"sheetId": sheet.id},
"cell": {},
"fields": "userEnteredFormat",
}
},
]
}
References:
Method: spreadsheets.batchUpdate
PasteDataRequest
I want to store key-value JSON data in aws DynamoDB where key is a date string in YYYY-mm-dd format and value is entries which is a python dictionary. When I used boto3 client to save data there, it saved it as a data type object, which I don't want. My purpose is simple: Store JSON data against a key which is a date, so that later I will query the data by giving that date. I am struggling with this issue because I did not find any relevant link which says how to store JSON data and retrieve it without any conversion.
I need help to solve it in Python.
What I am doing now:
item = {
"entries": [
{
"path": [
{
"name": "test1",
"count": 1
},
{
"name": "test2",
"count": 2
}
],
"repo": "test3"
}
],
"date": "2022-10-11"
}
dynamodb_client = boto3.resource('dynamodb')
table = self.dynamodb_client.Table(table_name)
response = table.put_item(Item = item)
What actually saved:
[{"M":{"path":{"L":[{"M":{"name":{"S":"test1"},"count":{"N":"1"}}},{"M":{"name":{"S":"test2"},"count":{"N":"2"}}}]},"repo":{"S":"test3"}}}]
But I want to save exactly the same JSON data as it is, without any conversion at all.
When I retrieve it programmatically, you see the difference of single quote, count value change.
response = table.get_item(
Key={
"date": "2022-10-12"
}
)
Output
{'Item': {'entries': [{'path': [{'name': 'test1', 'count': Decimal('1')}, {'name': 'test2', 'count': Decimal('2')}], 'repo': 'test3'}], 'date': '2022-10-12} }
Sample picture:
Why not store it as a single attribute of type string? Then you’ll get out exactly what you put in, byte for byte.
When you store this in DynamoDB you get exactly what you want/have provided. Key is your date and you have a list of entries.
If you need it to store in a different format you need to provide the JSON which correlates with what you need. It's important to note that DynamoDB is a key-value store not a document store. You should also look up the differences in these.
I figured out how to solve this issue. I have two column name date and entries in my dynamo db (also visible in screenshot in ques).
I convert entries values from list to string then saved it in db. At the time of retrival, I do the same, create proper json response and return it.
I am also sharing sample code below so that anybody else dealing with the same situation can have atleast one option.
# While storing:
entries_string = json.dumps([
{
"path": [
{
"name": "test1",
"count": 1
},
{
"name": "test2",
"count": 2
}
],
"repo": "test3"
}
])
item = {
"entries": entries_string,
"date": "2022-10-12"
}
dynamodb_client = boto3.resource('dynamodb')
table = dynamodb_client.Table(<TABLE-NAME>)
-------------------------
# While fetching:
response = table.get_item(
Key={
"date": "2022-10-12"
}
)['Item']
entries_string=response['entries']
entries_dic = json.loads(entries_string)
response['entries'] = entries_dic
print(json.dumps(response))
I Have a config file
Position,ColumnName
1,TXS_ID
4,TXX_NAME
8,AGE
As per the above position i have 1 , 4, 8 --- we have only 3 columns are available. In between 1 & 4 we don't have 2,3 position where i want to fill them with Null Values .
As per the above config file i am trying to parse the data from a Json file by using Python but i have a scenario where i need to define the columns on the base of position as mentioned above. When python script is running if the "TXS_ID" is available it should pick the data from the JSON file & as i dont have 2& 3 fields i want to keep them as Null.
Sample output file
TSX_ID,,,TXX_NAME,,,,AGE
10000,,,AAAAAAAAA,,,,40
As per the config file i specify , data should be extracted from Json file and if the position is missing as per above example then it should be filling with nulls. Please help me if there is any possibility i can achieve.
Below is the sample Json File.
{
"entities": [
{
"id": "XXXXXXXXXXXXXXX",
"data": {
"attributes": {
"TSX_ID": {
"values": [
{
"value": 10000
}
]
},
"TXX_NAME": {
"values": [
{
"value": "AAAAAAAAA"
}
]
},
"AGE": {
"values": [
{
"value": "40"
}
]
}
}
}
}
]
}
Assuming that the config file line 1,TXS_ID has a typo and is actually 1,TSX_ID, this program works with your sample data (see explanations in comments):
import pandas
# read the "config file" into a Series of the "ColumnName"s:
config = pandas.read_csv('config', index_col='Position', squeeze=True)
maxdex = config.index[-1] # get the maximum Position
# fill the Positions missing in the "config file" with empty "ColumnName"s:
config = config.reindex(range(1, maxdex+1), fill_value='')
import json
sample = json.load(open('sample.json'))
# create an empty DataFrame with the desired columns:
output = pandas.DataFrame(columns=config.values)
# now insert the nested JSON data values into the given columns:
for a in config.values:
if a: # only if not an empty column name, of course
output[a] = [av['value'] for e in sample['entities']
for av in e['data']['attributes'][a]['values']]
output.to_csv('output.csv', index=False)
I am currently trying to convert a JSON file to a CSV file using Pandas.
The codes that I'm using now are able to convert the JSON to a CSV file.
import pandas as pd
json_data = pd.read_json("out1.json")
from pandas.io.json import json_normalize
df = json_normalize(json_data["events"])
df.to_csv("out.csv)
This is my JSON file:
{
"events": [
{
"raw": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on by 80801234 at Area A\n\"}",
"logtypes": [
"json"
],
"timestamp": 1537190572023,
"unparsed": null,
"logmsg": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on by 80801234 at Area A\n\"}",
"id": "c77afb4c-ba7c-11e8-8000-12b233ae723a",
"tags": [
"INFO"
],
"event": {
"json": {
"message": "Disabled camera with QR scan on by 80801234 at Area A\n",
"level": "INFO"
},
"http": {
"clientHost": "116.197.237.29",
"contentType": "text/plain; charset=UTF-8"
}
}
},
{
"raw": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
"logtypes": [
"json"
],
"timestamp": 1537190528619,
"unparsed": null,
"logmsg": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
"id": "ad9c0175-ba7c-11e8-803d-12b233ae723a",
"tags": [
"INFO"
],
"event": {
"json": {
"message": "Employee number saved successfully.",
"level": "INFO"
},
"http": {
"clientHost": "116.197.237.29",
"contentType": "text/plain; charset=UTF-8"
}
}
}
]
}
But what I wanted was just some fields (timestamp, level, message) inside the JSON file not all of it.
I have tried a variety of ways:
df = json_normalize(json_data["timestamp"]) // gives a KeyError on 'timestamp'
df = json_normalize(json_data, 'timestamp', ['event', 'json', ['level', 'message']]) // TypeError: string indices must be integers
Where did i went wrong?
I don't think json_normalize is intended to work on this specific orientation. I could be wrong but from the documentation, it appears that normalization means "Deal with lists within each dictionary".
Assume data is
data = json.load(open('out1.json'))['events']
Look at the first entry
data[0]['timestamp']
1537190572023
json_normalize wants this to be a list
[{'timestamp': 1537190572023}]
Create augmented data2
I don't actually recommend this approach.
If we create data2 accordingly:
data2 = [{**d, **{'timestamp': [{'timestamp': d['timestamp']}]}} for d in data]
We can use json_normalize
json_normalize(
data2, 'timestamp',
[['event', 'json', 'level'], ['event', 'json', 'message']]
)
timestamp event.json.level event.json.message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.
Comprehension
I think it's simpler to just do
pd.DataFrame([
(d['timestamp'],
d['event']['json']['level'],
d['event']['json']['message'])
for d in data
], columns=['timestamp', 'level', 'message'])
timestamp level message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.
json_normalize
But without the fancy arguments
json_normalize(data).pipe(
lambda d: d[['timestamp']].join(
d.filter(like='event.json')
)
)
timestamp event.json.level event.json.message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.