I have a JSON file that looks like this:
{
"Person A": {
"Company A": {
"Doctor": {
"Morning": "2000",
"Afternoon": "1200"
},
"Nurse": {}
}
},
"Person B": {
"Education": {
"main": {
"Primary school": {
"2012": "2A",
"2013": "3A"
},
"Secondary school": {
"2016": "1K",
"2017": "2K"
}
}
}
}
}
How do I extract the table for Education (without the main) with
primary_school.xlsx as an excel file:
year, class
secondary_school.xlsx as an excel file:
year, class
PersonA_CompanyA_Doctor.xlsx
Time, salary
PersonA_CompanyA_Nurse.xlsx:
Time, salary
I have tried json_normalize but still cannot get the result that I want.
pd.json_normalize(file, max_level=1)
Is there a simple way of doing it using dataframe?
The JSON data you presented as an example is in the form of a graph with a lot of connections, firstly, after connecting your ports on this data structure, regardless of the format of the data, -cut the green wire :)-
After this process, you should have a one-dimensional array, iterable, in which you will access the names of the xlsx files you specify.
If you are specifically asking about the connection part, it is possible for us to find a general solution by simplifying the example.
But if you want to continue,
Examining the detailed example below and installing the relevant package with the pip install xlsxwriter cli command if necessary.
With the list in your hand, you can create the xlsx files you want, in order.
`
import xlsxwriter
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('Expenses01.xlsx')
worksheet = workbook.add_worksheet()
# Some data we want to write to the worksheet.
expenses = (
['Rent', 1000],
['Gas', 100],
['Food', 300],
['Gym', 50],
)
# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0
# Iterate over the data and write it out row by row.
for item, cost in (expenses):
worksheet.write(row, col, item)
worksheet.write(row, col + 1, cost)
row += 1
# Write a total using a formula.
worksheet.write(row, 0, 'Total')
worksheet.write(row, 1, '=SUM(B1:B4)')
workbook.close()
`
Related
I have this panda Data Frame (DF1).
DF1= DF1.groupby(['Name', 'Type', 'Metric'])
DF1= DF1.first()
If I output to df1.to_excel("output.xlsx"). The format is correct see bellow :
But when I upload to my google sheets using python and GSpread
from gspread_formatting import *
worksheet5.clear()
set_with_dataframe(worksheet=worksheet1, dataframe=DF1, row=1, include_index=True,
include_column_header=True, resize=True)
That's the output
How can I keep the same format in my google sheets using gspread_formatting like in screenshot 1?
Issue and workaround:
In the current stage, it seems that the data frame including the merged cells cannot be directly put into the Spreadsheet with gspread. So, in this answer, I would like to propose a workaround. The flow of this workaround is as follows.
Prepare a data frame including the merged cells.
Convert the data frame to an HTML table.
Put the HTML table with the batchUpdate method of Sheets API.
By this flow, the values can be put into the Spreadsheet with the merged cells. When this is reflected in a sample script, how about the following sample script?
Sample script:
# This is from your script.
DF1 = DF1.groupby(["Name", "Type", "Metric"])
DF1 = DF1.first()
# I added the below script.
spreadsheetId = "###" # Please set your spreadsheet ID.
sheetName = "Sheet1" # Please set your sheet name you want to put the values.
spreadsheet = client.open_by_key(spreadsheetId)
sheet = spreadsheet.worksheet(sheetName)
body = {
"requests": [
{
"pasteData": {
"coordinate": {"sheetId": sheet.id},
"data": DF1.to_html(),
"html": True,
"type": "PASTE_NORMAL",
}
}
]
}
spreadsheet.batch_update(body)
When this script is run with your sample value including the merged cells, the values are put to the Spreadsheet by reflecting the merged cells.
If you want to clear the cell format, please modify body as follows.
body = {
"requests": [
{
"pasteData": {
"coordinate": {"sheetId": sheet.id},
"data": DF1.to_html(),
"html": True,
"type": "PASTE_NORMAL",
}
},
{
"repeatCell": {
"range": {"sheetId": sheet.id},
"cell": {},
"fields": "userEnteredFormat",
}
},
]
}
References:
Method: spreadsheets.batchUpdate
PasteDataRequest
What I m basically doing is Scraping data from Websites -> Saving to CSV -> Converting to JSON and Posting the JSON data to Firebase using firebase-import . This is the doc for firebase-import : https://github.com/FirebaseExtended/firebase-import.
Since, firebase doesnt allow creating unique keys via. firebase-import, so the data is completely dependent on the IndexNumber of the JSON data (you can see in the picture, no index number is same , so its considered as unique objects).
This is the Screenshot of Firebase Database and what I mean by Index.
This is same data (the structure) in Raw format :
{
"0": {
"title": "Title 1",
"description": "Description 1 here"
},
"1": {
"title": "Title 2",
"description": "Description 2 here"
},
"2": {
"title": "Title 3",
"description": "Description 3 here"
}
}
I'm using python (pandas module) to convert the received CSV to JSON. This is the code :
csv_file = pd.DataFrame(pd.read_csv("file_without_dupes.csv", sep = ",", header = 0, index_col = False, encoding='utf-8-sig'))
json_file = csv_file.to_json( orient = "index", date_format = "epoch", double_precision = 10, date_unit = "ms", default_handler = None)
file = open("MiningJson.json", "w")
file.write(json_file)
file.close()
The Issue is everytime the Next Conversion happens, the index starts from 0 everytime and increments and Overwrites all the previous indexes. This is what happens when a new data is inserted:
{
"0": {
"title": "Title 4
"description": "Description 4 here"
}
}
{
"1": {
"title": "Title 5
"description": "Description 5 here"
}
}
Also, after every iteration , the json file is cleared completely.
Is it Possible to Avoid this , such as Getting the Last Index of the JSON and saving it a txt file and whenever a new Data is inserted, it should get the last index from the txt file and increment it from their. I think in that way, it will never overrrite the existing data and the Number System is Infinite , so it will be unique every time.
Please mention a better solution if you think of One. Thanks
I do have a scraped data which i overwriting google sheet daily with it.
The point here that I'm unable to find an option where i can set number of rows and cols for the existing google sheet.
I noticed that can be done only for new created sheet according to documentation but i don't know how to do it for existing sheet!
def api(key):
myfilt = [list of lists]
columns = [name of columns]
gc = gspread.service_account(filename='Auth.json')
sh = gc.open_by_key(key)
worksheet = sh.sheet1
worksheet.clear()
head = worksheet.insert_row(columns, 1)
res = worksheet.insert_rows(myfilt, 2)
api("MyAPIHere")
My target here is to predefined number of rows according to len(myfilt) and number of cols according to len(cols)
I believe your goal as follows.
You want to change the max row and column number of the existing sheet in the Google Spreadsheet.
You want to achieve this using gspread with python.
You have already been able to get and put values for Google Spreadsheet using Sheets API.
Points for achieving your goal:
In this case, it is required to use the method of "spreadsheets.batchUpdate" in Sheets API. And I would like to propose the following flow.
Insert one row.
Insert one column.
Delete rows from 2 to end.
Delete columns from 2 to end.
Insert rows. In this case, you can set the number of rows you want to insert.
Insert columns. In this case, you can set the number of columns you want to insert.
1 and 2 are used for avoiding the error. Because when the DeleteDimensionRequest is run for the sheet which has only one row or one column, an error occurs.
When above flow is reflected to the script using gspread, it becomes as follows.
Sample script:
Please set the Spreadsheet ID and sheet name.
spreadsheetId = "###" # Please set the Spreadsheet ID.
sheetName = "###" # Please set the sheet name.
client = gspread.authorize(credentials)
spreadsheet = client.open_by_key(spreadsheetId)
# worksheet = spreadsheet.worksheet(sheetName)
sheetId = spreadsheet.worksheet(sheetName)._properties['sheetId']
rows = len(myfilt)
columns = len(cols)
req = {
"requests": [
{
"insertDimension": {
"range": {
"sheetId": sheetId,
"startIndex": 0,
"endIndex": 1,
"dimension": "ROWS"
}
}
},
{
"insertDimension": {
"range": {
"sheetId": sheetId,
"startIndex": 0,
"endIndex": 1,
"dimension": "COLUMNS"
}
}
},
{
"deleteDimension": {
"range": {
"sheetId": sheetId,
"startIndex": 1,
"dimension": "ROWS"
}
}
},
{
"deleteDimension": {
"range": {
"sheetId": sheetId,
"startIndex": 1,
"dimension": "COLUMNS"
}
}
},
{
"insertDimension": {
"range": {
"sheetId": sheetId,
"startIndex": 0,
"endIndex": rows - 1,
"dimension": "ROWS"
}
}
},
{
"insertDimension": {
"range": {
"sheetId": sheetId,
"startIndex": 0,
"endIndex": columns - 1,
"dimension": "COLUMNS"
}
}
}
]
}
res = spreadsheet.batch_update(req)
print(res)
References:
Method: spreadsheets.batchUpdate
DeleteDimensionRequest
InsertDimensionRequest
batch_update(body)
I used the following to solve my issue as well:
worksheet.clear() # to clear the sheet firstly.
head = worksheet.insert_row(header, 1) # inserting the header at first row
res = worksheet.insert_rows(mydata, 2) # inserting my data.
worksheet.resize(rows=len(mydata) + 1, cols=len(header)) # resize according to length of cols and rows.
I'm currently trying to process a json as pandas dataframe. What happened here is that I get a continuous stream of json structures. They are simply appended. It's a whole line. I extracted a .txt from it and want to analyse it now via pandas.
Example snippet:
{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}...
as you see in this light snipped is, that every json starts with {"positionFlightMessage": and ends with messageSubtype":"ADSB"
After a json ends, the next json just appends after it.
What i need is a table out of it, like this:
95b3b6ca-5dd2-44b4-918a-baa51022d143 1.0-RC1 1533134514 DLH1601 4.414.525 -131.849 340 24.0 ADSB AFR1601-1532928365-airline-0002 AFR AFR89GA 442.0 34000.0 ADSB
884708c1-2fff-4ebf-b72c-bbc6ed2c3623 1.0-RC1 1533134515 DLH012 3.734.542 14.379.951 320 54.0 ADSB EVA12-1532928367-airline-0096 DLH EVA012 462.0 32000.0 ADSB
i tried to use pandas read json but i get a error.
import pandas as pd
df = pd.read_json("tD.txt",orient='columns')
df.head()
ValueError: Trailing data
tD.txt has the above given snippet without the last (...) dots
I think the problem is, that every json is just appended. I could add a new line after every
messageSubtype":"ADSB"}}
and then read it, but maybe you have a solution where i can just convert the big txt file directly and convert it easily to a df
Try to get the stream of json to output like the following:
Notice the starting '[' and the ending ']'.
Also notice the ',' between each json input.
data = [{
"positionFlightMessage": {
"messageUuid": "95b3b6ca-5dd2-44b4-918a-baa51022d143",
"schemaVersion": "1.0-RC1",
"timestamp": 1533134514,
"flightNumber": "DLH1601",
"position": {
"waypoint": {
"latitude": 44.14525,
"longitude": -1.31849
},
"flightLevel": 340,
"heading": 24.0
},
"messageSource": "ADSB",
"flightUniqueId": "AFR1601-1532928365-airline-0002",
"airlineIcaoCode": "AFR",
"atcCallsign": "AFR89GA",
"fuel": {},
"speed": {
"groundSpeed": 442.0
},
"altitude": {
"altitude": 34000.0
},
"nextPosition": {
"waypoint": {}
},
"messageSubtype": "ADSB"
}
}, {
"positionFlightMessage": {
"messageUuid": "884708c1-2fff-4ebf-b72c-bbc6ed2c3623",
"schemaVersion": "1.0-RC1",
"timestamp": 1533134515,
"flightNumber": "DLH012",
"position": {
"waypoint": {
"latitude": 37.34542,
"longitude": 143.79951
},
"flightLevel": 320,
"heading": 54.0
},
"messageSource": "ADSB",
"flightUniqueId": "EVA12-1532928367-airline-0096",
"airlineIcaoCode": "DLH",
"atcCallsign": "EVA012",
"fuel": {},
"speed": {
"groundSpeed": 462.0
},
"altitude": {
"altitude": 32000.0
},
"nextPosition": {
"waypoint": {}
},
"messageSubtype": "ADSB"
}
}]
Now you should be able to loop over each 'list' element in the json and append it to the pandas df.
print(len(data))
for i in range(0,len(data)):
#here is just show messageSource only. Up to you to find out the rest..
print(data[i]['positionFlightMessage']['messageSource'])
#instead of printing here you should append it to pandas df.
Hope this helps you out a bit.
Now here's a solution for your JSON as is using regex.
s = '{"positionFlightMessage":{"messageUuid":"95b3b6ca-5dd2-44b4-918a-baa51022d143","schemaVersion":"1.0-RC1","timestamp":1533134514,"flightNumber":"DLH1601","position":{"waypoint":{"latitude":44.14525,"longitude":-1.31849},"flightLevel":340,"heading":24.0},"messageSource":"ADSB","flightUniqueId":"AFR1601-1532928365-airline-0002","airlineIcaoCode":"AFR","atcCallsign":"AFR89GA","fuel":{},"speed":{"groundSpeed":442.0},"altitude":{"altitude":34000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}{"positionFlightMessage":{"messageUuid":"884708c1-2fff-4ebf-b72c-bbc6ed2c3623","schemaVersion":"1.0-RC1","timestamp":1533134515,"flightNumber":"DLH012","position":{"waypoint":{"latitude":37.34542,"longitude":143.79951},"flightLevel":320,"heading":54.0},"messageSource":"ADSB","flightUniqueId":"EVA12-1532928367-airline-0096","airlineIcaoCode":"DLH","atcCallsign":"EVA012","fuel":{},"speed":{"groundSpeed":462.0},"altitude":{"altitude":32000.0},"nextPosition":{"waypoint":{}},"messageSubtype":"ADSB"}}'
import re
import json
replaced = json.loads('['+re.sub(r'{\"positionFlightMessage*', ',{\"positionFlightMessage', s)[1:] + ']')
dfTemp = pd.DataFrame(data=replaced)
df = pd.DataFrame()
counter = 0
def newDf(row):
global df,counter
counter += 1
temp = pd.DataFrame([row])
df = df.append(temp)
dfTemp['positionFlightMessage'] = dfTemp['positionFlightMessage'].apply(newDf)
print(df)
First we replace all occurrences of {"positionFlightMessage with ,{"positionFlightMessage and discard the first separator.
We create a dataframe out of this but we have only one column here. Use the apply function on the column and create a new dataframe out of it.
From this dataframe, you can perform some more cleaning.
I have following nested json file, which I need to convert in pandas dataframe, the main problem is there is only one unique item in the whole json and it is very deeply nested.
I tried to solve this problem with the following code, but it gives repeating output.
[{
"questions": [{
"key": "years-age",
"responseKey": null,
"responseText": "27",
"responseKeys": null
},
{
"key": "gender",
"responseKey": "male",
"responseText": null,
"responseKeys": null
}
],
"transactions": [{
"accId": "v1BN3o9Qy9izz4Jdz0M6C44Oga0qjohkOV3EJ",
"tId": "80o4V19Kd9SqqN80qDXZuoov4rDob8crDaE53",
"catId": "21001000",
"tType": "80o4V19Kd9SqqN80qDXZuoov4rDob8crDaE53",
"name": "Online Transfer FROM CHECKING 1200454623",
"category": [
"Transfer",
"Acc Transfer"
]
}
],
"institutions": [{
"InstName": "Citizens company",
"InstId": "inst_1",
"accounts": [{
"pAccId": "v1BN3o9Qy9izz4Jdz0M6C44Oga0qjohkOV3EJ",
"pAccType": "depo",
"pAccSubtype": "check",
"_id": "5ad38837e806efaa90da4849"
}]
}]
}]
I need to convert this to pandas dataframe as follows:
id pAccId tId
5ad38837e806efaa90da4849 v1BN3o9Qy9izz4Jdz0M6C44Oga0qjohkOV3EJ 80o4V19Kd9SqqN80qDXZuoov4rDob8crDaE53
The main problem I am facing is with the "id" as it very deeply nested which is the only unique key for the json.
here is my code:
import pandas as pd
import json
with open('sub.json') as f:
data = json.load(f)
csv = ''
for k in data:
for t in k.get("institutions"):
csv += k['institutions'][0]['accounts'][0]['_id']
csv += "\t"
csv += k['institutions'][0]['accounts'][0]['pAccId']
csv += "\t"
csv += k['transactions'][]['tId']
csv += "\t"
csv += "\n"
text_file = open("new_sub.csv", "w")
text_file.write(csv)
text_file.close()
Hope above code makes sense, as I am new to python.
Read the JSON file and create a dictionary of account pAccId keys mapped to account.
Build the list of transactions as well.
with open('sub.json', 'r') as file:
records = json.load(file)
accounts = {
account['pAccId']: account
for record in records
for institution in record['institutions']
for account in institution['accounts']
}
transactions = (
transaction
for record in records
for transaction in record['transactions']
)
Open a csv file. For each transaction, get account for it from the accounts dictionary.
with open('new_sub.csv', 'w') as file:
file.write('id, pAccId, tId\n')
for transaction in transactions:
pAccId = transaction['accId']
account = accounts[pAccId]
_id = account['_id']
tId = transaction['tId']
file.write(f"{_id}, {pAccId}, {tId}\n")
Finally, read csv file to pandas.DataFrame.
df = pd.read_csv('new_sub.csv')