I am converting dictionary to dataframe in python and output has changed

I am converting dictionary to dataframe in python and output has changed - python

I am converting dictionary to dataframe in python and output has changed. Did i write anything wrong here.
In dictionary output we have source id for 1 time but after converted to dataframe source id coming 4 times but i want parsed_address should be multivalue. I want dictionary output same in dataframe only.
Dictionary Output:
dic--->
{
"source_id":123,
"parsed_address":[
{
"address_type":"Primary",
"address":[
{
"address_line_1":"18 Atherton",
"address_line_2":"nan"
}
]
},
{
"address_type":"directory",
"address":[
{
"address_line_1":"130 E Chapman Ave Apt 312",
"address_line_2":"nan"
}
]
},
{
"address_type":"home",
"address":[
{
"address_line_1":"9722 Walnut St",
"address_line_2":"nan"
}
]
},
{
"address_type":"work",
"address":[
{
"address_line_1":"1325 S Grand Ave",
"address_line_2":"nan"
}
]
}
]
}
DataFrame Output:
[
{
"parsed_address":{
"address_type":"Primary",
"address":[
{
"address_line_1":"18 Atherton",
"address_line_2":null
}
]
},
"source_id":"90353002"
},
{
"parsed_address":{
"address_type":"directory",
"address":[
{
"address_line_1":"130 E Chapman Ave Apt 312",
"address_line_2":null
}
]
},
"source_id":"90353002"
},
{
"parsed_address":{
"address_type":"home",
"address":[
{
"address_line_1":"9722 Walnut St",
"address_line_2":null
}
]
},
"source_id":"90353002"
},
{
"parsed_address":{
"address_type":"work",
"address":[
{
"address_line_1":"1325 S Grand Ave",
"address_line_2":null
}
]
},
"source_id":"90353002"
}
]
code
print("dic--->",dic)
df4 = pd.DataFrame(dic)
print("df4--->",df4)
Converted dictionary to dataframe and output has changed.

This is recycling of scalar values.
This behavior is also why you can assign a scalar to a new column and the number of rows stays the same.
You can wrap the parsed address into another list to get a single row.

Related

Find the longest group after groupby on normalized json in pandas

My code below groups by values and creates a list of values that were once the length of arrays. But how can I return the id that has the largest sum of each number in the elements:
Original Json read into df (not same data as printed because it was too long)
{
"kind":"admin#reports#activities",
"etag":"\"5g8\"",
"nextPageToken":"A:1651795128914034:-4002873813067783265:151219070090:C02f6wppb",
"items":[
{
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:59:39.421Z",
"uniqueQualifier":"5526793068617678141",
"applicationName":"token",
"customerId":"cds"
},
"etag":"\"jkYcURYoi8\"",
"actor":{
"email":"blah#blah.net",
"profileId":"1323"
},
"ipAddress":"107.178.193.87",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"admin"
},
{
"name":"method_name",
"value":"directory.users.list"
},
{
"name":"client_id",
"value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"7158"
},
{
"name":"product_bucket",
"value":"GSUITE_ADMIN"
},
{
"name":"app_name",
"value":"Untitled project"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
]
},
{
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:58:48.914Z",
"uniqueQualifier":"-4002873813067783265",
"applicationName":"token",
"customerId":"df"
},
"etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
"actor":{
"email":"blah.blah#bebe.net",
"profileId":"1324"
},
"ipAddress":"54.80.168.30",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"gmail"
},
{
"name":"method_name",
"value":"gmail.users.messages.list"
},
{
"name":"client_id",
"value":"927538837578.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"2"
},
{
"name":"product_bucket",
"value":"GMAIL"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
]
}
]
}
current code:
df = pd.json_normalize(response['items'])
df['test'] = df.groupby('actor.profileId')['events'].apply(lambda x: [len(x.iloc[i][0]['parameters']) for i in range(len(x))])
output:
ID
1002306 [7, 7, 7, 5]
1234444 [3,5,6]
1222222 [1,3,4,5]
desired output
id total
1002306 26
Sorry had to fill up more space, as there was so much code

There’s no need to construct the intermediate df and do groupby on it. You can use pass the record and meta paths to json_normalize to directly flatten the json data. Then your job seems to be about counting the number of rows per actor.profileId and finding the maximum.
df = pd.json_normalize(response['items'], ['events','parameters'], ['actor'])
df['actor.profileId'] = df['actor'].str['profileId']
out = df.value_counts('actor.profileId').pipe(lambda x: x.iloc[[0]])
Output:
actor.profileId
1323 7
dtype: int64

Pandas Nested Array Columns

Pandas Nested Array Columns: Details of my question listed below:
I have a column that is a nested array in pandas, and is as below when you print
How can I get the length of the nested parameters array to be a new column value ?
print(df["arraycolumn"])
Prints:
[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"admin"
},
{
"name":"method_name",
"value":"directory.users.list"
},
{
"name":"client_id",
"value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"7158"
},
{
"name":"product_bucket",
"value":"GSUITE_ADMIN"
},
{
"name":"app_name",
"value":"Untitled project"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
] }, {
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:58:48.914Z",
"uniqueQualifier":"-4002873813067783265",
"applicationName":"token",
"customerId":"C02f6wppb"
},
"etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
"actor":{
"email":"nancy.admin#hyenacapital.net",
"profileId":"100230688039070881323"
},
"ipAddress":"54.80.168.30",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"gmail"
},
{
"name":"method_name",
"value":"gmail.users.messages.list"
},
{
"name":"client_id",
"value":"927538837578.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"2"
},
{
"name":"product_bucket",
"value":"GMAIL"
},
{
"name":"app_name",
"value":"Zapier"
},
{
"name":"client_type",
"value":"WEB"
}
]

You can use .apply to apply an arbitrary Python function to each element of a Series:
def get_parameters(obj):
return obj[0]["parameters"]
df["arraycolumn_parameters_length"] = (
df["arraycolumn"]
.apply(lambda y: len(get_parameters(y)))
)
Of course, you might have to add error checking or other additional logic to the get_parameters function above, as needed.

How about using .str?
df['arraycolumn'].str['parameters'].str.len()

Best way to extract/format data in JSON format using Python?

I am trying to do some data analysis on bulk patent data (data is usually found here but is currently down - https://ped.uspto.gov/peds/).
Here is the first entry in the JSON file:
{
"PatentBulkData":[
{
"patentCaseMetadata":{
"applicationNumberText":{
"value":"15733015",
"electronicText":"15733015"
},
"filingDate":"2020-01-01",
"applicationTypeCategory":"Utility",
"partyBag":{
"applicantBagOrInventorBagOrOwnerBag":[
{
"applicant":[
{
"contactOrPublicationContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"personStructuredName":{
"firstName":"Birol",
"middleName":"",
"lastName":"Cimen"
}
}
]
},
"cityName":"Hengelo",
"geographicRegionName":{
"value":"",
"geographicRegionCategory":"STATE"
},
"countryCode":"NL"
}
]
}
]
},
{
"partyIdentifierOrContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"personStructuredName":{
"lastName":"Oppedahl Patent Law Firm LLC (Mink)"
}
}
]
},
"postalAddressBag":{
"postalAddress":[
{
"postalStructuredAddress":{
"addressLineText":[
{
"value":"P O Box 351240"
}
],
"cityName":"Westminster",
"geographicRegionName":[
{
"value":"CO"
}
],
"countryCode":"US",
"postalCode":"80035"
}
}
]
}
},
{
"value":"133517"
}
]
}
]
},
"groupArtUnitNumber":{
"value":"3771",
"electronicText":"3771"
},
"applicationConfirmationNumber":"7897",
"applicantFileReference":"FP01.P035 SST02US",
"priorityClaimBag":{
"priorityClaim":[
{
"ipOfficeName":"NETHERLANDS",
"applicationNumber":{
"applicationNumberText":"2019179"
},
"filingDate":"2017-07-05",
"sequenceNumber":"1"
}
]
},
"patentClassificationBag":{
"cpcClassificationBagOrIPCClassificationOrECLAClassificationBag":[
{
"ipOfficeCode":"US",
"mainNationalClassification":{
"nationalClass":"606",
"nationalSubclass":"133000"
}
}
]
},
"businessEntityStatusCategory":"SMALL",
"firstInventorToFileIndicator":"true",
"inventionTitle":{
"content":[
"Hair removal device for removing body hair on a body surface"
]
},
"applicationStatusCategory":"Application Dispatched from Preexam, Not Yet Docketed",
"applicationStatusDate":"2020-05-08",
"officialFileLocationCategory":"ELECTRONIC",
"patentPublicationIdentification":{
"publicationNumber":"US20200170371A1",
"publicationDate":"2020-06-04"
},
"relatedDocumentData":{
"parentDocumentDataOrChildDocumentData":[
{
"descriptionText":"This application is National Stage Entry of",
"applicationNumberText":"PCT/NL2018/050434",
"filingDate":"2018-07-04",
"parentDocumentStatusCode":"Published",
"patentNumber":""
}
]
}
},
"prosecutionHistoryDataBag":{
"prosecutionHistoryData":[
{
"eventDate":"2020-06-05",
"eventCode":"PG-ISSUE",
"eventDescriptionText":"PG-Pub Issue Notification"
},
{
"eventDate":"2020-05-11",
"eventCode":"M903",
"eventDescriptionText":"Notice of DO/EO Acceptance Mailed"
},
{
"eventDate":"2020-05-11",
"eventCode":"FLRCPT.U",
"eventDescriptionText":"Filing Receipt - Updated"
},
{
"eventDate":"2020-05-11",
"eventCode":"MPEN",
"eventDescriptionText":"Mail Pre-Exam Notice"
},
{
"eventDate":"2020-02-26",
"eventCode":"EML_NTR",
"eventDescriptionText":"Email Notification"
},
{
"eventDate":"2020-02-26",
"eventCode":"EML_NTR",
"eventDescriptionText":"Email Notification"
},
{
"eventDate":"2020-02-26",
"eventCode":"CCRDY",
"eventDescriptionText":"Application ready for PDX access by participating foreign offices"
},
{
"eventDate":"2020-01-05",
"eventCode":"371COMP",
"eventDescriptionText":"371 Completion Date"
},
{
"eventDate":"2020-02-25",
"eventCode":"PGPC",
"eventDescriptionText":"Sent to Classification Contractor"
},
{
"eventDate":"2020-02-25",
"eventCode":"FTFS",
"eventDescriptionText":"FITF set to YES - revise initial setting"
},
{
"eventDate":"2020-01-02",
"eventCode":"PTA.RFE",
"eventDescriptionText":"Patent Term Adjustment - Ready for Examination"
},
{
"eventDate":"2020-02-26",
"eventCode":"FLRCPT.O",
"eventDescriptionText":"Filing Receipt"
},
{
"eventDate":"2020-02-26",
"eventCode":"M903",
"eventDescriptionText":"Notice of DO/EO Acceptance Mailed"
},
{
"eventDate":"2019-12-31",
"eventCode":"SREXR141",
"eventDescriptionText":"PTO/SB/69-Authorize EPO Access to Search Results"
},
{
"eventDate":"2019-12-31",
"eventCode":"APPERMS",
"eventDescriptionText":"Applicants have given acceptable permission for participating foreign "
},
{
"eventDate":"2020-02-25",
"eventCode":"SMAL",
"eventDescriptionText":"Applicant Has Filed a Verified Statement of Small Entity Status in Compliance with 37 CFR 1.27"
},
{
"eventDate":"2019-12-31",
"eventCode":"L194",
"eventDescriptionText":"Cleared by OIPE CSR"
},
{
"eventDate":"2019-12-31",
"eventCode":"WIDS",
"eventDescriptionText":"Information Disclosure Statement (IDS) Filed"
},
{
"eventDate":"2019-12-31",
"eventCode":"WIDS",
"eventDescriptionText":"Information Disclosure Statement (IDS) Filed"
},
{
"eventDate":"2019-12-31",
"eventCode":"BIG.",
"eventDescriptionText":"ENTITY STATUS SET TO UNDISCOUNTED (INITIAL DEFAULT SETTING OR STATUS CHANGE)"
},
{
"eventDate":"2019-12-31",
"eventCode":"IEXX",
"eventDescriptionText":"Initial Exam Team nn"
}
]
},
"st96Version":"V3_1",
"ipoVersion":"US_V8_0"
},
I import the json data as a dictionary. However, what is the best way to obtain the information I would like to retrieve. Should I use json.normalize to flatten it and convert to a Dataframe?
I would like to specifically retrieve information in the "prosecutionHistoryData". For example, with other patent applications, this would provide specific information regarding how many office actions have been issued.
Eventually I would like to cross-reference this office action data by Patent Examiner (which would be found in the "applicantBagOrInventorBagOrOwnerBag" when assigned to an Examiner).
Are there any good resources that explain how to clean json data such I can get break this information into separate columns?
Thank you for the information! Here is an example with an Examiner:
{
"patentCaseMetadata":{
"applicationNumberText":{
"value":"16732312",
"electronicText":"16732312"
},
"filingDate":"2020-01-01",
"applicationTypeCategory":"Utility",
"partyBag":{
"applicantBagOrInventorBagOrOwnerBag":[
{
"primaryExaminerOrAssistantExaminerOrAuthorizedOfficer":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"personFullName":"ORGAD, EDAN"
}
]
}
}
]
},
{
"applicant":[
{
"contactOrPublicationContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"organizationStandardName":{
"content":[
"Communication Systems LLC"
]
}
}
]
},
"cityName":"Santa Fe",
"geographicRegionName":{
"value":"NM",
"geographicRegionCategory":"STATE"
},
"countryCode":""
}
]
}
]
}
]
},
"groupArtUnitNumber":{
"value":"2414",
"electronicText":"2414"
},
"applicationConfirmationNumber":"8996",
"applicantFileReference":"CS1003US03",
"patentClassificationBag":{
"cpcClassificationBagOrIPCClassificationOrECLAClassificationBag":[
{
"ipOfficeCode":"US",
"mainNationalClassification":{
"nationalClass":"370",
"nationalSubclass":"329000"
}
}
]
},
"businessEntityStatusCategory":"SMALL",
"firstInventorToFileIndicator":"true",
"inventionTitle":{
"content":[
"APPARATUSES, METHODS, AND COMPUTER-READABLE MEDIUM FOR COMMUNICATION IN A WIRELESS LOCAL AREA NETWORK"
]
},
"applicationStatusCategory":"Docketed New Case - Ready for Examination",
"applicationStatusDate":"2020-02-07",
"officialFileLocationCategory":"ELECTRONIC",
"patentPublicationIdentification":{
"publicationNumber":"US20200154403A1",
"publicationDate":"2020-05-14"
}
},
"prosecutionHistoryDataBag":{
"prosecutionHistoryData":[
{
"eventDate":"2020-05-19",
"eventCode":"PG-ISSUE",
"eventDescriptionText":"PG-Pub Issue Notification"
}
]
},
"assignmentDataBag":{
"assignmentData":[
{
"reelNumber":"52436",
"frameNumber":"295",
"documentReceivedDate":"2020-04-20",
"recordedDate":"2020-04-20",
"mailDate":"2020-04-21",
"pageTotalQuantity":3,
"conveyanceText":"ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).",
"assignorBag":{
"assignor":[
{
"executionDate":"2016-07-14",
"contactOrPublicationContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"value":"ATEFI, ALI"
}
]
}
}
]
}
]
},
"assigneeBag":{
"assignee":[
{
"contactOrPublicationContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"value":"COMMUNICATION SYSTEMS LLC"
}
]
},
"postalAddressBag":{
"postalAddress":[
{
"postalAddressText":[
{
"sequenceNumber":"1",
"value":"530-B HARKLE ROAD"
},
{
"sequenceNumber":"2",
"value":"STE. 100"
},
{
"sequenceNumber":"3",
"value":"SANTA FE NEW MEXICO 87505"
}
]
}
]
}
}
]
}
]
},
"correspondenceAddress":{
"partyIdentifierOrContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"value":"ALI ATEFI"
}
]
},
"postalAddressBag":{
"postalAddress":[
{
"postalAddressText":[
{
"sequenceNumber":"1",
"value":"530-B HARKLE ROAD"
},
{
"sequenceNumber":"2",
"value":"STE. 100"
},
{
"sequenceNumber":"3",
"value":"SANTA FE, NM 87505"
}
]
}
]
}
}
]
},
"sequenceNumber":"1"
}
],
"assignmentTotalQuantity":1
},
"st96Version":"V3_1",
"ipoVersion":"US_V8_0"
},
My parse will not go past the applicantBagOrInventorBagOrOwnerBag. Here is my example parse for trying to obtain the Examiner name, which returns an empty dataframe:
jsonpath_expression = parse('PatentBulkData[*].patentCaseMetadata.partyBag.applicantBagOrInventorBagOrOwnerBag.primaryExaminerOrAssistantExaminerOrAuthorizedOfficer.name.personNameOrOrganizationNameOrEntityName.personFullName[*]')
If I end at the applicantBagOrInventorBagOrOwnerBag, I return a dataframe with proper information - just with brackets and all the other JSON notation. Am I missing the key structure?
Thanks again!

For parsing more or less complex JSON documents you might wanna take a look at the JSONPath "query language".
There's a nice Python implementation in jsonpath-rw. Since the data you need is nested like this
{
"PatentBulkData": [
{
"prosecutionHistoryDataBag": {
"prosecutionHistoryData": [
{
"eventDate": "2020-06-05",
"eventCode": "PG-ISSUE",
"eventDescriptionText": "PG-Pub Issue Notification"
},
the JSONPath would be
Under key PatentBulkData, get the every element of the array, then the key prosecutionHistoryDataBag, then the key prosecutionHistoryData, and finally all array elements under that.
Or
PatentBulkData[*].prosecutionHistoryDataBag.prosecutionHistoryData[*]
This is what you'd do in Python
import json
from jsonpath_rw import jsonpath, parse
import pandas as pd
# Parse the string containing the whole JSON document
data = json.loads(<YOUR_JSON_STRING>)
jsonpath_expr = parse('PatentBulkData[*].prosecutionHistoryDataBag.prosecutionHistoryData[*]')
# Extract the raw value from each matching element,
# i.e. every element of the JSON array
matches = [match.value for match in jsonpath_expr.find(data)]
# Create dataframe from the list of dictionaries
df = pd.DataFrame.from_records(matches)
Result:
| eventDate | eventCode | eventDescriptionText |
|-------------|:------------|:----------------------------------|
| 2020-06-05 | PG-ISSUE | PG-Pub Issue Notification |
| 2020-05-11 | M903 | Notice of DO/EO Acceptance Mailed |
| 2020-05-11 | FLRCPT.U | Filing Receipt - Updated |
| 2020-05-11 | MPEN | Mail Pre-Exam Notice |
| 2020-02-26 | EML_NTR | Email Notification |
EDIT
For the examiner query, you need to look out for nested arrays. Every time you get to an array in the tree, you need to either get one ([0], [1], etc.) or all the elements in the array ([*]):
examiner_expr = parse(
"PatentBulkData[*].patentCaseMetadata.partyBag"
".applicantBagOrInventorBagOrOwnerBag[*]"
".primaryExaminerOrAssistantExaminerOrAuthorizedOfficer[*]"
".name.personNameOrOrganizationNameOrEntityName[*]"
".personFullName"
)
[match.value for match in examiner_expr.find(data)]
# ['ORGAD, EDAN']

Best method to import Nested JSON data into Dataframes or Dictionary?

I am wondering if anyone would have a suggestion on how to import nested JSON data in as a Dataframe or dictionary? The data in question is normally available here - https://ped.uspto.gov/peds/.
Here is an example of the format of the data:
{
"PatentBulkData":[
{
"patentCaseMetadata":{
"applicationNumberText":{
"value":"16732342",
"electronicText":"16732342"
},
"filingDate":"2020-01-01",
"applicationTypeCategory":"Utility",
"partyBag":{
"applicantBagOrInventorBagOrOwnerBag":[
{
"primaryExaminerOrAssistantExaminerOrAuthorizedOfficer":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"personFullName":"VO, PETER DUNG BA"
}
]
}
}
]
},
{
"applicant":[
{
"contactOrPublicationContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"organizationStandardName":{
"content":[
"CYNTEC CO., LTD."
]
}
}
]
},
"cityName":"Hsinchu",
"geographicRegionName":{
"value":"",
"geographicRegionCategory":"STATE"
},
"countryCode":"TW"
}
]
}
]
}
]
},
"groupArtUnitNumber":{
"value":"3729",
"electronicText":"3729"
},
"applicationConfirmationNumber":"1040",
"applicantFileReference":"6101.179US",
"patentClassificationBag":{
"cpcClassificationBagOrIPCClassificationOrECLAClassificationBag":[
{
"ipOfficeCode":"US",
"mainNationalClassification":{
"nationalClass":"029",
"nationalSubclass":"602100"
}
}
]
},
"businessEntityStatusCategory":"UNDISCOUNTED",
"firstInventorToFileIndicator":"true",
"inventionTitle":{
"content":[
"INDUCTOR WITH AN ELECTRODE STRUCTURE"
]
},
"applicationStatusCategory":"Docketed New Case - Ready for Examination",
"applicationStatusDate":"2020-04-27",
"officialFileLocationCategory":"ELECTRONIC",
"patentPublicationIdentification":{
"publicationNumber":"US20200135386A1",
"publicationDate":"2020-04-30"
}
},
"assignmentDataBag":{
"assignmentData":[
{
"reelNumber":"51406",
"frameNumber":"55",
"documentReceivedDate":"2020-01-03",
"recordedDate":"2020-01-03",
"mailDate":"2020-01-06",
"pageTotalQuantity":3,
"conveyanceText":"ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).",
"assignorBag":{
"assignor":[
{
"executionDate":"2020-01-02",
"contactOrPublicationContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"value":"LEE, CHI-HSUN"
}
]
}
}
]
},
{
"executionDate":"2020-01-02",
"contactOrPublicationContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"value":"HSIEH, HSIEH-SHEN"
}
]
}
}
]
},
{
"executionDate":"2020-01-02",
"contactOrPublicationContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"value":"CHEN, SEN-HUEI"
}
]
}
}
]
}
]
},
"assigneeBag":{
"assignee":[
{
"contactOrPublicationContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"value":"CYNTEC CO., LTD."
}
]
},
"postalAddressBag":{
"postalAddress":[
{
"postalAddressText":[
{
"sequenceNumber":"1",
"value":"NO. 2, RESEARCH & DEVELOPMENT 2ND RD."
},
{
"sequenceNumber":"2",
"value":"SCIENCE PARK"
},
{
"sequenceNumber":"3",
"value":"HSINCHU TAIWAN"
}
]
}
]
}
}
]
}
]
},
"correspondenceAddress":{
"partyIdentifierOrContact":[
{
"name":{
"personNameOrOrganizationNameOrEntityName":[
{
"value":"LITRON INTERNATIONAL PATENT & TRADEMARK OFFICE"
}
]
},
"postalAddressBag":{
"postalAddress":[
{
"postalAddressText":[
{
"sequenceNumber":"1",
"value":"11F.-2, NO.248, SEC. 3, NANJING E. RD."
},
{
"sequenceNumber":"2",
"value":"TAIPEI CITY, TAIWAN"
}
]
}
]
}
}
]
},
"sequenceNumber":"1"
}
],
"assignmentTotalQuantity":1
},
"st96Version":"V3_1",
"ipoVersion":"US_V8_0"
},
I bring in the data with the following:
import json
import pandas as pd
with open('/content/drive/My Drive/2020.json') as json_file:
data = json.load(json_file)
While this does create a dictionary, it is keyed on 'PatentBulkData'. Thus, the remaining portion of the data is a list. In other words, when I run
print(type(data['PatentBulkData']))
The type is a 'list'.
Ideally, I would like to go down one more level in order to create a dictionary based on the application number text, examiner name, and prosecution history bag (an example with prosecution history bag can be found here - Best way to extract/format data in JSON format using Python?).
The purpose of this is to get the data into a format such that I can conduct analytics based on examiner, applicant, etc.
I believe the data is also available in XML format - would XML be easier?
Any suggestions would greatly be appreciated. Thanks!

The command you're looking for is json_normalize. The Pandas documentation for it is pretty good- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
So for example, you would want to do something like:
import json
import pandas as pd
from pandas import json_normalize
with open('/content/drive/My Drive/2020.json') as json_file:
data = json.load(json_file)
df = json_normalize(data, max_level=1)

MongoDB - Get SUM of values INSIDE of the array

I have JSON document recorded to MongoDB with structure like so:
[{ "SessionKey": "172e3b6b-509e-4ef3-950c-0c1dc5c83bab",
"Query": {"Date": "2020-03-04"},
"Flights": [
{"LegId":"13235",
"PricingOptions": [
{"Agents": [1963108],
"Price": 61763.64 },
{"Agents": [4035868],
"Price": 62395.83 }]},
{"LegId": "13236",
"PricingOptions": [{
"Agents": [2915951],
"Price": 37188.0}]}
...
The result I'm trying to get is "LegId":"sum_per_flight", in this case -> {'13235': (61763.64+62395.83), '13236': 37188.0} and then get flights with price < N
I've tried to run this pipeline for aggregation step (but it returns list of ALL prices - I don't know how to sum them up properly):
result = collection.aggregate([
{'$match': {'Query.Date': '2020-03-01'}},
{'$group': {'_id': {'Flight':'$Flights.LegId', 'Price':'$Flights.PricingOptions.Price'}}} ])
Also I've tried this pipeline, but it returns 0 for 'total_price_per_flight':
result = collection.aggregate({'$project': {
'Flights.LegId':1,
'total_price_per_flight': {'$sum': '$Flights.PricingOptions.Price'}
}})

You need to use $unwind to flatten Flights array to able iterate individually.
With $reduce operator, we iterate PricingOptions array and sum Price fields (accumulate prices).
The last step we return your documents into original structure. Before that, you may apply "get flights with price < N"
db.collection.aggregate([
{
"$match": {
"Query.Date": "2020-03-04"
}
},
{
$unwind: "$Flights"
},
{
$addFields: {
"Flights.LegId": {
$arrayToObject: [
[
{
k: "$Flights.LegId",
v: {
$reduce: {
input: "$Flights.PricingOptions",
initialValue: 0,
in: {
$add: [
"$$value",
"$$this.Price"
]
}
}
}
}
]
]
}
}
},
{
$group: {
_id: "$_id",
SessionKey: {
$first: "$SessionKey"
},
Query: {
$first: "$Query"
},
Flights: {
$push: "$Flights"
}
}
}
])
MongoPlayground

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

I am converting dictionary to dataframe in python and output has changed - python

This is recycling of scalar values. This behavior is also why you can assign a scalar to a new column and the number of rows stays the same. You can wrap the parsed address into another list to get a single row.

Related

Find the longest group after groupby on normalized json in pandas

Pandas Nested Array Columns

Best way to extract/format data in JSON format using Python?

Best method to import Nested JSON data into Dataframes or Dictionary?

MongoDB - Get SUM of values INSIDE of the array

Categories

Resources