How to convert json data from single key to csv - python

I'm trying to convert some data from a JSON file to csv. The data from the JSON file that I need exists in a single key.
I have separated the data from that key using the code below. This gives me the data in the following format:
[['/s/case/50034000013ZPEoAAO$##$00192169', 'Unable to add authentication', 'Anypoint Studio', 'Other', '7.1.3', '/s/contact/00334000023cIUYAA2$##$Paul S', '05-31-2018 22:07', '09-27-2018 05:46', 'S4'], ['/s/case/50034000014dk7mAAA$##$00195409', 'Connect Virtual Private Circuit - VPC-Pre-Prod 198.18.12.0/23', 'Anypoint Platform', 'CloudHub', '', '/s/contact/00334000023ZzOSAA0$##$James G', '07-16-2018 15:59', '07-22-2018 14:42', 'S4']
I want to separate the data so that everything contained in a square bracket is returned as a single row in my CSV file (the data is a lot longer than above, many more square brackets).
import json
json_data = json.load(open('sample_response.txt'))
for x in json_data['actions']:
data = x['returnValue']

You need writerows(data) to save it
import csv
data = [
['/s/case/50034000013ZPEoAAO$##$00192169', 'Unable to add authentication', 'Anypoint Studio', 'Other', '7.1.3', '/s/contact/00334000023cIUYAA2$##$Paul S', '05-31-2018 22:07', '09-27-2018 05:46', 'S4'],
['/s/case/50034000014dk7mAAA$##$00195409', 'Connect Virtual Private Circuit - VPC-Pre-Prod 198.18.12.0/23', 'Anypoint Platform', 'CloudHub', '', '/s/contact/00334000023ZzOSAA0$##$James G', '07-16-2018 15:59', '07-22-2018 14:42', 'S4']
# more rows
]
with open('test.csv', 'w') as fh:
csvwriter = csv.writer(fh)
csvwriter.writerows(data)

Related

Issue importing Json into panda and exporting to data to CSV

I'm trying to import data from various JSON files (different structures) into a panda and then out to a CSV file. The issue i'm having with this file in particular is the data doesn't parse correctly when loading into the panda. So the output (CSV) looks like a JSON string. How do I correct this issue.
with open(jsonfile, 'rb') as json_data:
data = json.load(json_data)
#df = pd.DataFrame(data) # throws array length error
#df = pd.read_json(data) # throws invalid file path o buffer object ype: <class ;dict'>
df = pd.json_normalize(data, record_path=['data'], sep=',') #,meta=['arrayname',['arrayname','nestlevel1name'],['arrayname','nestlevel1name','nestlevel2name'] ]
# df1 = df
# df2 = df.iloc[1:, :]
# df3 = pd.json_normalize(df2)
df.to_csv(CSVout)
Example of output (to_csv)
[{'file_create_date': '2022-02-18', 'run_id': 'a82ba22a-85f0-11ec-a670-d1a382a9bda1', 'name': 'Piedmont Medical Center', 'tax_id': '95-3561198', 'code': '0360', 'code type': 'revCode', 'code description': 'Operating Room Services General', 'payer': 'administrative concepts', 'patient_class': 'O', 'gross charge': '7193.33', 'de-identified minimum negotiated charge': '1312.00', 'payer-specific negotiated charge': '1312.00', 'de-identified maximum negotiated charge': '3738.71', 'discounted cash price': '5395.00'}]

Identifying partial character encoding/compression in text content

I have a CSV (extracted from BZ2) where only some values are encoded:
hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0
The |, 0 and 1 characters are definitely appearing as intended but the other values are clearly encoded. In fact, they look like text-compression replacements which could mean the CSV had its values compressed and then also compressed as a whole to BZ2.
I get the same results whether extracting the BZ2 with 7zip then opening the CSV in a text editor, or opening with Python bz2 module, or with Pandas and read_csv:
import bz2
with bz2.open("test-balanced.csv.bz2") as f:
contents = f.read().decode()
import pandas as pd
contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")
How can I identify which type of encoding type to decode with?
Source directory: https://nlp.cs.princeton.edu/SARC/2.0/main
Source file: test-balanced.csv.bz2
First 100 lines from extracted CSV: https://pastebin.com/mgW8hKdh
I asked the original authors of the CSV/dataset but they didn't respond which is understandable.
From readme.txt:
File Guide:
raw/key.csv: column key for raw/sarc.csv
raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json
*/comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format
/.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that
sequence, and sarcastic/non-sarcastic labels for those responses. The
format is post_id comment_id … comment_id|response_id … response_id|label … labelwhere *_id is a key to */comments.json
and label 1 indicates the respective response_id maps to a
sarcastic response. Thus each row has three entries (comment
chain, responses, labels) delimited by '|', and each of these entries
has elements delimited by spaces.The first entry always contains a
post_id and 0 or more comment_ids. The second and third entries
have the same number of elements, with the first response_id
corresponding to the first label and so on.
Converting above to a Python code snippet:
import pandas as pd
import json
from pprint import pprint
file_csv = r"D:\bat\SO\71596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
sep='|',
names=['posts','responses','labels'],
encoding='utf-8')
file_json = r"D:\bat\SO\71596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
data_json = json.load(f)
print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} post_id: {post_id}')
pprint(data_json[post_id])
for response_id in data_csv['responses'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} response_id: {response_id}')
pprint(data_json[response_id])
Note that files were (manually) downloaded from the pol directory for their acceptable size (pol: contains subset of main dataset corresponding to comments in /r/politics).
Result: D:\bat\SO\71596864.py
First csv line decoded:
post_id: hqa1x
{'author': 'joshlamb619',
'created_utc': 1307053256,
'date': '2011-06',
'downs': 359,
'score': 274,
'subreddit': 'politics',
'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
'candidates during recall elections.',
'ups': 633}
response_id: c1xiujs
{'author': 'Artisane',
'created_utc': 1307077221,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "And we're upset since the Democrats would *never* try something as "
'sneaky as this, right?',
'ups': -2}
response_id: c1xj4e2
{'author': 'stellarfury',
'created_utc': 1307080843,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
"Picture this we were makin' up candidates Being huge election whores",
'ups': -2}

File containing dictionaries that are <class 'str'> and no commas separating the dictionaries, need to load into pandas to create csv file easily

I've created a generator object and want to write it out into a CSV file so I can upload it to an external tool. At the minute the generator returns records as separate dictionaries but don't appear to have any commas separating the records/dictionaries and when I write out the file to a txt file and reload it back into the script it returns a <class 'str'>.
Class Generator declared as:
matches =
{'type_of_reference': 'JOUR', 'title': 'Ranking evidence in substance use and addiction', 'secondary_title': 'International Journal of Drug Policy', 'alternate_title1': 'Int. J. Drug Policy', 'volume': '83', 'year': '2020', 'doi': '10.1016/j.drugpo.2020.102840'}
{'type_of_reference': 'JOUR', 'title': 'Methods used in the selection of instruments for outcomes included in core outcome sets have improved since the publication of the COSMIN/COMET guideline', 'secondary_title': 'Journal of Clinical Epidemiology', 'alternate_title1': 'J. Clin. Epidemiol.', 'volume': '125', 'start_page': '64', 'end_page': '75', 'year': '2020', 'doi': '10.1016/j.jclinepi.2020.05.021',}
Which is a result of the following generator function that compares records "doi" key within this generator object and a set of doi's from an other file.
def match_record():
with open(filename_ris) as f:
ris_records = readris(f)
for entry in ris_records:
if entry['doi'] in doi_match:
yield entry
I've outputted this generator class matches by using the following code to review that the correct records have been kept as a txt file.
with open('output.txt', 'w') as f:
for x in matchs:
f.write(str(x))
It's not a list of dictionaries nor dictionaries separated by commas that I have so I'm a bit confused about how to read/load it into pandas effectively. I want to load it into pandas to drop certain series[keys] and then write it out as a csv once completed.
I'm reading it in using pd.read_csv and just returns the key: value pairs for all the separate records as column headers which is no surprise but I don't know what to do before this step.

how to retrieve particular table data in multiple tables using python-docx?

I am using python-docx to extract particular table data in a word file.
I have a word file with multiple tables. This is the particular table in multiple tables
and the retrieved data need to be arranged like this.
Challenges:
Can I find a particular table in word file using python-docx
Can I achieve my requirement using python-docx
This is not a complete answer, but it should point you in the right direction, and is based on some similar task I have been working on.
I run the following code in Python 3.6 in a Jupyter notebook, but it should work just in Python.
First we start but importing the docx Document module and point to the document we want to work with.
from docx.api import Document
document = Document(<your path to doc>)
We create a list of tables, and print how many tables there are in that. We create a list to hold all the tabular data.
tables = document.tables
print (len(tables))
big_data = []
Next we loop through the tables:
for table in document.tables:
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
#print (data)
big_data.append(data)
print(big_data)
By looping through all the tables, we read the data, creating a list of lists. Each individual list represents a table, and within that we have dictionaries per row. Each dictionary contains a key / value pair. The key is the column heading from the table and value is the cell contents for that row's data for that column.
So, that is half of your problem. The next part would be to use python-docx to create a new table in your output document - and to fill it with the appropriate content from the list / list / dictionary data.
In the example I have been working on this is the final table in the document.
When I run the routine above, this is my output:
[{'Version': '1', 'Changes': 'Local Outcome Improvement Plan ', 'Page Number': '1-34 and 42-61', 'Approved By': 'CPA Board\n', 'Date ': '22 August 2016'},
{'Version': '2', 'Changes': 'People are resilient, included and supported when in need section added ', 'Page Number': '35-41', 'Approved By': 'CPA Board', 'Date ': '12 December 2016'},
{'Version': '2', 'Changes': 'Updated governance and accountability structure following approval of the Final Report for the Review of CPA Infrastructure', 'Page Number': '59', 'Approved By': 'CPA Board', 'Date ': '12 December 2016'}]]

Parsing information from MapQuest Reverse Geocoded Data

Hi Everyone so I am having problems parsing out information from a query I made to Mapquest API. I am trying to parse out data from my geocode_data column and place into separate columns. I am trying to extract the address specifically the following components in the geocode data below. bolded words are the things I am trying to extract.
'providedLocation': {'latLng': {'lat': 52.38330319, 'lng': 4.7959011}}, 'locations': [{'adminArea6Type': 'Neighborhood', 'street': (4) '25 Philip Vingboonsstraat', 'adminArea4Type': 'County', 'adminArea3Type': 'State', 'displayLatLng': (9){'lat': 52.383324, (10){ 'lng': 4.795784}, (7) 'adminArea3': 'Noord-Holland', 'adminArea1Type': 'Country', 'linkId': '0', 'adminArea4': 'MRA', 'dragPoint': False, 'mapUrl': 'http://www.mapquestapi.com/staticmap/v4/getmap?key=Cxk9Ng7G6M8VlrJytSZaAACnZE6pG3xp&type=map&size=225,160&pois=purple-1,52.3833236,4.7957837,0,0,|&center=52.3833236,4.7957837&zoom=15&rand=-152222465', 'type': 's', '(5)postalCode': '1067BG', 'latLng': {'lat': 52.383324, 'lng': 4.795784},(6) 'adminArea5': 'Amsterdam', 'adminArea6': 'Amsterdam', 'geocodeQuality': 'ADDRESS', 'unknownInput': '', 'adminArea5Type': 'City', 'geocodeQualityCode': 'L1AAA', (8) 'adminArea1': 'NL', 'sideOfStreet': 'N'}]}
I have tried building my code but I keep getting KeyErrors. Can anyone fix my code so that I am able to extract the different address components for my study. Thanks! My code is correct until locations part towards the end. then I get an key error.
import pandas as pd
import json
import requests
df = pd.read_csv('/Users/albertgonzalobautista/Desktop/Testing11.csv')
df['geocode_data'] = ''
df['address']=''
df['st_pr_mn']= ' '
def reverseGeocode(latlng):
result = {}
url = 'http://www.mapquestapi.com/geocoding/v1/reverse?key={1}&location={0}'
apikey = 'Cxk9Ng7G6M8VlrJytSZaAACnZE6pG3xp'
request = url.format(latlng, apikey)
data = json.loads(requests.get(request).text)
if len(data['results']) > 0:
result = data['results'][0]
return result
for i, row in df.iterrows():
df['geocode_data'][i] = reverseGeocode(df['lat'][i].astype(str) + ',' + df['lon'][i].astype(str))
for i, row in df.iterrows():
if 'locations' in row['geocode_data']:
for component in row['locations']:
print (row['locations'])
df['st_pr_mn'][i] = row['adminArea3']
First of all , according to your if condition , locations is a key in row['geocode_data'] , so you should try row['geocode_data']['locations'] , not row['locations'] , this is most probably the reason you are getting the KeyError.
Then according to the json you have given in the OP, seems like locations key stores a list, so iterate over each element (as you are doing now) and get the required element from component not row. Example -
for i, row in df.iterrows():
if 'locations' in row['geocode_data']:
for component in row['geocode_data']['locations']:
print (row['geocode_data']['locations'])
df['st_pr_mn'][i] = component['adminArea3']
Though this would overwrite df['st_pr_mn'][i] with a new value for component['adminArea3'] for every dictionary in the list of row['geocode_data']['locations'] . If there is only one element in the list then its fine, otherwise you would have to decide how to store the multiple values , maybe use a list for that.

Categories