Find string in text file and print strings near it - python

I'll try to explain my goal. I have to write reports based on a document sent to me that has common strings in it. For example, the document sent to me contains data like:
"reportId": 84561234,
"dateReceived": "2020-01-19T17:54:31.000+0000",
"reportingEsp": {
"firstName": "Google",
"lastName": "Reviewer",
"addresses": {
"address": [
{
"street1": "1600 Ampitheater Parkway",
"street2": null,
"city": "Mountainview",
"postalCode": "94043",
"state": "CA",
"nonUsaState": null,
"country": "US",
"type": "BUSINESS"
This is an example of the 'raw' data. It is also presented in a PDF. I have tried scraping the PDF using tabula, but there seems to be some issue with fonts?? So I only get about 10% of the text. And I am wondering/thinking going after the raw data will be more accurate/easier...(if you think scraping the PDF would be easier, please let me know)
So I used this code:
with open('filetobesearched.txt', 'r') as searchfile:
for line in searchfile:
if 'reportId' in line:
print (line)
if 'dateReceived' in line:
print (line)
if 'firstName' in line:
print (line)
and this is where trouble starts... there are multiple occurrences of the string 'firstName' in the file. So my code as exists prints each of those one after the other. In the raw file those fields exist in different sections each are preceded by a section header like in the example above 'reportingESP'. So I'd like my code to somehow know the 'firstName' string belongs to a given section and the next occurrence belongs to another section to be printed with it... (make sense?)
Eventually I'd like to parse out the address information but omit any fields with a null.
And ULTIMATELY I'd like the data outputted into a file I could then in turn import into my report template and fill those fields as applicable. Which seems like a huge thing to me... so I'll be happy with help simply parsing through the raw data and outputting the results to a file in the proper order.
Thanks in advance for any help!

Thanks, yes TIL - it's json data. So I accomplished my goal like this:
JSON Data
"reportId": 84561234,
"dateReceived": "2020-01-19T17:54:31.000+0000",
"reportingEsp": {
"firstName": "Google",
"lastName": "Reviewer",
"addresses": {
"address": [
{
"street1": "1600 Ampitheater Parkway",
"street2": null,
"city": "Mountainview",
"postalCode": "94043",
"state": "CA",
"nonUsaState": null,
"country": "US",
"type": "BUSINESS"
My code:
import json
# read files
myjsonfile=open('file.json', 'r')
jsondata=myjsonfile.read()
# Parse
obj=json.loads(jsondata)
#parse through the json data to populate report variables
rptid = str(str(obj['reportId']))
dateReceived = str(str(obj['dateReceived']))
print('Report ID: ', rptid)
print('Date Received: ', dateReceived)
So now that I have those as variables I am trying to using them to fill a docx template... but that's another question I think.
Consider this one answered. Thanks again!

Related

PDF long text extraction to JSON in Python

I'm trying to create a python script that extracts text from a PDF then converts it to a correctly formatted JSON file (see below).
The text extraction is not a problem. I'm using PyPDF2 to extract the text from user inputted pdf - which will often result in a LONG text string. I would like to add this text as a 'value' to a json 'key' (see 2nd example below).
My code:
# Writing all data to JSON file
# Data to be written
dictionary ={
"company": str(company),
"document": str(document),
"text": str(text) # This is what would be a LONG string of text
}
# Serializing json
json_object = json.dumps(dictionary, indent = 4)
print(json_object)
with open('company_document.json', 'w') as fp:
json.dump(json_object, fp)
The ideal output would be a JSON file that is structured like this:
[
{
"company": 1,
"document-name": "Orlando",
"text": " **LONG_TEXT_HERE** "
}
]
I'm not getting the right json structure as an output. Also, the long text string most likely contains some punctuation or special characters that can affect the json - such as closing the string too early. I could take this out before, but is there a way to keep it in for the json file so I can address it in the next step (in Neo4j) ?
This is my output at the moment:
"{\n \"company\": \"Stack\",\n \"document\": \"Overflow Report\",\n \"text\": \"Long text 2020\\nSharing relevant and accountable information about our development "quotes and things...
Current:
Current situation
Goal:
Ideal situation
Does anyone have an idea on how this can be achieved?
Like many people, you are confusing the CONTENT of your data with the REPRESENTATION of your data. The code you have works just fine. Notice:
import json
# Data to be written
dictionary ={
"company": 1,
"document": "Orlando",
"text": """Long text 2020
Sharing relevant and accountable information about our development.
This is a complicated text string with "quotes and things".
"""
}
# Serializing json
json_object = json.dumps([dictionary], indent = 4)
print(json_object)
with open('company_document.json', 'w') as fp:
json.dump([dictionary], fp)
When executed, this produces the following on stdout:
[
{
"company": 1,
"document": "Orlando",
"text": "Long text 2020\nSharing relevant and accountable information about our development.\nThis is a complicated text string with \"quotes and things\".\n"
}
]
Notice that the embedded quotes are escaped. That's what the standard requires. The file does not have the indentation, because you didn't ask for it, but it's still quite valid JSON.
[{"company": 1, "document": "Orlando", "text": "Long text 2020\nSharing relevant and accountable information about our development.\nThis is a complicated text string with \"quotes and things\".\n"}]
FOLLOWUP
This version reads in whatever was in the file before, adds a new record to the list, and saves the whole thing out.
import os
import json
# Read existing data.
MASTER = "company_document.json"
if os.path.exists( MASTER ):
database = json.load( open(MASTER,'r') )
else:
database = []
# Data to be written
dictionary ={
"company": 1,
"document": "Orlando",
"text": """Long text 2020
Sharing relevant and accountable information about our development.
This is a complicated text string with "quotes and things".
"""
}
# Serializing json
json_object = json.dumps([dictionary], indent = 4)
print(json_object)
database.append(dictionary)
with open(MASTER, 'w') as fp:
json.dump(database, fp)

Scrapy Field Names that are not as per Python variable restrictions

Is it possible to have field names that do not conform to python variable naming rules? To elaborate, is it possible to have the field name as "Job Title" instead of "job_title" in the export file. While may not be useful in JSON or XML exports, such an functionality might be useful while exporting in CSV format. For instance, if I need to use this data to import to another system which is already configured to accept CSVs with a certain field name.
Tried to reading the Item Pipelines documentation but it appears to be for an "an item has been scraped by a spider" but not for the field names themselves (Could be totally wrong though).
Any help in this direction would be really helpful!
I would suggest you to use a third party lib called scrapy-jsonschema. With that you can define your Items like this:
from scrapy_jsonschema.item import JsonSchemaItem
class MyItem(JsonSchemaItem):
jsonschema = {
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "MyItem",
"description": "My Item with spaces",
"type": "object",
"properties": {
"id": {
"description": "The unique identifier for the employee",
"type": "integer"
},
"name": {
"description": "Name of the employee",
"type": "string"
},
"job title": {
"description": "The title of employee's job.",
"type": "string",
}
},
"required": ["id", "name", "job title"]
}
And populate it like this:
item = MyItem()
item['job title'] = 'Boss'
You can read more about here.
This solution address the Item definition as you asked, but you can achieve similar results without defining an Item. For example, you could just scrape the data into a dict and yield it back to scrapy.
yield {
"id": response.xpath('...').get(),
"name": response.xpath('...').get(),
"job title": response.xpath('...').get(),
}
with scrapy crawl myspider -o file.csv that would scrape into a csv and the columns will have the names you chose.
You could also have the spider directly write into a csv, or it's pipeline, etc. Several ways to do it without a Item definition.

How do I resolve a single reference with Python jsonschema RefResolver

I am writing Python code to validate a .csv file using a JSON schema and the jsonschema Python module. I have a clinical manifest schema that looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/veoibd_schema.json",
"title": "clinical data manifest schema",
"description": "Validates clinical data manifests",
"type": "object",
"properties": {
"individualID": {
"type": "string",
"pattern": "^CENTER-"
},
"medicationAtDx": {
"$ref": "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
}
},
"required": [
"individualID",
"medicationAtDx"
]
}
The schema referenced by the $ref looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/clinicalData.json",
"definitions":{
"ageDxYears": {
"description": "Age in years at diagnosis",
"type": "number",
"minimum": 0,
"maximum": 90
},
"ageOnset": {
"description": "Age in years of first symptoms",
"type": "number",
"exclusiveMinimum": 0
},
"medicationAtDx": {
"description": "Medication prescribed at diagnosis",
"type": "string"
}
}
}
(Note that both schemas are quite a bit larger and have been edited for brevity.)
I need to be able to figure out the "type" of "medicationAtDx" and am trying to figure out how to use jsonschema.RefResolver to de-reference it, but am a little lost in the terminology used in the documentation and can't find a good example that explains what the parameters are and what it returns "in small words", i.e. something that a beginning JSON schema user would easily understand.
I created a RefResolver from the clinical manifest schema:
import jsonschema
testref = jsonschema.RefResolver.from_schema(clin_manifest_schema)
I fed it the url in the "$ref":
meddx_url = "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
testref.resolve_remote(meddx_url)["definitions"].keys()
What I was expecting to get back was:
dict_keys(['medicationAtDx'])
What I actually got back was:
dict_keys(['ageDxYears', 'ageOnset', 'medicationAtDx'])
Is this the expected behavior? If not, how can I narrow it down to just the definition for "medicationAtDx"? I can traverse the whole dictionary to get what I want if I have to, but I'd rather have it return just the reference I need.
Thanks in advance!
ETA: per Relequestual's comment below, I took a couple of passes with resolve_fragment as follows:
ref_doc = meddx_url.split("#")[0]
ref_frag = meddx_url.split("#")[1]
testref.resolve_fragment(ref_doc, ref_frag)
This gives me "TypeError: string indices must be integers" and "RefResolutionError: Unresolvable JSON pointer". I tried tweaking the parameters in different ways (adding the "#" back into the fragment, removing the leading slash, etc.) and got the same results. Relequestual's explanation of a fragment was very helpful, but apparently I'm still not understanding the exact parameters that resolve_fragment is expecting.

Python Firebase replace the auto-generated Key

I was adding some data to Firebase with Python and I want to use the MD5 strings I generated as the unique index for each record. The auto-generated key in Firebase looks like "-KgMvzKKxVgj4RKN-3x5". Is it possible to replace its value with Python? I know how to do it with Javascript though. Please help... Thanks in advance!
f = firebase.FirebaseApplication('https://xxxxx.firebaseio.com')
f.post('meeting/',
{
"MD5index":MD5String,
"title": title,
"date": date,
"time": time,
"location": location
})
It sure is. Just use put instead of post:
f = firebase.FirebaseApplication('https://xxxxx.firebaseio.com')
f.put('meeting/mymeetingkey',
{
"MD5index":MD5String,
"title": title,
"date": date,
"time": time,
"location": location
})

Python: Access a dictionary that is located inside of a text file

I am working on a quick and dirty script to get Chromium's bookmarks and turn them into a pipe menu for Openbox. Chromium stores it's bookmarks in a file called Bookmarks that stores information in a dictionary form like this:
{
"checksum": "99999999999999999999",
"roots": {
"bookmark_bar": {
"children": [ {
"date_added": "9999999999999999999",
"id": "9",
"name": "Facebook",
"type": "url",
"url": "http://www.facebook.com/"
}, {
"date_added": "999999999999",
"id": "9",
"name": "Twitter",
"type": "url",
"url": "http://twitter.com/"
How would I open this dictionary in this file in Python and assign it to a variable. I know you open a file with open(), but I don't really know where to go from there. In the end, I want to be able to access the info in the dictionary from a variable like this bookmarks[bookmarks_bar][children][0][name] and have it return 'Facebook'
Do you know if this is a json file? If so, python provides a json library.
Json can be used as a data serialization/interchange format. It's nice because it's cross platform. Importing this like you ask seems fairly easy, an example from the docs:
>>> import json
>>> json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]')
[u'foo', {u'bar': [u'baz', None, 1.0, 2]}]
So in your case it would look something like:
import json
with open(file.txt) as f:
text = f.read()
bookmarks = json.loads(text)
print bookmarks[bookmarks_bar][children][0][name]
JSON is definitely the "right" way to do this, but for a quick-and-dirty script eval() might suffice:
with open('file.txt') as f:
bookmarks = eval(f.read())

Categories