PDF long text extraction to JSON in Python

PDF long text extraction to JSON in Python - python

I'm trying to create a python script that extracts text from a PDF then converts it to a correctly formatted JSON file (see below).
The text extraction is not a problem. I'm using PyPDF2 to extract the text from user inputted pdf - which will often result in a LONG text string. I would like to add this text as a 'value' to a json 'key' (see 2nd example below).
My code:
# Writing all data to JSON file
# Data to be written
dictionary ={
"company": str(company),
"document": str(document),
"text": str(text) # This is what would be a LONG string of text
}
# Serializing json
json_object = json.dumps(dictionary, indent = 4)
print(json_object)
with open('company_document.json', 'w') as fp:
json.dump(json_object, fp)
The ideal output would be a JSON file that is structured like this:
[
{
"company": 1,
"document-name": "Orlando",
"text": " **LONG_TEXT_HERE** "
}
]
I'm not getting the right json structure as an output. Also, the long text string most likely contains some punctuation or special characters that can affect the json - such as closing the string too early. I could take this out before, but is there a way to keep it in for the json file so I can address it in the next step (in Neo4j) ?
This is my output at the moment:
"{\n \"company\": \"Stack\",\n \"document\": \"Overflow Report\",\n \"text\": \"Long text 2020\\nSharing relevant and accountable information about our development "quotes and things...
Current:
Current situation
Goal:
Ideal situation
Does anyone have an idea on how this can be achieved?

Like many people, you are confusing the CONTENT of your data with the REPRESENTATION of your data. The code you have works just fine. Notice:
import json
# Data to be written
dictionary ={
"company": 1,
"document": "Orlando",
"text": """Long text 2020
Sharing relevant and accountable information about our development.
This is a complicated text string with "quotes and things".
"""
}
# Serializing json
json_object = json.dumps([dictionary], indent = 4)
print(json_object)
with open('company_document.json', 'w') as fp:
json.dump([dictionary], fp)
When executed, this produces the following on stdout:
[
{
"company": 1,
"document": "Orlando",
"text": "Long text 2020\nSharing relevant and accountable information about our development.\nThis is a complicated text string with \"quotes and things\".\n"
}
]
Notice that the embedded quotes are escaped. That's what the standard requires. The file does not have the indentation, because you didn't ask for it, but it's still quite valid JSON.
[{"company": 1, "document": "Orlando", "text": "Long text 2020\nSharing relevant and accountable information about our development.\nThis is a complicated text string with \"quotes and things\".\n"}]
FOLLOWUP
This version reads in whatever was in the file before, adds a new record to the list, and saves the whole thing out.
import os
import json
# Read existing data.
MASTER = "company_document.json"
if os.path.exists( MASTER ):
database = json.load( open(MASTER,'r') )
else:
database = []
# Data to be written
dictionary ={
"company": 1,
"document": "Orlando",
"text": """Long text 2020
Sharing relevant and accountable information about our development.
This is a complicated text string with "quotes and things".
"""
}
# Serializing json
json_object = json.dumps([dictionary], indent = 4)
print(json_object)
database.append(dictionary)
with open(MASTER, 'w') as fp:
json.dump(database, fp)

Related

Find string in text file and print strings near it

I'll try to explain my goal. I have to write reports based on a document sent to me that has common strings in it. For example, the document sent to me contains data like:
"reportId": 84561234,
"dateReceived": "2020-01-19T17:54:31.000+0000",
"reportingEsp": {
"firstName": "Google",
"lastName": "Reviewer",
"addresses": {
"address": [
{
"street1": "1600 Ampitheater Parkway",
"street2": null,
"city": "Mountainview",
"postalCode": "94043",
"state": "CA",
"nonUsaState": null,
"country": "US",
"type": "BUSINESS"
This is an example of the 'raw' data. It is also presented in a PDF. I have tried scraping the PDF using tabula, but there seems to be some issue with fonts?? So I only get about 10% of the text. And I am wondering/thinking going after the raw data will be more accurate/easier...(if you think scraping the PDF would be easier, please let me know)
So I used this code:
with open('filetobesearched.txt', 'r') as searchfile:
for line in searchfile:
if 'reportId' in line:
print (line)
if 'dateReceived' in line:
print (line)
if 'firstName' in line:
print (line)
and this is where trouble starts... there are multiple occurrences of the string 'firstName' in the file. So my code as exists prints each of those one after the other. In the raw file those fields exist in different sections each are preceded by a section header like in the example above 'reportingESP'. So I'd like my code to somehow know the 'firstName' string belongs to a given section and the next occurrence belongs to another section to be printed with it... (make sense?)
Eventually I'd like to parse out the address information but omit any fields with a null.
And ULTIMATELY I'd like the data outputted into a file I could then in turn import into my report template and fill those fields as applicable. Which seems like a huge thing to me... so I'll be happy with help simply parsing through the raw data and outputting the results to a file in the proper order.
Thanks in advance for any help!

Thanks, yes TIL - it's json data. So I accomplished my goal like this:
JSON Data
"reportId": 84561234,
"dateReceived": "2020-01-19T17:54:31.000+0000",
"reportingEsp": {
"firstName": "Google",
"lastName": "Reviewer",
"addresses": {
"address": [
{
"street1": "1600 Ampitheater Parkway",
"street2": null,
"city": "Mountainview",
"postalCode": "94043",
"state": "CA",
"nonUsaState": null,
"country": "US",
"type": "BUSINESS"
My code:
import json
# read files
myjsonfile=open('file.json', 'r')
jsondata=myjsonfile.read()
# Parse
obj=json.loads(jsondata)
#parse through the json data to populate report variables
rptid = str(str(obj['reportId']))
dateReceived = str(str(obj['dateReceived']))
print('Report ID: ', rptid)
print('Date Received: ', dateReceived)
So now that I have those as variables I am trying to using them to fill a docx template... but that's another question I think.
Consider this one answered. Thanks again!

How to load a JSON file in Python keeping some escaped Unicode characters

I'm trying to parse a json file that have some escaped unicode characters on it's values for some processing.
It comes that when I open the file to work on it, python converts the escaped characters to the actual ascii character, but I need to keep the characters escaped, due to a requirement of the tool that will later process this file.
I tried opening the file with different encodings and also tried both options (True and False) for the ensure_ascii option on the json.dump function.
But the problem really seems to be when I load the file, as the example bellow:
This is an simple json file to be used as input:
{
"id": "my_current_id",
"name": "dockeradmin \u003e aclpolicy",
"content": "description: Read \u0026 Execute permissions."
}
This is a model that I'm using to simplify the test of different solutions
import json
with open('./test.json', encoding='utf-8') as tfstateFile:
tfstateData = json.load(tfstateFile)
tfstateData['id'] = 'my_new_id'
print(tfstateData['id'])
print(tfstateData['name'])
print(tfstateData['content'])
resultFilePath = '/result/test_result.json'
with open(resultFilePath, 'w', encoding='utf-8') as resultStateFile:
json.dump(tfstateData, resultStateFile, indent=4, sort_keys=False, ensure_ascii=False)
When I use the provided json as input, this is the output I'm receiving:
>>> my_new_id
>>> dockeradmin > aclpolicy
>>> description: Read & Execute permissions.
The output file is also decoding the escaped characters:
{
"id": "my_new_id",
"name": "dockeradmin > aclpolicy",
"content": "description: Read & Execute permissions."
}
Is it possible to load the json file and keep the escaped characters on the strings values?
The resultant output should be:
{
"id": "my_new_id",
"name": "dockeradmin \u003e aclpolicy",
"content": "description: Read \u0026 Execute permissions."
}

How to handle \n in json files in python

so I am trying to read a json file like this:
{
"Name": "hello",
"Source": " import json \n x= 10, .... "
}
the way I am trying to read it by using the json library in python
thus my code looks something like this:
import json
input =''' {
"Name": "python code",
"Source": " import json \n x= 10, .... "
}'''
output = json.load(input)
print(output)
the problem that source has the invalid character "\n". I don't want to replace \n with \n as this is code will be later run in another program.
I know that json.JSONDecoder is able to handle \n but I am not sure how to use.

You need to escape the backslash in the input string, so that it will be taken literally.
import json
input =''' {
"Name": "python code",
"Source": " import json \\n x= 10, .... "
}'''
output = json.loads(input)
print output
Also, you should be using json.loads to parse JSON in a string, json.load is for getting it from a file.
Note that if you're actually getting the JSON from a file or URL, you don't need to worry about this. Backslash only has special meaning to Python when it's in a string literal in the program, not when it's read from somewhere else.

Alternatively, you can define the string in raw format using r:
import json
# note the r
input = r''' {
"Name": "python code",
"Source": " import json \n x= 10, .... "
}'''
# loads (not load) is used to parse a string
output = json.loads(input)
print(output)

How can I improve this script to make it more pythonic?

I'm fairly new to Python programming, and have thus far been reverse engineering code that previous developers have made, or have cobbled together some functions on my own.
The script itself works; to cut a long story short, its designed to parse a CSV and to (a) create and or update the contacts found in the CSV, and (b) to correctly assign the contact to their associated company. All using the HubSpot API. To achieve this i've also imported requests and csvmapper.
I had the following questions:
How can I improve this script to make it more pythonic?
What is the best way to make this script run on a remote server,
keeping in mind that Requests and CSVMapper probably aren't
installed on that server, and that I most likely won't have
permission to install them - what is the best way to "package" this
script, or to upload Requests and CSVMapper to the server?
Any advice much appreciated.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys, os.path, requests, json, csv, csvmapper, glob, shutil
from time import sleep
major, minor, micro, release_level, serial = sys.version_info
# Client Portal ID
portal = "XXXXXX"
# Client API Key
hapikey = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
# This attempts to find any file in the directory that starts with "note" and ends with ".CSV"
# Server Version
# findCSV = glob.glob('/home/accountName/public_html/clientFolder/contact*.CSV')
# Local Testing Version
findCSV = glob.glob('contact*.CSV')
for i in findCSV:
theCSV = i
csvfileexists = os.path.isfile(theCSV)
# Prints a confirmation if file exists, prints instructions if it doesn't.
if csvfileexists:
print ("\nThe \"{csvPath}\" file was found ({csvSize} bytes); proceeding with sync ...\n".format(csvSize=os.path.getsize(theCSV), csvPath=os.path.basename(theCSV)))
else:
print ("File not found; check the file name to make sure it is in the same directory as this script. Exiting ...")
sys.exit()
# Begin the CSVmapper mapping... This creates a virtual "header" row - the CSV therefore does not need a header row.
mapper = csvmapper.DictMapper([
[
{'name':'account'}, #"Org. Code"
{'name':'id'}, #"Hubspot Ref"
{'name':'company'}, #"Company Name"
{'name':'firstname'}, #"Contact First Name"
{'name':'lastname'}, #"Contact Last Name"
{'name':'job_title'}, #"Job Title"
{'name':'address'}, #"Address"
{'name':'city'}, #"City"
{'name':'phone'}, #"Phone"
{'name':'email'}, #"Email"
{'name':'date_added'} #"Last Update"
]
])
# Parse the CSV using the mapper
parser = csvmapper.CSVParser(os.path.basename(theCSV), mapper)
# Build the parsed object
obj = parser.buildObject()
def contactCompanyUpdate():
# Open the CSV, use commas as delimiters, store it in a list called "data", then find the length of that list.
with open(os.path.basename(theCSV),"r") as f:
reader = csv.reader(f, delimiter = ",", quotechar="\"")
data = list(reader)
# For every row in the CSV ...
for row in range(0, len(data)):
# Set up the JSON payload ...
payload = {
"properties": [
{
"name": "account",
"value": obj[row].account
},
{
"name": "id",
"value": obj[row].id
},
{
"name": "company",
"value": obj[row].company
},
{
"property": "firstname",
"value": obj[row].firstname
},
{
"property": "lastname",
"value": obj[row].lastname
},
{
"property": "job_title",
"value": obj[row].job_title
},
{
"property": "address",
"value": obj[row].address
},
{
"property": "city",
"value": obj[row].city
},
{
"property": "phone",
"value": obj[row].phone
},
{
"property": "email",
"value": obj[row].email
},
{
"property": "date_added",
"value": obj[row].date_added
}
]
}
nameQuery = "{first} {last}".format(first=obj[row].firstname, last=obj[row].lastname)
# Get a list of all contacts for a certain company.
contactCheck = "https://api.hubapi.com/contacts/v1/search/query?q={query}&hapikey={hapikey}".format(hapikey=hapikey, query=nameQuery)
# Convert the payload to JSON and assign it to a variable called "data"
data = json.dumps(payload)
# Defined the headers content-type as 'application/json'
headers = {'content-type': 'application/json'}
contactExistCheck = requests.get(contactCheck, headers=headers)
for i in contactExistCheck.json()[u'contacts']:
# ... Get the canonical VIDs
canonicalVid = i[u'canonical-vid']
if canonicalVid:
print ("{theContact} exists! Their VID is \"{vid}\"".format(theContact=obj[row].firstname, vid=canonicalVid))
print ("Attempting to update their company...")
contactCompanyUpdate = "https://api.hubapi.com/companies/v2/companies/{companyID}/contacts/{vid}?hapikey={hapikey}".format(hapikey=hapikey, vid=canonicalVid, companyID=obj[row].id)
doTheUpdate = requests.put(contactCompanyUpdate, headers=headers)
if doTheUpdate.status_code == 200:
print ("Attempt Successful! {theContact}'s has an updated company.\n".format(theContact=obj[row].firstname))
break
else:
print ("Attempt Failed. Status Code: {status}. Company or Contact not found.\n".format(status=doTheUpdate.status_code))
def createOrUpdateClient():
# Open the CSV, use commas as delimiters, store it in a list called "data", then find the length of that list.
with open(os.path.basename(theCSV),"r") as f:
reader = csv.reader(f, delimiter = ",", quotechar="\"")
data = list(reader)
# For every row in the CSV ...
for row in range(0, len(data)):
# Set up the JSON payload ...
payloadTest = {
"properties": [
{
"property": "email",
"value": obj[row].email
},
{
"property": "firstname",
"value": obj[row].firstname
},
{
"property": "lastname",
"value": obj[row].lastname
},
{
"property": "website",
"value": None
},
{
"property": "company",
"value": obj[row].company
},
{
"property": "phone",
"value": obj[row].phone
},
{
"property": "address",
"value": obj[row].address
},
{
"property": "city",
"value": obj[row].city
},
{
"property": "state",
"value": None
},
{
"property": "zip",
"value": None
}
]
}
# Convert the payload to JSON and assign it to a variable called "data"
dataTest = json.dumps(payloadTest)
# Defined the headers content-type as 'application/json'
headers = {'content-type': 'application/json'}
#print ("{theContact} does not exist!".format(theContact=obj[row].firstname))
print ("Attempting to add {theContact} as a contact...".format(theContact=obj[row].firstname))
createOrUpdateURL = 'http://api.hubapi.com/contacts/v1/contact/createOrUpdate/email/{email}/?hapikey={hapikey}'.format(email=obj[row].email,hapikey=hapikey)
r = requests.post(createOrUpdateURL, data=dataTest, headers=headers)
if r.status_code == 409:
print ("This contact already exists.\n")
elif (r.status_code == 200) or (r.status_code == 202):
print ("Success! {firstName} {lastName} has been added.\n".format(firstName=obj[row].firstname,lastName=obj[row].lastname, response=r.status_code))
elif r.status_code == 204:
print ("Success! {firstName} {lastName} has been updated.\n".format(firstName=obj[row].firstname,lastName=obj[row].lastname, response=r.status_code))
elif r.status_code == 400:
print ("Bad request. You might get this response if you pass an invalid email address, if a property in your request doesn't exist, or if you pass an invalid property value.\n")
else:
print ("Contact Marko for assistance.\n")
if __name__ == "__main__":
# Run the Create or Update function
createOrUpdateClient()
# Give the previous function 5 seconds to take effect.
sleep(5.0)
# Run the Company Update function
contactCompanyUpdate()
print("Sync complete.")
print("Moving \"{something}\" to the archive folder...".format(something=theCSV))
# Cron version
#shutil.move( i, "/home/accountName/public_html/clientFolder/archive/" + os.path.basename(i))
# Local version
movePath = "archive/{thefile}".format(thefile=theCSV)
shutil.move( i, movePath )
print("Move successful! Exiting...\n")
sys.exit()

I'll just go from top to bottom. The first rule is, do what's in PEP 8. It's not the ultimate style guide, but it's certainly a reference baseline for Python coders, and that's more important, especially when you're getting started. The second rule is, make it maintainable. A couple of years from now, when some other new kid comes through, it should be easy for her to figure out what you were doing. Sometimes that means doing things the long way, to reduce errors. Sometimes it means doing things the short way, to reduce errors. :-)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
Two things: you got the encoding right, per PEP 8. And
Conventions for writing good documentation strings (a.k.a. "docstrings") are immortalized in PEP 257.
You've got a program that does something. But you don't document what.
from __future__ import print_function
import sys, os.path, requests, json, csv, csvmapper, glob, shutil
from time import sleep
major, minor, micro, release_level, serial = sys.version_info
Per PEP 8: put your import module statements one per line.
Per Austin: make your paragraphs have separate subjects. You've got some imports right next to some version info stuff. Insert a blank line. Also, DO SOMETHING with the data! Or you didn't need it to be right here, did you?
# Client Portal ID
portal = "XXXXXX"
# Client API Key
hapikey = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
You've obscured these in more ways than one. WTF is a hapikey? I think you mean Hubspot_API_key. And what does portal do?
One piece of advice: the more "global" a thing is, the more "formal" it should be. If you have a for loop, it's okay to call one of the variables i. If you have a piece of data that is used throughout a function, call it obj or portal. But if you have a piece of data that is used globally, or is a class variable, make it put on a tie and a jacket so everyone can recognize it: make it Hubspot_api_key instead of client_api_key. Maybe even Hubspot_client_api_key if there are more than one API. Do the same with portal.
# This attempts to find any file in the directory that starts with "note" and ends with ".CSV"
# Server Version
# findCSV = glob.glob('/home/accountName/public_html/clientFolder/contact*.CSV')
It didn't take long for the comments to become lies. Just delete them if they aren't true.
# Local Testing Version
findCSV = glob.glob('contact*.CSV')
This is the kind of thing that you should create a function for. Just create a simple function called "get_csv_files" or whatever, and have it return a list of filenames. That decouples you from glob, and it means you can make your test code data driven (pass a list of filenames into a function, or pass a single file into a function, instead of asking it to search for them). Also, those glob patterns are exactly the kind of thing that go in a config file, or a global variable, or get passed as command line arguments.
for i in findCSV:
I'll bet typing CSV in upper case all the time is a pain. And what does findCSV mean? Read that line, and figure out what that variable should be called. Maybe csv_files? Or new_contact_files? Something that demonstrates that there is a collection of things.
theCSV = i
csvfileexists = os.path.isfile(theCSV)
Now what does i do? You had this nice small variable name, in a BiiiiiiG loop. That was a mistake, since if you can't see a variable's entire scope all on one page, it probably needs a somewhat longer name. But then you created an alias for it. Both i and theCSV refer to the same thing. And ... I don't see you using i again. So maybe your loop variable should be theCSV. Or maybe it should be the_csv to make it easier to type. Or just csvname.
# Prints a confirmation if file exists, prints instructions if it doesn't.
This seems a little needless. If you're using glob to get filenames, they pretty much are going to exist. (If they don't, it's because they were deleted between the time you called glob and the time you tried to open them. That's possible, but rare. Just continue or raise an exception, depending.)
if csvfileexists:
print ("\nThe \"{csvPath}\" file was found ({csvSize} bytes); proceeding with sync ...\n".format(csvSize=os.path.getsize(theCSV), csvPath=os.path.basename(theCSV)))
In this code, you use the value of csvfileexists. But that's the only place you use it. In this case, you can probably move the call to os.path.isfile() into the if statement and get rid of the variable.
else:
print ("File not found; check the file name to make sure it is in the same directory as this script. Exiting ...")
sys.exit()
Notice that in this case, when there is an actual problem, you didn't print the file name? How helpful was that?
Also, remember the part where you're on a remote server? You should consider using Python's logging module to record these messages in a useful manner.
# Begin the CSVmapper mapping... This creates a virtual "header" row - the CSV therefore does not need a header row.
mapper = csvmapper.DictMapper([
[
{'name':'account'}, #"Org. Code"
{'name':'id'}, #"Hubspot Ref"
{'name':'company'}, #"Company Name"
{'name':'firstname'}, #"Contact First Name"
{'name':'lastname'}, #"Contact Last Name"
{'name':'job_title'}, #"Job Title"
{'name':'address'}, #"Address"
{'name':'city'}, #"City"
{'name':'phone'}, #"Phone"
{'name':'email'}, #"Email"
{'name':'date_added'} #"Last Update"
]
])
You're creating an object with a bunch of data. This would be a good place for a function. Define a make_csvmapper() function to do all this for you, and move it out of line.
Also, note that the standard csv module has most of the functionality you are using. I don't think you actually need csvmapper.
# Parse the CSV using the mapper
parser = csvmapper.CSVParser(os.path.basename(theCSV), mapper)
# Build the parsed object
obj = parser.buildObject()
Here's another chance for a function. Maybe instead of making a csv mapper, you could just return the obj?
def contactCompanyUpdate():
At this point, things get fishy. You have these function definitions indented, but I don't think you need them. Is that a stackoverflow problem, or does your code really look like this?
# Open the CSV, use commas as delimiters, store it in a list called "data", then find the length of that list.
with open(os.path.basename(theCSV),"r") as f:
No, apparently it really looks like this. Because you're using theCSV inside this function when you don't really need to. Please consider using formal function parameters instead of just grabbing outer-scope objects. Also, why are you using basename on the csv file? If you obtained it using glob, doesn't it already have the path you want?
reader = csv.reader(f, delimiter = ",", quotechar="\"")
data = list(reader)
# For every row in the CSV ...
for row in range(0, len(data)):
Here you forced data to be a list of rows obtained from reader, and then started iterating over them. Just iterate over reader directly, like: for row in reader: BUT WAIT! You're actually iterating over a CSV file that you have already opened, in your obj variable. Just pick one, and iterate over it. You don't need to open the file twice for this.
# Set up the JSON payload ...
payload = {
"properties": [
{
"name": "account",
"value": obj[row].account
},
{
"name": "id",
"value": obj[row].id
},
{
"name": "company",
"value": obj[row].company
},
{
"property": "firstname",
"value": obj[row].firstname
},
{
"property": "lastname",
"value": obj[row].lastname
},
{
"property": "job_title",
"value": obj[row].job_title
},
{
"property": "address",
"value": obj[row].address
},
{
"property": "city",
"value": obj[row].city
},
{
"property": "phone",
"value": obj[row].phone
},
{
"property": "email",
"value": obj[row].email
},
{
"property": "date_added",
"value": obj[row].date_added
}
]
}
Okay, that was a LOOOONG span of code that didn't do much. At the least, tighten those inner dicts up to one line each. But better still, write a function to create your dictionary in the format you want. You can use getattr to pull the data by name from obj.
nameQuery = "{first} {last}".format(first=obj[row].firstname, last=obj[row].lastname)
# Get a list of all contacts for a certain company.
contactCheck = "https://api.hubapi.com/contacts/v1/search/query?q={query}&hapikey={hapikey}".format(hapikey=hapikey, query=nameQuery)
# Convert the payload to JSON and assign it to a variable called "data"
data = json.dumps(payload)
# Defined the headers content-type as 'application/json'
headers = {'content-type': 'application/json'}
contactExistCheck = requests.get(contactCheck, headers=headers)
Here you're encoding details of the API into your code. Consider pulling them out into functions. (That way, you can come back later and build a module of them, to re-use in your next program.) Also, beware of comments that don't actually tell you anything. And feel free to pull that together as a single paragraph, since it's all in service of the same key thing - making an API call.
for i in contactExistCheck.json()[u'contacts']:
# ... Get the canonical VIDs
canonicalVid = i[u'canonical-vid']
if canonicalVid:
print ("{theContact} exists! Their VID is \"{vid}\"".format(theContact=obj[row].firstname, vid=canonicalVid))
print ("Attempting to update their company...")
contactCompanyUpdate = "https://api.hubapi.com/companies/v2/companies/{companyID}/contacts/{vid}?hapikey={hapikey}".format(hapikey=hapikey, vid=canonicalVid, companyID=obj[row].id)
doTheUpdate = requests.put(contactCompanyUpdate, headers=headers)
if doTheUpdate.status_code == 200:
print ("Attempt Successful! {theContact}'s has an updated company.\n".format(theContact=obj[row].firstname))
break
else:
print ("Attempt Failed. Status Code: {status}. Company or Contact not found.\n".format(status=doTheUpdate.status_code))
I'm not sure if this last bit should be an exception or not. Is an "Attempt Failed" normal behavior, or does it mean that something is broken?
At any rate, please look into the API you are using. I'd bet there is some more information available for minor failures. (Major failures would be the internet is broken or their server is offline.) They might provide an "errors" or "error" field in their return JSON, for example. Those should be logged or printed with your failure message.
def createOrUpdateClient():
Mostly this function has the same issues as the previous one.
else:
print ("Contact Marko for assistance.\n")
Except here. Never put your name in someplace like this. Or you'll still be getting calls on this code 10 years from now. Put your department name ("IT Operations") or a support number. The people who need to know will already know. And the people who don't need to know can just notify the people that already know.
if __name__ == "__main__":
# Run the Create or Update function
createOrUpdateClient()
# Give the previous function 5 seconds to take effect.
sleep(5.0)
# Run the Company Update function
contactCompanyUpdate()
print("Sync complete.")
print("Moving \"{something}\" to the archive folder...".format(something=theCSV))
# Cron version
#shutil.move( i, "/home/accountName/public_html/clientFolder/archive/" + os.path.basename(i))
# Local version
movePath = "archive/{thefile}".format(thefile=theCSV)
shutil.move( i, movePath )
print("Move successful! Exiting...\n")
This was awkward. You might consider taking some command line arguments and using them to determine your behavior.
sys.exit()
And don't do this. Never put an exit() at module scope, because it means you can't possibly import this code. Maybe someone wants to import it to parse the docstrings. Or maybe they want to borrow some of those API functions you wrote. Too bad! sys.exit() means always having to say "Oh, sorry, I'll have to do that for you." Put it at the bottom of your actual __name__ == "__main__" code. Or, since you aren't actually passing a value, just remove it entirely.

JSON Dump in Python CGI starts with trailing dots

The data returned from run_report returns a python dictionary, where it is then parsed into JSON String and printed so it can be accessed by JSON. The run_report function also creates a .json file in which I can access later:
print "Content-type: application/json\n"
json_data = run_report(sites_list, tierOne, dateFrom, dateTo, filename, file_extension)
print json.dumps(json_data, indent=4, sort_keys=True)
However, when it prints, I receive this output:
..{
"data": {
"FR": 1424068
},
"tierone": {
"countries": [
"US",
"BR",
...
],
"ratio": 100.0,
"total": 1424068,
"total_countries": 1
},
"total": 1424068,
"total_countries": 1
}
What I don't understand is how those trailing dots even show up. The dots, however, do not show up if I were to open one of the .json files I created with the run_report function and print the read data file.
def open_file(file_extension, json_file):
with open(file_extension + json_file) as data_file:
data = json.load(data_file)
return json.dumps(data)
json_data = open_file(file_extension, filename)
print json_data

Something else is producing those . characters; the json.dumps() function never adds those.
Make sure nothing else is writing to sys.stdout; everything you send to sys.stdout is sent to the browser (print writes to sys.stdout by default).
From your comments I understand you wanted to write additional information to the server logs; do not use sys.stdout for that; write to sys.stderr instead.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PDF long text extraction to JSON in Python - python

Related

Find string in text file and print strings near it

How to load a JSON file in Python keeping some escaped Unicode characters

How to handle \n in json files in python

How can I improve this script to make it more pythonic?

JSON Dump in Python CGI starts with trailing dots

Categories

Resources