Scrapy Field Names that are not as per Python variable restrictions

Scrapy Field Names that are not as per Python variable restrictions - python

Is it possible to have field names that do not conform to python variable naming rules? To elaborate, is it possible to have the field name as "Job Title" instead of "job_title" in the export file. While may not be useful in JSON or XML exports, such an functionality might be useful while exporting in CSV format. For instance, if I need to use this data to import to another system which is already configured to accept CSVs with a certain field name.
Tried to reading the Item Pipelines documentation but it appears to be for an "an item has been scraped by a spider" but not for the field names themselves (Could be totally wrong though).
Any help in this direction would be really helpful!

I would suggest you to use a third party lib called scrapy-jsonschema. With that you can define your Items like this:
from scrapy_jsonschema.item import JsonSchemaItem
class MyItem(JsonSchemaItem):
jsonschema = {
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "MyItem",
"description": "My Item with spaces",
"type": "object",
"properties": {
"id": {
"description": "The unique identifier for the employee",
"type": "integer"
},
"name": {
"description": "Name of the employee",
"type": "string"
},
"job title": {
"description": "The title of employee's job.",
"type": "string",
}
},
"required": ["id", "name", "job title"]
}
And populate it like this:
item = MyItem()
item['job title'] = 'Boss'
You can read more about here.
This solution address the Item definition as you asked, but you can achieve similar results without defining an Item. For example, you could just scrape the data into a dict and yield it back to scrapy.
yield {
"id": response.xpath('...').get(),
"name": response.xpath('...').get(),
"job title": response.xpath('...').get(),
}
with scrapy crawl myspider -o file.csv that would scrape into a csv and the columns will have the names you chose.
You could also have the spider directly write into a csv, or it's pipeline, etc. Several ways to do it without a Item definition.

Related

How to search for a value in two different type of field or index or heading of mongodb using python?

I am new to any kind of programming. This is an issue I encountered when using mongodb. Below is the collection structure of the document I imported from two different csv files.
{
"_id": {
"$oid": "61bc4217ed94f9d5fe6a350c"
},
"Telephone Number": "8429950810",
"Date of Birth": "01/01/1945"
}
{
"_id": {
"$oid": "61bc4217ed94f9d5fe6a350c"
},
"Telephone Number": "8129437810",
"Date of Birth": "01/01/1998"
}
{
"_id": {
"$oid": "61bd98d36cc90a9109ab253c"
},
"TELEPHONE_NUMBER": "9767022829",
"DATE_OF_BIRTH": "16-Jun-98"
}
{
"_id": {
"$oid": "61bd98d36cc9090109ab253c"
},
"TELEPHONE_NUMBER": "9567085829",
"DATE_OF_BIRTH": "16-Jan-91"
}
The first two entries are from a csv and the next two entries from another csv file. Now I am creating a user interface where users can search for a telephone number. How to write the query to search the telephone number value in both the index ( Telephone Number and TELEPHONE_NUMBER) using find() in the above case. If not possible is there a way to change the index's to a desired format while importing csv to db. Or is there a way where I create two different collection and then import csv to each collections and then perform a collective search of both the collections. Or can we create a compound index and then search the compound index instead. I am using pymongo for all the operations.
Thankyou.

You can use or query if different key is used to store same type of data.
yourmongocoll.find({"$or":[ {"Telephone Number":"8429950810"}, {"TELEPHONE_NUMBER":8429950810}]})

Assuming you have your connection string to connect via pymongo. Then the following is an example of how to query for the telephone number "8429950810":
from pymongo import MongoClient
client = MongoClient("connection_string")
db = client["db"]
collection = db["collection"]
results = collection.find({"Telephone Number":"8429950810"})
Please note this will return as type cursor, if you would like your documents in a list consider wrapping the query in list() like so:
results = list(collection.find({"Telephone Number":"8429950810"}))

Find string in text file and print strings near it

I'll try to explain my goal. I have to write reports based on a document sent to me that has common strings in it. For example, the document sent to me contains data like:
"reportId": 84561234,
"dateReceived": "2020-01-19T17:54:31.000+0000",
"reportingEsp": {
"firstName": "Google",
"lastName": "Reviewer",
"addresses": {
"address": [
{
"street1": "1600 Ampitheater Parkway",
"street2": null,
"city": "Mountainview",
"postalCode": "94043",
"state": "CA",
"nonUsaState": null,
"country": "US",
"type": "BUSINESS"
This is an example of the 'raw' data. It is also presented in a PDF. I have tried scraping the PDF using tabula, but there seems to be some issue with fonts?? So I only get about 10% of the text. And I am wondering/thinking going after the raw data will be more accurate/easier...(if you think scraping the PDF would be easier, please let me know)
So I used this code:
with open('filetobesearched.txt', 'r') as searchfile:
for line in searchfile:
if 'reportId' in line:
print (line)
if 'dateReceived' in line:
print (line)
if 'firstName' in line:
print (line)
and this is where trouble starts... there are multiple occurrences of the string 'firstName' in the file. So my code as exists prints each of those one after the other. In the raw file those fields exist in different sections each are preceded by a section header like in the example above 'reportingESP'. So I'd like my code to somehow know the 'firstName' string belongs to a given section and the next occurrence belongs to another section to be printed with it... (make sense?)
Eventually I'd like to parse out the address information but omit any fields with a null.
And ULTIMATELY I'd like the data outputted into a file I could then in turn import into my report template and fill those fields as applicable. Which seems like a huge thing to me... so I'll be happy with help simply parsing through the raw data and outputting the results to a file in the proper order.
Thanks in advance for any help!

Thanks, yes TIL - it's json data. So I accomplished my goal like this:
JSON Data
"reportId": 84561234,
"dateReceived": "2020-01-19T17:54:31.000+0000",
"reportingEsp": {
"firstName": "Google",
"lastName": "Reviewer",
"addresses": {
"address": [
{
"street1": "1600 Ampitheater Parkway",
"street2": null,
"city": "Mountainview",
"postalCode": "94043",
"state": "CA",
"nonUsaState": null,
"country": "US",
"type": "BUSINESS"
My code:
import json
# read files
myjsonfile=open('file.json', 'r')
jsondata=myjsonfile.read()
# Parse
obj=json.loads(jsondata)
#parse through the json data to populate report variables
rptid = str(str(obj['reportId']))
dateReceived = str(str(obj['dateReceived']))
print('Report ID: ', rptid)
print('Date Received: ', dateReceived)
So now that I have those as variables I am trying to using them to fill a docx template... but that's another question I think.
Consider this one answered. Thanks again!

What type specs are allowed when reading json into a pandas dataframe?

I want to use the pandas.read_json() function to import data into a pandas dataframe and I'm going to use table for the orient parameter in order to be able to provide data type information. In this case the input json has a schema property which can be used to specify input metadata, like this:
{
"schema": {
"fields": [
{
"name": "col1",
"type": "integer"
},
{
"name": "col2",
"type": "string"
},
{
"name": "col3",
"type": "integer"
}
],
"primaryKey": [
"col1"
]
},
"data": [...]
}
However, the pandas documentation (section "Encoding with Table Schema") does not elaborate on what kind of type specifications I can use. The "schema" property name suggests that maybe the type specs from json schema are supported? Can anyone confirm or otherwise provide info on the supported type specs?

The table schema that applies for orient='table' in both pandas.read_json() and Dataframe.to_json() is documented here.

How do I resolve a single reference with Python jsonschema RefResolver

I am writing Python code to validate a .csv file using a JSON schema and the jsonschema Python module. I have a clinical manifest schema that looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/veoibd_schema.json",
"title": "clinical data manifest schema",
"description": "Validates clinical data manifests",
"type": "object",
"properties": {
"individualID": {
"type": "string",
"pattern": "^CENTER-"
},
"medicationAtDx": {
"$ref": "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
}
},
"required": [
"individualID",
"medicationAtDx"
]
}
The schema referenced by the $ref looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/clinicalData.json",
"definitions":{
"ageDxYears": {
"description": "Age in years at diagnosis",
"type": "number",
"minimum": 0,
"maximum": 90
},
"ageOnset": {
"description": "Age in years of first symptoms",
"type": "number",
"exclusiveMinimum": 0
},
"medicationAtDx": {
"description": "Medication prescribed at diagnosis",
"type": "string"
}
}
}
(Note that both schemas are quite a bit larger and have been edited for brevity.)
I need to be able to figure out the "type" of "medicationAtDx" and am trying to figure out how to use jsonschema.RefResolver to de-reference it, but am a little lost in the terminology used in the documentation and can't find a good example that explains what the parameters are and what it returns "in small words", i.e. something that a beginning JSON schema user would easily understand.
I created a RefResolver from the clinical manifest schema:
import jsonschema
testref = jsonschema.RefResolver.from_schema(clin_manifest_schema)
I fed it the url in the "$ref":
meddx_url = "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
testref.resolve_remote(meddx_url)["definitions"].keys()
What I was expecting to get back was:
dict_keys(['medicationAtDx'])
What I actually got back was:
dict_keys(['ageDxYears', 'ageOnset', 'medicationAtDx'])
Is this the expected behavior? If not, how can I narrow it down to just the definition for "medicationAtDx"? I can traverse the whole dictionary to get what I want if I have to, but I'd rather have it return just the reference I need.
Thanks in advance!
ETA: per Relequestual's comment below, I took a couple of passes with resolve_fragment as follows:
ref_doc = meddx_url.split("#")[0]
ref_frag = meddx_url.split("#")[1]
testref.resolve_fragment(ref_doc, ref_frag)
This gives me "TypeError: string indices must be integers" and "RefResolutionError: Unresolvable JSON pointer". I tried tweaking the parameters in different ways (adding the "#" back into the fragment, removing the leading slash, etc.) and got the same results. Relequestual's explanation of a fragment was very helpful, but apparently I'm still not understanding the exact parameters that resolve_fragment is expecting.

Convert a JSON schema to a python class

Is there a python library for converting a JSON schema to a python class definition, similar to jsonschema2pojo -- https://github.com/joelittlejohn/jsonschema2pojo -- for Java?

So far the closest thing I've been able to find is warlock, which advertises this workflow:
Build your schema
>>> schema = {
'name': 'Country',
'properties': {
'name': {'type': 'string'},
'abbreviation': {'type': 'string'},
},
'additionalProperties': False,
}
Create a model
>>> import warlock
>>> Country = warlock.model_factory(schema)
Create an object using your model
>>> sweden = Country(name='Sweden', abbreviation='SE')
However, it's not quite that easy. The objects that Warlock produces lack much in the way of introspectible goodies. And if it supports nested dicts at initialization, I was unable to figure out how to make them work.
To give a little background, the problem that I was working on was how to take Chrome's JSONSchema API and produce a tree of request generators and response handlers. Warlock doesn't seem too far off the mark, the only downside is that meta-classes in Python can't really be turned into 'code'.
Other useful modules to look for:
jsonschema - (which Warlock is built on top of)
valideer - similar to jsonschema but with a worse name.
bunch - An interesting structure builder thats half-way between a dotdict and construct
If you end up finding a good one-stop solution for this please follow up your question - I'd love to find one. I poured through github, pypi, googlecode, sourceforge, etc.. And just couldn't find anything really sexy.
For lack of any pre-made solutions, I'll probably cobble together something with Warlock myself. So if I beat you to it, I'll update my answer. :p

python-jsonschema-objects is an alternative to warlock, build on top of jsonschema
python-jsonschema-objects provides an automatic class-based binding to JSON schemas for use in python.
Usage:
Sample Json Schema
schema = '''{
"title": "Example Schema",
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"age": {
"description": "Age in years",
"type": "integer",
"minimum": 0
},
"dogs": {
"type": "array",
"items": {"type": "string"},
"maxItems": 4
},
"gender": {
"type": "string",
"enum": ["male", "female"]
},
"deceased": {
"enum": ["yes", "no", 1, 0, "true", "false"]
}
},
"required": ["firstName", "lastName"]
} '''
Converting the schema object to class
import python_jsonschema_objects as pjs
import json
schema = json.loads(schema)
builder = pjs.ObjectBuilder(schema)
ns = builder.build_classes()
Person = ns.ExampleSchema
james = Person(firstName="James", lastName="Bond")
james.lastName
u'Bond' james
example_schema lastName=Bond age=None firstName=James
Validation :
james.age = -2
python_jsonschema_objects.validators.ValidationError: -2 was less
or equal to than 0
But problem is , it is still using draft4validation while jsonschema has moved over draft4validation , i filed an issue on the repo regarding this .
Unless you are using old version of jsonschema , the above package will work as shown.

I just created this small project to generate code classes from json schema, even if dealing with python I think can be useful when working in business projects:
pip install jsonschema2popo
running following command will generate a python module containing json-schema defined classes (it uses jinja2 templating)
jsonschema2popo -o /path/to/output_file.py /path/to/json_schema.json
more info at: https://github.com/frx08/jsonschema2popo

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy Field Names that are not as per Python variable restrictions - python

Related

How to search for a value in two different type of field or index or heading of mongodb using python?

Find string in text file and print strings near it

What type specs are allowed when reading json into a pandas dataframe?

How do I resolve a single reference with Python jsonschema RefResolver

Convert a JSON schema to a python class

Categories

Resources