Building a solr index using large text file - python

I have a large text file in following format:
00001,234234|234|235|7345
00005,788|298|234|735
You can treat values prior to , as keys and what I want to do is quick and dirty approach to query these keys and find the results sets for each key. After reading a bit I found out that solr provide a good framework to do this.
What would be the starting point?
Can I use python to read the file and build this index (search
engine) using solr?
is there a different mechanism to do such?

You can definitely do that using pysolr which is a python library. If the data is in key value form you can read it in python like shown here :
https://pypi.python.org/pypi/pysolr/3.1.0
To have more control on search you need to modify the schema.xml file to have the keys as you have in your text file.
Once you have the data ingested in SOLR you can follow the above link to perform search.

You can index your data directly in Solr using the UpdateCSV handler: You just need to specify the destination field names in the fieldnames parameter in your curl call (or add them as the first line in your file if that is easier). No custom code needed.
Do remember to check that the destination field for the |-separated values splits into tokens using that characters.
See https://wiki.apache.org/solr/UpdateCSV for details.

Related

Pyspark: Read in only certain fields from nested json data

I'm trying to create a spark job that can read in 1000's of json files and perform some actions, then write out to file (s3) again.
It is taking a long time and I keep running out of memory. I know that spark tries to infer the schema if one isn't given. The obvious thing to do would be to supply the schema when reading in. However, the schema changes from file to file depending on a many factors that aren't important. There are about 100 'core' columns that are in all files and these are the only ones I want.
Is it possible to write a partial schema that only reads the specific fields I want into spark using pyspark?
At first, It is recommended to have a jsonl file that each of it contains a single json input data. Generally, you can read just a specific set of fields from a big json, but that should not be considered to be Sparks' job. You should have an initial method that converts your json input into an object of a serializable datatype; you should feed that object into your Spark pipeline.
Passing the schema is not an appropriate design, and it is just making the problem more severe. Instead, define a single method and extract the specific fields after loading the data from files. You can use the following link for finding how to extract some considering fields from a json string in python: How to extract specific fields and values from a JSON with python?

Writing CSV from Elasticsearch result using python with records exceeding 10000 ?

Im able to create the CSV using the solution provided here:
Export Elasticsearch results into a CSV file
but problem arises when the records exceeds 10000 (size=10000), is there any way to write all the records?
The method you given in your question use elasticsearch's Python API, and es.search do have a 10 thousand docs retrieving limit.
If you want to retrieve data more than 10,000, as suggested by dshockley in the comment, you can try scroll API. Or you can try elasticsearch's scan helpers, which automates a lot work with scroll API. For example, you won't need to get a scroll_id and pass it to the API, which will be necessary if you use scroll directly.
When use helpers.scan, you need to specify index and doc_type in the parameters when call the function, or write them in the query body. Note that, the parameter name is 'query' rather than 'body'.

Can I query a YAML dataset in Python?

Similar to Is there a query language for JSON? and the more specific How can I filter a YAML dataset with an attribute value? - I would like to:
hand-edit small amounts data in YAML files
perform arbitrary queries on the complete dataset (probably in Python, open to other ideas)
work with the resulting subset in Python
It doesn't appear that PyYAML has a feature like this, and today I can't find the link I had to the YQuery language, which wasn't a mature project anyway (or maybe I dreamt it).
Is there a (Python) library that offers YAML queries? If not, is there a Pythonic way to "query" a set of objects other than just iterating over them?
I don't thing there is a direct way to do it. But PyYAML reads yaml files into a dict representing everything in the file. Afterwards you can execute all dict related operations. The question python query keys in a dictionary based on values mentions some pythonic "query" styles.
bootalchemy provides a means to do this via SQLAlchemy. First, define your schema in a SQLAlchemy model. Then load your YAML into a SQLAlchemy session using bootalchemy. Finally, perform queries on that session. (You don't have to commit the session to an actual database.)
Example from the PyPI page (assume model is already defined):
from bootalchemy.loader import Loader
# (this simulates loading the data from YAML)
data = [{'Genre':[{'name': "action",
'description':'Car chases, guns and violence.'
}
]
}
]
# load YAML data into session using pre-defined model
loader = Loader(model)
loader.from_list(session, data)
# query SQLAlchemy session
genres = session.query(Genre).all()
# print results
print [(genre.name, genre.description) for genre in genres]
Output:
[('action', 'Car chases, guns and violence.')]
You could try to use jsonpath? Yes, that's meant for json, not yaml, but as long as you have json-compatible datastructures, this should work, because you're working on the parsed data, not on the json or yaml represention? (seems to work with the python libraries jsonpath and jsonpath-rw)
You can check the following tools:
yq for CLI queries, like with jq,
yaml-query another CLI query tool written in Python.

Python - Validation with multiple schemas using lxml

I'm working with a schema that was built by a third party and I'd like to validate it with lxml. The problem is that such a schema is split over different xsd files, which reference themselves.
For example, a file called "extension.xsd" (which builds upon the "master" schema) has a line like:
<redefine schemaLocation="master.xsd">
If I try to validate it with lxml (parsing, then using XMLSchema then validating another document which I know is valid already), I only get validation using "extension" and not "master": in other words, the validation fails (because in the XML file there are elements only present in "master" and not in "extension").
How can I solve (or work around) this issue? Thanks!
If lxml doesn't support "redefine", the best option would be to fix it and submit a patch. :)
Failing that, the workaround would be to parse the master.xsd file yourself, and then apply the changes from extension.xsd, and output a single xsd file with the combined schema.

Using upload_data on Google AppEngine doesn't let me update entities with id based keys

This seems so basic - I must be missing something.
I am trying to download my entities, update a few properties, and upload the entities. I'm using the Django nonrel & appengine projects, so all the entities are stored as id rather than name.
I can download the entities to csv fine, but when I upload (via appcfg.py upload_data ...), the keys come in as name=... rather than id=...
In the config file, I added -
import_transform: transform.create_foreign_key('auth_user', key_is_id=True)
to see if this would, as the documentation for transform states, "convert the key into an integer to be used as an id." With this import_transform, I get this error -
ErrorOnTransform: Numeric keys are not supported on input at this time.
Any ideas?
As the error message indicates, overwriting entities with numeric IDs isn't currently supported. You may be able to work around it by providing a post-upload function that recreates the entity with the relevant key, but I'd suggest stepping back and analyzing why you're doing this - why not just update the entities in-place on App Engine, or use remote_api to do this? Doing a bulk download and upload seems a cumbersome way to handle it.

Categories