Determine read or write query neo4j - python

Given a string containing a Neo4j cypher query, how to determine quickly in python it is db read or db write.
Currently I have thought of two ways of seeing this -
Check for keywords like CREATE, DELETE etc to tag write queries
and
MATCH, START etc to tag read queries.
Otherwise we can check for patterns according to this link here- Neo4j refcard and write a parser for it accordingly.
Method 1 fails here -
MATCH (n:Person {id:1, create:3}) return n
And method 2 looks too deep for seemingly small task.
Any other/better ideas to do the same?

You can use EXPLAIN option and inspect operatorType of the execution plan:
EXPLAIN MERGE (n:Person)
and find and check for possible values for writing, updating and other, something like that
"operatorType": "MergeCreateNode"
"operatorType": "CreateNode"
"operatorType": "MergeCreateRelationship"

Related

How to create a SQL table from several SQL files?

All explained above is in the context of an ETL process. I have a git repository full of sql files. I need to put all those sql files (once pulled) into a sql table with 2 columns: name and query, so that I can access each file later on using a SQL query instead of loading them from the file path. How can I make this? I am free to use the tool I want to, but I just know python and Pentaho.
Maybe the assumption that this method would require less computation time than simply accessing to the pull file located in the hard drive is wrong. In that case let me know.
You can first define the table you're interested in using something along the lines of (you did not mention the database you are using):
CREATE TABLE queries (
name TEXT PRIMARY KEY,
query TEXT
);
After creating the table, you can use perhaps os.walk to iterate through the files in your repository, and insert both the contents (e.g. file.read()) and the name of the file into the table you created previously.
It sounds like you're trying to solve a different problem though. It seems like you're interested in speeding up some process, because you asked about whether accessing queries using a table would be faster than opening a file on disk. To investigate that (separate!) question further, see this.
I would recommend that you profile the existing process you are trying to speed up using profiling tools. After that, you can see whether IO is your bottleneck. Otherwise, you may do all of this work without any benefit.
As a side note, if you are looking up queries in this way, it may indicate that you need to rearchitect your application. Please consider that possibility as well.

Storing and querying a large amount of data

I have a large amount of data around 50GB worth in a csv which i want to analyse purposes of ML. It is however way to large to fit in Python. I ideally want to use mySQL because querying is easier. Can anyone offer a host of tips for me to look into. This can be anything from:
How to store it in the first place, i realise i probably can't load it in all at once, would i do it iteratively? If so what things can i look into for this? In addition i've heard about indexing, would that really speed up queries on such a massive data set?
Are there better technologies out there to handle this amount of data and still be able to query and do feature engineering quickly. What i eventually feed into my algorithm should be able to be done in Python but i need query and do some feature engineering before i get my data set that is ready to be analysed.
I'd really appreciate any advice this all needs to be done on personal computer! Thanks!!
Can anyone offer a host of tips for me to look into
Gladly!
Look at the CSV file first line to see if there is a header. You'd need to create a table with the same fields (and type of data)
One of the fields might seem unique per line and can be used later to find the line. That's your candidate for PRIMARY KEY. Otherwise add an AUTO-INCREMENT field as PRIMARY KEY
INDEXes are used to later search for data. Whatever fields you feel you will be searching/filtering on later should have some sort of INDEX. You can always add them later.
INDEXes can combine multiple fields if they are often searched together
In order to read in the data, you have 2 ways:
Use LOAD DATA INFILE Load Data Infile Documentation
Write your own script: The best technique is to create a prepared statement for the
INSERT command. Then read your CSV line by line (in a loop), split the fields
into variables and execute the prepared statement with this line's
values
You will benefit from a web page designed to search the data. Depends on who needs to use it.
Hope this gives you some ideas
That's depend on what you have, you can use Apache spark and then use their SQL feature, spark SQL gives you the possibility to write SQL queries in your dataset, but for best performance you need a distributed mode(you can use it in a local machine but the result is limited) and high machine performance. you can use python, scala, java to write your code.

add if no duplicate in collection mongodb python

i'm writing a script that puts a large number of xml files into mongodb, thus when i execute the script multiple times the same object is added many times to the same collection.
I checked out for a way to stop this behavior by checkinng the existance of the object before adding it, but can't find a way.
help!
The term for the operation you're describing is "upsert".
In mongodb, the way to upsert is to use the update functionality with upsert=True.
You can index on one or more fields(not _id) of the document/xml structure. Then make use of find operator to check if a document containing that indexed_field:value is present in the collection. If it returns nothing then you can insert new documents into your collection. This will ensure only new docs are inserted when you re-run the script.

How can I specify the current ResultObject in Geomagics with a python script

I am trying to automate the report creation in Geomagics, using the create_report() function.
However, we have several sets of results, which need to be reviewed by a human operator (within the Geomagics interface) before the various reports can be created if the results are considered acceptable. Since create_report() works on the current ResultObject, I'd like to be able to set this to all my results in a loop.
(Alternatively, there might be a way to write a report for a specific object, not just the current result?)
Can you break the problem down further ?
How should the operator see the results, as a spreadsheet or some other way ?
For example, can you script outside of GeoMagic to fetch result sets and display those to the operator, then write back approved results to another dataset
then at the end, create the report within GeoMagic from the "approved" dataset.

50 million+ Rows of Data - CSV or MySQL

I have a CSV file which is about 1GB big and contains about 50million rows of data, I am wondering is it better to keep it as a CSV file or store it as some form of a database. I don't know a great deal about MySQL to argue for why I should use it or another database framework over just keeping it as a CSV file. I am basically doing a Breadth-First Search with this dataset, so once I get the initial "seed" set the 50million I use this as the first values in my queue.
Thanks,
I would say that there are a wide variety of benefits to using a database over a CSV for such large structured data so I would suggest that you learn enough to do so. However, based on your description you might want to check out non-server/lighter weight databases. Such as SQLite, or something similar to JavaDB/Derby... or depending on the structure of your data a non-relational (Nosql) database- obviously you will need one with some type of python support though.
If you want to search on something graph-ish (since you mention Breadth-First Search) then a graph database might prove useful.
Are you just going to slurp in everything all at once? If so, then CSV is probably the way to go. It's simple and works.
If you need to do lookups, then something that lets you index the data, like MySQL, would be better.
From your previous questions, it looks like you are doing social-network searches against facebook friend data; so I presume your data is a set of 'A is-friend-of B' statements, and you are looking for a shortest connection between two individuals?
If you have enough memory, I would suggest parsing your csv file into a dictionary of lists. See Can this breadth-first search be made faster?
If you cannot hold all the data at once, a local-storage database like SQLite is probably your next-best alternative.
There are also some python modules which might help:
graph-tool http://projects.skewed.de/graph-tool/
python-graph http://pypi.python.org/pypi/python-graph/1.8.0
networkx http://networkx.lanl.gov/
igraph http://igraph.sourceforge.net/
How about some key-value storages like MongoDB

Categories