I have a few million documents. What I am trying to do is simple, process the documents to extract the information I need and load it into a database. I am doing it in Python and using SQLAlchemy. Also I am using multiprocessing to make use of all the cores on my machine. The documents are XML with huge chunks of text. The database is MySQL with a custom relation schema defined.
However, it runs very slow and loads only about 50k documents in 6-7 hours.
Is there any way that I can speed this task up?
sometimes RDBMS is not the answer, one sign for such situation is if your data has no relations to one another, for example, if every document stands by itself.
if you'd like to have some unstructured data searchable, consider building a searchable index using pylucene
or maybe put the data in some non-rel database like mongodb
in any case, try to identify what part of your system is slowing down the process, my guess would be the database or the file system, if this is mysql all you can do is throwing more hardware on it.
another way to optimize a system that use IO extensively is to switch to async programming using a library like twisted but it has some learning curve, so better make 100% sure its needed.
Related
I am new to Google BigQuery so I'm trying to understand how to best accomplish my use case.
I have daily data of customer visits stored in BigQuery that I wish to analyse using some algorithms that I have written in python. Since, there are multiple scripts that use subsets of the daily data, I was wondering what would be the best way to fetch and temporarily store the data. Additionally, the scripts run in a sequential manner. Each script modifies some columns of the data and the subsequent script uses this modified data. After all the scripts have run, I want to store the modified data back to BigQuery.
Some approaches I had in mind are:
Export the bigquery table into a GAE (Google App Engine) instance as a db file and query the relevant data for each script from the db file using sqlite3 python package. Once, all the scripts have run, store the modified table back to BigQuery and then remove the db file from the GAE instance.
Query data from BigQuery every time I want to run a script using the google-cloud python client library or pandas gbq package. Modify the BigQuery table after running each script.
Could somebody know which of these would be a better way to accomplish this (in terms of efficiency/cost) or suggest alternatives?
Thanks!
The answer to your question mostly depends on your use case and the size of the data that you will be processing, so there is not an absolute and correct answer for it.
However, there are some points that you may want to take into account regarding the usage of BigQuery and how some of its features can be interesting for you in the scenario you described.
Let me quickly go over the main topics you should have a look at:
Pricing: leaving aside the billing of storage, and focusing in the cost of queries themselves (which is more related to your use case), BigQuery billing is based on the number of bytes processed on each query. There is a 1TB free quota per month, and from then on, the cost is of $5 per TB of processed data, being the minimum measurable unit 10MB of data.
Cache: when BigQuery returns some information, it is stored in a temporary cached table (or a permanent one if you wish), and they are maintained for approximately 24 hours with some exceptions that you may find in this same documentation link (they are also best-effort, so earlier deletion may happen too). Results returned from a cached table are not billed (because as per the definition of the billing, the cost is based on the number of bytes processed, and accessing a cached table implies that there is no processing being done), as long as you are running the exact same query. I think it would be worth having a look at this feature, because from your sentence "Since there are multiple scripts that use subsets of the daily data", maybe (but just guessing here) it applies to your use case to perform a single query once and then retrieve the results multiple times from a cached version without having to store it anywhere else.
Partitions: BigQuery offers the concept of partitioned tables, which are individual tables that are partitioned into smaller segments by date, what will make it easier to query data daily as you require.
Speed: BigQuery offers a real-time analytics platform, so you will be able to perform fast queries retrieving the information you need, applying some initial processing that you can later use in your custom Python algorithms.
So, in general, I would say that there is no need for you to keep any other database with partial results a part from your BigQuery storage. In terms of resource and cost efficiency, BigQuery offers enough features for you to work with your data locally without having to deal with huge expenses or delays in data retrieving. However, again, this will finally depend on your use case and the amount of data you are storing and need to process simultaneously; but in general terms, I would just go with BigQuery on its own.
I am playing around with Python and some of NoSql DBs to create file store(mainly because of built in replication), i tried it with MongoDB and its working but due to "Write Greedy" nature of MongoDB i moved to cassandra and implemented the same thing. While its working, i want to know (point me to docs that will be fine) how to defragment the data in cassandra. i will explain this with example, say i upload the 200 MB file, then 20 MB file. now data size in cassandra is ~220MB. If i go and delete the 200MB file then also i see that data size is ~200MB so that space is not gained back. In mongoDB there is a command to gain (re use the same space for new files) i want to know how same can be achieved in cassandra. I am getting confused b/w compress & compaction.
And to store data i am splitting file in part and then storing as "blob" in table.
Cassandra cleans up deleted and expired data using a process called compaction.
While you can force compactions yourself using nodetool compact, I would not recommend this as it is better to tune compaction and let it happen in the background.
That may not completely do the trick as cassandra has a configuration property named 'gc_grace_seconds' which prevents data marked as deleted (with a tombstone) from being deleted until gc_grace_seconds passes. The default is 10 days but you can configure this to a smaller value or even make it 0 to disable tombstones all together.
I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.
Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.
Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.
I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.
I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..
I wrote a Python program that handles very large data. As it processes the data, it puts the processed data into an array, which easily grows to hundreds of megabytes or even over a gigabyte.
The reason I set it like that is because Python needs to continuously access the data in the array. Because the array gets larger and larger, the process is easily prone to error and very slow.
Is there a way to have the array-like database stored on a different file or database module and access it on a as-needed basis?
Perhaps this is a very basic task, but I have no clue.
You can use sqlite3 if you want. it is part of the python
packages, it is simpler for basic usage.
MySQL for python
Postgres for python
is it possible to set up tables for Mysql in Python?
Here's my problem, I have bunch of .txt files which I want to load into Mysql database. Instead of creating tables in phpmyadmin manually, is it possible to do the following things all in Python?
Create table, including data type definition.
Load many files one by one. I only know this LOAD DATA LOCAL INFILE command to load one file.
Many thanks
Yes, it is possible, you'll need to read the data from the CSV files using CSV module.
http://docs.python.org/library/csv.html
And the inject the data using Python MySQL binding. Here is a good starter tutorial:
http://zetcode.com/databases/mysqlpythontutorial/
If you already know python it will be easy
It is. Typically what you want to do is use an Object-Retlational Mapping library.
Probably the most widely used in the python ecosystem is SQLAlchemy, but there is a lot of magic going on in it, so if you want to keep a tighter control on your DB schema, or if you are learning about relational DB's and want to follow along what the code does, you might be better off with something lighter like Canonical's storm.
EDIT: Just thought to add. The reason to use ORM's is that they provide a very handy way to manipulate data / interface to the DB. But if all you will ever want to do is to do a script to convert textual data to MySQL tables, than you might get along with something even easier. Check the tutorial linked from the official MySQL website, for example.
HTH!