50 million+ Rows of Data - CSV or MySQL - python

I have a CSV file which is about 1GB big and contains about 50million rows of data, I am wondering is it better to keep it as a CSV file or store it as some form of a database. I don't know a great deal about MySQL to argue for why I should use it or another database framework over just keeping it as a CSV file. I am basically doing a Breadth-First Search with this dataset, so once I get the initial "seed" set the 50million I use this as the first values in my queue.
Thanks,

I would say that there are a wide variety of benefits to using a database over a CSV for such large structured data so I would suggest that you learn enough to do so. However, based on your description you might want to check out non-server/lighter weight databases. Such as SQLite, or something similar to JavaDB/Derby... or depending on the structure of your data a non-relational (Nosql) database- obviously you will need one with some type of python support though.

If you want to search on something graph-ish (since you mention Breadth-First Search) then a graph database might prove useful.

Are you just going to slurp in everything all at once? If so, then CSV is probably the way to go. It's simple and works.
If you need to do lookups, then something that lets you index the data, like MySQL, would be better.

From your previous questions, it looks like you are doing social-network searches against facebook friend data; so I presume your data is a set of 'A is-friend-of B' statements, and you are looking for a shortest connection between two individuals?
If you have enough memory, I would suggest parsing your csv file into a dictionary of lists. See Can this breadth-first search be made faster?
If you cannot hold all the data at once, a local-storage database like SQLite is probably your next-best alternative.
There are also some python modules which might help:
graph-tool http://projects.skewed.de/graph-tool/
python-graph http://pypi.python.org/pypi/python-graph/1.8.0
networkx http://networkx.lanl.gov/
igraph http://igraph.sourceforge.net/

How about some key-value storages like MongoDB

Related

What's the best strategy for dumping very large python dictionaries to a database?

I'm writing something that essentially refines and reports various strings out of an enormous python dictionary (the source file for the dictionary is XML over a million lines long).
I found mongodb yesterday and was delighted to see that it accepts python dictionaries easy as you please... until it refused mine because the dict object is larger than the BSON size limit of 16MB.
I looked at GridFS for a sec, but that won't accept any python object that doesn't have a .read attribute.
Over time, this program will acquire many of these mega dictionaries; I'd like to dump each into a database so that at some point I can compare values between them.
What's the best way to handle this? I'm awfully new to all of this but that's fine with me :) It seems that a NoSQL approach is best; the structure of these is generally known but can change without notice. Schemas would be nightmarish here.
Have your considered using Pandas? Yes Pandas does not natively accept xmls but if you use ElementTree from xml (standard library) you should be able to read it into a Pandas data frame and do what you need with it including refining strings and adding more data to the data frame as you get it.
So I've decided that this problem is more of a data design problem than a python situation. I'm trying to load a lot of unstructured data into a database when I probably only need 10% of it. I've decided to save the refined xml dictionary as a pickle on a shared filesystem for cool storage and use mongo to store the refined queries I want from the dictionary.
That'll reduce their size from 22MB to 100K.
Thanks for chatting with me about this :)

Storing and querying a large amount of data

I have a large amount of data around 50GB worth in a csv which i want to analyse purposes of ML. It is however way to large to fit in Python. I ideally want to use mySQL because querying is easier. Can anyone offer a host of tips for me to look into. This can be anything from:
How to store it in the first place, i realise i probably can't load it in all at once, would i do it iteratively? If so what things can i look into for this? In addition i've heard about indexing, would that really speed up queries on such a massive data set?
Are there better technologies out there to handle this amount of data and still be able to query and do feature engineering quickly. What i eventually feed into my algorithm should be able to be done in Python but i need query and do some feature engineering before i get my data set that is ready to be analysed.
I'd really appreciate any advice this all needs to be done on personal computer! Thanks!!
Can anyone offer a host of tips for me to look into
Gladly!
Look at the CSV file first line to see if there is a header. You'd need to create a table with the same fields (and type of data)
One of the fields might seem unique per line and can be used later to find the line. That's your candidate for PRIMARY KEY. Otherwise add an AUTO-INCREMENT field as PRIMARY KEY
INDEXes are used to later search for data. Whatever fields you feel you will be searching/filtering on later should have some sort of INDEX. You can always add them later.
INDEXes can combine multiple fields if they are often searched together
In order to read in the data, you have 2 ways:
Use LOAD DATA INFILE Load Data Infile Documentation
Write your own script: The best technique is to create a prepared statement for the
INSERT command. Then read your CSV line by line (in a loop), split the fields
into variables and execute the prepared statement with this line's
values
You will benefit from a web page designed to search the data. Depends on who needs to use it.
Hope this gives you some ideas
That's depend on what you have, you can use Apache spark and then use their SQL feature, spark SQL gives you the possibility to write SQL queries in your dataset, but for best performance you need a distributed mode(you can use it in a local machine but the result is limited) and high machine performance. you can use python, scala, java to write your code.

python unusual textfile to database: strategy

I am pretty new to Python, so I like to ask you for some advice about the right strategy.
I've a textfile with fixed positions for the data, like this.
It can have more than 10000 rows. At the end the database (SQL) table should look like this. File & Table
The important col is nr. 42. It defines the kind of data in this row.
(2-> Titel, 3->Text 6->Amount and Price). So the data comes from different rows.
QUESTIONS:
Reading the Data: Since there are always more than 4 rows
containing the data, process them line by line, as soon as one sql
statement is complete, send it OR:read all the lines into a list of
lists, and then iterate over these lists? OR: read all the lines in
one list and iterate?
Would it be better to convert the data into a csv or json instead of preparing sql statements, and then use the database software to import to db? (Or use NoSQL DB)
I hope I made my problems clear, if not, I will try.....
Every advice is really appreciated.
The problem is pretty simple, so perhaps you are overthinking it a bit. My suggestion is to use the simplest solution: read a line, parse it, prepare an SQL statement and execute it. If the database is around 10000 records, anything would work, e.g. SQLLite would do just fine. The problem is in the form of a table already so translation to a relational database like SQLLite or MySQL is a pretty obvious and straightforward choice. If you need a different type of organization in your data then you can look at other types of databases: don't do it only because it is "fashionable".

quickest method of accessing key - value pairs?

I hope the question is not too unspecific: I have a huge database-like list (~ 900,000 entries) which I want to use for processing text files. (More details below.) Since this list will be edited and used with other programs as well, I would prefer to keep it in one separate file and include it in the python code, either directly or by dumping it to some format that python can use. I was wondering if you can advice on what would be the quickest and most efficient way. I have looked at several options, but may not have seen what is best:
Include the list as a python dictionary in the form
my_list = { "key": "value" }
directly into my python code.
Dump the list to an sqlite database and use the sqlite3 module.
Have the list as a yml file and use the yaml module.
Any ideas how these approaches would scale if I process a text file and want to do replacements on something like 30,000 lines?
For those interested: this is for linguistic processing, in particular ancient Greek. The list is an exhaustive list of Greek forms and the head words that they are derived from. For every word form in a text file, I want to add the dictionary head word.
Point 1 is much faster than using either YAML or SQL as #b4hand and #DeepSpace indicated. What you should do though is not include the list in the rest of the rest of the python code you are developing, as you indicated, but make a separate .py file with just the that dictionary definition.
That way the list in that file is more easy to write from a program (or extend by a program). And, on first import, a .pyc will be created which speeds up re-reading on further runs of your program. This is actually very similar in performance
to using the pickle module and pickling the dictionary to file and reading it back from there, while keeping the dictionary in an easy human readable and editable form.
Less than one million entries is not huge and should fit in memory easily. Thus, your best bet is option 1.
If you are looking for speed, option 1 should be the fastest because the other 2 will need to repeatedly access the HD which will be the bottleneck.
I would use a caching mechanism to hold this data or maybe a data structure storage like redis. Loading all of this in memory might become too expensive.

Use Python to load data into Mysql

is it possible to set up tables for Mysql in Python?
Here's my problem, I have bunch of .txt files which I want to load into Mysql database. Instead of creating tables in phpmyadmin manually, is it possible to do the following things all in Python?
Create table, including data type definition.
Load many files one by one. I only know this LOAD DATA LOCAL INFILE command to load one file.
Many thanks
Yes, it is possible, you'll need to read the data from the CSV files using CSV module.
http://docs.python.org/library/csv.html
And the inject the data using Python MySQL binding. Here is a good starter tutorial:
http://zetcode.com/databases/mysqlpythontutorial/
If you already know python it will be easy
It is. Typically what you want to do is use an Object-Retlational Mapping library.
Probably the most widely used in the python ecosystem is SQLAlchemy, but there is a lot of magic going on in it, so if you want to keep a tighter control on your DB schema, or if you are learning about relational DB's and want to follow along what the code does, you might be better off with something lighter like Canonical's storm.
EDIT: Just thought to add. The reason to use ORM's is that they provide a very handy way to manipulate data / interface to the DB. But if all you will ever want to do is to do a script to convert textual data to MySQL tables, than you might get along with something even easier. Check the tutorial linked from the official MySQL website, for example.
HTH!

Categories