I am pretty new to Python, so I like to ask you for some advice about the right strategy.
I've a textfile with fixed positions for the data, like this.
It can have more than 10000 rows. At the end the database (SQL) table should look like this. File & Table
The important col is nr. 42. It defines the kind of data in this row.
(2-> Titel, 3->Text 6->Amount and Price). So the data comes from different rows.
QUESTIONS:
Reading the Data: Since there are always more than 4 rows
containing the data, process them line by line, as soon as one sql
statement is complete, send it OR:read all the lines into a list of
lists, and then iterate over these lists? OR: read all the lines in
one list and iterate?
Would it be better to convert the data into a csv or json instead of preparing sql statements, and then use the database software to import to db? (Or use NoSQL DB)
I hope I made my problems clear, if not, I will try.....
Every advice is really appreciated.
The problem is pretty simple, so perhaps you are overthinking it a bit. My suggestion is to use the simplest solution: read a line, parse it, prepare an SQL statement and execute it. If the database is around 10000 records, anything would work, e.g. SQLLite would do just fine. The problem is in the form of a table already so translation to a relational database like SQLLite or MySQL is a pretty obvious and straightforward choice. If you need a different type of organization in your data then you can look at other types of databases: don't do it only because it is "fashionable".
Related
So I have this database (Size 3.1Gb total), but this is due to one specific table I've got, containing A LOT of console output text, from some test runs. The table itself is 2.7Gb, and I was wondering if there could be another solution for this table, so the database would get a lot smaller? It's getting a bit anoying to backup the database or even make a copy of the database to a playground, because it's so big this table.
The Table is this one
Would it be better to delete this table and make all the LogTextData <- LongText, be stored in a PDF, instead of the database? (Then I can't backup this data tho...)
Do anyone have an idea on how to make this table smaller, or another solution? I'm open for suggestions, to make this table smaller.
The way this console log data gets imported to the database is by Python scipts, so I have fully access to other python solutions, if there is any.
You could try enabling either the Storage-Engine Independent Column Compression or InnoDB page compression. Both provides ways to have a smaller on-disk database which is especially useful for the large text fields.
Since there's only one table with one particular field that's taking up space, trying out individual column compression seems like the easiest first step.
According to me you should just store the path of log files instead of the complete logs in the database. By using those paths you can access the files anytime you want.
It will decrease the size of database too.
Your new table would look like this,
LogID, BuildID, JenkinsJobName,LogTextData.
I have a large amount of data around 50GB worth in a csv which i want to analyse purposes of ML. It is however way to large to fit in Python. I ideally want to use mySQL because querying is easier. Can anyone offer a host of tips for me to look into. This can be anything from:
How to store it in the first place, i realise i probably can't load it in all at once, would i do it iteratively? If so what things can i look into for this? In addition i've heard about indexing, would that really speed up queries on such a massive data set?
Are there better technologies out there to handle this amount of data and still be able to query and do feature engineering quickly. What i eventually feed into my algorithm should be able to be done in Python but i need query and do some feature engineering before i get my data set that is ready to be analysed.
I'd really appreciate any advice this all needs to be done on personal computer! Thanks!!
Can anyone offer a host of tips for me to look into
Gladly!
Look at the CSV file first line to see if there is a header. You'd need to create a table with the same fields (and type of data)
One of the fields might seem unique per line and can be used later to find the line. That's your candidate for PRIMARY KEY. Otherwise add an AUTO-INCREMENT field as PRIMARY KEY
INDEXes are used to later search for data. Whatever fields you feel you will be searching/filtering on later should have some sort of INDEX. You can always add them later.
INDEXes can combine multiple fields if they are often searched together
In order to read in the data, you have 2 ways:
Use LOAD DATA INFILE Load Data Infile Documentation
Write your own script: The best technique is to create a prepared statement for the
INSERT command. Then read your CSV line by line (in a loop), split the fields
into variables and execute the prepared statement with this line's
values
You will benefit from a web page designed to search the data. Depends on who needs to use it.
Hope this gives you some ideas
That's depend on what you have, you can use Apache spark and then use their SQL feature, spark SQL gives you the possibility to write SQL queries in your dataset, but for best performance you need a distributed mode(you can use it in a local machine but the result is limited) and high machine performance. you can use python, scala, java to write your code.
I am new to python so this might have wrong syntax as well. I want to bulk update my sql database. I have already created a column in the database with null values. My aim is to find the moving average on python then use those value to update the database. But the problem is that the moving averages that I have found is in a data table format where columns are time blocks and rows are dates. But in the database both dates and time blocks are different column.
MA30 = pd.rolling_mean(df, 30)
cur.execute(""" UPDATE feeder_ndmc_copy
SET time_block = CASE date, (MA30)""")
db.commit()
This is the error I am getting.
self.errorhandler(self, exc, value)
I have seen a lot of other question answers but there is no example of how to use the finding of python command to update the database. Any suggestions?
Ok, so it is very hard to give you a complete answer with what little information your question contains, but I'll try my best to explain how I would deal with this.
The easiest way is probably to automatically write a separate UPDATE query for each row you want to update. If I'm not mistaken, this will be relatively efficient on the database side of things but it will produce some overhead in your python program. I'm not a database guy, but since you didn't mention performance optimality in your question, i will assume that any solution that works will do for now.
I will be using sqlalchemy to handle interactions with the database. Take care that if you want to copy my code, you will need to install sqlalchemy and import the module in your code.
First, we will need to create a sqlalchemy engine. I will assume that you use a local database, if not you will need to edit this part.
engine = sqlalchemy.create_engine('mysql://localhost/yourdatabase')
Now lets create a string containing all our queries (i don't know the name of the columns you want to update, so I'll use place holders, I also do no know the format of your time index, so I'll have to guess):
queries = ''
for index, value in MA30.iterrows():
queries += 'UPDATE feeder_ndmc_copy SET column_name = {} WHERE index_column_name = {};\n'.format(value, index.strftime(%Y-%m-%d %H:%M:%S))
You will need to adapt this part heavily to conform with your requirements. I simply can't do any better without you supplying the proper schema of your database.
With the list of queries complete, we proceed to put the data into SQL:
with engine.connect() as connection:
with engine.begin():
connection.execute(queries)
Edit:
Obviously my solution does not deal in any way with things like if your pandas operations create datapoints for timestamps that are not in mysql, etc. You need to keep that in mind. If that is a problem, you will need to use queries of the form
INSERT INTO table (id,Col1,Col2) VALUES (1,1,1),(2,2,3),(3,9,3),(4,10,12)
ON DUPLICATE KEY UPDATE Col1=VALUES(Col1),Col2=VALUES(Col2);
I have a CSV file which is about 1GB big and contains about 50million rows of data, I am wondering is it better to keep it as a CSV file or store it as some form of a database. I don't know a great deal about MySQL to argue for why I should use it or another database framework over just keeping it as a CSV file. I am basically doing a Breadth-First Search with this dataset, so once I get the initial "seed" set the 50million I use this as the first values in my queue.
Thanks,
I would say that there are a wide variety of benefits to using a database over a CSV for such large structured data so I would suggest that you learn enough to do so. However, based on your description you might want to check out non-server/lighter weight databases. Such as SQLite, or something similar to JavaDB/Derby... or depending on the structure of your data a non-relational (Nosql) database- obviously you will need one with some type of python support though.
If you want to search on something graph-ish (since you mention Breadth-First Search) then a graph database might prove useful.
Are you just going to slurp in everything all at once? If so, then CSV is probably the way to go. It's simple and works.
If you need to do lookups, then something that lets you index the data, like MySQL, would be better.
From your previous questions, it looks like you are doing social-network searches against facebook friend data; so I presume your data is a set of 'A is-friend-of B' statements, and you are looking for a shortest connection between two individuals?
If you have enough memory, I would suggest parsing your csv file into a dictionary of lists. See Can this breadth-first search be made faster?
If you cannot hold all the data at once, a local-storage database like SQLite is probably your next-best alternative.
There are also some python modules which might help:
graph-tool http://projects.skewed.de/graph-tool/
python-graph http://pypi.python.org/pypi/python-graph/1.8.0
networkx http://networkx.lanl.gov/
igraph http://igraph.sourceforge.net/
How about some key-value storages like MongoDB
I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.
I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.
The db has IIRC well over half a million rows of data. My questions are:
Is the number of records a cause for concern? (i.e. Will I hit some limits)?
Is there a better file format for the transitory data (instead of CSV)?
I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but
I would like to hear from someone who may have done something similar before.
Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.
For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).
Yet another approach if you have Access available ...
Create a table in MySQL to hold the data.
In your Access db, create an ODBC link to the MySQL table.
Then execute a query such as:
INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;
Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.
I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.
The only limit should be operating system file size.
That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.