I know this has been discussed various times on the world wide web, but as a newbie, it's really hard for me to translate these answers into practical execution steps.
Basically, I am looking to have a few list variables in my Lambda script to persist their storage across multiple sessions and empty themselves in the end of the day. When the script runs, certain values will be generated and I want to append those values into the lists for use in the subsequent lambda sessions.
Based on my research, we can persist data via the S3, /tmp, or EFS method. But for something like a list, how can I achieve that?
UPDATE
Decided to create parameter stores which are free for my small scale of work. With parameter stores, I store my JSON data in string type and parse them into dict/JSON for processing. In the end of the function, I parse my value back into string before overwriting them into the existing parameter store. Thanks everyone!
Related
I'm writing something that essentially refines and reports various strings out of an enormous python dictionary (the source file for the dictionary is XML over a million lines long).
I found mongodb yesterday and was delighted to see that it accepts python dictionaries easy as you please... until it refused mine because the dict object is larger than the BSON size limit of 16MB.
I looked at GridFS for a sec, but that won't accept any python object that doesn't have a .read attribute.
Over time, this program will acquire many of these mega dictionaries; I'd like to dump each into a database so that at some point I can compare values between them.
What's the best way to handle this? I'm awfully new to all of this but that's fine with me :) It seems that a NoSQL approach is best; the structure of these is generally known but can change without notice. Schemas would be nightmarish here.
Have your considered using Pandas? Yes Pandas does not natively accept xmls but if you use ElementTree from xml (standard library) you should be able to read it into a Pandas data frame and do what you need with it including refining strings and adding more data to the data frame as you get it.
So I've decided that this problem is more of a data design problem than a python situation. I'm trying to load a lot of unstructured data into a database when I probably only need 10% of it. I've decided to save the refined xml dictionary as a pickle on a shared filesystem for cool storage and use mongo to store the refined queries I want from the dictionary.
That'll reduce their size from 22MB to 100K.
Thanks for chatting with me about this :)
I have a large amount of data around 50GB worth in a csv which i want to analyse purposes of ML. It is however way to large to fit in Python. I ideally want to use mySQL because querying is easier. Can anyone offer a host of tips for me to look into. This can be anything from:
How to store it in the first place, i realise i probably can't load it in all at once, would i do it iteratively? If so what things can i look into for this? In addition i've heard about indexing, would that really speed up queries on such a massive data set?
Are there better technologies out there to handle this amount of data and still be able to query and do feature engineering quickly. What i eventually feed into my algorithm should be able to be done in Python but i need query and do some feature engineering before i get my data set that is ready to be analysed.
I'd really appreciate any advice this all needs to be done on personal computer! Thanks!!
Can anyone offer a host of tips for me to look into
Gladly!
Look at the CSV file first line to see if there is a header. You'd need to create a table with the same fields (and type of data)
One of the fields might seem unique per line and can be used later to find the line. That's your candidate for PRIMARY KEY. Otherwise add an AUTO-INCREMENT field as PRIMARY KEY
INDEXes are used to later search for data. Whatever fields you feel you will be searching/filtering on later should have some sort of INDEX. You can always add them later.
INDEXes can combine multiple fields if they are often searched together
In order to read in the data, you have 2 ways:
Use LOAD DATA INFILE Load Data Infile Documentation
Write your own script: The best technique is to create a prepared statement for the
INSERT command. Then read your CSV line by line (in a loop), split the fields
into variables and execute the prepared statement with this line's
values
You will benefit from a web page designed to search the data. Depends on who needs to use it.
Hope this gives you some ideas
That's depend on what you have, you can use Apache spark and then use their SQL feature, spark SQL gives you the possibility to write SQL queries in your dataset, but for best performance you need a distributed mode(you can use it in a local machine but the result is limited) and high machine performance. you can use python, scala, java to write your code.
i'm rather new to python and coding in general.
I'm writing my own chat statistics bot for russian social net (vk. com).
My question is can i store a dictionary in a file and work with it?
For example:
Userlist=open('userlist.txt', '+')
If lastmessage['uid'] not in Userlist.read():
Userlist.read()[lastmessage.'uid']=1
Userlist.close()
Or do i have to use some side modules like JSON?
Thank you
(Ammended answer in light of clarifying comment: in the while true cycle i want to check, if a user's id is in 'userlist' dictionary (as a key) and if not, add it to this dictionary with value 1. Then i want to rewrite the file with a new dictionary. the file is opened as soon as the program is launched, before the cycle):
For robustly using data on disk as though it were a dictionary you should consider either one of the dbm modules or just using the SQLite3 support.
A dbm file is simply a set of keys and values stored with transparently maintained and used indexing. Once you've opened your dbm file you simply use it exactly like you would any other Python dictionary (with strings as keys). Any changes can simply be flushed and written before closing the file. This is very simple though it offers no special features for locking (or managing consistency in the case where you might have multiple processes writing to the file concurrently) and so on.
On the other hand the incredibly powerful SQLite subsystem, which has been included in the Python standard libraries for many years, allows you to easily treat a set of local file as an SQL database management system ... with all of the features you'd expect from a client/server based system (foreign keys, data type and referential integrity constraint management, views and triggers, indexes, etc).
In your case you could simply have a single table containing a single column. Binding to that database (by its filename) would allow you to query for a user's name with SELECT and add the user's name with INSERT. As your application grows and changes you could add other columns to track when the account was created and when it was most recently used or checked (a couple of time/date stamp columns) and you could create other tables with related data (selected using JOINs, for example).
(Original answer):
In general the processing of storing any internal data structure as a file, or transmitting it over a network connection, is referred to a "serialization." The complementary process of loading or receiving such data and instantiating its contents into a new data structure is referred to (unsurprisingly) as "deserialization."
That's true of all programming languages.
There are many ways to serialize and deserialize data in Python. In particular we have the native (standard library) pickle module which produces files (or strings) which are only intended or use with other processes running Python or we can, as you said, use JSON ... the JavaScript Object Notation which has become the de facto cross-language data structure serialization standard. (There are others such as YAML and XML ... but JSON has come to predominate).
The caveat about using JSON vs. Pickle is that JavaScript (and a number of other programming and scripting languages, uses different semantics for some sorts of "dictionary" (associative array) keys than Python. In particular Python (and Ruby and Lua) treats keys such as "1" (a string containing the digit "one") and 1 or 1.0 (numeric values equal to one) as distinct keys. JavaScript, Perl and some others treats the keys as "scalar" values in which strings like "1" and the the number 1 will evaluate into the same key.
There are some other nuances which can affect the fidelity of your serialization. But that's the easiest to understand. Dictionaries with strings as keys are fine ... mixtures of numeric and string keys are the most likely cause of any troubles you'll encounter using JSON serialization/deserialization in lieu of pickling.
I have a data set which I do multiple mappings on.
Assuming that I have 3 key-values pair for the reduce function, how do I modify the output such that I have 3 blobfiles - one for each of the key value pair?
Do let me know if I can clarify further.
I don't think such functionality exists (yet?) in the GAE Mapreduce library.
Depending on the size of your dataset, and the type of output required, you can small-time-investment hack your way around it by co-opting the reducer as another output writer. For example, if one of the reducer outputs should go straight back to the datastore, and another output should go to a file, you could open a file yourself and write the outputs to it. Alternatively, you could serialize and explicitly store the intermediate map results to a temporary datastore using operation.db.Put, and perform separate Map or Reduce jobs on that datastore. Of course, that will end up being more expensive than the first workaround.
In your specific key-value example, I'd suggest writing to a Google Cloud Storage File, and postprocessing it to split it into three files as required. That'll also give you more control over final file names.
I would like to get the suggestion on using No-SQL datastore for my particular requirements.
Let me explain:
I have to process the five csv files. Each csv contains 5 million rows and also The common id field is presented in each csv.So, I need to merge all csv by iterating 5 million rows.So, I go with python dictionary to merge all files based on the common id field.But here the bottleneck is you can't store the 5 million keys in memory(< 1gig) with python-dictionary.
So, I decided to use No-Sql.I think It might be helpful to process the 5 million key value storage.Still I didn't have clear thoughts on this.
Anyway we can't reduce the iteration since we have the five csvs each has to be iterated for updating the values.
Is it there an simple steps to go with that?
If this is the way Could you give me the No-Sql datastore to process the key-value pair?
Note: We have the values as list type also.
If the CSV is already sorted by id you can use the merge-join algorithm. It allows you to iterate over the single lines, so you don't have to keep everything in memory.
Extending the algorithm to multiple tables/CSV files will be a greater challenge, though. (But probably faster than learning something new like Hadoop)
If this is just a one-time process, you might want to just setup an EC2 node with more than 1G of memory and run the python scripts there. 5 million items isn't that much, and a Python dictionary should be fairly capable of handling it. I don't think you need Hadoop in this case.
You could also try to optimize your scripts by reordering the items in several runs, than running over the 5 files synchronized using iterators so that you don't have to keep everything in memory at the same time.
As I understand you want to merge about 500,000 items from 5 input files. If you do this on one machine it might take long time to process 1g of data. So I suggest to check the possibility of using Hadoop. Hadoop is a batch processing tool. Usually Hadoop programs are written in Java, but you can write it in Python as well.
I recommend to check feasibility of using Hadoop to process your data in a cluster. You may use HBase (Column datastore) to store your data. It's an idea, check whether its applicable to your problem.
If this does not help, give some more details about the problem your are trying to solve. Technically you can use any language or datastore to solve this problem. But you need to find which one solves the best (in terms of time or resources) and your willingness to use/learn a new tool/db.
Excellent tutorial to get started: http://developer.yahoo.com/hadoop/tutorial/