I have a data set which I do multiple mappings on.
Assuming that I have 3 key-values pair for the reduce function, how do I modify the output such that I have 3 blobfiles - one for each of the key value pair?
Do let me know if I can clarify further.
I don't think such functionality exists (yet?) in the GAE Mapreduce library.
Depending on the size of your dataset, and the type of output required, you can small-time-investment hack your way around it by co-opting the reducer as another output writer. For example, if one of the reducer outputs should go straight back to the datastore, and another output should go to a file, you could open a file yourself and write the outputs to it. Alternatively, you could serialize and explicitly store the intermediate map results to a temporary datastore using operation.db.Put, and perform separate Map or Reduce jobs on that datastore. Of course, that will end up being more expensive than the first workaround.
In your specific key-value example, I'd suggest writing to a Google Cloud Storage File, and postprocessing it to split it into three files as required. That'll also give you more control over final file names.
Related
I know this has been discussed various times on the world wide web, but as a newbie, it's really hard for me to translate these answers into practical execution steps.
Basically, I am looking to have a few list variables in my Lambda script to persist their storage across multiple sessions and empty themselves in the end of the day. When the script runs, certain values will be generated and I want to append those values into the lists for use in the subsequent lambda sessions.
Based on my research, we can persist data via the S3, /tmp, or EFS method. But for something like a list, how can I achieve that?
UPDATE
Decided to create parameter stores which are free for my small scale of work. With parameter stores, I store my JSON data in string type and parse them into dict/JSON for processing. In the end of the function, I parse my value back into string before overwriting them into the existing parameter store. Thanks everyone!
We are running logfile parsing jobs in google dataflow using the Python SDK. Data is spread over several 100s of daily logs, which we read via file-pattern from Cloud Storage. Data volume for all files is about 5-8 GB (gz files) with 50-80 million lines in total.
loglines = p | ReadFromText('gs://logfile-location/logs*-20180101')
In addition, we have a simple (small) mapping csv, that maps logfile-entries to human readable text. Has about 400 lines, 5 kb size.
For Example a logfile entry with [param=testing2] should be mapped to "Customer requested 14day free product trial" in the final output.
We do this in a simple beam.Map with sideinput, like so:
customerActions = loglines | beam.Map(map_logentries,mappingTable)
where map_logentries is the mapping function and mappingTable is said mapping table.
However, this only works if we read the mapping table in native python via open() / read(). If we do the same utilising the beam pipeline via ReadFromText() and pass the resulting PCollection as side-input to the Map, like so:
mappingTable = p | ReadFromText('gs://side-inputs/category-mapping.csv')
customerActions = loglines | beam.Map(map_logentries,beam.pvalue.AsIter(mappingTable))
performance breaks down completely to about 2-3 items per Second.
Now, my questions:
Why would performance break so badly, what is wrong with passing a
PCollection as side-input?
If it is maybe not recommended to use
PCollections as side-input, how is one supposed to build such as
pipeline that needs mappings that can/should not be hard coded into
the mapping function?
For us, the mapping does change frequently and I need to find a way to have "normal" users provide it. The idea was to have the mapping csv available in Cloud Storage, and simply incorporate it into the Pipeline via ReadFromText(). Reading it locally involves providing the mapping to the workers, so only the tech-team can do this.
I am aware that there are caching issues with side-input, but surely this should not apply to a 5kb input.
All code above is pseudo code to explain the problem. Any ideas and thoughts on this would be highly appreciated!
For more efficient side inputs (with small to medium size) you can utilize
beam.pvalue.AsList(mappingTable)
since AsList causes Beam to materialize the data, so you're sure that you will get in-memory list for that pcollection.
Intended for use in side-argument specification---the same places
where AsSingleton and AsIter are used, but forces materialization of
this PCollection as a list.
Source: https://beam.apache.org/documentation/sdks/pydoc/2.2.0/apache_beam.pvalue.html?highlight=aslist#apache_beam.pvalue.AsList
The code looks fine. However, since mappingTable is a mapping, wouldn't beam.pvalue.AsDict be more appropriate for your use case?
Your mappingTable is small enough so side input is a good use case here.
Given that mappingTable is also static, you can load it from GCS in start_bundle function of your DoFn. See the answer to this post for more details. If mappingTable becomes very large in future, you can also consider converting your map_logentries and mappingTable into PCollection of key-value pairs and join them using CoGroupByKey.
I hope the question is not too unspecific: I have a huge database-like list (~ 900,000 entries) which I want to use for processing text files. (More details below.) Since this list will be edited and used with other programs as well, I would prefer to keep it in one separate file and include it in the python code, either directly or by dumping it to some format that python can use. I was wondering if you can advice on what would be the quickest and most efficient way. I have looked at several options, but may not have seen what is best:
Include the list as a python dictionary in the form
my_list = { "key": "value" }
directly into my python code.
Dump the list to an sqlite database and use the sqlite3 module.
Have the list as a yml file and use the yaml module.
Any ideas how these approaches would scale if I process a text file and want to do replacements on something like 30,000 lines?
For those interested: this is for linguistic processing, in particular ancient Greek. The list is an exhaustive list of Greek forms and the head words that they are derived from. For every word form in a text file, I want to add the dictionary head word.
Point 1 is much faster than using either YAML or SQL as #b4hand and #DeepSpace indicated. What you should do though is not include the list in the rest of the rest of the python code you are developing, as you indicated, but make a separate .py file with just the that dictionary definition.
That way the list in that file is more easy to write from a program (or extend by a program). And, on first import, a .pyc will be created which speeds up re-reading on further runs of your program. This is actually very similar in performance
to using the pickle module and pickling the dictionary to file and reading it back from there, while keeping the dictionary in an easy human readable and editable form.
Less than one million entries is not huge and should fit in memory easily. Thus, your best bet is option 1.
If you are looking for speed, option 1 should be the fastest because the other 2 will need to repeatedly access the HD which will be the bottleneck.
I would use a caching mechanism to hold this data or maybe a data structure storage like redis. Loading all of this in memory might become too expensive.
I would like to get the suggestion on using No-SQL datastore for my particular requirements.
Let me explain:
I have to process the five csv files. Each csv contains 5 million rows and also The common id field is presented in each csv.So, I need to merge all csv by iterating 5 million rows.So, I go with python dictionary to merge all files based on the common id field.But here the bottleneck is you can't store the 5 million keys in memory(< 1gig) with python-dictionary.
So, I decided to use No-Sql.I think It might be helpful to process the 5 million key value storage.Still I didn't have clear thoughts on this.
Anyway we can't reduce the iteration since we have the five csvs each has to be iterated for updating the values.
Is it there an simple steps to go with that?
If this is the way Could you give me the No-Sql datastore to process the key-value pair?
Note: We have the values as list type also.
If the CSV is already sorted by id you can use the merge-join algorithm. It allows you to iterate over the single lines, so you don't have to keep everything in memory.
Extending the algorithm to multiple tables/CSV files will be a greater challenge, though. (But probably faster than learning something new like Hadoop)
If this is just a one-time process, you might want to just setup an EC2 node with more than 1G of memory and run the python scripts there. 5 million items isn't that much, and a Python dictionary should be fairly capable of handling it. I don't think you need Hadoop in this case.
You could also try to optimize your scripts by reordering the items in several runs, than running over the 5 files synchronized using iterators so that you don't have to keep everything in memory at the same time.
As I understand you want to merge about 500,000 items from 5 input files. If you do this on one machine it might take long time to process 1g of data. So I suggest to check the possibility of using Hadoop. Hadoop is a batch processing tool. Usually Hadoop programs are written in Java, but you can write it in Python as well.
I recommend to check feasibility of using Hadoop to process your data in a cluster. You may use HBase (Column datastore) to store your data. It's an idea, check whether its applicable to your problem.
If this does not help, give some more details about the problem your are trying to solve. Technically you can use any language or datastore to solve this problem. But you need to find which one solves the best (in terms of time or resources) and your willingness to use/learn a new tool/db.
Excellent tutorial to get started: http://developer.yahoo.com/hadoop/tutorial/
I want to create a very very large dictionary, and I'd like to store it on disk so as not to kill my memory. Basically, my needs are a cross between cPickle and the dict class, in that it's a class that Python treats like a dictionary, but happens to live on the disk.
My first thought was to create some sort of wrapper around a simple MySQL table, but I have to store types in the entries of the structure that MySQL can't even hope to support out of the box.
The simplest way is the shelve module, which works almost exactly like a dictionary:
import shelve
myshelf = shelve.open("filename") # Might turn into filename.db
myshelf["A"] = "First letter of alphabet"
print myshelf["A"]
# ...
myshelf.close() # You should do this explicitly when you're finished
Note the caveats in the module documentation about changing mutable values (lists, dicts, etc.) stored on a shelf (you can, but it takes a bit more fiddling). It uses (c)pickle and dbm under the hood, so it will cheerfully store anything you can pickle.
I don't know how well it performs relative to other solutions, but it doesn't require any custom code or third party libraries.
Look at dbm in specific, and generally the entire Data Persistence chapter in the manual. Most key/value-store databases (gdbm, bdb, metakit, etc.) have a dict-like API which would probably serve your needs (and are fully embeddable so no need to manage an external database process).
File IO is expensive in terms of CPU cycles. So my first thoughts would be in favor of a database.
However, you could also split your "English dictionary" across multiple files so that (say) each file holds words that start with a specific letter of the alphabet (therefore, you'll have 26 files).
Now, when you say I want to create a very very large dictionary, do you mean a python dict or an English dictionary with words and their definitions, stored in a dict (with words as keys and definitions as values)? The second can be easily implemented with cPickle, as you pointed out.
Again, if memory is your main concern, then you'll need to recheck the number of files you want to use, because, if you're pickling dicts into each file, then you want the dicts to not get too big
Perhaps a usable solution for you would be to do this (I am going to assume that all the English words are sorted):
Get all the words in the English language into one file.
Count how many such words there are and split them into as many files as you see fit, depending on how large the files get.
Now, these smaller files contain the words and their meanings
This is how this solution is useful:
Say that your problem is to lookup the definition of a particular word. Now, at runtime, you can read the first word in each file, and determine if the word that you are looking for is in the previous file that you read (you will need a loop counter to check if you are at the last file). Once you have determined which file the word you are looking for is in, then you can open that file and load the contents of that file into the dict.
It's a little difficult to offer a solution without knowing more details about the problem at hand.