Adding additional data to each row in an H2OFrame

Adding additional data to each row in an H2OFrame - python

I am working with a huge H2OFrame (~150gb, ~200 million rows), which I need to manipulate a little. To be more specific: I have to use the frame's ip column, to find the location/city names for each IP and add this information to each of the frame's rows.
Converting the frame to a plain python object and manipulating it locally is not an option, due to the huge size of the frame. So what I was hoping I could do is to use my H2O cluster to create a new H2OFrame city_names using the original frame's ip column and then merge both frames.
My question is kind of similar to the question posed here, and what I gathered from this question's answer is that there is no way in H2O to do complex manipulations of each of the frame's rows. Is that really the case? H2OFrame's apply function only accepts a lambda without custom methods after all.
One option I thought of was to use Spark/Sparkling Water for this kind of data manipulation and then convert the spark frame to an H2OFrame to do the machine learning operations. However, if possible I would prefer to avoid that and only use H2O, not least due to the overhead that such a conversion creates.
So I guess it comes down to this: Is there any way to do this kind of manipulation using only H2O? And if not is there another option to do this without having to change my cluster architecture (i.e. without having to turn my H2O cluster into a sparkling water cluster?)

Yes, when using apply with H2OFrame, you can not pass a function instead only lambda is accept. For example if you try passing tryit function you will get the following error showing the limitation:
H2OValueError: Argument `fun` (= <function tryit at 0x108d66410>) does not satisfy the condition fun.__name__ == "<lambda>"
As you already know Sparkling Water is another option to perform all the data munging first in spark and then push you data into H2O for ML.
If you want to stick with H2O as it is, then your options are to just loop through the dataframe to process elements your way. The following option could be little time consuming depending on your data however it does not ask you to move your environment.
Create a new H2O frame by selecting your "ip" column only and add location, city, and other empty columns to it with NA.
Loop through all the ip values and based on "ip", find location/city and add location, city and other column values to the existing columns
Finally cbind the new h2oFrame with original H2OFrame
Check "ip" and "ip0" columns for proper merge with 100% match and then remove one of the duplicate "ip0" column.
Remove the other extra H2OFrame to save memory

If your ip --> city algorithm is a lookup table, you could create that as a data frame, then use h2o.merge. For an example, this video (starting at around the 59min mark) shows how to merge weather data into the airlines data.
For ip addresses I imagine you might want to first truncate to the first two or three parts.
If you don't have a lookup table, it becomes interesting as to whether it is quicker to turn a complex algorithm into that lookup tree and do the h2o.merge, or stick with downloading your huge data in batches, running locally in client, uploading a batch of answers, and doing h2o.cbind at the end.
BTW, the cool and trendy approach would be to sample 1 million of your ip addresses, lookup the correct answer on the client to make a training data set, then use h2o to build a machine learning model. You can then use h2o.predict() to create the new city column in your real data. (You will want to at least split ip address into 4 columns first, though.) (My hunch is a deep random forest would work best... but I would definitely experiment a bit.)

Related

How do I perform deduplication with the python record linkage toolkit with large data sets?

I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.
The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.
Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.

While I was able to solve this, sort of, I am going to leave it open because I suspect given my inexperience with python, my process could be improved.
Basically, I had to ditch the index function from record linkage toolkit. I pulled out the Index of the dataframe I was using, and then converted it to a list, and passed it through the itertools combinations function.
candidates = fl
candidates = candidates.index
candidates = candidates.tolist()
candidates = combinations(candidates,2)
This then gave me an iteration object full of tuples, without having to load everything in to memory. I then passed it into an islice grouper as a for loop.
for x in iter(lambda: list(islice(candidates,1000000)),[]):
I then proceeded to perform all of the necessary comparisons in the for loop, and added the resultant dataframe to a dictionary, which I then concatenate at the end for the full list. Python's memory usage hasn't risen above 3GB the entire time.
I would still love some information on how to incorporate dask into this, so I will accept any answer that can provide that (unless the mods think I should open a new question).

Is it possible to manually create Dask data frames? (i.e., not by a fixed partition count)

I would like to define a way in which a data dataframe is created (e.g., a particular criteria for splitting) or be able to manually create one.
The situation:
I have a Python function that traverses a subset of a large data frame. The traversal can be limited to all rows that match a certain key. So I need to ensure that this key is not split over several partitions.
Currently, I am splitting the input data frame (Pandas) manually and use multiprocessing to process each partition separately.
I would love to use Dask, which I also user for other computations, due to its ease of use. But I can't find a way to manually define how the input dataframe is split in order to later use map_partitions.
Or am I on a completely wrong path here and should other methods of Dask?

You might find using dask delayed useful and then use that to create a custom dask dataframe? https://docs.dask.org/en/latest/dataframe-create.html#dask-delayed

Slow loop python to search data in antoher data frame in python

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :

This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])

This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

HDF5 Links to Events in Dataset

I'm trying to use HDF5 to store time-series EEG data. These files can be quite large and consist of many channels, and I like the features of the HDF5 file format (lazy I/O, dynamic compression, mpi, etc).
One common thing to do with EEG data is to mark sections of data as 'interesting'. I'm struggling with a good way to store these marks in the file. I see soft/hard links supported for linking the same dataset to other groups, etc -- but I do not see any way to link to sections of the dataset.
For example, let's assume I have a dataset called EEG containing sleep data. Let's say I run an algorithm that takes a while to process the data and generates indices corresponding to periods of REM sleep. What is the best way to store these index ranges in an HDF5 file?
The best I can think of right now is to create a dataset with three columns -- the first column is a string and contains a label for the event ("REM1"), and the second/third column contains the start/end index respectively. The only reason I don't like this solution is because HDF5 datasets are pretty set in size -- if I decide later that a period of REM sleep was mis-identified and I need to add/remove that event, the dataset size would need to change (and deleting the dataset/recreating it with a new size is suboptimal). Compound this by the fact that I may have MANY events (imagine marking eyeblink events), this becomes more of a problem.
I'm more curious to find out if there's functionality in the HDF5 file that I'm just not aware of, because this seems like a pretty common thing that one would want to do.

I think what you want is a Region Reference — essentially, a way to store a reference to a slice of your data. In h5py, you create them with the regionref property and numpy slicing syntax, so if you have a dataset called ds and your start and end indexes of your REM period, you can do:
rem_ref = ds.regionref[start:end]
ds.attrs['REM1'] = rem_ref
ds[ds.attrs['REM1']] # Will be a 1-d set of values
You can store regionrefs pretty naturally — they can be attributes on a dataset, objects in a group, or you can create a regionref-type dataset and store them in there.
In your case, I might create a group ("REM_periods" or something) and store the references in there. Creating a "REM_periods" dataset and storing the regionrefs there is reasonable too, but you run into the whole "datasets tend not to be variable-length very well" thing.
Storing them as attrs on the dataset might be OK, too, but it'd get awkward if you wanted to have more than one event type.

How can I ensure unique rows in a large HDF5

I'm working on implementing a relatively large (5,000,000 and growing) set of time series data in an HDF5 table. I need a way to remove duplicates on it, on a daily basis, one 'run' per day. As my data retrieval process currently stands, it's far easier to write in the duplicates during the data retrieval process than ensure no dups go in.
What is the best way to remove dups from a pytable? All of my reading is pointing me towards importing the whole table into pandas, and getting a unique- valued data frame, and writing it back to disk by recreating the table with each data run. This seems counter to the point of pytables, though, and in time I don't know that the whole data set will efficiently fit into memory. I should add that it is two columns that define a unique record.
No reproducible code, but can anyone give me pytables data management advice?
Big thanks in advance...

See this releated question: finding a duplicate in a hdf5 pytable with 500e6 rows
Why do you say that this is 'counter to the point of pytables'? It is perfectly possible to store duplicates. The user is responsible for this.
You can also try this: merging two tables with millions of rows in python, where you use a merge function that is simply drop_duplicates().

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.