Dask DataFrame groupby/apply efficiency - python

I am struggling to come up with an efficient way of solving what seems to be a typical use case of dask.dataframe groupby+apply and am wondering if I'm missing something obvious (various docs speak to this issue but I haven't been able to fully resolve it).
In short, I'm trying to load a medium-sized (say 10GB) dataframe, group by some column, train a machine learning model for each subset (a few seconds per model, ~100k subsets), and save that model to disk. My best attempt so far looks like:
c = Client()
df = dd.read_parquet('data.parquet')
df = c.persist(dd.set_index('key')) # data is already sorted by key
result = c.compute(df.groupby(df.index).apply(train_and_save_model))
No matter how I try to repartition the data, I seem to spend an enormous amount of time on serialization/IO compared to on actual computation. A silly naive workaround of writing 100k separate Parquet files up front and then passing filenames to the workers to load/train on seems to be much more efficient; I'm struggling to see why that would perform any differently. Isn't the idea of setting the index and partitioning that each worker understands which parts of the file it should read from? I assume I'm missing something obvious here so any guidance would be appreciated.

It looks like there's a big difference between data that is read in with index='key' and data that is re-indexed with set_index; I thought that if the new index was already sorted then there'd be no shuffling cost but I guess that was wrong.
Changing to dd.read_parquet('data.parquet', index='key') seems to give performance like what I was hoping for so I'm happy (though still curious why set_index seems to shuffle unnecessarily in this case).

Related

Data Quality check with Python Dask

Currently trying to write code to check for data quality of a 7 gb data file. I tried googling exactly but to no avail. Initially, the purpose of the code is to check how many are nulls/NaNs and later on to join it with another datafile and compare the quality between each. We are expecting the second is the more reliable but I would like to later on automate the whole process. I was wondering if there is someone here willing to share their data quality python code using Dask. Thank you
I would suggest the following approach:
try to define how you would check quality on small dataset and implement it in Pandas
try to generalize the process in a way that if each "part of file" or partition is of good quality, than whole dataset can be considered of good quality.
use Dask's map_partitions to parralelize this processing over your dataset's partition.

Is there a proper time to clean data when importing into a Pandas dataframe?

I am scraping the CIA Worldbook for country data as a learning exercise. I scrape the data and clean it up during import and then later convert to Pandas dataframe.
I have two choices - clean the data as it is being read in, as I am doing now, or just read everything into the dataframe and clean it up after the fact.
Here are two examples of what I am doing now:
raw data
info = "$2,000 note: data are in 2017 dollars (2020 est.)"
int(info.text[:info.text.find(' ')].replace(',', '').replace('$', ''))
result 2000
raw data
info = "36.08 births/1,000 population (2021 est.)"
float(info.text[:info.text.find(' ')].replace(',', ''))
result 36.08
I suspect that cleaning in the dataframe after downloading would be a better solution but the only way I can think to do that is using Regular Expressions - which at the moment I am not too well versed in. Would that be the "correct" way to do it, or does it even matter? If cleaning up the dataframe is the solution, what might these look like?
Thanks
There are some things that are important depending on your case:
Do you want it to be highly reproducible or extendable?
Should it be highly performant?
Is readability more important than performance/extendability?
I've found that in the far majority of the cases, the performance doesn't matter that much. As long as you're not dealing with enormous amount of data to process or you're not working on low-performing infrastructure, it should run sufficiently fast. Again, this depends on your use case.
What I find way more annoying/time-consuming is over-complex functions that you won't know how they work afterwards, or having severely nested functionality. Those can take enormous amount of time to fix once your data-format changes or you need to alter some small parts in the code.
I would therefore agree that the ideal workflow would be to first download and store the raw data for reproducibility. Then you should write a processing function that makes them 'DataFrame' ready. Whenever your raw data then changes, you only have to rewrite this single function and assert the processed data comes out the same format it used to.
Moreover, whenever you decide that you don't want to use pandas anymore (because you want to use regular numpy arrays for example), it is an easier fix to exclude pandas from your code than when it is completely knitted in your workflow since the very beginning.
This would be my motivation to do the processing before reading into a DataFrame.

Slow loop python to search data in antoher data frame in python

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :
This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])
This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

Nested data in Pandas

First of all: I know this is a dangerous question. There are a lot of similar questions about storing and accessing nested data in pandas but I think my question is different (more general) so hang on. :)
I have medium sized dataset of workouts for 1 athlete. Each workout has a date and time, ~200 properties (e.g. average speed and heart rate) and some raw data (3-10 lists of e.g. speed and heart rate values per second). I have about 300 workouts and each workouts contains on average ~4000 seconds.
So far I tried 3 solutions to store this data with pandas to be able to analyze it:
I could use MultiIndex and store all data in 1 DataFrame but this
DataFrame would get quite large (which doesn't have to be a problem
but visually inspecting it will be hard) and slicing the data is cumbersome.
Another way would be to store the date and properties
in a DataFrame df_1 and to store the raw data in a separate
DataFrame df_2 that I would store in a separate column raw_data
in df_1.
...Or (similar to (2)) I could store the raw data in separate DataFrames
that I store in a dict with keys identical to the index of the
DataFrame df_1.
Either of these solutions work and for this use case there are no major performance benefits to either of them. To me (1) feels the most 'Pandorable' (really like that word :) ) but slicing the data is difficult and visual inspection of the DataFrame (printing it) is of no use. (2) feels a bit 'hackish' and in-place modifications can be unreliable but this solution is very nice to work with. And (3) is ugly and a bit difficult to work with, but also the most Pythonic in my opinion.
Question: What would be the benefits of each method and what is the most Pandorable solution in your opinion?
By the way: Of course I am open to alternative solutions.

Is there a faster way to append many XLS files into a single CSV file?

After the recommendation from Jeff's Answer to check out this Google Forum, I still didn't feel satisfied on what the conclusion was regarding the appendCSV method. Below, you can see my implementation of reading many XLS files. Is there a way to significantly increase the speed of this? It currently takes over 10 minutes for around 900,000 rows.
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
data.rename(columns={'Alphabeta':'AlphaBeta'}, inplace=True)
frame = frame.append(data)
# Save to CSV..
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")
The very first important point
Optimize only code that is required to be optimized.
If you need to convert all you files just once then you have already made a great job, congrats! If you, however, need to reuse it really often (and by really I mean that there is a source that produce your Excel files with a speed at least of 900K rows per 10 minutes and you need to parse them in real-time) then what you need to do is to analyze your profiling results.
Profiling analysis
Sorting your profile in descending order by 'cumtime', which is cumulative execution time of function including its subcalls, you will discover that out of ~2000 seconds of runtime ~800 seconds are taken by 'read_excel' method and ~1200 seconds are taken by 'to_csv' method.
If then you will sort profile by 'tottime' which is total execution time of functions themselves you will find out that top time consumers are populated with functions that are connected with reading and writing lines and conversion between formats. So, the real problem is that either libraries you use are slow, or the amount of data you are parsing is really huge.
Possible solutions
For the first reason, please keep in mind that parsing Excel lines and converting them could be a really complex task. It is hard to advice you without having an example of your input data. But there could be a real time loss just because the library you are using is for everything and it does hard work parsing rows several times when you actually do not need it, because your rows have very simple structure. In this case you may try to switch to different libraries, that does not perform complex parsing of input data, for example use xlrd for reading data from Excel. But in title you mentioned that input files are also CSVs so if this is applicable in your case then load lines with just:
line.strip().split(sep)
instead of complex Excel format parsing. And of course if your rows are simple than you can always use
','.join(list_of_rows)
to write CSV instead of using complex DataFrames at all. However, if your files contain Unicode symbols, complex fields and so on then these libraries are probably the best choice.
For the second reason - 900K rows could contain from 900K to infinite bytes, so it is really hard to understand whether your data input is really so big, without an example again. If you have really a lot of data then probably there is not too much you could do and you just have to wait. And remember that disk is actually a very slow device. Usual disks could provide you with ~100Mb/s at its best so if you are copying (because ultimately that is what you are doing) 10Gb of data then you can see that at least 3-4 minutes will be required for just physically reading raw data and writing the result. But in case if you are not using your disk bandwidth for 100% (for example if parsing one row with library that you are using takes comparable time with just reading this row from disk) you might also try to increase speed of your code by asynchronous data reading with multiprocessing map_async instead of cycle.
If you are using pandas, you could do this:
dfs = [pd.read_excel(path.join(dir, name), sep='\t', encoding='cp1252', error_bad_lines=False ) for name in os.listdir(dir) if name.endswith(suffix)]
df = pd.concat(dfs, axis=0, ignore_index=True)
This is screaming fast compared to other methods of getting data into pandas. Other tips:
You can also speed this up by specifying dtype for all columns.
If you are doing read_csv, use the engine='c' to speed up the import.
Skip rows on error

Categories