After some extensive research I have figured that
Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
However, I am unable to understand why parquet writes multiple files when I run df.write.parquet("/tmp/output/my_parquet.parquet") despite supporting flexible compression options and efficient encoding.
Is this directly related to parallel processing or similar concepts?
Lots of frameworks make use of this multi-file layout feature of the parquet format. So I’d say that it’s a standard option which is part of the parquet specification, and spark uses it by default.
This does have benefits for parallel processing, but also other use cases, such as processing (in parallel or series) on the cloud or networked file systems, where data transfer times may be a significant portion of total IO. in these cases, the parquet “hive” format, which uses small metadata files which provide statistics and information about which data files to read, offers significant performance benefits when reading small subsets of the data. This is true whether a single-threaded application is reading a subset of the data or if each worker in a parallel process is reading a portion of the whole.
It's not just for parquet but rather a spark feature where to avoid network io it writes each shuffle partition as a 'part...' file on disk and each file as you said will have compression and efficient encoding by default.
So Yes it is directly related to parallel processing
I have nearly 60-70 timing log files(all are .csv files, with a total size of nearly 100MB). I need to analyse these files at a single go. Till now, I've tried the following methods :
Merged all these files into a single file and stored it in a DataFrame (Pandas Python) and analysed them.
Stored all the csv files in a database table and analysed them.
My doubt is, which of these two methods is better? Or is there any other way to process and analyse these files?
Thanks.
For me I usually merge the file into a DataFrame and save it as a pickle but if you merge it the file will pretty big and used up a lot of ram when you used it but it is the fastest way if your machine have a lot of ram.
Storing the database is better in the long term but you will waste your time uploading the csv to the database and then waste even more of your time retrieving it from my experience you use the database if you want to query specific things from the table such as you want a log from date A to date B however if you use pandas to query all of that than this method is not very good.
Sometime for me depending on your use case you might not even need to merge it use the filename as a way to query and get the right log to process (using the filesystem) then merge the log files you are concern with your analysis only and don't save it you can save that as pickle for further processing in the future.
What exactly means analysis on a single go?
I think your problem(s) might be solved using dask and particularly the dask dataframe
However, note that the dask documentation recommends to work with one big dataframe, if it fits comfortably in the RAM of you machine.
Nevertheless, an advantage of dask might be to have a better parallelized or distributed computing support than pandas.
I'm trying to apply machine learning (Python with scikit-learn) to a large data stored in a CSV file which is about 2.2 gigabytes.
As this is a partially empirical process I need to run the script numerous times which results in the pandas.read_csv() function being called over and over again and it takes a lot of time.
Obviously, this is very time consuming so I guess there is must be a way to make the process of reading the data faster - like storing it in a different format or caching it in some way.
Code example in the solution would be great!
I would store already parsed DFs in one of the following formats:
HDF5 (fast, supports conditional reading / querying, supports various compression methods, supported by different tools/languages)
Feather (extremely fast - makes sense to use on SSD drives)
Pickle (fast)
All of them are very fast
PS it's important to know what kind of data (what dtypes) you are going to store, because it might affect the speed dramatically
I'm taking some AI classes and have learned about some basic algorithms that I want to experiment with. I have gotten access to several data sets containing lots of great real-world data through Kaggle, which hosts data analysis competitions.
I have tried entering several competitions to improve my machine learning skills, but have been unable to find a good way to access the data in my code. Kaggle provides one large data file, 50-200mb, per competition in csv format.
What is the best way to load and use these tables in my code? My first instinct was to use databases, so I tried loading the csv into sqlite a single database, but this put a tremendous load on my computer and during the commits, it was common for my computer to crash. Next, I tried using a mysql server on a shared host, but doing queries on it took forever, and it made my analysis code really slow. Plus, I am afraid I will exceed my bandwidth.
In my classes so far, my instructors usually clean up the data and give us managable datasets that can be completely loaded into RAM. Obviously this is not possible for my current interests. Please suggest how I should proceed. I am currently using a 4 year old macbook with 4gb ram and a dualcore 2.1Ghz cpu.
By the way, I am hoping to do the bulk of my analysis in Python, as I know this language the best. I'd like a solution that allows me to do all or nearly all coding in this language.
Prototype--that's the most important thing when working with big data. Sensibly carve it up so that you can load it in memory to access it with an interpreter--e.g., python, R. That's the best way to create and refine your analytics process flow at scale.
In other words, trim your multi-GB-sized data files so that they are small enough to perform command-line analytics.
Here's the workflow i use to do that--surely not the best way to do it, but it is one way, and it works:
I. Use lazy loading methods (hopefully) available in your language of
choice to read in large data files, particularly those exceeding about 1 GB. I
would then recommend processing this data stream according to the
techniques i discuss below, then finally storing this fully
pre-processed data in a Data Mart, or intermediate staging container.
One example using Python to lazy load a large data file:
# 'filename' is the full path name for a data file whose size
# exceeds the memory on the box it resides. #
import tokenize
data_reader = open(some_filename, 'r')
tokens = tokenize.generate_tokens(reader)
tokens.next() # returns a single line from the large data file.
II. Whiten and Recast:
Recast your columns storing categorical
variables (e.g., Male/Female) as integers (e.g., -1, 1). Maintain
a
look-up table (the same hash as you used for this conversion
except
the keys and values are swapped out) to convert these integers
back
to human-readable string labels as the last step in your analytic
workflow;
whiten your data--i.e., "normalize" the columns that
hold continuous data. Both of these steps will substantially
reduce
the size of your data set--without introducing any noise. A
concomitant benefit from whitening is prevention of analytics
error
caused by over-weighting.
III. Sampling: Trim your data length-wise.
IV. Dimension Reduction: the orthogonal analogue to sampling. Identify the variables (columns/fields/features) that have no influence or de minimis influence on the dependent variable (a.k.a., the 'outcomes' or response variable) and eliminate them from your working data cube.
Principal Component Analysis (PCA) is a simple and reliable technique to do this:
import numpy as NP
from scipy import linalg as LA
D = NP.random.randn(8, 5) # a simulated data set
# calculate the covariance matrix: #
R = NP.corrcoef(D, rowvar=1)
# calculate the eigenvalues of the covariance matrix: #
eigval, eigvec = NP.eig(R)
# sort them in descending order: #
egval = NP.sort(egval)[::-1]
# make a value-proportion table #
cs = NP.cumsum(egval)/NP.sum(egval)
print("{0}\t{1}".format('eigenvalue', 'var proportion'))
for i in range(len(egval)) :
print("{0:.2f}\t\t{1:.2f}".format(egval[i], cs[i]))
eigenvalue var proportion
2.22 0.44
1.81 0.81
0.67 0.94
0.23 0.99
0.06 1.00
So as you can see, the first three eigenvalues account for 94% of the variance observed in original data. Depending on your purpose, you can often trim the original data matrix, D, by removing the last two columns:
D = D[:,:-2]
V. Data Mart Storage: insert a layer between your permanent storage (Data Warehouse) and your analytics process flow. In other words, rely heavily on data marts/data cubes--a 'staging area' that sits between your Data Warehouse and your analytics app layer. This data mart is a much better IO layer for your analytics apps. R's 'data frame' or 'data table' (from the CRAN Package of the same name) are good candidates. I also strongly recommend redis--blazing fast reads, terse semantics, and zero configuration, make it an excellent choice for this use case. redis will easily handle datasets of the size you mentioned in your Question. Using the hash data structure in redis, for instance, you can have the same structure and the same relational flexibility as MySQL or SQLite without the tedious configuration. Another advantage: unlike SQLite, redis is in fact a database server. I am actually a big fan of SQLite, but i believe redis just works better here for the reasons i just gave.
from redis import Redis
r0 = Redis(db=0)
r0.hmset(user_id : "100143321, {sex : 'M', status : 'registered_user',
traffic_source : 'affiliate', page_views_per_session : 17,
total_purchases : 28.15})
200 megabytes isn't a particularly large file for loading into a database. You might want to try to split the input file into smaller files.
split -l 50000 your-input-filename
The split utility will split your input file into multiple files of whatever size you like. I used 50000 lines per file above. It's a common Unix and Linux command-line utility; don't know whether it ships with Macs, though.
Local installs of PostgreSQL or even MySQL might be a better choice than SQLite for what you're doing.
If you don't want to load the data into a database, you can take subsets of it using command-line utilities like grep, awk, and sed. (Or scripting languages like python, ruby, and perl.) Pipe the subsets into your program.
You need 'Pandas' for it. I think you must have got it by now. But still if someone else is facing the issue can be benefited by the answer.
So you don't have to load the data into any RDBMS. keep it in file and use it by simple Pandas
load, dataframes. Here is the link for pandas lib-->
http://pandas.pydata.org/
If data is too large you need any cluster for that. Apache Spark or Mahout which can run on Amazon EC2 cloud. Buy some space there and it will be easy to use. Spark also has an API for Python.
I loaded a 2 GB Kaggle dataset in R using H2O.
H2O is build on java and it creates a virtual environment, the data will be available in memory and you will have much faster access since H2O is java.
You will just have to get used to the syntax of H2O.
It has many beautifully built ml algorithms and provides good support for distributed computing.
You can also use all your cpu cores easily. Check out h2o.ai for how to use it. It can handle 200 MB easily, given than you only have 4 GB ram. You should upgrade to 8 G or 16 G
General technic is to "divide and conquere". If you can split you data into parts and process them separetly then it can be handled by one machine. Some task can be solved that way(PageRank, NaiveBayes, HMM etc) and some were not,(one requred global optimisation) like LogisticeRegression, CRF, many dimension reduction technicues
I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memory functions in numpy, scipy, and PyIMSL nearly impossible. The statistical analysis language SAS has a big advantage here in that it can operate on data from hard disk as opposed to strictly in-memory processing. But, I want to avoid having to write a lot of code in SAS (for a variety of reasons) and am therefore trying to determine what options I have with Python (besides buying more hardware and memory).
I should clarify that approaches like map-reduce will not help in much of my work because I need to operate on complete sets of data (e.g. computing quantiles or fitting a logistic regression model).
Recently I started playing with h5py and think it is the best option I have found for allowing Python to act like SAS and operate on data from disk (via hdf5 files), while still being able to leverage numpy/scipy/matplotlib, etc. I would like to hear if anyone has experience using Python and h5py in a similar setting and what they have found. Has anyone been able to use Python in "big data" settings heretofore dominated by SAS?
EDIT: Buying more hardware/memory certainly can help, but from an IT perspective it is hard for me to sell Python to an organization that needs to analyze huge data sets when Python (or R, or MATLAB etc) need to hold data in memory. SAS continues to have a strong selling point here because while disk-based analytics may be slower, you can confidently deal with huge data sets. So, I am hoping that Stackoverflow-ers can help me figure out how to reduce the perceived risk around using Python as a mainstay big-data analytics language.
We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs.
HDF5 advantages:
data can be inspected conveniently using the h5view application, h5py/ipython and the h5* commandline tools
APIs are available for different platforms and languages
structure data using groups
annotating data using attributes
worry-free built-in data compression
io on single datasets is fast
HDF5 pitfalls:
Performance breaks down, if a h5 file contains too many datasets/groups (> 1000), because traversing them is very slow. On the other side, io is fast for a few big datasets.
Advanced data queries (SQL like) are clumsy to implement and slow (consider SQLite in that case)
HDF5 is not thread-safe in all cases: one has to ensure, that the library was compiled with the correct options
changing h5 datasets (resize, delete etc.) blows up the file size (in the best case) or is impossible (in the worst case) (the whole h5 file has to be copied to flatten it again)
I don't use Python for stats and tend to deal with relatively small datasets, but it might be worth a moment to check out the CRAN Task View for high-performance computing in R, especially the "Large memory and out-of-memory data" section.
Three reasons:
you can mine the source code of any of those packages for ideas that might help you generally
you might find the package names useful in searching for Python equivalents; a lot of R users are Python users, too
under some circumstances, it might prove convenient to just link to R for a particular analysis using one of the above-linked packages and then draw the results back into Python
Again, I emphasize that this is all way out of my league, and it's certainly possible that you might already know all of this. But perhaps this will prove useful to you or someone working on the same problems.