I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :
This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])
This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.
Related
for fname in ids['fnames']:
aq = xr.open_dataset(fname, chunks='auto', mask_and_scale=False)
aq = aq[var_lists]
aq = aq.isel(lat=slice(yoff, yoff+ysize), lon=slice(xoff, xoff+xsize))
list_of_ds.append(aq)
aq.close()
all_ds = xr.concat(list_of_ds, dim='time')
all_ds.to_netcdf('tmp.nc')
Hi all, I am making use of xarray to read netcdf files (around 1000) and save selected resutls to a temporary file, as shown above. However, the saving part runs very slow. How can I speed this up?
I also tried directly load the data, but still very slow.
I've also tried using open_mfdataset with parallel=True, and it's also slow:
aq = xr.open_mfdataset(
sorted(ids_list),
data_vars=var_lists,
preprocess=add_time_dim,
combine='by_coords',
mask_and_scale=False,
decode_cf=False,
parallel=True,
)
aq.isel({'lon':irlon,'lat':irlat}).to_netcdf('tmp.nc')
Unfortunately, concatenating ~1000 files in xarray will be slow. Not a great way around that.
It's hard for us to offer specific advice without more detail about your data and setup. But here are some things I'd try:
use xr.open_mfdataset. Your second code block looks great. dask will generally be faster and more efficient at managing tasks than you will with a for loop.
Make sure your chunks are aligned with how you're slicing the data. You don't want to read in more than you have to. If you're reading netCDFs, you have flexibility about how to read in the data into dask. Since you're selecting (it looks like) a small spatial region within each array, it may make sense to explicitly chunk the data such that you're only reading in a small portion of each array, e.g. with chunks={"lat": 50, "lon": 50}. You'll want to balance a few things here - making sure the chunk sizes are manageable and not too small (leading to too many tasks). Shoot for chunks ~100-500 MB range as a general rule, and trying to keep the number of tasks to less than 1 million (or # chunks to fewer than ~10-100k across all your datasets).
Be explicit about your concatenation. The more "magic" the process feels, the more work xarray is doing to infer what you mean. Generally, combine='nested' performs better than 'by_coords', so if you're concatenating files which are structured logically along one or more dimensions, it may help to arrange the files in the same way a dim is provided.
skip the pre-processing. If you can, add new dimensions on concatenation rather than as an ingestion step. This allows dask to more fully plan the computation, rather than treating your preprocess function as a black box, and what's worse as a pre-requisite to scheduling the final array construction operation because you're using combine='by_coords', where the coords are the result of an earlier dask operation. If you need to attach a time dim to each file, with 1 element per file, something like xr.open_mfdataset(files, concat_dim=pd.Index(pd.date_range("2020-01-01", freq="D", periods=1000), name="time"), combine="nested") works well in my experience.
If this is all taking too long, you could try pre-processing the data. Using a compiled utility like nco or even just subsetting the data and grouping smaller subsets of the data into larger files using dask.distributed's client.map might help cut down on the complexity of the final dataset join.
I am scraping the CIA Worldbook for country data as a learning exercise. I scrape the data and clean it up during import and then later convert to Pandas dataframe.
I have two choices - clean the data as it is being read in, as I am doing now, or just read everything into the dataframe and clean it up after the fact.
Here are two examples of what I am doing now:
raw data
info = "$2,000 note: data are in 2017 dollars (2020 est.)"
int(info.text[:info.text.find(' ')].replace(',', '').replace('$', ''))
result 2000
raw data
info = "36.08 births/1,000 population (2021 est.)"
float(info.text[:info.text.find(' ')].replace(',', ''))
result 36.08
I suspect that cleaning in the dataframe after downloading would be a better solution but the only way I can think to do that is using Regular Expressions - which at the moment I am not too well versed in. Would that be the "correct" way to do it, or does it even matter? If cleaning up the dataframe is the solution, what might these look like?
Thanks
There are some things that are important depending on your case:
Do you want it to be highly reproducible or extendable?
Should it be highly performant?
Is readability more important than performance/extendability?
I've found that in the far majority of the cases, the performance doesn't matter that much. As long as you're not dealing with enormous amount of data to process or you're not working on low-performing infrastructure, it should run sufficiently fast. Again, this depends on your use case.
What I find way more annoying/time-consuming is over-complex functions that you won't know how they work afterwards, or having severely nested functionality. Those can take enormous amount of time to fix once your data-format changes or you need to alter some small parts in the code.
I would therefore agree that the ideal workflow would be to first download and store the raw data for reproducibility. Then you should write a processing function that makes them 'DataFrame' ready. Whenever your raw data then changes, you only have to rewrite this single function and assert the processed data comes out the same format it used to.
Moreover, whenever you decide that you don't want to use pandas anymore (because you want to use regular numpy arrays for example), it is an easier fix to exclude pandas from your code than when it is completely knitted in your workflow since the very beginning.
This would be my motivation to do the processing before reading into a DataFrame.
I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.
The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.
Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.
While I was able to solve this, sort of, I am going to leave it open because I suspect given my inexperience with python, my process could be improved.
Basically, I had to ditch the index function from record linkage toolkit. I pulled out the Index of the dataframe I was using, and then converted it to a list, and passed it through the itertools combinations function.
candidates = fl
candidates = candidates.index
candidates = candidates.tolist()
candidates = combinations(candidates,2)
This then gave me an iteration object full of tuples, without having to load everything in to memory. I then passed it into an islice grouper as a for loop.
for x in iter(lambda: list(islice(candidates,1000000)),[]):
I then proceeded to perform all of the necessary comparisons in the for loop, and added the resultant dataframe to a dictionary, which I then concatenate at the end for the full list. Python's memory usage hasn't risen above 3GB the entire time.
I would still love some information on how to incorporate dask into this, so I will accept any answer that can provide that (unless the mods think I should open a new question).
Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.
I have three datasets, each containing about 20 milions rows (csv files)
There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.
But first I need a way to get the data into Janusgraph.
Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?
Thanks for any help in advance.
EDIT:
Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:
metric_1 metric_2 metric_3 ..
person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..
Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)
And are [a,..., l] then indexed?
The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?
Apologies for these probably straightforward questions, but I'm fairly new to this.
Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)
All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).
Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system
I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.
All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.
All the best!
JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
An efficient way to load the data is to split it into two files:
nodes.csv: one line per node with all attributes
links.csv: one line per link with source_id and target_id and all the links attributes
This might require some data preparation steps.
Here is an example script
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script
I'm taking some AI classes and have learned about some basic algorithms that I want to experiment with. I have gotten access to several data sets containing lots of great real-world data through Kaggle, which hosts data analysis competitions.
I have tried entering several competitions to improve my machine learning skills, but have been unable to find a good way to access the data in my code. Kaggle provides one large data file, 50-200mb, per competition in csv format.
What is the best way to load and use these tables in my code? My first instinct was to use databases, so I tried loading the csv into sqlite a single database, but this put a tremendous load on my computer and during the commits, it was common for my computer to crash. Next, I tried using a mysql server on a shared host, but doing queries on it took forever, and it made my analysis code really slow. Plus, I am afraid I will exceed my bandwidth.
In my classes so far, my instructors usually clean up the data and give us managable datasets that can be completely loaded into RAM. Obviously this is not possible for my current interests. Please suggest how I should proceed. I am currently using a 4 year old macbook with 4gb ram and a dualcore 2.1Ghz cpu.
By the way, I am hoping to do the bulk of my analysis in Python, as I know this language the best. I'd like a solution that allows me to do all or nearly all coding in this language.
Prototype--that's the most important thing when working with big data. Sensibly carve it up so that you can load it in memory to access it with an interpreter--e.g., python, R. That's the best way to create and refine your analytics process flow at scale.
In other words, trim your multi-GB-sized data files so that they are small enough to perform command-line analytics.
Here's the workflow i use to do that--surely not the best way to do it, but it is one way, and it works:
I. Use lazy loading methods (hopefully) available in your language of
choice to read in large data files, particularly those exceeding about 1 GB. I
would then recommend processing this data stream according to the
techniques i discuss below, then finally storing this fully
pre-processed data in a Data Mart, or intermediate staging container.
One example using Python to lazy load a large data file:
# 'filename' is the full path name for a data file whose size
# exceeds the memory on the box it resides. #
import tokenize
data_reader = open(some_filename, 'r')
tokens = tokenize.generate_tokens(reader)
tokens.next() # returns a single line from the large data file.
II. Whiten and Recast:
Recast your columns storing categorical
variables (e.g., Male/Female) as integers (e.g., -1, 1). Maintain
a
look-up table (the same hash as you used for this conversion
except
the keys and values are swapped out) to convert these integers
back
to human-readable string labels as the last step in your analytic
workflow;
whiten your data--i.e., "normalize" the columns that
hold continuous data. Both of these steps will substantially
reduce
the size of your data set--without introducing any noise. A
concomitant benefit from whitening is prevention of analytics
error
caused by over-weighting.
III. Sampling: Trim your data length-wise.
IV. Dimension Reduction: the orthogonal analogue to sampling. Identify the variables (columns/fields/features) that have no influence or de minimis influence on the dependent variable (a.k.a., the 'outcomes' or response variable) and eliminate them from your working data cube.
Principal Component Analysis (PCA) is a simple and reliable technique to do this:
import numpy as NP
from scipy import linalg as LA
D = NP.random.randn(8, 5) # a simulated data set
# calculate the covariance matrix: #
R = NP.corrcoef(D, rowvar=1)
# calculate the eigenvalues of the covariance matrix: #
eigval, eigvec = NP.eig(R)
# sort them in descending order: #
egval = NP.sort(egval)[::-1]
# make a value-proportion table #
cs = NP.cumsum(egval)/NP.sum(egval)
print("{0}\t{1}".format('eigenvalue', 'var proportion'))
for i in range(len(egval)) :
print("{0:.2f}\t\t{1:.2f}".format(egval[i], cs[i]))
eigenvalue var proportion
2.22 0.44
1.81 0.81
0.67 0.94
0.23 0.99
0.06 1.00
So as you can see, the first three eigenvalues account for 94% of the variance observed in original data. Depending on your purpose, you can often trim the original data matrix, D, by removing the last two columns:
D = D[:,:-2]
V. Data Mart Storage: insert a layer between your permanent storage (Data Warehouse) and your analytics process flow. In other words, rely heavily on data marts/data cubes--a 'staging area' that sits between your Data Warehouse and your analytics app layer. This data mart is a much better IO layer for your analytics apps. R's 'data frame' or 'data table' (from the CRAN Package of the same name) are good candidates. I also strongly recommend redis--blazing fast reads, terse semantics, and zero configuration, make it an excellent choice for this use case. redis will easily handle datasets of the size you mentioned in your Question. Using the hash data structure in redis, for instance, you can have the same structure and the same relational flexibility as MySQL or SQLite without the tedious configuration. Another advantage: unlike SQLite, redis is in fact a database server. I am actually a big fan of SQLite, but i believe redis just works better here for the reasons i just gave.
from redis import Redis
r0 = Redis(db=0)
r0.hmset(user_id : "100143321, {sex : 'M', status : 'registered_user',
traffic_source : 'affiliate', page_views_per_session : 17,
total_purchases : 28.15})
200 megabytes isn't a particularly large file for loading into a database. You might want to try to split the input file into smaller files.
split -l 50000 your-input-filename
The split utility will split your input file into multiple files of whatever size you like. I used 50000 lines per file above. It's a common Unix and Linux command-line utility; don't know whether it ships with Macs, though.
Local installs of PostgreSQL or even MySQL might be a better choice than SQLite for what you're doing.
If you don't want to load the data into a database, you can take subsets of it using command-line utilities like grep, awk, and sed. (Or scripting languages like python, ruby, and perl.) Pipe the subsets into your program.
You need 'Pandas' for it. I think you must have got it by now. But still if someone else is facing the issue can be benefited by the answer.
So you don't have to load the data into any RDBMS. keep it in file and use it by simple Pandas
load, dataframes. Here is the link for pandas lib-->
http://pandas.pydata.org/
If data is too large you need any cluster for that. Apache Spark or Mahout which can run on Amazon EC2 cloud. Buy some space there and it will be easy to use. Spark also has an API for Python.
I loaded a 2 GB Kaggle dataset in R using H2O.
H2O is build on java and it creates a virtual environment, the data will be available in memory and you will have much faster access since H2O is java.
You will just have to get used to the syntax of H2O.
It has many beautifully built ml algorithms and provides good support for distributed computing.
You can also use all your cpu cores easily. Check out h2o.ai for how to use it. It can handle 200 MB easily, given than you only have 4 GB ram. You should upgrade to 8 G or 16 G
General technic is to "divide and conquere". If you can split you data into parts and process them separetly then it can be handled by one machine. Some task can be solved that way(PageRank, NaiveBayes, HMM etc) and some were not,(one requred global optimisation) like LogisticeRegression, CRF, many dimension reduction technicues