I want to create a large dataset (that conforms to a given schema) using Python. Is there a nice way to specify the schema (datatype & length of each of the fields), and let Python create about 100,000 observations for me? Any nice tools already there?
I am familiar with Python...hence would like to stick with it. if there is one using Bash or any other way, please let me know as well.
Thanks!
PD.
You should probably check out the fake-factory package.
Please have a look at this:-
https://github.com/sanju51/Generate-large-Dataset-dynamically-in-Python
SPEED:- 100000 records in 5 seconds(10 columns)
USAGE:- python generate_dataset.py -i Metadata.csv -f sample.csv -nrec 100000 -d ',' -hdr Y
Related
So I want to create 1 million random records that will distribute normally.
I have 12 time spaces:
hours_dic = {1:"00:00-02:00",2:"02:00-05:00",3:"05:00-07:00",4:"07:00-08:00",5:"08:00-10:00",
6:"10:00-12:00",7:"12:00-16:00" ,8:"15:00-17:00",9:"17:00-19:00",10:"19:00-21:00",11:"21:00-22:00",12:"22:00-00:00"}
My problem is that I want the randomized records would distribute normally. Lets say that the mean is "17:00-19:00", right now the std is not important as those records are "dummy data" for a research. I would like to generate the records and later to put them in EXCEL.
To clarify, those hours represent using time, and I want to generate the data with the assumption that the majority use the app at the afternoon.
I thought about using numpy:
x = np.random.normal(loc=9,scale=1,size=1000000)
And then maybe using .map with the hours_dic.
However, I can't find a way to make the generated number integers only (as I want to use the dictionary) and to ensure the distribution is as I wanted.
Thanks for any help, if any elaboration is required please ask.
(If there's an Excel solution I'd like to know it too)
Quick Aside So, I'm a bit of a rookie with Python; therefore forgive my incorrect ways of describing things AND ask me questions if I don't provide enough information.
Ask my title indicates, I'm attempting to bring in a data set that is Lisp data structure. I'm trying to start small and work with a smaller data set (as I'm going to be dealing with much larger eventually) however, I'm unclear as to how I should set up my separators for my pandas
So, I'm bringing in a .dat file from a lisp data structure, and reading it with pandas (or attempting to).
My goal, is to try and have it be a normal data set, where I can separate a given, say function, with its' respected outputs.
My Lisp Data set looks like the following:
(setf nameoffile?'
((function-1 output1) (function-2 output2 output3 output4) (function-3 output5 output 6 output7...)
(function-4 output)
...
(function-N outputN outputM ... )) )
Hopefully this is not too cryptic. Please, let me know if I'm not providing enough information.
Lastly, my goal is to have all of the functions, lets say in a row and have the outputs read across the row in a pandas dataframe (since I'm used to that); for example:
function-1: output1
function-2: output2 and so on and so forth...
Again, please let me know if I'm a bit confusing, or did not provide enough information.
Thank you so much in advance!
EDIT:
My specific question is how can I insert this somewhat ambiguous lisp data structure into a pandas dataframe? Additionally, I dont know how to modify what I want into their desired rows and on how to separate them (delimiter/sep = ?). When I insert this via pandas, I get a very mumble jumbled dataframe. I think a key issue is how do I separate them appropriately?
As noted by #molbdnilo and #sds, it's probably easier to export data from lisp in a common format and then import them in Python using an existing parser.
For example you can save them to CSV file from Lisp, using the cl-csv library that is also available on quicklisp.
As you can see from cl-csv tests, you can get a csv string from you data using the write-csv function:
(write-csv *your-data-rows* :always-quote t)
Or, if you want to proceed line-by-line, you can use write-csv-row function.
Then will be easy to save the resulting string into a file and read this CSV from Python.
If your Lisp program isn't already too large, consider rewriting it in Hy. Hy is a Lisp dialect, so you can continue writing in Lisp. But also,
Hy maintains, over everything else, 100% compatibility in both directions with Python itself.
This means you can use Python libraries when writing Hy, and you can write a module in Hy to use in Python.
I don't know how your project is setup (and I don't know Pandas), but perhaps you can use this to communicate directly with Pandas?
I produce a very large data file with Python, mostly consisting of 0 (false) and only a few 1 (true). It has about 700.000 columns and 15.000 rows and thus a size of 10.5GB. The first row is the header.
This file then needs to be read and visualized in R.
I'm looking for the right data format to export my file from Python.
As stated here:
HDF5 is row based. You get MUCH efficiency by having tables that are
not too wide but are fairly long.
As I have a very wide table, I assume, HDF5 is inappropriate in my case?
So what data format suits best for this purpose?
Would it also make sense to compress (zip) it?
Example of my file:
id,col1,col2,col3,col4,col5,...
1,0,0,0,1,0,...
2,1,0,0,0,1,...
3,0,1,0,0,1,...
4,...
Zipping won't help you, as you'll have to unzip it to process it. If you could post your code that generates the file, that might help a lot.
Also, what do yo want to accomplish in R? Might it be faster to visualize it in Python, avoiding the read/write of 10.5GB?
Perhaps rethinking your approach to how you're storing the data (eg: store the coordinates of the 1's if there are very few) might be a better angle here.
For instance, instead of storing a 700K by 15K table of all zeroes except for a 1 in line 600492 column 10786, I might just store the tuple (600492, 10786) and achieve the same visualization in R.
SciPy has scipy.io.mmwrite which makes files that can be read by R's readMM command. SciPy also supports several different sparse matrix representations.
I am having a few big files sets of HDF5 files and I am looking for an efficient way of converting the data in these files into XML, TXT or some other easily readable format. I tried working with the Python package (www.h5py.org), but I was not able to figure out any methods with which I can get this stuff done fast enough. I am not restricted to Python and can also code in Java, Scala or Matlab. Can someone give me some suggestions on how to proceed with this?
Thanks,
TM
Mathias711's method is the best direct way. If you want to do it within python, then use pandas.HDFStore:
from pandas import HDFStore
store = HDFStore('inputFile.hd5')
store['table1Name'].to_csv('outputFileForTable1.csv')
You can use h5dump -o dset.asci -y -w 400 dset.h5
-o dset.asci specifies the output file
-y -w 400 specifies the dimension size multiplied by the number of positions and spaces needed to print each value. You should take a very large number here.
dset.h5 is of course the hdf5 file you want to convert
I think this is the easiest way to convert it to an ascii file, which you can import to excel or whatever you want. I did it a couple of times, and it worked for me. I got his information from this website.
In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory
Neither Rpy or Scipy is necessary, although numpy may make it a bit easier.
This problem seems ideally suited to a line-by-line parser.
Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.
Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.
Something like:
f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
#line is a string type
#do whatever you want to do on a per-line basis here, for example:
print len(line)
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).
As #gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.
Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.
How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
import numpy as np
with file("data.csv", "rb") as f:
title = f.readline() # if your data have a title line.
data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column
I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.
Since this has the R tag I'll give some R solutions:
Overview
http://www.r-bloggers.com/r-references-for-handling-big-data/
bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
Hadoop interfaces to R (RHIPE, etc.)