Quick Aside So, I'm a bit of a rookie with Python; therefore forgive my incorrect ways of describing things AND ask me questions if I don't provide enough information.
Ask my title indicates, I'm attempting to bring in a data set that is Lisp data structure. I'm trying to start small and work with a smaller data set (as I'm going to be dealing with much larger eventually) however, I'm unclear as to how I should set up my separators for my pandas
So, I'm bringing in a .dat file from a lisp data structure, and reading it with pandas (or attempting to).
My goal, is to try and have it be a normal data set, where I can separate a given, say function, with its' respected outputs.
My Lisp Data set looks like the following:
(setf nameoffile?'
((function-1 output1) (function-2 output2 output3 output4) (function-3 output5 output 6 output7...)
(function-4 output)
...
(function-N outputN outputM ... )) )
Hopefully this is not too cryptic. Please, let me know if I'm not providing enough information.
Lastly, my goal is to have all of the functions, lets say in a row and have the outputs read across the row in a pandas dataframe (since I'm used to that); for example:
function-1: output1
function-2: output2 and so on and so forth...
Again, please let me know if I'm a bit confusing, or did not provide enough information.
Thank you so much in advance!
EDIT:
My specific question is how can I insert this somewhat ambiguous lisp data structure into a pandas dataframe? Additionally, I dont know how to modify what I want into their desired rows and on how to separate them (delimiter/sep = ?). When I insert this via pandas, I get a very mumble jumbled dataframe. I think a key issue is how do I separate them appropriately?
As noted by #molbdnilo and #sds, it's probably easier to export data from lisp in a common format and then import them in Python using an existing parser.
For example you can save them to CSV file from Lisp, using the cl-csv library that is also available on quicklisp.
As you can see from cl-csv tests, you can get a csv string from you data using the write-csv function:
(write-csv *your-data-rows* :always-quote t)
Or, if you want to proceed line-by-line, you can use write-csv-row function.
Then will be easy to save the resulting string into a file and read this CSV from Python.
If your Lisp program isn't already too large, consider rewriting it in Hy. Hy is a Lisp dialect, so you can continue writing in Lisp. But also,
Hy maintains, over everything else, 100% compatibility in both directions with Python itself.
This means you can use Python libraries when writing Hy, and you can write a module in Hy to use in Python.
I don't know how your project is setup (and I don't know Pandas), but perhaps you can use this to communicate directly with Pandas?
Related
I am scraping the CIA Worldbook for country data as a learning exercise. I scrape the data and clean it up during import and then later convert to Pandas dataframe.
I have two choices - clean the data as it is being read in, as I am doing now, or just read everything into the dataframe and clean it up after the fact.
Here are two examples of what I am doing now:
raw data
info = "$2,000 note: data are in 2017 dollars (2020 est.)"
int(info.text[:info.text.find(' ')].replace(',', '').replace('$', ''))
result 2000
raw data
info = "36.08 births/1,000 population (2021 est.)"
float(info.text[:info.text.find(' ')].replace(',', ''))
result 36.08
I suspect that cleaning in the dataframe after downloading would be a better solution but the only way I can think to do that is using Regular Expressions - which at the moment I am not too well versed in. Would that be the "correct" way to do it, or does it even matter? If cleaning up the dataframe is the solution, what might these look like?
Thanks
There are some things that are important depending on your case:
Do you want it to be highly reproducible or extendable?
Should it be highly performant?
Is readability more important than performance/extendability?
I've found that in the far majority of the cases, the performance doesn't matter that much. As long as you're not dealing with enormous amount of data to process or you're not working on low-performing infrastructure, it should run sufficiently fast. Again, this depends on your use case.
What I find way more annoying/time-consuming is over-complex functions that you won't know how they work afterwards, or having severely nested functionality. Those can take enormous amount of time to fix once your data-format changes or you need to alter some small parts in the code.
I would therefore agree that the ideal workflow would be to first download and store the raw data for reproducibility. Then you should write a processing function that makes them 'DataFrame' ready. Whenever your raw data then changes, you only have to rewrite this single function and assert the processed data comes out the same format it used to.
Moreover, whenever you decide that you don't want to use pandas anymore (because you want to use regular numpy arrays for example), it is an easier fix to exclude pandas from your code than when it is completely knitted in your workflow since the very beginning.
This would be my motivation to do the processing before reading into a DataFrame.
Totally new in this forum and new in python so I would appreciate it if anybody can help me.
I am trying to build a script in python based on data that I have in an excel spreadsheet. I'd like to create an app/script where I can estimate the pregnancy due date and the conception date (for animals) based on measurements that I have taken during ultrasounds. I am able to estimate it with a calculator but it takes some conversion to do (from cm to mm) and days to months. In order to do that in Python, I figured I create a variable for each measurement and set each variable equals to its value in days (and integer).
Here is the problem: the main column of my data set is the actual measurements of the babies in mm (Known as BPD) but the BPD can be an integer like 5mm or 6.4mm. Since I can't name a variable with a period or a dot in it, what would be the best way to handle my data and assign variables to it? I have tried BPD_4.8= 77days, but python tells me there's a syntax error (I'm sure lol), but if I type BDP_5= 78 it seems to work. I haven't mastered lists and tuples, not do I really know how to use them properly so ill keep looking online and see what happens.
I'm sure it's something super silly for you guys, but I'm really pulling my hair out and I have nothing but 2 inches of hair lol
This is what my current screen looks like..HELP :(
Howdy and welcome to StackOverflow. The short answer is:
Use a better data structure
You really shouldn't be encoding valuable information into variable names like that. What's going to happen if you want to calculate something with your BPD measurements? Or when you have duplicate BPD's?
This is bad practise. It might seem like a lot of effort to take the time to figure out how to do this properly - but it will be more than worth it if you intend to continue to use Python :)
I'll give you a couple options...
Option 1: Use a dictionary
Dictionaries are common data structures in any language.. so it can pay to know how to use them.
Dictionaries hold information about an object using key/value pairs. For example you might have:
measurements = {
'animal_1' : {'bpd': 4.6, 'due_date_days': 55},
'animal_2' : {'bpd': 5.2, 'due_date_days': 77},
}
An advantage of dictionaries is that they are explicit, ie values have keys which explicitly identify what the information is assigned to. E.g. measurements['animal_1']['due_date_days'] would return the due date for animal 1.
A disadvantage is that it will be harder to compute information / examine relationships than you'll be used to in Excel.
Option 2: Use Pandas
Pandas is a data science library for Python. It's fast, has similar functionality to Excel and is probably well suited to your use case.
I'd recommend you take the time to do a tutorial or two. If you're planning to use Python for data analysis then it's worth using the language and any suitable libraries properly.
You can check out some Pandas tutorials here: https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html
Good luck!
You can use python to automate things in SPSS or to shorten the way, but I need to know if it is possible to replace the SPSS Syntax with python for example to aggregate data in loops etc..
Or another example. I have 2 datesets with the follwing variables id, begin, end and type. It is
possible to put them into different arrays/lists and then compare the arrays/lists so that at the end i have a new table/dataset
with non matching entries and a dataset with the matching entries in SPSS.
My idea is to extend the context of matching files in SPSS.
Normally programming languages like python or php can handle this.
Excuse me. I hope someone will understand what I mean.
There are many ways to do this sort of thing with Python. The SPSS module Dataset class allows you to read and write the case data. The spssdata module provides a somewhat simpler way to do this. These are included when you install the Python Essentials. There are also utility modules available from the SPSS Community website. In particular, the extended Transforms module provides a standard lookup function and an interval-based lookup.
I'm not sure, though, that the standard MATCH FILES won't do what you need here. Mismatches will generate missing data in the variables, and you can select subsets based on that criterion.
This question explains several ways how to import an SPSS dataset in Python code: Importing SPSS dataset into Python
Afterwards, you can use the standard Python tools to analyze them.
Note: I've had some success with simply formatting the data in a text file. I can then use any diff tool to compare the files.
The advantage of this approach is that's usually very easy to write text exporters which sort the data to make it easier for the diff tool to see what is similar.
The drawback is that text only works for simple cases. When your data has a recursive structure, then text is not ideal. In that case, try an XML diff tool.
I have a speed/efficiency related question about python:
I need to write a large number of very large R dataframe-ish files, about 0.5-2 GB sizes. This is basically a large tab-separated table, where each line can contain floats, integers and strings.
Normally, I would just put all my data in numpy dataframe and use np.savetxt to save it, but since there are different data types it can't really be put into one array.
Therefore I have resorted to simply assembling the lines as strings manually, but this is a tad slow. So far I'm doing:
1) Assemble each line as a string
2) Concatenate all lines as single huge string
3) Write string to file
I have several problems with this:
1) The large number of string-concatenations ends up taking a lot of time
2) I run of of RAM to keep strings in memory
3) ...which in turn leads to more separate file.write commands, which are very slow as well.
So my question is: What is a good routine for this kind of problem? One that balances out speed vs memory-consumption for most efficient string-concatenation and writing to disk.
... or maybe this strategy is simply just bad and I should do something completely different?
Thanks in advance!
Seems like Pandas might be a good tool for this problem. It's pretty easy to get started with pandas, and it deals well with most ways you might need to get data into python. Pandas deals well with mixed data (floats, ints, strings), and usually can detect the types on its own.
Once you have an (R-like) data frame in pandas, it's pretty straightforward to output the frame to csv.
DataFrame.to_csv(path_or_buf, sep='\t')
There's a bunch of other configuration things you can do to make your tab separated file just right.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
Unless you are running into a performance issue, you can probably write to the file line by line. Python internally uses buffering and will likely give you a nice compromise between performance and memory efficiency.
Python buffering is different from OS buffering and you can specify how you want things buffered by setting the buffering argument to open.
I think what you might want to do is create a memory mapped file. Take a look at the following documentation to see how you can do this with numpy:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html
In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory
Neither Rpy or Scipy is necessary, although numpy may make it a bit easier.
This problem seems ideally suited to a line-by-line parser.
Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.
Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.
Something like:
f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
#line is a string type
#do whatever you want to do on a per-line basis here, for example:
print len(line)
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).
As #gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.
Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.
How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
import numpy as np
with file("data.csv", "rb") as f:
title = f.readline() # if your data have a title line.
data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column
I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.
Since this has the R tag I'll give some R solutions:
Overview
http://www.r-bloggers.com/r-references-for-handling-big-data/
bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
Hadoop interfaces to R (RHIPE, etc.)