Smallest learning curve language to work with CSV files - python

VBA is not cutting it for me anymore. I have lots of huge Excel files to which I need to make lots of calculations and break them down into other Excel/CSV files.
I need a language that I can pick up within the next couple of days to do what I need, because it is kind of an emergency. I have been suggested python, but I would like to check with you if there is anything else that does CSV file handling quickly and easily.

Python is an excellent choice. The csv module makes reading and writing CSV files easy (even Microsoft's, uh, "idiosyncratic" version) and Python syntax is a breeze to pick up.
I'd actually recommend against Perl, if you're coming to it fresh. While Perl is certainly powerful and fast, it's often cryptic to the point of incomprehensible to the uninitiated.

What kind of calculation you have to do? Maybe R would be an alternative?
EDIT: just to give a few basic examples
# Basic usage
data <- read.csv("myfile.csv")
# Pipe-separated values
data <- read.csv("myfile.csv", sep="|")
# File with header (columns will be named as header)
data <- read.csv("myfile.csv", header=TRUE)
# Skip the first 5 lines of the file
data <- read.csv("myfile.csv", skip=5)
# Read only 100 lines
data <- read.csv("myfile.csv", nrows=100)

There are many tools for the job, but yes, Python is perhaps the best these days. There is a special module for dealing with csv files. Check the official docs.

Python definitely has a small learning curve, and works with csv files well

You say you have "excel files to which i need to make lots of calculations and break them down into other excel/csv files" but all the answers so far talk about csv only ...
Python has a csv read/write module as others have mentioned. There are also 3rd party modules xlrd (reads) and xlwt (writes) modules for XLS files. See the tutorial on this site.

You know VBA? Why not Visual Basic 2008 / 2010, or perhaps C#? I'm sure languages like python and ruby would be relatively easier for the job, but you're already accustomed to the ".NET way" of doing things, so it makes sense to keep working with them instead of learning a whole new thing just for this job.
Using C#:
var csvlines = File.ReadAllLines("file.csv");
var query = from csvline in csvlines
let data = csvline.Split(',')
select new
{
ID = data[0],
FirstName = data[1],
LastName = data[2],
Email = data[3]
};
.NET: Linq to CSV library.
.NET: Read CSV with LINQ
Python: Read CSV file

Perl is surprisingly efficient for a scripting language for text. cpan.org has a tremendous number of modules for dealing with CSV data. I've also both written and wrote data in XLS format with another Perl module. If you were able to use VBA, you can certainly learn Perl (the basics of Perl are easy, though it's just as easy for you or others to write terse yet cryptic code).

That depends on what you want to do with the files.
Python's learning curve is less steep than R's. However, R has a bunch of built-in functions that make it very well suited for manipulating .csv files easily, particularly for statistical purposes.
Edit: I'd recommend R over Python for this purpose alone, if only because the basic operations (reading files, dropping rows, dropping columns, etc.) are slightly faster to write in R than in Python.

I'd give awk a try. If you're running windows, you can get awk via the cygwin utilities.

This may not be anybody's popular language du-jour, but since CSV files are line-oriented and split into fields, dealing with them is just about the perfect application for awk. It was built for processing line oriented text data that can be split into fields.
Most of the other languages folks are going to reccomend will be much more general-purpose, so there's going to be a lot more in them that isn't nessecarily applicable to processing line-oriented text data.

PowerShell has CSV import built in.
The syntax is ugly as death, but it's designed to be useful for administrators more than for programmers -- so who knows, you might like it.
It's supposed to be a quick get-up-and-go language, for better and worse.

I'm surprised nobody's suggested PowerQuery; it's perfect for consolidating and importing files to Excel, does column calculations nicely and has a good graphical editor built in. Works for csvs and excel files but also SQL databases and most other things you'd expect. I managed to get some basic cleaning and formatting stuff up and running in a day, maybe a few days to start writing my own functions (break free from the GUI)
And since it only really does database stuff, it's got barely any functions to learn (the actual language is called "M")

PHP has a couple of csv functions that are easy to use:
http://www.php.net/manual-lookup.php?pattern=csv&lang=en

Related

Save Data in C++, Load from Python - Recommended Data Formats

I have a ROS/CPP simulator that saves large amounts of data to a rosbag (around 90 MB). I want to read this data frequently from Python and since reading rosbags is slow and cumbersome, I currently have another python script that reads the rosbag and saves the relevant contents to a HDF5 file.
It would be nice though to be able to just save the data from the simulator directly (in C++) and then read it from my scripts (in Python). So I was wondering which data format I should use.
It should be:
Fast to load from Python
Be compact (so ideally a binary of some sort)
Be easy to use
You might be wondering why I don't just save to HDF5 from my C++ simulator, but it just doesn't seem to be easy. There is basically nothing on forums such as Stackoverflow and the HDF5 Group website is opaque, seems to have some complicated licensing and very poor examples. I just want something quick and dirty that I can get running this afternoon.
You may want to have a look at HDFql as it is a high-level language (similar to SQL) to manage HDF5 files. Amongst others, HDFql supports C++ and Python. There are some examples that illustrates how to use HDFql in these languages here.
I see two solutions that can be useful for your problem :
LV: Length Value that you can store directly in binary into a file.
JSON: This does not add many data more than you need, and there are many libraries in Python or C++ that can simplify you the work
Protocol Buffers is an option with language bindings in C++ and Python, though it might be more time investment than quick/dirty running this afternoon.

Automating excel reporting and graphs - Python xlsxWriter/xlswings or Ruby axlsx/win32ole

I want to create a program, which automates excel reporting including various graphs in colours. The program needs to be able to read an excel dataset. Based on this dataset, the program then has to create report pages and graphs and then export to an excel file as well as pdf file.
I have done some research and it seems this is possible using python with pandas - xlsxWriter or xlswings as well as Ruby gems - axlsx or win32ole.
Which is the user-friendlier and easy to learn alternative? What are the advantages and disadvantages? Are there other options I should consider (I would like to avoid VBA - as this is how the reports are currently produced)?
Any responses and comments are appreciated. Thank you!
If you already have VBA that works for your project, then translating it to Ruby + WIN32OLE is probably your quickest path to working code. Anything you can do in VBA is doable in Ruby (if you find something you can't do, post here to ask for help).
I prefer working with Excel via OLE since I know the file produced by Excel will work anywhere I open it. I haven't used axlsx but I'm sure it's a fine project; I just wouldn't trust that it would produce working Excel files every time.

Efficiency: openpyxl or VBA?

I'm trying to figure out which one is generally faster for a similar task: using VBA or openpyxl.
I know it probably depends on the task you want to achieve, but let's say I have a table that is 50 cells wide and 150,000 cells tall and I want to copy it from woorkbook A to workbook B.
Any thoughts on whether python will do better or if Excel is better in dealing with itself?
My guts tell me that python should be fairly faster for some reasons:
In order for a sub to copy from a workbook to another, both should be open and running, whereas with python I can simply load both;
VBA has to deal with a lot of clutter with most tasks and it takes A LOT of system resources
Besides that, I'd like to know if I can make some further improvements to a openpyxl script, like multithreading or perhaps using NumPy along with it.
Thanks for the help!
TBH the fastest approach would probably be remote controlling Excel using xlwings, because this can take advantage of Excel's optimisation. VBA might be able to hook into that as well but I've never found VBA to be fast.
Python will have to convert from XML to Python and back to XML. You've got around 5,000,000 million cells so I'd expect this to take about a minute on my machine. I'd suggest combining read-only and write-only modes to do this to keep memory use low.
If you only have numerical data (no dates) then you might be able to find a shortcut and "transplant" the relevant worksheet XML file from one Excel file to another and just alter the relevant metadata.
TL;DR Consider making a direct data connection to the Excel file (ADO in VBA or Python+PyWin32, pyodbc in Python, or the .NET OleDbConnection class, among others). The language in which you make such a connection is much less relevant.
Long version
If all you want is to work with the data itself, you might want to consider a direct connection to Excel using ADO, pyodbc, or the .NET OleDbConnection class.
Automating the Excel application (with the Microsoft Excel object model, or (presumably) with xlwings) incurs a lot of overhead, which is understandable, because you might not be only reading the data in the Excel file, but also manipulating all the objects in the Excel UI — windows, menus — as well as objects beyond the data, such as formatting on individual cells or ranges.
It's true that openpyxl doesn't have all this overhead of UI elements, because it's reading the file directly, but I'm presuming there is still some overhead incurred because openpyxl has to make available all the information in the file, which is more than just the data — cell formatting, for example.
Making a data connection also allows you to treat the Excel file as a database, to which you can issue SQL statements, with all the power of SQL -- joins, sorting, grouping, aggregates.
See here for an example using ADO and VBA.
With openpyxl ...
This link was really helpful for me:
https://blog.dchidell.com/2019/06/24/openpyxl-poor-performance-optimisation/
Use read_only when opening the file if all you're doing is reading.
Use the built in iterators!
I cannot stress this enough - the iterators are fast, crazy fast.
Call functions as infrequently as possible and store intermediate
data in variables. It may bulk the code up a bit, but it tends to be
more efficient and also allows your code to be more readable (but this
is icing on the cake compared to points 1 and 2). Python can also be
ambiguous as to what is a variable and what is a function; but as a
general rule intermediate variables are good for multiple function
calls.
I was doing some reading of values in a particular workbook, and I did this initially:
wb = load_workbook(filename)
And that would take nearly 80 seconds. Caching the workbook between actions with it was helpful but still painful every time I reloaded my script.
I switched to reading only.
wb = load_workbook(filename, data_only=True, read_only=True)
Now it only takes < 0.1 seconds.

Strings, CSV/Excel, eventually DB, but statistics needed -which tool(s) (Ch, python, ceemple?)

I am currently working on aligning text data, mostly hidden in CSV or Excel files from multiple sources. I've done this easily enough with python (even on a Raspberry Pi) and Openoffice. The issues are:
transforming disparate names to unique names (easy)
storing the data in CSV or Excel files (because my collabs use Excel)
Eventually building a real DB (SQL based- MariaDB, Postgres) from the Excel files
Doing statistics on the data; mostly enumeration from different CSV files and comparison between samples - nice to generate graphs
for debugging purposes it would be nice to quickly generate bar charts and such of groups of the data
Nothing superfancy, except it gets slow in python (no doubt generously helped by my "I am not a programmer" 'code' . The data sets will get 'large' (10's of thousands of lines of data times multiple dozens data sets). I would like a programming tool which facilitates this.
I looked into Ch (& cling, cint) because I still remember a bit of C, interpreted, but Ch seems to offer a good set of libs. Python is ok for much of it, but I dislike the syntax. I try to work on Linux as much as I can, but eventually I have to hand it off to Windows users in a country not known for having fast computers. I was looking at ceemple (ceemple.com) and was wondering if anyone has used that for a project and what their experience has been. Does it help with cross platform issues (e.g., line termination)? Should I just forget Linux (with that wonderful shell and easy python and text editors which can load large files w/o bogging down) and move it to Windows? If so, then compiled is just about the only way to go for me, likely precluding Ch and probably python.
Please keep in mind that this is my 'side job' - I'm not a professional programmer. Low learning curve and least amount of tools required is important.

How to parse a .shp file?

I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!
Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.
If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.
might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls

Categories