Opening NSFG data (from ThinkStats book) with Pandas - python

I'm going through the book ThinkStats. http://greenteapress.com/thinkstats/nsfg_data.html
I'd prefer to work with pandas because I'd like to strengthen my skill in that, but I'm having a hard time making out how to open this file.
http://greenteapress.com/thinkstats/nsfg_data.html
The usual pd.read_csv(filename) does not seem to work.
I'm also reading the code provided with the book, but it's a bit difficult to make out for me.

The pandas read_csv function will not work on this data set without some thinking about the data set itself. Indeed, it is neither a comma separated value nor a space separated format.
Instead, it is a kind of home-made format where the number of fields per line is not contant, which is another issue. Besides, number of spaces between values is not constant, which is another issue.
In order to better understand the format of the data files, I would recommend you get the code from the author. The link is provided in the book but it is here http://greenteapress.com/thinkstats/ and to play with the code provided to figure out the format being used
Provided you have the data file, you can use the survey module
import survey
preg = survey.Pregancies()
pre.ReadRecors(".")

Related

How can I format google sheets so I can export my data properly?

I plan to make an educational web game. I have thousands of trivia questions I need to write down in a way that can be easily transferred out and automatically organized based on their column, at a later date.
I was suggested to use google sheets so I can later export as a .csv, and that should be easy to work with for a developer. When i exported a .csv and opened it in Panda python the a column was cut off and 1 column was used as a 'header', not just a normal entry https://imgur.com/a/olcpVO8. This obviously wont work and seems to be an issue.
Should I just leave the first row and column empty and work around the issue? I don't want to write thousands of sets only to find out I did this the wrong way. Can anyone give any insight into whether this is my best option and how I should best format it?
I have to write Questions(1), Answers(4), Explanations(1) per entry
I hope this makes sense, thanks for your time.
I tried doing this and have no issue at all using the exported CSV from Google Sheets, using the same data as in your example.
In my opinion, whatever software you're using in your second screenshot is your issue, it seems like its removing numbers from the first row because that should be your header row. Check around in your software for options like, "First column contains headers" or "Use row 1 as Header" and make sure these aren't being used.

CSV Standard - Multiple Tables

I am working on a python project that does some analysis on csv files. I know there is no well-definedstandard for csv files, but as far as I understood the definition (https://www.rfc-editor.org/rfc/rfc4180#page-2), I think that a csv file should not contain more than one table. Is this thinking correct, or did I misunderstood the definitions?
How often do you see more than one table in csv's?
You are correct. There is no universal accepted standard. The definition is written to suggest that each file contains one table, and this is by far the most common practice.
There's technically nothing stopping you from having more than one table, using a format you decide on and implement and keep consistent. For instance, you could parse the file yourself and use a line with 5 hyphens to designate a separate table.
However I wouldn't recommend this. It goes against the common practice, and you will eliminate the possibility of using existing CSV libraries to help you.

Export data from Python into Tableau using JSON?

How do I get 4 million rows and 28 columns from Python to Tableau in a table form?
I assume (based on searching) that I should use a JSON format. This format can handle a lot of data and is fast enough.
I have made a subset of 12 rows of the data and tried to get it working. The good news is: it's working. The bad news: not the way I want to.
My issue is that when I import it in Tableau it doesn't look like a table. I have tried the variances which are displayed here.
This is the statement in Python (pandas):
jsonfile = pbg.to_json("//vsv1f40/Pricing_Management$/Z-DataScience/01_Requests/Marketing/Campaign_Dashboard/Bronbestanden/pbg.json",orient='values')
Maybe I select too many schemas in Tableau (I select them all), but I think my problem is in Python. Do I need to use another library instead of Pandas? Or do I need to change the variables?
Other ways are also welcome. I have no preference for JSON, but I thought that was the best way, based on the search results.
Note: I am new to python and tableau :) I use python 3.5.2 and work in Jupyter. From Tableau I only have the free trial desktop version.
JSON is good for certain types of data, but if your DataFrame is purely tabular (no MultiIndexes, complex objects, etc.) and contains simple data types (strings, digits, floats), then a comma-separated value (CSV) text file is probably the best format to use, as it would take up the least space. A DataFrame can easily be saved as a CSV using the to_csv() method, and there are a number of customization options available. I'm not terribly familiar with Tableau, but according to their website CSV files are a supported input format.

How do I tell python what my data structure (that is in binary) looks like so I can plot it?

I have a data set that looks like this.
b'\xa3\x95\x80\x80YFMT\x00BBnNZ\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Type,Length,Name,Format,Columns\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xa3\x95\x80\x81\x17PARMNf\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Name,Value\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xa3\x95\x80\x82-GPS\x00BIHBcLLeeEefI\x00\x00\x00Status,TimeMS,Week,NSats,HDop,Lat,Lng,RelAlt,Alt,Spd,GCrs,VZ,T\x00\x00\xa3\x95\x80\x83\x1fIMU\x00Iffffff\x00\x00\x00\x00\x00\x00\x00\x00\x00TimeMS,GyrX,GyrY,G
I have been reading around to try and find how do I implement a code into python that will allow me to parse this data so that I can plot some of the column against each other (Mostly time).
Some things I found that may help in doing this:
There is a code that will allow me to convert this data into a CSV file. I know how to use the code and convert it to a CSV file and plot from there, but for a learning experience I want to be able to do this without converting it to a CSV file. Now I tried reading that code but I am clueless since I am very new to python. Here is the link to the code:
https://github.com/PX4/Firmware/blob/master/Tools/sdlog2/sdlog2_dump.py
Also, Someone posted this saying this might be the log format, but again I couldn't understand or run any code on that page.
http://dev.px4.io/advanced-ulog-file-format.html
A good starting point for parsing binary data is the struct module https://docs.python.org/3/library/struct.html and it's unpack function. That's what the CSV dump routine you linked to is doing as well. If you walk through the process method, it's doing the following:
Read a chunk of binary data
Figure out if it has a valid header
Check the message type - if it's a FORMAT message parse that. If it's
a description message, parse that.
Dump out a CSV row
You could modify this code to essentially replace the __printCSVRow method with something that captures the data into a pandas dataframe (or other handy data structure) so that when the main routine is all done you can grab all the data from the dataframe and plot it.

openpyxl and stdev.p name error

I have a script to format a bunch of data and then push it into excel, where I can easily scrub the broken data, and do a bit more analysis.
As part of this I'm pushing quite a lot of data to excel, and want excel to do some of the legwork, so I'm putting a certain number of formulae into the sheet.
Most of these ("=AVERAGE(...)" "=A1+3" etc) work absolutely fine, but when I add the standard deviation ("=STDEV.P(...)" I get a name error when I open in excel 2013.
If I click in the cell within excel and hit (i.e. don't change anything within the cell), the cell re-calculates without the name error, so I'm a bit confused.
Is there anything extra that needs to be done to get this to work?
Has anyone else had any experience of this?
Thanks,
Will
--
I've investigated further and this is the issue:
When saving the formula "STDEV.P" openpyxl saves it as:
"=_xludf.STDEV.P(...)"
which is correct for many formula, but not this one.
The result should be:
"=_xlfn.STDEV.P(...)"
When I explicitly change the function to the latter, it works as expected.
I'll file a bug report, so hopefully this is done automatically in the future.
I suspect that there might be a subtle difference in what you think you need to write as the formula and what is actually required. openpyxl itself does nothing with the formula, not even check it. You can investigate this by comparing two files (one from openpyxl, one from Excel) with ostensibly the same formula. The difference might be simple – using "." for decimals and "," as a separator between values even if English isn't the language – or it could be that an additional feature is required: Microsoft has continued to extend the specification over the years.
Once you have some pointers please submit a bug report on the openpyxl issue tracker.

Categories