Reading survey data CSV with multiple selection sub-columns? - python

I would like to import this data from a Navigraph survey results.
https://navigraph.com/blog/survey2022
The dataset is here:
https://download.navigraph.com/docs/flightsim-community-survey-by-navigraph-2022-data.zip
However, I noticed the structure is something I'm not quite used to, and perhaps this is how a lot of polling data is shared. The semicolons being separators is not an issue. It's the fact there's a mix of "select multiple" responses as columns. The tidiest thing is starting at the third row, each row is a single respondent.
How can I clean up this data so it is as "tidy" as possible? How would I melt() these columns into rows? How do I handle the multiple selection responses in the sub-columns?
I'd like the questions and responses to simply be two columns respectively.

Hello how are you? I don't have full knowledge in this type of work but I believe you will have to:
1- Read the file as is
2- Concatenate the columns of questions and answers
3- Create the dataset that will be used
I believe that pandas has some commands that will help you, just find the patterns to define what are "questions" and "answers" in this dataset.

Related

Manipulating data for network analysis

I am trying to manipulate my dataframe before I conduct network analysis using networkx.
Here is an sample of data i got:
sample data
I am trying to use the title and cast columns and trun them to something like this:
ideal format
The ideal result is to have one column for each individual actor and the movie/show that he/she is in. If the actor has more than 1 show/movie, I want to have different rows for that actor as well.
Could someone please advise me on how to make it happen? Thank you!!
So to use pandas you first import into the dataframe. Lets call it "f".
import pandas
f = pandas.read_csv('path/to/csv')
after that you can access individual columns by doing:
f['title']
similar to a dictionary. if you want both in the same dataframe, pass in a list of columns like so:
f[['title', 'cast']]
that is as much as I can provide without knowing the extent of the project.

Merging two tables in Pandas, and adding in multiple Indices too

I have data that's to be analysed for a project I'm working in, mostly done using pandas at the moment as the data comes in from Excel.
I'm trying to merge some of these tables, based on a column, which isn't the issue, the issue is that the tables have column names that are the same, looking kind of like below:
the columns that get reused are 10-30, 30-50 etc.
I want to do it so that i can have a higher index on the numbered columns, and have it called something like "Percentages", "Real miles" etc, so that when I'm completing calculations later on it's easier to link up the relevant cells, as well as have it more presentable at the end
Right now I'm having difficulty producing this, as the only place I've seen that have something more akin to what I want is when you see people creating dataframes from tuple/dictionaries, but considering how large the final inputs will be in this project, I wouldn't know how to go about writing them in.
I'm basically looking to have it look like below:
I hope I understood your issue
Using the Merge function you can set suffix for each of the columns from each of the dataframes. E.g.:
df1.merge(df2, left_on='lkey', right_on='rkey',suffixes=('_left', '_right'))
This way you will differentiate between columns coming from each of your dataframes.

How can I create a formatted and annotated excel with embedded pandas DataFrames

I want to create a "presentation ready" excel document with embedded pandas DataFrames and additional data and formatting
A typical document will include some titles and meta data, several Data Frames with sum row\column for each data frame.
The DataFrame itself should be formatted
The best thing I found was this which explains how to use pandas with XlsxWriter.
The main problem is that there's no apparent method to get the exact location of the embedded DataFrame to add the summary row below (the shape of the DataFrame is a good estimate, but it might no be exact when rendering complex DataFrames.
If there's a solution that relies on some kind of template, and not hard coding it would be even better.

How can I ensure unique rows in a large HDF5

I'm working on implementing a relatively large (5,000,000 and growing) set of time series data in an HDF5 table. I need a way to remove duplicates on it, on a daily basis, one 'run' per day. As my data retrieval process currently stands, it's far easier to write in the duplicates during the data retrieval process than ensure no dups go in.
What is the best way to remove dups from a pytable? All of my reading is pointing me towards importing the whole table into pandas, and getting a unique- valued data frame, and writing it back to disk by recreating the table with each data run. This seems counter to the point of pytables, though, and in time I don't know that the whole data set will efficiently fit into memory. I should add that it is two columns that define a unique record.
No reproducible code, but can anyone give me pytables data management advice?
Big thanks in advance...
See this releated question: finding a duplicate in a hdf5 pytable with 500e6 rows
Why do you say that this is 'counter to the point of pytables'? It is perfectly possible to store duplicates. The user is responsible for this.
You can also try this: merging two tables with millions of rows in python, where you use a merge function that is simply drop_duplicates().

Using Pandas to create, read, and update hdf5 file structure

We would like to be able to allow the HDF5 files themselves to define their columns, indexes, and column types instead of maintaining a separate file that defines structure of the HDF5 data.
How can I create an empty HDF5 file from Pandas with a specific table structure like:
Columns
id (Int)
name (Str)
update_date (datetime)
some_float (float)
Indexes
id
name
Once the HDF5 is created and saved to disk, how do I retrieve the column and index information without having to open the file completely each time since it will likely contain several GB of data.
Many thanks in advance...
-- UPDATE --
Thanks for the comments. To clarify a bit more:
We do have some experience with Pandas but by no means are really proficient. The part that is tripping us up is creating an empty data structure and reading that structure from a file that you will not want to fully open. In all of the Pandas examples there is data. The Pandas examples also only show two ways to retrieve data/structure which are to read the entire frame into memory or issue a where clause. In this case, we would like to be able to see the table structure without query operations if possible.
I know this is an odd case. Why the heck would you want an empty dataframe?? Well, we want to have a great deal of flexility in moving data around and want to be able to define a target dataframe structure prior to data writing, which could take place much later (e.g. hours or days). Since the HDF5 specification maintains all that information it seems directionally incorrect to store the table structure information separately. Thus our desire to crack the code on this subject.
-- UPDATE 2 --
To add more detail as #jeff requested.
We would like to abstract some of the common Pandas functions like summing data or merging two frames. Thus we would like to be able to ask each frame what their columns are so we can present a view for the user to select the result frame columns.
For example, if we imported a CSV with columns A, B, C, D, and V and saved the frame to HDF5 as my_csv.hdf then we would be able to determine the columns by opening the file.
However, in our use case it is likely that the import frame for the CSV could be cleared periodically and no longer contain the data. The reason knowing that the my_csv frame has certain columns and types is important because we want to enable a user to then select those columns for summing in a downstream operation. Lets say a user wants to sum column V by the values in columns A and B only and save the frame as my_sum. Since we can't ensure my_csv will always have data we would like to ensure it at least contains the structure.
Open to other suggestions obviously. It is also possible to store the table structure info in the user_block. This, again, is not ideal because the structure is now being kept in two different areas but I guess it would be possible to always update the user_block on save using the latest column and index information for the frame, although I believe the to_* operations in Pandas will blow away the user_block so...blah. I feel like I'm talking myself into maintaining a peer structure definition but I REALLY would love some suggestions to not have to do that.

Categories