I have data that's to be analysed for a project I'm working in, mostly done using pandas at the moment as the data comes in from Excel.
I'm trying to merge some of these tables, based on a column, which isn't the issue, the issue is that the tables have column names that are the same, looking kind of like below:
the columns that get reused are 10-30, 30-50 etc.
I want to do it so that i can have a higher index on the numbered columns, and have it called something like "Percentages", "Real miles" etc, so that when I'm completing calculations later on it's easier to link up the relevant cells, as well as have it more presentable at the end
Right now I'm having difficulty producing this, as the only place I've seen that have something more akin to what I want is when you see people creating dataframes from tuple/dictionaries, but considering how large the final inputs will be in this project, I wouldn't know how to go about writing them in.
I'm basically looking to have it look like below:
I hope I understood your issue
Using the Merge function you can set suffix for each of the columns from each of the dataframes. E.g.:
df1.merge(df2, left_on='lkey', right_on='rkey',suffixes=('_left', '_right'))
This way you will differentiate between columns coming from each of your dataframes.
Related
I would like to import this data from a Navigraph survey results.
https://navigraph.com/blog/survey2022
The dataset is here:
https://download.navigraph.com/docs/flightsim-community-survey-by-navigraph-2022-data.zip
However, I noticed the structure is something I'm not quite used to, and perhaps this is how a lot of polling data is shared. The semicolons being separators is not an issue. It's the fact there's a mix of "select multiple" responses as columns. The tidiest thing is starting at the third row, each row is a single respondent.
How can I clean up this data so it is as "tidy" as possible? How would I melt() these columns into rows? How do I handle the multiple selection responses in the sub-columns?
I'd like the questions and responses to simply be two columns respectively.
Hello how are you? I don't have full knowledge in this type of work but I believe you will have to:
1- Read the file as is
2- Concatenate the columns of questions and answers
3- Create the dataset that will be used
I believe that pandas has some commands that will help you, just find the patterns to define what are "questions" and "answers" in this dataset.
I have a seemingly complicated problem and I have a general idea of how I should solve it but I am not sure if it is the best way to go about it. I'll give the scenario and would appreciate any help on how to break this down. I'm fairly new with Pandas so please excuse my ignorance.
The Scenario
I have a CSV file that I import as a dataframe. My example I am working through contains 2742 rows × 136 columns. The rows are variable but the columns are set. I have a set of 23 lookup tables (also as CSV files) named per year, per quarter (range is 2020 3rd quarter - 2015 1st quarter) The lookup files are named as such: PPRRVU203.csv. So that contains values from the 3rd quarter of 2020. The lookup tables are matched by two columns ('Code' and 'Mod') and I use three values that are associated in the lookup.
I am trying to filter sections of my data frame, pull the correct values from the matching lookup file, merge back into the original subset, and then replace into the original dataframe.
Thoughts
I can probably abstract this and wrap in a function but not sure how I can place back in. My question, for those that understand Pandas better than myself, what is the best method to filter, replace the values, and write the file back out.
The straight forward solution would be to filter the original dataframe into 23 separate dataframes, then do the merge on each individual file, then concat into a new dataframe and output to CSV.
This seems highly inefficient?
I can post code but I am looking for more of any high-level thoughts?
Not sure exactly how your DataFrame looks like but Pandas.query() method will maybe prove useful for the selection of data.
name = df.query('columnname == "something"')
I have a historic csv that needs to be updated daily (concatenated) with a freshly pulled csv. The issue is that the new csv may have different number of columns from the historic one. If each of them was light, I could just read in both and concatenate with pandas. If the number of columns was the same, I could use cat and do a command-line call. Unfortunately, neither is true.
So, I am wondering if there is a way to do out-of-memory concatenation/join with pandas for something like above, or using one of the command line tools.
Thanks!
I have put ~100 dataframes containing data into a list tables and a list of names (so I can call by name or just iterate over the whole bunch without needing names)
This data will need to be stored, appended to and later queried. So I want to store it as a pandas hdf5 store.
There are ~100 DFs but I can group them into pairs (two different observers).
In the end I want to iterate over all the list of tables but also
I've thought about Panels (but that will have annoying NaN values since the tables aren't the same length), hierachical hd5f (but that doesn't really solve anything, just groups by observer), one continuous dataframe (seeming as they have the same number of columns) (but that will just make it harder because I'll have to piece the DFs back together afterwards).
Is there anything blatantly obvious I'm missing, or am I just going to have to grin and bear it with one these? (if so which one would you go for to give the greatest flexibility?)
Thanks
We would like to be able to allow the HDF5 files themselves to define their columns, indexes, and column types instead of maintaining a separate file that defines structure of the HDF5 data.
How can I create an empty HDF5 file from Pandas with a specific table structure like:
Columns
id (Int)
name (Str)
update_date (datetime)
some_float (float)
Indexes
id
name
Once the HDF5 is created and saved to disk, how do I retrieve the column and index information without having to open the file completely each time since it will likely contain several GB of data.
Many thanks in advance...
-- UPDATE --
Thanks for the comments. To clarify a bit more:
We do have some experience with Pandas but by no means are really proficient. The part that is tripping us up is creating an empty data structure and reading that structure from a file that you will not want to fully open. In all of the Pandas examples there is data. The Pandas examples also only show two ways to retrieve data/structure which are to read the entire frame into memory or issue a where clause. In this case, we would like to be able to see the table structure without query operations if possible.
I know this is an odd case. Why the heck would you want an empty dataframe?? Well, we want to have a great deal of flexility in moving data around and want to be able to define a target dataframe structure prior to data writing, which could take place much later (e.g. hours or days). Since the HDF5 specification maintains all that information it seems directionally incorrect to store the table structure information separately. Thus our desire to crack the code on this subject.
-- UPDATE 2 --
To add more detail as #jeff requested.
We would like to abstract some of the common Pandas functions like summing data or merging two frames. Thus we would like to be able to ask each frame what their columns are so we can present a view for the user to select the result frame columns.
For example, if we imported a CSV with columns A, B, C, D, and V and saved the frame to HDF5 as my_csv.hdf then we would be able to determine the columns by opening the file.
However, in our use case it is likely that the import frame for the CSV could be cleared periodically and no longer contain the data. The reason knowing that the my_csv frame has certain columns and types is important because we want to enable a user to then select those columns for summing in a downstream operation. Lets say a user wants to sum column V by the values in columns A and B only and save the frame as my_sum. Since we can't ensure my_csv will always have data we would like to ensure it at least contains the structure.
Open to other suggestions obviously. It is also possible to store the table structure info in the user_block. This, again, is not ideal because the structure is now being kept in two different areas but I guess it would be possible to always update the user_block on save using the latest column and index information for the frame, although I believe the to_* operations in Pandas will blow away the user_block so...blah. I feel like I'm talking myself into maintaining a peer structure definition but I REALLY would love some suggestions to not have to do that.