I have a DataFrame like this : DATASET
I want to separate my data set into four parts.
Each of the four parts correspond to a title.
The difficulty is that each of these four parts of the table has different sizes. Moreover I don't have only one table but many others (with parts having different sizes but still four parts).
The goal is to find a way to separate my table by referring to the TITLE each time. Do you have an idea of how to do it?
Sincerely,
Etienne
Related
I have two datasets that partially overlap. The overlapping part should have identical values in two columns. However, I suspect that's not always the case. I want to check this using pandas, but I run into a problem: since the dataframes are structured differently, their row indexes do not correspond. Moreover, corresponding rows have a different "Name" or "ID". Therefore, I wanted to match rows by matching values from three other columns that I am confident are the same: latitude, longitude and number of samples (I need all three because some rows are collected at the same location and some rows may have the same number of samples).
In short, I want to formulate a condition that requires three columns in a row from either dataframe to be equal, and then check the values of the columns that I suspect are different. Unfortunately, I have not been able to formulate this problem well enough to make google find me the correct function.
Many thanks!
I'm manually comparing two or three rows very similar using pandas. Is there a more automated way to do this? I would like a better method than using '=='.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html
See if this will satisfy your needs.
df['sales_diff'] = df['sales'].diff()
The above code snippet creates a new column in your data frame, which contains the difference between the previous row by default. You can screw around with the parameters (axis) to compare rows or columns and you can change (period) to compare to a specific row or column.
I have relatively complex csv files which contain multiple matrices representing several types of data and I would like to be able to parse these into multiple dataframes.
The complication is that these files are quite variable in size and content, as seen in this example containing two types of data, in this case a Median and Count metric for each sample.
There are some commonalities that all of these files share. Each metric will be stored in a matrix structured essentially like the two in the above example. In particular, the DataType field and subsequent entry will always be there, and the feature space (columns) and sample space (rows) will be consistent within a file (the row space may vary between files).
Essentially, the end result should be a dataframe of the data for just one metric, with the feature ids as the column names (Analyte 1, Analyte 2, etc in this example) and the sample ids as the row names (Location column in this case).
So far I've attempted this using the pandas read_csv function without much success.
In theory I could do something like this, but only if I know (1) the size and (2) the location of the particular matrix for the metric that I am after. In this case the headers for my particular metric of interest would be in row 46 and I happen to know that the number of samples is 384.
import pandas as pd
df = pd.read_csv('/path/to/file.csv', sep = ",", header=46, nrows=385, index_col='Location')
I am at a complete loss how to do this in a dynamic fashion with files and matrices that change dimensions. Any input on overall strategy here would be greatly appreciated!
I've tried searching for an example of how I might solve this but can't seem to find a specific example that matches my needs.
I'm trying to create multiple separate dataframes (up to 80 - depending on the data that I have) from one large dataframe (using the value in a column s the "grouper". I have records for multiple patient "types" (where patient type is a column variable) and want to create separate dataframes for each of these types.
The reason I want to do this is I want to plot separate kaplan meier survival curves for each of these dataframes. I've tried doing this using subplots - but there are too many different patient types to do in a series of sub-plots (the subplots end up looking too small).
I'm new to Python so apologies if this is a silly question...and thanks in advance for any suggestions.
I have put ~100 dataframes containing data into a list tables and a list of names (so I can call by name or just iterate over the whole bunch without needing names)
This data will need to be stored, appended to and later queried. So I want to store it as a pandas hdf5 store.
There are ~100 DFs but I can group them into pairs (two different observers).
In the end I want to iterate over all the list of tables but also
I've thought about Panels (but that will have annoying NaN values since the tables aren't the same length), hierachical hd5f (but that doesn't really solve anything, just groups by observer), one continuous dataframe (seeming as they have the same number of columns) (but that will just make it harder because I'll have to piece the DFs back together afterwards).
Is there anything blatantly obvious I'm missing, or am I just going to have to grin and bear it with one these? (if so which one would you go for to give the greatest flexibility?)
Thanks