python upset plot data type unclear - python

I am trying to make an upset plot using gene-disease association lists. I assume that I simply do not understand which data type is required as an input as most examples use artificially created datasets that are of the data type "int64".
Upsetplot: https://buildmedia.readthedocs.org/media/pdf/upsetplot/latest/upsetplot.pdf and https://pydigger.com/pypi/UpSetPlot
I copied the examples given in the links above and they work just fine. When I try my own dataset I get the error message: AttributeError: 'Index' object has no attribute 'levels'
The data I use as input is a data frame with boolean information (see attachment "mydata.png" mydata boolean df). So I have the diseases as columns, the genes as rows and then boolean statements about the specific gene being associated with that disease or not (I can make this sound more computational if required).
An example data set that works can be found in the documentation or in the screenshot "upsetplot_data_example.png" upsetplot_data_example. In the documentation is says something about "category membership", but I do not quite understand what data type that is.
I assume it is a basic issue of not understanding what "format" is required. If anyone has an idea of what I need to do, please let me know. I welcome all feedback. I do not expect anyone to actually do the coding for me, however some pointers would be so helpful.
Thanks everyone!

The recently released Data Format Guide might prove helpful. Perhaps you need to set those boolean columns as the index of your data frame before passing it in, although ultimately, it may be easier to use from_contents or from_memberships to describe your data.
However, upsetplot will hopefully make the input format easier in a future version.

Related

Pandas Styles removing default table format

I am trying to format a pandas DataFrame value representation.
Basically, all I want is to get the "Thousand" separator on my values.
I managed to do it using the pd.style.format function. It does the job, but also "breaks" all my table original design.
here is an example of what is going on:
Is there anything I can do to avoid doing it? I want to keep the original table format, only changing the format of the value.
PS: Don't know if it makes any difference, but I am using Google Colab.
In case anyone is having the same problem as I was using Colab, I have found a solution:
.set_table_attributes('class="dataframe"') seems to solve the problem
More infos can be found here: https://github.com/googlecolab/colabtools/issues/1687
For this case you could do:
pdf.assign(a=pdf['a'].map("{:,.0f}".format))

No numeric types to aggregate?

I am trying to calculate the mean of different columns using groupby.
Here is my code.
However, as soon as I try to calculate mean, the error 'no numeric types to aggregate' appears. What is wrong with my code? Please help me!!! Thank you so much.
can you please post your code as text and some example data?
What are the contents of data['low_stress'] and data['high_stress']?
My guess is, you use pd.Series([low_stress]) and thereby instantiate a series of an array of your data. Using pd.Series(low_stress) will probably fix your problem.

what is the best way to handle variable names in python if the all the variables need to have a float(decimal) in it?

Totally new in this forum and new in python so I would appreciate it if anybody can help me.
I am trying to build a script in python based on data that I have in an excel spreadsheet. I'd like to create an app/script where I can estimate the pregnancy due date and the conception date (for animals) based on measurements that I have taken during ultrasounds. I am able to estimate it with a calculator but it takes some conversion to do (from cm to mm) and days to months. In order to do that in Python, I figured I create a variable for each measurement and set each variable equals to its value in days (and integer).
Here is the problem: the main column of my data set is the actual measurements of the babies in mm (Known as BPD) but the BPD can be an integer like 5mm or 6.4mm. Since I can't name a variable with a period or a dot in it, what would be the best way to handle my data and assign variables to it? I have tried BPD_4.8= 77days, but python tells me there's a syntax error (I'm sure lol), but if I type BDP_5= 78 it seems to work. I haven't mastered lists and tuples, not do I really know how to use them properly so ill keep looking online and see what happens.
I'm sure it's something super silly for you guys, but I'm really pulling my hair out and I have nothing but 2 inches of hair lol
This is what my current screen looks like..HELP :(
Howdy and welcome to StackOverflow. The short answer is:
Use a better data structure
You really shouldn't be encoding valuable information into variable names like that. What's going to happen if you want to calculate something with your BPD measurements? Or when you have duplicate BPD's?
This is bad practise. It might seem like a lot of effort to take the time to figure out how to do this properly - but it will be more than worth it if you intend to continue to use Python :)
I'll give you a couple options...
Option 1: Use a dictionary
Dictionaries are common data structures in any language.. so it can pay to know how to use them.
Dictionaries hold information about an object using key/value pairs. For example you might have:
measurements = {
'animal_1' : {'bpd': 4.6, 'due_date_days': 55},
'animal_2' : {'bpd': 5.2, 'due_date_days': 77},
}
An advantage of dictionaries is that they are explicit, ie values have keys which explicitly identify what the information is assigned to. E.g. measurements['animal_1']['due_date_days'] would return the due date for animal 1.
A disadvantage is that it will be harder to compute information / examine relationships than you'll be used to in Excel.
Option 2: Use Pandas
Pandas is a data science library for Python. It's fast, has similar functionality to Excel and is probably well suited to your use case.
I'd recommend you take the time to do a tutorial or two. If you're planning to use Python for data analysis then it's worth using the language and any suitable libraries properly.
You can check out some Pandas tutorials here: https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html
Good luck!

Take dates and times from multiple columns to one datetime object with Python

I've got a dataset with multiple time values as below.
Area,Year,Month,Day of Week,Time of Day,Hour of Day
x,2016,1,6.0,108,1.0
z,2016,1,6.0,140,1.0
n,2016,1,6.0,113,1.0
p,2016,1,6.0,150,1.0
r,2016,1,6.0,158,1.0
I have been trying to transform this into a single datetime object to simplify the dataset and be able to do proper time series analysis against it.
For some reason I have been unable to get the right outcome using the datetime library from Python. Would anyone be able to point me in the right direction?
Update - Example of stats here.
https://data.pa.gov/Public-Safety/Crash-Incident-Details-CY-1997-Current-Annual-Coun/dc5b-gebx/data
I don't think there is a week column. Hmm. I wonder if I've missed something?
Any suggestions would be great. Really just looking to simplify this dataset. Maybe even create another table / sheet for the causes of crash, as their's a lot of superfluous columns that are taking up a lot of data, which can be labeled with simple ints.

unpack dictionary values in dataframe python

I have this dataframe from where I need to exact the act1omschr from the column adresactiviteit, however sinds it is an object with a list and dict I don't know how to extract these values.
Can someone help me out?
It looks like that's not a dictionary, but a 'json' (java script object notation). It's a bit like a csv but with nested values and pretty comumn especially for web data.
Pandas has a function called 'json_normalize' which should help. For specifically using it on one column, this was answered pretty well over here. You should more or less be able to use the exact code given.

Categories