I am trying to format a pandas DataFrame value representation.
Basically, all I want is to get the "Thousand" separator on my values.
I managed to do it using the pd.style.format function. It does the job, but also "breaks" all my table original design.
here is an example of what is going on:
Is there anything I can do to avoid doing it? I want to keep the original table format, only changing the format of the value.
PS: Don't know if it makes any difference, but I am using Google Colab.
In case anyone is having the same problem as I was using Colab, I have found a solution:
.set_table_attributes('class="dataframe"') seems to solve the problem
More infos can be found here: https://github.com/googlecolab/colabtools/issues/1687
For this case you could do:
pdf.assign(a=pdf['a'].map("{:,.0f}".format))
Related
I'm new to coding and having a hard time expressing/searching for the correct terms to help me along with this task. In my work I get some pretty large excel-files from people out in the field monitoring birds. The results need to be prepared for databases, reports, tables and more. I was hoping to use Python to automate some tasks for this.
How can I use Python (pandas?) to find certain rows/columns based on a common name/ID but with a unique suffix , and aggregate/sum the results that belongs together under that common name? As an example in the table provided I need get all the results from sub-localities e.g. AA3_f, AA3_lf and AA3_s expressed as the sum (total of gulls for each species) of the subs in a new row for the main Locality AA3.
Can someone please provide some code for this task, or help me in some other way? I have searched and watched many tutorials on python, numpy, pandas and also matplotlib .. still clueless on how to set this up
any help appreciated
Thanks!
Update:
#Harsh Nagouda, thanks for your reply. I tried your example using groupby function, but I having trouble dividing into correct groups. The "Locality" column has only unique values/ID because they all have a suffix (they are sub categories).
I tried to solve this by slicing the strings:
eng.Locality.str.slice(0,4,1)
i managed to slice off the suffices so that the remainders = AA3_ , AA4_ and so on.
Then i tried to do this slicing in the groupby function. That failed. Then I tried to slice using pandas.Dataframe.apply(). That failed as well.
eng["Locality"].apply(eng.Locality.str.slice(0,4,1))
sum = eng.groupby(["Locality"].str.slice(0,4,1)).sum()
Any more help out there? As you can see above - I need it :-)
In your case, the pd.groupby option seems to be a good fit for the problem. The groupby function does exactly what it means, it groups parts of the dataframe you like it to.
Since you mentioned a case based on grouping by localities and finding the sum of those values, this snippet should help you out:
sum = eng.groupby(["Locality"]).sum()
Additional commands and sorting styles can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I finally figured out a way to get it done. Maybe not the smoothest way, but at least I get the end result I need:
Edited the Locality-ID to remove suffix:eng["Locality"]=eng["Locality].str.slice(0,4,1)
Used the groupby function:sum = eng.groupby(["Locality"]).sum()
End result:
Table
I am trying to make an upset plot using gene-disease association lists. I assume that I simply do not understand which data type is required as an input as most examples use artificially created datasets that are of the data type "int64".
Upsetplot: https://buildmedia.readthedocs.org/media/pdf/upsetplot/latest/upsetplot.pdf and https://pydigger.com/pypi/UpSetPlot
I copied the examples given in the links above and they work just fine. When I try my own dataset I get the error message: AttributeError: 'Index' object has no attribute 'levels'
The data I use as input is a data frame with boolean information (see attachment "mydata.png" mydata boolean df). So I have the diseases as columns, the genes as rows and then boolean statements about the specific gene being associated with that disease or not (I can make this sound more computational if required).
An example data set that works can be found in the documentation or in the screenshot "upsetplot_data_example.png" upsetplot_data_example. In the documentation is says something about "category membership", but I do not quite understand what data type that is.
I assume it is a basic issue of not understanding what "format" is required. If anyone has an idea of what I need to do, please let me know. I welcome all feedback. I do not expect anyone to actually do the coding for me, however some pointers would be so helpful.
Thanks everyone!
The recently released Data Format Guide might prove helpful. Perhaps you need to set those boolean columns as the index of your data frame before passing it in, although ultimately, it may be easier to use from_contents or from_memberships to describe your data.
However, upsetplot will hopefully make the input format easier in a future version.
I have this dataframe from where I need to exact the act1omschr from the column adresactiviteit, however sinds it is an object with a list and dict I don't know how to extract these values.
Can someone help me out?
It looks like that's not a dictionary, but a 'json' (java script object notation). It's a bit like a csv but with nested values and pretty comumn especially for web data.
Pandas has a function called 'json_normalize' which should help. For specifically using it on one column, this was answered pretty well over here. You should more or less be able to use the exact code given.
I would like to know if there's a technique to simply undo a change that was done using Pandas.
For example, I did a string replacement on a few thousand rows of Pandas Dataframe, where, every occurrence of "&" in its string be replaced with "and". However after performing the replacement, I found out that I've made a mistake in the changes and would want to revert back to the Dataframe's most latest form before that string replacement was done.
Is there a way to do this?
Yes, there is a way to do this. If you're using the newest iteration of python and pandas you could do it this way:
df.replace(to_replace='and', value='&', inplace=true)
This is the way I learned it!
If you have cells structured in step, and the mess is because of running a couple of cells that have affected the dataset, you can stop the kernel and run all the cells from the beginning.
I am trying to read a csv file in Python3 using the numpy genfromtxt function. In my csv file I have a field which is a string that looks like the following: "0x30375107333f3333".
I need to use the "dtype=None" option because I need this section of code to work with many different csv files, only some of them having such a field. Unfortunately numpy interprets this as a float128 which is a pain because 1) it is not a float and 2) I cannot find way to convert it to an int after it has been read as a float128 (without losing precision).
What I would like to do is instead interpret this as a string because it is enough for me. I found on the Numpy documentation that there is a way of getting around this, but they give cryptic instructions:
This behavior may be changed by modifying the default mapper of the StringConverter class.
Unfortunately whenever I Google something related to this I fall back to this documentation page.
I would greatly appreciate either an explanation of what they mean in the above quoted text or a solution to my above stated problem.