Python (Pandas) : When to use replace vs. map vs. transform? - python

I'm trying to clearly understand for which type of data transformation the following functions in pandas should be used:
replace
map
transform
Can anybody provide some clear examples so I can better understand them?
Many thanks :)

As far as I understand, Replace is used when working on missing values and transform is used while doing group_by operations.Map is used to change series or index

Related

Pandas Styles removing default table format

I am trying to format a pandas DataFrame value representation.
Basically, all I want is to get the "Thousand" separator on my values.
I managed to do it using the pd.style.format function. It does the job, but also "breaks" all my table original design.
here is an example of what is going on:
Is there anything I can do to avoid doing it? I want to keep the original table format, only changing the format of the value.
PS: Don't know if it makes any difference, but I am using Google Colab.
In case anyone is having the same problem as I was using Colab, I have found a solution:
.set_table_attributes('class="dataframe"') seems to solve the problem
More infos can be found here: https://github.com/googlecolab/colabtools/issues/1687
For this case you could do:
pdf.assign(a=pdf['a'].map("{:,.0f}".format))

How to find/filter/combine based on common prefix in rows and columns with use of python/pandas?

I'm new to coding and having a hard time expressing/searching for the correct terms to help me along with this task. In my work I get some pretty large excel-files from people out in the field monitoring birds. The results need to be prepared for databases, reports, tables and more. I was hoping to use Python to automate some tasks for this.
How can I use Python (pandas?) to find certain rows/columns based on a common name/ID but with a unique suffix , and aggregate/sum the results that belongs together under that common name? As an example in the table provided I need get all the results from sub-localities e.g. AA3_f, AA3_lf and AA3_s expressed as the sum (total of gulls for each species) of the subs in a new row for the main Locality AA3.
Can someone please provide some code for this task, or help me in some other way? I have searched and watched many tutorials on python, numpy, pandas and also matplotlib .. still clueless on how to set this up
any help appreciated
Thanks!
Update:
#Harsh Nagouda, thanks for your reply. I tried your example using groupby function, but I having trouble dividing into correct groups. The "Locality" column has only unique values/ID because they all have a suffix (they are sub categories).
I tried to solve this by slicing the strings:
eng.Locality.str.slice(0,4,1)
i managed to slice off the suffices so that the remainders = AA3_ , AA4_ and so on.
Then i tried to do this slicing in the groupby function. That failed. Then I tried to slice using pandas.Dataframe.apply(). That failed as well.
eng["Locality"].apply(eng.Locality.str.slice(0,4,1))
sum = eng.groupby(["Locality"].str.slice(0,4,1)).sum()
Any more help out there? As you can see above - I need it :-)
In your case, the pd.groupby option seems to be a good fit for the problem. The groupby function does exactly what it means, it groups parts of the dataframe you like it to.
Since you mentioned a case based on grouping by localities and finding the sum of those values, this snippet should help you out:
sum = eng.groupby(["Locality"]).sum()
Additional commands and sorting styles can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I finally figured out a way to get it done. Maybe not the smoothest way, but at least I get the end result I need:
Edited the Locality-ID to remove suffix:eng["Locality"]=eng["Locality].str.slice(0,4,1)
Used the groupby function:sum = eng.groupby(["Locality"]).sum()
End result:
Table

what is the significance of coerce in python/pandas?

Being a newbie to python/pandas, just curious to know the significance/use of "coerce" in pandas syntax.
Is it only used in case of conversion(i would rather say force-full conversion) or it has some use cases in any other places.
Thanks in Adv.(An example will be highly valuable).

What is the data structure in python that can contain multiple pandas data frames?

I want to write a function to return several data frames (different dims) and put them into a larger "container" and then select each from the "container" using indexing. I think I want to find some data structure like list in R, which can have different kinds of objects.
What can I use to do this?
I haven't done much with Panels, but what exactly is the functionality that you need? Is there a reason a simple python list wouldn't work? Or, if you want to refer by name and not just by list position, a dictionary?
It depends a bit what you want to achieve. People used to work a lot with MultiIndex and have an identifier of the dataframe as an index (documentation).
But recently, there has been a lot of improvements of the Panels class, which is most likely the optimal solution for you (api, documentation)
I agree with #foobar I have used MultiColumns and MultiIndexes before for this type of Data. However, I believe the best datatype for this would be to use a pandas Panel. Here is the documentation...
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Panel.html
You can add the frames on just like you would add elements to a dict

Finding median with pandas transform

I needed to find the median for a pandas dataframe and used a piece of code from this previous SO answer: How I do find median using pandas on a dataset?.
I used the following code from that answer:
data['metric_median'] = data.groupby('Segment')['Metric'].transform('median')
It seemed to work well, so I'm happy about that, but I had a question: how is it that transform method took the argument 'median' without any prior specification? I've been reading the documentation for transform but didn't find any mention of using it to find a median.
Basically, the fact that .transform('median') worked seems like magic to me, and while I have no problem with magic and fancy myself a young Tony Wonder, I'm curious about how it works.
I'd recommend diving into the source code to see exactly why this works (and I'm mobile so I'll be terse).
When you pass the argument 'median' to tranform pandas converts this behind the scenes via getattr to the appropriate method then behaves like you passed it a function.

Categories