Being a newbie to python/pandas, just curious to know the significance/use of "coerce" in pandas syntax.
Is it only used in case of conversion(i would rather say force-full conversion) or it has some use cases in any other places.
Thanks in Adv.(An example will be highly valuable).
Related
Totally new in this forum and new in python so I would appreciate it if anybody can help me.
I am trying to build a script in python based on data that I have in an excel spreadsheet. I'd like to create an app/script where I can estimate the pregnancy due date and the conception date (for animals) based on measurements that I have taken during ultrasounds. I am able to estimate it with a calculator but it takes some conversion to do (from cm to mm) and days to months. In order to do that in Python, I figured I create a variable for each measurement and set each variable equals to its value in days (and integer).
Here is the problem: the main column of my data set is the actual measurements of the babies in mm (Known as BPD) but the BPD can be an integer like 5mm or 6.4mm. Since I can't name a variable with a period or a dot in it, what would be the best way to handle my data and assign variables to it? I have tried BPD_4.8= 77days, but python tells me there's a syntax error (I'm sure lol), but if I type BDP_5= 78 it seems to work. I haven't mastered lists and tuples, not do I really know how to use them properly so ill keep looking online and see what happens.
I'm sure it's something super silly for you guys, but I'm really pulling my hair out and I have nothing but 2 inches of hair lol
This is what my current screen looks like..HELP :(
Howdy and welcome to StackOverflow. The short answer is:
Use a better data structure
You really shouldn't be encoding valuable information into variable names like that. What's going to happen if you want to calculate something with your BPD measurements? Or when you have duplicate BPD's?
This is bad practise. It might seem like a lot of effort to take the time to figure out how to do this properly - but it will be more than worth it if you intend to continue to use Python :)
I'll give you a couple options...
Option 1: Use a dictionary
Dictionaries are common data structures in any language.. so it can pay to know how to use them.
Dictionaries hold information about an object using key/value pairs. For example you might have:
measurements = {
'animal_1' : {'bpd': 4.6, 'due_date_days': 55},
'animal_2' : {'bpd': 5.2, 'due_date_days': 77},
}
An advantage of dictionaries is that they are explicit, ie values have keys which explicitly identify what the information is assigned to. E.g. measurements['animal_1']['due_date_days'] would return the due date for animal 1.
A disadvantage is that it will be harder to compute information / examine relationships than you'll be used to in Excel.
Option 2: Use Pandas
Pandas is a data science library for Python. It's fast, has similar functionality to Excel and is probably well suited to your use case.
I'd recommend you take the time to do a tutorial or two. If you're planning to use Python for data analysis then it's worth using the language and any suitable libraries properly.
You can check out some Pandas tutorials here: https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html
Good luck!
I am looking to truncate a pandas Series but am surprised that the function does not take a pd.Period as an argument.
The following code does not work.
pdi = pd.PeriodIndex([pd.Period('2017-09-01'),pd.Period('2017-09-02'),pd.Period('2017-09-03')])
ser = pd.Series([1,2,3], index=pdi)
ser.truncate(before=pd.Period("2017-09-03"))
TypeError: <class 'pandas._libs.tslibs.period.Period'> is not convertible to datetime
This suggests to me that PeriodIndex is just an overlay on an underlying DatetimeIndex as opposed to being implemented directly. Still trying to wrap my head around pandas (a great library!) but this seems like a design problem and would kindly like some insight from someone more knowledgeable. As a user, just hoping for a better understanding as it will help me with how I am approaching future problems.
Anyways, the workaround is simple but conceptually messy in my opinion.
ser.truncate(before=str(pd.Period("2017-09-03")))
Thanks in advance!
I'm using xlwings with numpy and unittest in Python to test an Excel spreadsheet. However, when xlwings is importing a value which has #N/A it is resulting in -2146826246.
I understand that this may have something to do with xlwings importing values as float, and there may not be a good float representation of #N/A.
I want to compare #N/A with nan. Any advice on how to accomplish this?
For anyone who may stumble across the same problem in the future. I used a very crude method of building a dictionary with the error numbers and the value which I wanted to return.
error_dict = {-2146826281:np.inf,-2146826246:np.nan}
If anyone has a more elegant solution, please let me know!
Just to add another potential if you brought in (or converted) your excel data as a Pandas Dataframe you can always use replace to convert the old #N/A values to NaN's (which are a little easier to deal with in Python/Pandas..)
df.replace(-2146826246,float('nan'))
When converting range to df, use this option
.options(empty=np.nan)
Then, treating like NaN different errors is very easy
df[df==-2146826246]=np.nan
This is very useful in order to avoid error while using functions and calculations to our data frame
I am trying to figure what is the right way to plot pandas DataFrames as, there seem to be multiple working syntaxes coexisting. I know Pandas is still developing so my question is which of the methods below is the most future proof?
Let's say I have DataFrame df I could plot it as a histogram using following pandas API calls.
df.plot(kind='hist')
df.plot.hist()
df.hist()
Looking at the documentation options 1, 2 seem to be pretty much the same thing in which case I prefer df.plot.hist() as I get auto-complete with the plot name. 'hist' is still pretty easy to spell as a string, but 'candlestick_ohlc' for example is pretty easy to typo...
What gets me confused is the 3th option. It does not have all the options of the first 2 and API is different. Is that one some legacy thing or the actual right way of doing things?
The recommended method is plot._plot_type this is to avoid the ambiguity in kwarg params and to aid in tab-completion see here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0170-plot.
The .hist method still works as a legacy support, I don't believe there are plans to remove this but it's recommended to use plot.hist for future compatibility.
Additionally it simplifies the api somewhat as it was a bit problematic to use kind=graph_type to specify the graphy type and ensure the params were correct for each graphy type, the kwargs for plot._plottype are specified here: http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-plotting which should cover all the args in hist
I've always considered df.hist() to be the graphical equivalent to df.describe(): a quick way of getting an overview over the distribution of numeric data in a data frame. As this is indeed useful, and also used by a few people as far as I know, I'd be surprised if it became deprecated in a future version.
In contrast, I understand the df.plot method to be intended for actual data visualization, i.e. the preferred method if you want to tease a specific bit of information out of your data. Consequently, there are more arguments that you can use to modify the plot so that it fits your purpose, whereas with df.hist(), you can get useful distributional plots even with the default settings.
Thus, to answer your question: as I see it, both functions serve different purposes, both can be useful depending on your needs, and both should be future-safe.
I'm trying to clearly understand for which type of data transformation the following functions in pandas should be used:
replace
map
transform
Can anybody provide some clear examples so I can better understand them?
Many thanks :)
As far as I understand, Replace is used when working on missing values and transform is used while doing group_by operations.Map is used to change series or index