I'm completely new to Pandas, but I was given some code that uses it, and I came across a line that looked like the following:
df.loc[bool_array]['column_name']=0
Looking at that code, I would expect that if I ran df.loc[bool_array]['column_name'] to retrieve the values for the column after running said code, the resulting values would all be zero. In fact, however, this was not the case, the values remained unchanged. To actually change the values, I had to remove the .loc as so:
df[bool_array]['column_name']=0
whereupon displaying the values did, in fact, show 0 as expected, regardless of if I used .loc or not.
Everything I can find about .loc seems to indicate that it should behave essentially the same as simply doing [] when given a boolean array like this. What difference am I missing that explains this behavioral discrepancy?
EDIT: As pointed out by #QuangHoang in the comments, the following recommended code DOES work:
df.loc[bool_array, 'column_name']=0
which simplifies the question to "why doesn't the original code work"?
Related
I am creating two dataframes, that I set equal to eachother based on an index field. So each frame has the same indices on both sides and I sort them as well. I want to return the differences between these fields, so as to catch any of the rows that have 'updated' since the last run. But I am getting a weird result.
df1.compare(df2)
I fail to see any differences here, and when I manually look at the id's involved I do not see any changes at all. What could be causing this?
If you look at the below code:
It's working.
Can you please share both of your dfs so that we can assist you better.
I solved it and will post this in case someone else gets stuck. Apparently Nulls were being interpreted as 'None' when read into the dataframe. But the other dataframe actually had the String 'None' vs null. You would never know this by pulling these into dataframes, as they would look identical to the eye.
It took me a while to realize this and hopefully this saves someone else some time.
I've used TERR to make calculated columns and other types of data functions in Spotfire and am happy to hear you can now use Python. I did a simple test to ensure things are working (x3 = x2*2) - that's literally script i wrote in the python data function window and then set up the input paramters (x2 as a column) and the output paramters (x3) to be a new column....the values come out fine but the newly calculated column comes out as named x2(2)....i looked into the input/output parameters and all the names are correct, yet the column still comes out named that way. My concern is that this is uber simple, yet why is the column not being named what is in the script even though everything is set up correct. There is even a Youtube video by a Spotfire employee, the same thing happens to them and the don't mention it at all.
Has anybody else run across this?
It does seem to differ from how the equivalent TERR data function works. I consulted with the Spotfire engineering team, and here is what they suggest. It has to do with how a Column input is handled internally in Python vs TERR. In both Python and TERR, inputs (and outputs) are passed over as a table. In TERR's case a data.frame, and in Python's case a pandas.DataFrame. In TERR's case though, if the Data Function says the input is a Column, this is actually converted from a 1-column data.frame to a vector of the equivalent type; similarly, for a Value it is converted from its 1x1 data.frame to a scalar type. In Python, Value inputs are treated the same, but Column inputs are left as a pandas.Series, which retains the column name from the original input column.
Maybe you can try something different. You wouldn't want to convert it to a standard Python list, because in that case, x2*2 would actually make the column twice as long, rather than a vectorised arithmetic operation. But you could make it a straight numpy array instead. You can try adding "x2 = x2.to_numpy()" at the top of your example, and see if the result matches what you expected.
I have the following dataframe where I want to assign the bottom 1% value to a new column. When I do this calculation with using the ".loc" notification, it takes around 10 seconds for using .loc assignment, where the alternative solution is only 2 seconds.
df_temp = pd.DataFrame(np.random.randn(100000000,1),columns=list('A'))
%time df_temp["q"] = df_temp["A"].quantile(0.01)
%time df_temp.loc[:, "q1_loc"] = df_temp["A"].quantile(0.01)
Why is the .loc solution slower? I understand using the .loc solution is safer, but if I want to assign data to all indices in the column, what can go wrong with the direct assignment?
.loc is searching along the entirety of indices and columns (in this case, only 1 column) in your df along the whole axes, which is time consuming and perhaps redundant, in addition to figuring out the quantiles of df_temp['A'] (which is negligible as far as calculation time). Your direct assignment method, on the other hand, is just parsing df_temp['A'].quantile(0.01), and assigning df_temp['q']. It doesn't need to exhaustively search the indices/columns of your df.
See this answer for a similar description of the .loc method.
As far as safety is concerned, you are not using chained indexing, so you're probably safe (you're not trying to set anything on a copy of your data, it's being set directly on the data itself). It's good to be aware of the potential issues with not using .loc (see this post for a nice overview of SettingWithCopy warnings), but I think that you're OK as far as that goes.
If you want to be more explicit about your column creation, you could do something along the lines of df = df.assign(q=df_temp["A"].quantile(0.01)). It won't really change performance (I don't think), nor the result, but it allows you to see that you're explicitly assigning a new column to your existing dataframe (and thus not setting anything on a copy of said dataframe).
Simple question, and my google-fu is not strong enough to find the right term to get a solid answer from the documentation. Any term I look for that includes either change or modify leads me to questions like 'How to change column name....'
I am reading in a large dataframe, and I may be adding new columns to it. These columns are based on interpolation of values on a row by row basis, and the simple numbers of rows makes this process a couple hours in length. Hence, I save the dataframe, which also can take a bit of time - 30 seconds at least.
My current code will always save the dataframe, even if I have not added any new columns. Since I am still developing some plotting tools around it, I am wasting a lot of time waiting for the save to finish at the termination of the script needlessly.
Is there a DataFrame attribute I can test to see if the DataFrame has been modified? Essentially, if this is False I can avoid saving at the end of the script, but if it is True then a save is necessary. This simple one line if will save me a lot of time and a lost of SSD writes!
You can use:
df.equals(old_df)
You can read the it's functionality in pandas' documentation. It basically does what you want, returning True only if both DataFrames are equal, and it's probably the fastest way to do it since it's an implementation of pandas itself.
Notice you need to use .copy() when assigning old_df before changes in your current df, otherwise you might pass the dataframe by reference and not by value.
In Python, find and index are very similar methods, used to look up values in a sequence type. find is used for strings, while index is for lists and tuples. They both return the lowest index (the index furthest to the left) that the supplied argument is found.
For example, both of the following would return 1:
"abc".find("b")
[1,2,3].index(2)
However, one thing I'm somewhat confused about is that, even though the two methods are very similar, and fill nearly the same role, just for different data types, they have very different reactions to attempting to find something not in the sequence.
"abc".find("d")
Returns -1, to signify 'not found', while
[1,2,3].index(4)
raises an exception.
Basically, why do they have different behaviors? Is there a particular reason, or is it just a weird inconsistency for no particular reason?
Now, I'm not asking how to deal with this – obviously, a try/except block, or a conditional in statement, would work. I'm simply asking what the rationale was for making the behavior in just that particular case different. To me, it would make more sense to have a particular behavior to say not found, for consistency's sake.
Also, I'm not asking for opinions on whether the reason is a good reason or not – I'm simply curious about what the reason is.
Edit: Some have pointed out that strings also have an index method, which works like the index method for lists, which I'll admit I didn't know, but that just makes me wonder why, if strings have both, lists only have index.
This has always been annoying ;-) Contrary to one answer, there's nothing special about -1 with respect to strings; e.g.,
>>> "abc"[-1]
'c'
>>> [2, 3, 42][-1]
42
The problem with find() in practice is that -1 is in fact not special as an index. So code using find() is prone to surprises when the thing being searched for is not found - it was noted even before Python 1.0.0 was released that such code often went on to do a wrong thing.
No such surprises occur when index() is used instead - an exception can't be ignored silently. But setting up try/except for such a simple operation is not only annoying, it adds major overhead (extra time) for what "should be" a fast operation. Because of that, string.find() was added in Python 0.9.9 (before then, only string.index() could be used).
So we have both, and that persists even into Python 3. Pick your poison :-)