Why is pandas compare not working when comparing two dataframes? - python

I am creating two dataframes, that I set equal to eachother based on an index field. So each frame has the same indices on both sides and I sort them as well. I want to return the differences between these fields, so as to catch any of the rows that have 'updated' since the last run. But I am getting a weird result.
df1.compare(df2)
I fail to see any differences here, and when I manually look at the id's involved I do not see any changes at all. What could be causing this?

If you look at the below code:
It's working.
Can you please share both of your dfs so that we can assist you better.

I solved it and will post this in case someone else gets stuck. Apparently Nulls were being interpreted as 'None' when read into the dataframe. But the other dataframe actually had the String 'None' vs null. You would never know this by pulling these into dataframes, as they would look identical to the eye.
It took me a while to realize this and hopefully this saves someone else some time.

Related

Logic behind loc in pandas

I have some simple code:
for x in range(df2.shape[0]):
df1.loc[df1['df1_columnA'] == df2.iloc[x]['df2_columnB']]['df1_columnB']
This code goes through the cells located at (iloc[x], df2_columnB) of df2 and when there's a matching value of that cell with df1['df1_columnA'] it accesses that row's value (.loc) at ['df1_columnB'].
My question is, where can I find how this works internally (or if someone would be willing to explain)? Before I knew about this way of comparison I had a couple of nested for loops and other logic to find the values. I've tried searching through the github and other online resources but I can't find anything relevant. I'm simply curious to understand how it compares to my own initial code and/or whether vectorization is used, etc.
Thanks

Weird df.sort_values result - randomly leaves multiple rows empty but missing values are misplaced a row down in the wrong columns

I've run into a very weird problem when using pandas to sort my data frame using df.sort_values on my date-time column.
In some sense the code works, it does sort the rows by date, however, in doing so it cuts some of the rows in half leaving one half of the row where it should be but leaving the other half of the row empty. The other half is put in the next row down and put in the wrong columns.
It's easier to understand from the screenshots.
Pre-Sorted Data
Post-Sorted Data
It's difficult to see from the Post-Sorted date picture but not all rows are treated this way(only one isn't in the pic, second from the bottom). Some rows turn out just fine and others don't, it seems very random.
The problem code is shown below.
df = pd.read_csv('Pre-Sort.csv', low_memory=False, skiprows=0, lineterminator='\n')
df = df.sort_values(by="created")
df.to_csv('Post-Sort.csv')
I've tried using inplace=True and ignore_index=True as well but both returned the same results.
I'm worried this is a problem with the data as I've used df.sort_values before with virtually the same data and it worked fine.
Originally, with the data that worked, I realized that for some reason two months' worth of data roughly 15% of total data vanished. The culprit was when I merged two data frames together both containing all months with df.append. For some reason, df.append would consistently skip these two months so, in the end, I merged them through the mac terminal which kept the missing months.
This is as far as I know the only major difference between the two times I tried the code. It could be a red herring though. I had to redo many of the operations from scratch, so I might have done something differently than before, so it could be that.
Also, this changes the file size from 192MB to 210.3MB which probably shouldn't happen with a sort.
I just need to sort the data so I can resample it into daily variables, so if anyone knows a way to resample without needing to sort that would work for me just as well.
I think you're having problems with indexes (both in merging and sorting).
You can try the code above:
df = pd.read_csv('Pre-Sort.csv', low_memory=False, skiprows=0, lineterminator='\n')
# no assignments here, that was the reason of the NoneType happening
df.sort_values(by='created', inplace=True, ignore_index=True)
df.to_csv('Post-Sort.csv')
Solved mostly.
Problem wasn't the sorted function but a pre-existing problem with the dataset. The displacement issue was already there but was only made apparent when the data was mixed together. Still trying to figure out exactly where in the datahandling it went wrong put I think it's a delimiter issue.
So far, an equally confusing problem which I'll likely have to create another post on to fix

How to find/filter/combine based on common prefix in rows and columns with use of python/pandas?

I'm new to coding and having a hard time expressing/searching for the correct terms to help me along with this task. In my work I get some pretty large excel-files from people out in the field monitoring birds. The results need to be prepared for databases, reports, tables and more. I was hoping to use Python to automate some tasks for this.
How can I use Python (pandas?) to find certain rows/columns based on a common name/ID but with a unique suffix , and aggregate/sum the results that belongs together under that common name? As an example in the table provided I need get all the results from sub-localities e.g. AA3_f, AA3_lf and AA3_s expressed as the sum (total of gulls for each species) of the subs in a new row for the main Locality AA3.
Can someone please provide some code for this task, or help me in some other way? I have searched and watched many tutorials on python, numpy, pandas and also matplotlib .. still clueless on how to set this up
any help appreciated
Thanks!
Update:
#Harsh Nagouda, thanks for your reply. I tried your example using groupby function, but I having trouble dividing into correct groups. The "Locality" column has only unique values/ID because they all have a suffix (they are sub categories).
I tried to solve this by slicing the strings:
eng.Locality.str.slice(0,4,1)
i managed to slice off the suffices so that the remainders = AA3_ , AA4_ and so on.
Then i tried to do this slicing in the groupby function. That failed. Then I tried to slice using pandas.Dataframe.apply(). That failed as well.
eng["Locality"].apply(eng.Locality.str.slice(0,4,1))
sum = eng.groupby(["Locality"].str.slice(0,4,1)).sum()
Any more help out there? As you can see above - I need it :-)
In your case, the pd.groupby option seems to be a good fit for the problem. The groupby function does exactly what it means, it groups parts of the dataframe you like it to.
Since you mentioned a case based on grouping by localities and finding the sum of those values, this snippet should help you out:
sum = eng.groupby(["Locality"]).sum()
Additional commands and sorting styles can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I finally figured out a way to get it done. Maybe not the smoothest way, but at least I get the end result I need:
Edited the Locality-ID to remove suffix:eng["Locality"]=eng["Locality].str.slice(0,4,1)
Used the groupby function:sum = eng.groupby(["Locality"]).sum()
End result:
Table

Python Pandas: Dataframe.loc[] vs dataframe[] behavior difference?

I'm completely new to Pandas, but I was given some code that uses it, and I came across a line that looked like the following:
df.loc[bool_array]['column_name']=0
Looking at that code, I would expect that if I ran df.loc[bool_array]['column_name'] to retrieve the values for the column after running said code, the resulting values would all be zero. In fact, however, this was not the case, the values remained unchanged. To actually change the values, I had to remove the .loc as so:
df[bool_array]['column_name']=0
whereupon displaying the values did, in fact, show 0 as expected, regardless of if I used .loc or not.
Everything I can find about .loc seems to indicate that it should behave essentially the same as simply doing [] when given a boolean array like this. What difference am I missing that explains this behavioral discrepancy?
EDIT: As pointed out by #QuangHoang in the comments, the following recommended code DOES work:
df.loc[bool_array, 'column_name']=0
which simplifies the question to "why doesn't the original code work"?

Pandas DataFrame - Test for change/modification

Simple question, and my google-fu is not strong enough to find the right term to get a solid answer from the documentation. Any term I look for that includes either change or modify leads me to questions like 'How to change column name....'
I am reading in a large dataframe, and I may be adding new columns to it. These columns are based on interpolation of values on a row by row basis, and the simple numbers of rows makes this process a couple hours in length. Hence, I save the dataframe, which also can take a bit of time - 30 seconds at least.
My current code will always save the dataframe, even if I have not added any new columns. Since I am still developing some plotting tools around it, I am wasting a lot of time waiting for the save to finish at the termination of the script needlessly.
Is there a DataFrame attribute I can test to see if the DataFrame has been modified? Essentially, if this is False I can avoid saving at the end of the script, but if it is True then a save is necessary. This simple one line if will save me a lot of time and a lost of SSD writes!
You can use:
df.equals(old_df)
You can read the it's functionality in pandas' documentation. It basically does what you want, returning True only if both DataFrames are equal, and it's probably the fastest way to do it since it's an implementation of pandas itself.
Notice you need to use .copy() when assigning old_df before changes in your current df, otherwise you might pass the dataframe by reference and not by value.

Categories