I have some simple code:
for x in range(df2.shape[0]):
df1.loc[df1['df1_columnA'] == df2.iloc[x]['df2_columnB']]['df1_columnB']
This code goes through the cells located at (iloc[x], df2_columnB) of df2 and when there's a matching value of that cell with df1['df1_columnA'] it accesses that row's value (.loc) at ['df1_columnB'].
My question is, where can I find how this works internally (or if someone would be willing to explain)? Before I knew about this way of comparison I had a couple of nested for loops and other logic to find the values. I've tried searching through the github and other online resources but I can't find anything relevant. I'm simply curious to understand how it compares to my own initial code and/or whether vectorization is used, etc.
Thanks
Related
So I'm fairly new to coding only having relatively simple scripts here and there when I need them for work. I have a document that has an ID column formatted as:
"Number Word Number" and some values under a spec, lower tol, and upper tol column.
Where sometimes the number under ID is a integer or float and the word can be one of say 30 different possibilities. Ultimately these need to be read and then organized depending on the spec and lower/upper tol columns into something like below:
I'm using Pandas to read the data and do the manipulations I need so my question isn't so much of a how to do it, but more of a how should it best be done.
The way my code is written is basically a series of if statements that handle each of the scenarios I've come across so far, but based on other peoples code I've seen this is generally not done and as I understand considered poor practice. It's very basic if statements like:
if(The ID column has "Note" in it) then its a basic dimension
if(The ID column has Roughness) then its an Ra value
if(The ID column has Position in it) then its a true position etc
Problem is I'm not really sure what the "correct" way to do it would be in terms of making it more efficient and simpler. I have currently a series of 30+ if statements and ways to handle different situations that I've run into so far. Virtually all the code I've written is done in this overly specific and not very general coding methodology that even though it works I find personally overcomplicated but I'm not really sure what capabilities of python/pandas I'm sort of missing and not utilizing to simplify my code.
Since you need to test what the variable in ID is and do some staff accordingly you can't avoid the if statements most probably.What i suggest you to do since you have written the code is to reform the database.If there is not a very specific reason you have database with a structure like this,you should change it asap.
To be specific to ID add an (auto)increment unique number and break the 3 datapoints of ID column into 3 seperate columns.
I am creating two dataframes, that I set equal to eachother based on an index field. So each frame has the same indices on both sides and I sort them as well. I want to return the differences between these fields, so as to catch any of the rows that have 'updated' since the last run. But I am getting a weird result.
df1.compare(df2)
I fail to see any differences here, and when I manually look at the id's involved I do not see any changes at all. What could be causing this?
If you look at the below code:
It's working.
Can you please share both of your dfs so that we can assist you better.
I solved it and will post this in case someone else gets stuck. Apparently Nulls were being interpreted as 'None' when read into the dataframe. But the other dataframe actually had the String 'None' vs null. You would never know this by pulling these into dataframes, as they would look identical to the eye.
It took me a while to realize this and hopefully this saves someone else some time.
I have a database with 1 million rows and based on a user's input I need to find him the most relevant matches.
The way the code was written in the past was by using the library fuzzywuzzy. A ratio between 2 strings was calculated in order to show how similar the strings were.
The problem with that is that we had to run the ratio function for each row from the database, meaning 1 million function calls and the performance is really bad. We've never thought that we'd get to the point of having this much data.
I am looking for a better algorithm or solution for handling the search in this case. I've stomped upon something called TF-IDF (Term Frequency-Inverse Document Frequency). It was described as a solution for "fuzzy matching at scale", way faster.
Unfortunately I couldn't wrap my mind around it and completely understand how does it work, and the more I read about it, the more I think that this is not what I need, since all the examples that I've seen are trying to find similar matches between 2 lists, not 1 string and 1 list.
So, am I on the wrong path? And if so, could you please give me some ideas on how can I handle this scenario? Unfortunately, Full Text Search works only with exact matches, so in our case fuzzy is definitely the way we want to go.
And if you're going to propose the idea of using a separate search engine, we don't want to add a new tool to our infrastructure just for this.
I'm new to coding and having a hard time expressing/searching for the correct terms to help me along with this task. In my work I get some pretty large excel-files from people out in the field monitoring birds. The results need to be prepared for databases, reports, tables and more. I was hoping to use Python to automate some tasks for this.
How can I use Python (pandas?) to find certain rows/columns based on a common name/ID but with a unique suffix , and aggregate/sum the results that belongs together under that common name? As an example in the table provided I need get all the results from sub-localities e.g. AA3_f, AA3_lf and AA3_s expressed as the sum (total of gulls for each species) of the subs in a new row for the main Locality AA3.
Can someone please provide some code for this task, or help me in some other way? I have searched and watched many tutorials on python, numpy, pandas and also matplotlib .. still clueless on how to set this up
any help appreciated
Thanks!
Update:
#Harsh Nagouda, thanks for your reply. I tried your example using groupby function, but I having trouble dividing into correct groups. The "Locality" column has only unique values/ID because they all have a suffix (they are sub categories).
I tried to solve this by slicing the strings:
eng.Locality.str.slice(0,4,1)
i managed to slice off the suffices so that the remainders = AA3_ , AA4_ and so on.
Then i tried to do this slicing in the groupby function. That failed. Then I tried to slice using pandas.Dataframe.apply(). That failed as well.
eng["Locality"].apply(eng.Locality.str.slice(0,4,1))
sum = eng.groupby(["Locality"].str.slice(0,4,1)).sum()
Any more help out there? As you can see above - I need it :-)
In your case, the pd.groupby option seems to be a good fit for the problem. The groupby function does exactly what it means, it groups parts of the dataframe you like it to.
Since you mentioned a case based on grouping by localities and finding the sum of those values, this snippet should help you out:
sum = eng.groupby(["Locality"]).sum()
Additional commands and sorting styles can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I finally figured out a way to get it done. Maybe not the smoothest way, but at least I get the end result I need:
Edited the Locality-ID to remove suffix:eng["Locality"]=eng["Locality].str.slice(0,4,1)
Used the groupby function:sum = eng.groupby(["Locality"]).sum()
End result:
Table
I am failrly new in Python. I wrote a script and I am surprised of the time it take to go through a particular loop compare to the rest of my code.
Can someone tell me what is inefficient in the code I wrote and maybe how to improve the speed ?
Here is the loop in question : (BT_Histos and Histos_Last_Rebal are dataframes with dates in index and columns of floats. Portfolio and Portfolio_Last_Rebal are dataframes same index as the 2 previous one that i am filling through the loop. weights is just a list)
Udl_Perf=BT_Histos/Histos_Last_Rebal-1
for i in range(1,len(BT_Histos.index)):
"""tricky because isin doesn't work with timestamp"""
test_date=pd.Series(Portfolio.index[i-1])
if test_date.isin(Rebalancing_Dates)[0]:
Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']=Portfolio.loc[Portfolio.index[i-1],'PortSeries']
else:
Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']=Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i-1],'PortSeries']
Portfolio.loc[Portfolio.index[i],'PortSeries']=Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']*(1+sum(Udl_Perf.iloc[i]*weights))
Thanks!
If you really want it to be fast then first implement it in while loop.
Second the length variable which you will use, define the type in advance using Mypy library, in this you need Python 3.5+ version installed.
Also if every iteration is unique then you can use multithreading using threading library. Get eg in this git repo