Compare two PySpark dataframes and modify one of them? - python

I can't find a Sparkified-way to do this, and was hoping some of you data experts out there might be able to help:
I have two dataframes:
1
item_list
[1,2,3,4,5,6,7,0,0]
[1,2,3,4,5,6,7,8,0]
2
item_list
[3,0,0,4,2,6,1,0,0]
I want to return a new dataframe like this. For every non-zero in DF 2, replace it with 1 if DF 1 is non-zero at that index and return a new dataframe.
Result:
item_list
[3,1,1,4,2,6,1,1,0]
This is fairly easy to do in standard python. How can I do this in Spark?

Even though you are using spark it doesn't necessarily mean you have to use only spark methods and process to work around.
I would suggest analyzing a problem and look through the best approachable solution. Since you are using PySpark and you have two lists, you can actually achieve this using python (as you mentioned) easily over spark and it might be the more ideal way to do it in the current scenario.
Spark comes into play when you think a language either pyhton or scala cannot achieve or may achieve but Spark can have some helping libraries which makes your life easy.

Related

How to find/filter/combine based on common prefix in rows and columns with use of python/pandas?

I'm new to coding and having a hard time expressing/searching for the correct terms to help me along with this task. In my work I get some pretty large excel-files from people out in the field monitoring birds. The results need to be prepared for databases, reports, tables and more. I was hoping to use Python to automate some tasks for this.
How can I use Python (pandas?) to find certain rows/columns based on a common name/ID but with a unique suffix , and aggregate/sum the results that belongs together under that common name? As an example in the table provided I need get all the results from sub-localities e.g. AA3_f, AA3_lf and AA3_s expressed as the sum (total of gulls for each species) of the subs in a new row for the main Locality AA3.
Can someone please provide some code for this task, or help me in some other way? I have searched and watched many tutorials on python, numpy, pandas and also matplotlib .. still clueless on how to set this up
any help appreciated
Thanks!
Update:
#Harsh Nagouda, thanks for your reply. I tried your example using groupby function, but I having trouble dividing into correct groups. The "Locality" column has only unique values/ID because they all have a suffix (they are sub categories).
I tried to solve this by slicing the strings:
eng.Locality.str.slice(0,4,1)
i managed to slice off the suffices so that the remainders = AA3_ , AA4_ and so on.
Then i tried to do this slicing in the groupby function. That failed. Then I tried to slice using pandas.Dataframe.apply(). That failed as well.
eng["Locality"].apply(eng.Locality.str.slice(0,4,1))
sum = eng.groupby(["Locality"].str.slice(0,4,1)).sum()
Any more help out there? As you can see above - I need it :-)
In your case, the pd.groupby option seems to be a good fit for the problem. The groupby function does exactly what it means, it groups parts of the dataframe you like it to.
Since you mentioned a case based on grouping by localities and finding the sum of those values, this snippet should help you out:
sum = eng.groupby(["Locality"]).sum()
Additional commands and sorting styles can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I finally figured out a way to get it done. Maybe not the smoothest way, but at least I get the end result I need:
Edited the Locality-ID to remove suffix:eng["Locality"]=eng["Locality].str.slice(0,4,1)
Used the groupby function:sum = eng.groupby(["Locality"]).sum()
End result:
Table

Convert GraphFrame output to a pandas DataFrame

I checked multiple sources but couldn't pinpoint this particular problem although it probably has a very easy fix.
Let's say I have some graph, g.
I am able to print the vertices using g.vertices.show()
But I'm having a lot of trouble figuring out how to load all the vertices into a dataframe of some sort. I want to do a variety of tasks that are well supported on Pandas. Does anyone have a way to do this?
Just like how .show() will display the results of any query, you can do .toPandas() which will convert the output to a pandas DataFrame. As far as I can tell, this command couples any command that you can couple .show() with.
So for my specific question:
g.vertices.toPandas() solves the problem.

Speed-up/Improve loop construction/performance

I am failrly new in Python. I wrote a script and I am surprised of the time it take to go through a particular loop compare to the rest of my code.
Can someone tell me what is inefficient in the code I wrote and maybe how to improve the speed ?
Here is the loop in question : (BT_Histos and Histos_Last_Rebal are dataframes with dates in index and columns of floats. Portfolio and Portfolio_Last_Rebal are dataframes same index as the 2 previous one that i am filling through the loop. weights is just a list)
Udl_Perf=BT_Histos/Histos_Last_Rebal-1
for i in range(1,len(BT_Histos.index)):
"""tricky because isin doesn't work with timestamp"""
test_date=pd.Series(Portfolio.index[i-1])
if test_date.isin(Rebalancing_Dates)[0]:
Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']=Portfolio.loc[Portfolio.index[i-1],'PortSeries']
else:
Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']=Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i-1],'PortSeries']
Portfolio.loc[Portfolio.index[i],'PortSeries']=Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']*(1+sum(Udl_Perf.iloc[i]*weights))
Thanks!
If you really want it to be fast then first implement it in while loop.
Second the length variable which you will use, define the type in advance using Mypy library, in this you need Python 3.5+ version installed.
Also if every iteration is unique then you can use multithreading using threading library. Get eg in this git repo

Should the DataFrame function groupBy be avoided?

This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?
I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).
Essentially, I am trying to do is:
# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)
However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.
A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.
When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.
The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:
from pyspark.sql import functions as F
joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))
In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.
The documentation is pretty all over the place.
There has been a lot of optimization work for dataframes. Dataframes has additional information about the structure of your data, which helps with this. I often find that many people recommend dataframes over RDDs due to "increased optimization."
There is a lot of heavy wizardry behind the scenes.
I recommend that you try "groupBy" on both RDDs and dataframes on large datasets and compare the results. Sometimes, you may need to just do it.
Also, for performance improvements, I suggest fiddling (through trial and error) with:
the spark configurations Doc
shuffle.partitions Doc

What's the purpose of Series instead of lists in Pandas and Python?

Why doesn't Pandas build DataFrames directly from lists? Why was such a thing as a series created in the first place?
Or: If the data in a DataFrame is actually stored in memory as a collection of Series, why not just use a collection of lists?
Yet another way to ask the same question: what's the purpose of Series over lists?
This isn't going to be a very complete answer, but hopefully is an intuitive "general" answer.
Pandas doesn't use a list as the "core" unit that makes up a DataFrame because Series objects make assumptions that lists do not. A list in python makes very little assumptions about what is inside, it could be pretty much anything, which makes it great as a core component of python.
However, if you want to build a more specialized package that gives you extra functionality liked Pandas, then you want to create your own "core" data object and start building extra functionality on top of that. Compared with lists, you can do a lot more with a custom Series object (as witnessed by pulling a single column from a DataFrame and seeing what methods are available to the output).

Categories