update multiple columns based on two columns in pyspark data frames

update multiple columns based on two columns in pyspark data frames - python

I have a data frame like below in pyspark.
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| serial_number | rest_id | value | body | legs | face | idle |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| sn11 | rs1 | N | Y | N | N | acde |
| sn1 | rs1 | N | Y | N | N | den |
| sn1 | null | Y | N | Y | N | can |
| sn2 | rs2 | Y | Y | N | N | aeg |
| null | rs2 | N | Y | N | Y | ueg |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
Now I want to update some of the column while checking some column values.
I want to update the value when the any given serial_number or rest_id has value Y then all values of that particular serial_number or rest_id should be updated to Y. if not then what ever values they have.
I have done like below.
df.alias('a').join(df.filter(col('value')='Y').alias('b'),on=(col('a.serial_number') == col('b.serial_number')) | (col('a.rest_id') == col('b.rest_id')), how='left').withColumn('final_value',when(col('b.value').isNull(), col('a.value')).otherwise(col('b.value'))).select('a.serial_number','a.rest_id','a.body', 'a.legs', 'a.face', 'a.idle', 'final_val')
I got the result I want.
Now I want to repeat the same for columns body, legs and face as well.
I can do like above for all columns individually, I mean to say 3 join statements. But I want to update all the 4 columns in a single statement.
How can I do that?
Expected result
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| serial_number | rest_id | value | body | legs | face | idle |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| sn11 | rs1 | N | Y | N | N | acde |
| sn1 | rs1 | Y | Y | Y | N | den |
| sn1 | null | Y | Y | Y | N | can |
| sn2 | rs2 | Y | Y | N | Y | aeg |
| null | rs2 | Y | Y | N | Y | ueg |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+

You should be using window functions for both serial_number and rest_id columns for checking if Y is present in the columns within that groups. (comments as explanation are provided below)
#column names for looping for the updates
columns = ["value","body","legs","face"]
import sys
from pyspark.sql import window as w
#window for serial number grouping
windowSpec1 = w.Window.partitionBy('serial_number').rowsBetween(-sys.maxint, sys.maxint)
#window for rest id grouping
windowSpec2 = w.Window.partitionBy('rest_id').rowsBetween(-sys.maxint, sys.maxint)
from pyspark.sql import functions as f
from pyspark.sql import types as t
#udf function for checking if Y is in the collected list of windows defined above for the columns in the list defined for looping
def containsUdf(x):
return "Y" in x
containsUdfCall = f.udf(containsUdf, t.BooleanType())
#looping the columns for checking the condition defined in udf function above by collecting the N and Y in each columns within windows defined
for column in columns:
df = df.withColumn(column, f.when(containsUdfCall(f.collect_list(column).over(windowSpec1)) | containsUdfCall(f.collect_list(column).over(windowSpec2)), "Y").otherwise(df[column]))
df.show(truncate=False)
which should give you
+-------------+-------+-----+----+----+----+----+
|serial_number|rest_id|value|body|legs|face|idle|
+-------------+-------+-----+----+----+----+----+
|sn2 |rs2 |Y |Y |N |Y |aeg |
|null |rs2 |Y |Y |N |Y |ueg |
|sn11 |rs1 |N |Y |N |N |acde|
|sn1 |rs1 |Y |Y |Y |N |den |
|sn1 |null |Y |Y |Y |N |can |
+-------------+-------+-----+----+----+----+----+
I would recommend to use the window function separately in two loopings as it might give you memory exceptions for big data as both window functions are used at the same time for each rows

Related

Using regex expresion to create a new Dataframe Column

I have the following Python DataFrame:
| ColumnA | File |
| -------- | -------------- |
| First | aasdkh.xls |
| Second | sadkhZ.xls |
| Third | asdasdPH.xls |
| Fourth | adsjklahsd.xls |
and so on.
I'm trying to get the following DataFrame:
| ColumnA | File | Category|
| -------- | ---------------- | ------- |
| First | aasdkh.xls | N |
| Second | sadkhZ.xls | Z |
| Third | asdasdPH.xls | PH |
| Fourth | adsjklahsdPH.xls | PH |
I'm trying to use regex expresions, but I'm not sure how to use them. I need to get a new column that "extracts" the category of the file; N if is a "normal" file (no category), Z if the file contains a "Z" just before the extension and PH if the file contains a "PH" before the extension.
I defined the following regex expresions that I think I could use, but I dont know how to use them:
regex_Z = re.compile('Z.xls$')
regex_PH = re.compile('PH.xls$')
PD: Could you recomend me any website to learn how to use the regex expresions?

Let's try
df['Category'] = df['File'].str.extract('(Z|PH)\.xls$').fillna('N')
print(df)
ColumnA File Category
0 First aasdkh.xls N
1 Second sadkhZ.xls Z
2 Third asdasdPH.xls PH
3 Fourth adsjklahsd.xls N

How to run TA-Lib on multiple tickers in a single dataframe

I have a pandas dataframe named idf with data from 4/19/21 to 5/19/21 for 4675 tickers with the following columns: symbol, date, open, high, low, close, vol
|index |symbol |date |open |high |low |close |vol |EMA8|EMA21|RSI3|RSI14|
|-------|-------|-----------|-------|-------|-----------|-------|-------|----|-----|----|-----|
|0 |AACG |2021-04-19 |2.85 |3.03 |2.8000 |2.99 |173000 | | | | |
|1 |AACG |2021-04-20 |2.93 |2.99 |2.7700 |2.85 |73700 | | | | |
|2 |AACG |2021-04-21 |2.82 |2.95 |2.7500 |2.76 |93200 | | | | |
|3 |AACG |2021-04-22 |2.76 |2.95 |2.7200 |2.75 |56500 | | | | |
|4 |AACG |2021-04-23 |2.75 |2.88 |2.7000 |2.84 |277700 | | | | |
|... |... |... |... |... |... |... |... | | | | |
|101873 |ZYXI |2021-05-13 |13.94 |14.13 |13.2718 |13.48 |413200 | | | | |
|101874 |ZYXI |2021-05-14 |13.61 |14.01 |13.2200 |13.87 |225200 | | | | |
|101875 |ZYXI |2021-05-17 |13.72 |14.05 |13.5500 |13.82 |183600 | | | | |
|101876 |ZYXI |2021-05-18 |13.97 |14.63 |13.8300 |14.41 |232200 | | | | |
|101877 |ZYXI |2021-05-19 |14.10 |14.26 |13.7700 |14.25 |165600 | | | | |
I would like to use ta-lib to calculate several technical indicators like EMA of length 8 and 21, and RSI of 3 and 14.
I have been doing this with the following code after uploading the file and creating a dataframe named idf:
ind = pd.DataFrame()
tind = pd.DataFrame()
for ticker in idf['symbol'].unique():
tind['rsi3'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 3).round(2)
tind['rsi14'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 14).round(2)
tind['ema8'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 8).round(2)
tind['ema21'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 21).round(2)
ind = ind.append(tind)
tind = tind.iloc[0:0]
idf = pd.merge(idf, ind, left_index=True, right_index=True)
Is this the most efficient way to doing this?
If not, what is the easiest and fastest way to calculate indicator values and get those calculated indicator values into the dataframe idf?
Prefer to avoid a for loop if possible.
Any help is highly appreciated.

rsi = lambda x: talib.RSI(idf.loc[x.index, "close"], 14)
idf['rsi(14)'] = idf.groupby(['symbol']).apply(rsi).reset_index(0,drop=True)

Pyspark - filter out multiple rows based on a condition in one row

I have a table like so:
--------------------------------------------
| Id | Value | Some Other Columns Here
| 0 | 5 |
| 0 | 4 |
| 0 | 0 |
| 1 | 3 |
| 2 | 1 |
| 2 | 8 |
| 3 | -4 |
--------------------------------------------
I would like to remove all IDs which have any Value <= 0, so the result would be:
--------------------------------------------
| Id | Value | Some Other Columns Here
| 1 | 3 |
| 2 | 1 |
| 2 | 8 |
--------------------------------------------
I tried doing this by filtering to only rows with Value<=0, selecting the distinct IDs from this, converting that to a list, and then removing any rows in the original table that have an ID in that list using df.filter(~df.Id.isin(mylist))
However, I have a huge amount of data, and this ran out of memory making the list, so I need to come up with a pure pyspark solution.

As Gordon mentions, you may need a window for this, here is a pyspark version:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy("Id")
(df.withColumn("flag",F.when(F.col("Value")<=0,0).otherwise(1))
.withColumn("Min",F.min("flag").over(w)).filter(F.col("Min")!=0)
.drop("flag","Min")).show()
+---+-----+
| Id|Value|
+---+-----+
| 1| 3|
| 2| 1|
| 2| 8|
+---+-----+
Brief summary of approach taken:
Set a flag when Value<=0 then 0 else `1
get min over a partition of id (will return 0 if any of prev cond is
met)
filter only when this Min value is not 0
`

You can use window functions:
select t.*
from (select t.*, min(value) over (partition by id) as min_value
from t
) t
where min_value > 0

Merge two datasets based on specific column data

I have two pandas datasets
old:
| alpha | beta | zeta | id | rand | numb|
| ------ | ------------------ | ------------| ------ | --- -| ----|
| 1 | LA | bev | A100 | D | 100 |
| 1 | LA | malib | C150 | Z | 150 |
| 2 | NY | queens | B200 | N | 200 |
| 2 | NY | queens | B200 | N | 200 |
| 3 | Chic | lincpark | E300 | T | 300 |
| 3 | NY | Bronx | F300 | M | 300 |
new:
| alpha | beta | zeta | id | numb |
| ------ | ------------------ | ---------------| ------| -----|
| 1 | LA | Hwood | Q | Q400 |
| 2 | NY | queens | B | B200 |
| 3 | Chic | lincpark | D | D300 |
(Columns and data don't mean anything in particular, just an example).
I want to merge datasets in a way such that
IF old.alpha, old.beta, and old.zeta = their corresponding new columns and If old.id = new.numb, you only keep the entry from the old table. (in this case the row 2 on the old with queens would be kept as opposed to row 2 on new with queens)
Note that rows 3 and 4 on old are the same, but we still keep both. If there were 2 duplicates of these rows in new we consider them as 1-1 corresponding. If maybe there were 3 duplicates on new of rows 3 and 4 on old, then 2 are considered copies (and we don't add them, but we would add the third when we merge them)
IF old.alpha, old.beta, and old.zeta = their corresponding new columns and If old.numb is contained inside new.numb, you only keep the entry from the old table. (in this case the row 5 on the old with lincpark would be kept as opposed to row 3 on new with lincpark, because 300 is contained in new.numb)
Otherwise add the new data as new data, keeping the new table's id and numb, and having null for any extra columns that the old table has (new's row 1 with hollywood)
I have tried various merging methods along with the drop_duplicates method. The problem with the the latter is that I attempted to drop duplicates having the same alpha beta and zeta, but often they were deleted from the same datasource, because the rows were exactly the same.
This is what ultimately needs to be shown when merging. 2 of the rows in new were duplicates, one was something to be added.
| alpha | beta | zeta | id | rand | numb|
| ------ | ------------------ | ------------| ------ | --- -| ----|
| 1 | LA | bev | A100 | D | 100 |
| 1 | LA | malib | C150 | Z | 150 |
| 2 | NY | queens | B200 | N | 200 |
| 2 | NY | queens | B200 | N | 200 |
| 3 | Chic | lincpark | E300 | T | 300 |
| 3 | NY | Bronx | F300 | M | 300 |
| 1 | LA | Hwood | Q | | Q400|

We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
Assuming df1 is your new and df2 is the old
Follow merge by IF conditions.
import pandas
dfinal = df1.merge(df2, on="alpha", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'idold' as 'idnew'.
dfinal = df1.merge(df2, how='inner', left_on='alpha', right_on='id')
If you want to be even more specific, you may read the documentation of pandas merge operation.
Also specify If conditions and perform merge operations by rows, and then drop the remaining columns in a temporary dataframe. And add values to that dataframe according to conditions.
I understand the answer is a little bit complex, but so is your question. Cheers :)

Python flags in object constructor

I came across the concept of flags in python on some occasions, for example in wxPython. An example is the initialization of a frame object.
The attributes that are passed to "style".
frame = wx.Frame(None, style=wx.MAXIMIZE_BOX | wx.RESIZE_BORDER | wx.SYSTEM_MENU | wx.CAPTION | wx.CLOSE_BOX)
I don't really understand the concept of flags. I haven't even found a solid explanation what exactly the term "flag" means in Python. How are all these attributes passed to one variable?
The only thing i can think of is that the "|" character is used as a boolean operator, but in that case wouldn't all the attributes passed to style just evaluate to a single boolean expression?

What is usually meant with flags in this sense are bits in a single integer value. | is the ususal bit-or operator.
Let's say wx.MAXIMIZE_BOX=8 and wx.RESIZE_BORDER=4, if you or them together you get 12. In this case you can actually use + operator instead of |.
Try printing the constants print(wx.MAXIMIZE_BOX) etc. and you may get a better understanding.

Flags are not unique to Python; the are a concept used in many languages. They build on the concepts of bits and bytes, where computer memory stores information using, essentially, a huge number of flags. Those flags are bits, they either are off (value 0) or on (value 1), even though you usually access the computer memory in groups of at least 8 of such flags (bytes, and for larger groups, words of a multiple of 8, specific to the computer architecture).
Integer numbers are an easy and common representation of the information stored in bytes; a single byte can store any integer number between 0 and 255, and with more bytes you can represent bigger integers. But those integers still consist of bits that are either on or off, and so you can use those as switches to enable or disable features. You pass in specific integer values with specific bits enabled or disabled to switch features on and off.
So a byte consists of 8 flags (bits), and enabling one of these means you have 8 different integers; 1, 2, 4, 8, 16, 32, 64 and 128, and you can pass a combination of those numbers to a library like wxPython to set different options. For multi-byte integers, the numbers just go up by doubling.
But you a) don't want to remember what each number means, and b) need a method of combining them into a single integer number to pass on.
The | operator does the latter, and the wx.MAXIMIZE_BOX, wx.RESIZE_BORDER, etc names are just symbolic constants for the integer values, set by the wxWidget project in various C header files, and summarised in wx/toplevel.h and wx/defs.h:
/*
Summary of the bits used (some of them are defined in wx/frame.h and
wx/dialog.h and not here):
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|15|14|13|12|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | \_ wxCENTRE
| | | | | | | | | | | | | | \____ wxFRAME_NO_TASKBAR
| | | | | | | | | | | | | \_______ wxFRAME_TOOL_WINDOW
| | | | | | | | | | | | \__________ wxFRAME_FLOAT_ON_PARENT
| | | | | | | | | | | \_____________ wxFRAME_SHAPED
| | | | | | | | | | \________________ wxDIALOG_NO_PARENT
| | | | | | | | | \___________________ wxRESIZE_BORDER
| | | | | | | | \______________________ wxTINY_CAPTION_VERT
| | | | | | | \_________________________
| | | | | | \____________________________ wxMAXIMIZE_BOX
| | | | | \_______________________________ wxMINIMIZE_BOX
| | | | \__________________________________ wxSYSTEM_MENU
| | | \_____________________________________ wxCLOSE_BOX
| | \________________________________________ wxMAXIMIZE
| \___________________________________________ wxMINIMIZE
\______________________________________________ wxSTAY_ON_TOP
...
*/
and
/*
Summary of the bits used by various styles.
High word, containing styles which can be used with many windows:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|31|30|29|28|27|26|25|24|23|22|21|20|19|18|17|16|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | \_ wxFULL_REPAINT_ON_RESIZE
| | | | | | | | | | | | | | \____ wxPOPUP_WINDOW
| | | | | | | | | | | | | \_______ wxWANTS_CHARS
| | | | | | | | | | | | \__________ wxTAB_TRAVERSAL
| | | | | | | | | | | \_____________ wxTRANSPARENT_WINDOW
| | | | | | | | | | \________________ wxBORDER_NONE
| | | | | | | | | \___________________ wxCLIP_CHILDREN
| | | | | | | | \______________________ wxALWAYS_SHOW_SB
| | | | | | | \_________________________ wxBORDER_STATIC
| | | | | | \____________________________ wxBORDER_SIMPLE
| | | | | \_______________________________ wxBORDER_RAISED
| | | | \__________________________________ wxBORDER_SUNKEN
| | | \_____________________________________ wxBORDER_{DOUBLE,THEME}
| | \________________________________________ wxCAPTION/wxCLIP_SIBLINGS
| \___________________________________________ wxHSCROLL
\______________________________________________ wxVSCROLL
...
*/
The | operator is the bitwise OR operator; it combines the bits of two integers, each matching bit is paired up and turned into an output bit according to the boolean rules for OR. When you do this for those integer constants, you get a new integer number with multiple flags enabled.
So the expression
wx.MAXIMIZE_BOX | wx.RESIZE_BORDER | wx.SYSTEM_MENU | wx.CAPTION | wx.CLOSE_BOX
gives you an integer number with the bits numbers 9, 6, 11, 29, and 12 set; here I used '0' and '1' strings to represent the bits and int(..., 2) to interpret a sequence of those strings as a single integer number in binary notation:
>>> fourbytes = ['0'] * 32
>>> fourbytes[9] = '1'
>>> fourbytes[6] = '1'
>>> fourbytes[11] = '1'
>>> fourbytes[29] = '1'
>>> fourbytes[12] = '1'
>>> ''.join(fourbytes)
'00000010010110000000000000000100'
>>> int(''.join(fourbytes), 2)
39321604
On the receiving end, you can use the & bitwise AND operator to test if a specific flag is set; that return 0 if the flag is not set, or the same integer as assigned to the flag constant if the flag bit had been set. In both C and in Python, a non-zero number is true in a boolean test, so testing for a specific flag is usually done with:
if ( style & wxMAXIMIZE_BOX ) {
for determining that a specific flag is set, or
if ( !(style & wxBORDER_NONE) )
to test for the opposite.

It is a boolean operator - not logical one, but bitwise one. wx.MAXIMIZE_BOX and the rest are typically integers that are powers of two - 1, 2, 4, 8, 16... which makes it so that only one bit in them is 1, all the rest of them are 0. When you apply bitwise OR (x | y) to such integers, the end effect is they combine together: 2 | 8 (0b00000010 | 0b00001000) becomes 10 (0b00001010). They can be pried apart later using the bitwise AND (x & y) operator, also calling a masking operator: 10 & 8 > 0 will be true because the bit corresponding to 8 is turned on.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

update multiple columns based on two columns in pyspark data frames - python

Related

Using regex expresion to create a new Dataframe Column

How to run TA-Lib on multiple tickers in a single dataframe

Pyspark - filter out multiple rows based on a condition in one row

Merge two datasets based on specific column data

Python flags in object constructor

Categories

Resources