PySpark: Evaluating specific columns together - python

I have a Spark dataframe that looks like the following:
+---+----+---+---+
| a | b | c | d |
+---+----+-------+
|13 | 43 | 67| 3 |
+---+----+---+---+
Is it possible to choose specific columns to evaluate together to produce the following?
+----+----+---+---+-----+-----+-----------+
| a | b | c | d | a+b | c-b | a+b / c-b |
+----+----+-------+-----+-----+-----------+
| 13 | 43 | 67| 3 | 56 | 24 | 2.33 |
+----+----+---+---+-----+-----+-----------+

Yes, it's possible. You can use selectExpr or withColumn to add extra columns:
from pyspark.sql.functions import expr
(
df.withColumn("a+b", expr("a + b"))
.withColumn("c-b", expr("c - b"))
.withColumn("a+b / c-b", expr("(a + b) / (c - b)"))
.show()
)

Related

How do you give a date range then have that daterange be appended to the dataframe?

I know how to generate a daterange using this code:
pd.date_range(start='2022-10-16', end='2022-10-19')
How do I get the daterange result above and loop through every locations in the below dataframe?
+----------+
| Location |
+----------+
| A |
| B |
| C |
+----------+
This is the result I want.
+----------+------------+
| Location | Date |
+----------+------------+
| A | 2022/10/16 |
| A | 2022/10/17 |
| A | 2022/10/18 |
| A | 2022/10/19 |
| B | 2022/10/16 |
| B | 2022/10/17 |
| B | 2022/10/18 |
| B | 2022/10/19 |
| C | 2022/10/16 |
| C | 2022/10/17 |
| C | 2022/10/18 |
| C | 2022/10/19 |
+----------+------------+
I have spent the whole day figuring this out. Any help would be appreciated!
You can cross join your date range and dataframe to get your desired result:
date_range = (pd.date_range(start='2022-10-16', end='2022-10-19')
.rename('Date')
.to_series())
df = df.merge(date_range, 'cross')
print(df)
Output:
Location Date
0 A 2022-10-16
1 A 2022-10-17
2 A 2022-10-18
3 A 2022-10-19
4 B 2022-10-16
5 B 2022-10-17
6 B 2022-10-18
7 B 2022-10-19
8 C 2022-10-16
9 C 2022-10-17
10 C 2022-10-18
11 C 2022-10-19
You seem to be looking for a cartesian product of two iterables, which is something itertools.product can do. Take a look at this article.
In your case, you can try:
import pandas as pd
from itertools import product
# Test data:
df = pd.DataFrame(['A', 'B', 'C'], columns=['Location'])
dr = pd.date_range(start='2022-10-16', end='2022-10-19')
# Create the cartesian product:
res_df = pd.DataFrame(product(df['Location'], dr), columns=['Location', 'Date'])
print(res_df)

Pyspark: Reorder only a subset of rows among themselves

my data frame:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 2 | a | yes |
| 1 | b | no |
| 3 | c | no |
| 8 | d | yes |
| 7 | e | yes |
| 9 | f | no |
+-----+--------+-------+
In my desired output I will re-rank only the columns where reRnk==yes, ranking will be done based on "val"
I don't want to change the rows where reRnk = no, for example at id=b we have reRnk=no I want to keep that row at row no. 2 only.
my desired output will look like this:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 8 | d | yes |
| 1 | b | no |
| 3 | c | no |
| 7 | e | yes |
| 2 | a | yes |
| 9 | f | no |
+-----+--------+-------+
From what I'm reading, pyspark DF's do not have an index by default. You might need to add this.
I do not know the exact syntax for pyspark, however since it has many similarities with pandas this might lead you into a certain direction:
df.loc[df.reRnk == 'yes', ['val','id']] = df.loc[df.reRnk == 'yes', ['val','id']].sort_values('val', ascending=False).set_index(df.loc[df.reRnk == 'yes', ['val','id']].index)
Basically what we do here is isolating the rows with reRnk == 'yes', sorting these values but resetting the index to its original index. Then we assign these new values to the original rows in the df.
for .loc, https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.loc.html might be worth a try.
for .sort_values see: https://sparkbyexamples.com/pyspark/pyspark-orderby-and-sort-explained/

Use different dataframe inside PySpark UDF

I got a dataframe (df1), where I have listed some time frames:
| start | end | event name |
|-------|-----|------------|
| 1 | 3 | name_1 |
| 3 | 5 | name_2 |
| 2 | 6 | name_3 |
In these time frames, I would like to extract some data from another dataframe (df2). For example, I want to extend df1 with the average measurementn from df2 inside the specified time range.
| timestamp | measurement |
|-----------|-------------|
| 1 | 5 |
| 2 | 7 |
| 3 | 5 |
| 4 | 9 |
| 5 | 2 |
| 6 | 7 |
| 7 | 8 |
I was thinking about an UDF function which filters df2 by timestamp and evaluates the average. But in a UDF I can not reference two dataframes:
def get_avg(start, end):
return df2.filter(df2.timestamp > start & df2.timestamp < end).agg({"average": "avg"})
udf_1 = f.udf(get_avg)
df1.select(udf_1('start', 'end').show()
This will throw an error TypeError: cannot pickle '_thread.RLock' object.
How would I solve this issue efficiently?
In this case there is no need to use UDFs, you can simply use join over a range interval determined by the timestamps
import pyspark.sql.functions as F
df1.join(df2, on=[(df2.timestamp > df1.start) & (df2.timestamp < df1.end)]) \
.groupby('start', 'end', 'event_name') \
.agg(F.mean('measurement').alias('avg')) \
.show()
+-----+---+----------+-----------------+
|start|end|event_name| avg|
+-----+---+----------+-----------------+
| 1| 3| name_1| 7.0|
| 3| 5| name_2| 9.0|
| 2| 6| name_3|5.333333333333333|
+-----+---+----------+-----------------+

List combinations from a compatibility matrix

I have a table in the shape of a symmetric matrix that tells me which components are compatible. Here is an example;
Components | A | B | C | D | E | F | G |
-----------+---+---+---+---+---+---+---+
A | | | 1 | 1 | 1 | 1 | |
-----------+---+---+---+---+---+---+---+
B | | | | | 1 | | 1 |
-----------+---+---+---+---+---+---+---+
C | 1 | | | | | 1 | |
-----------+---+---+---+---+---+---+---+
D | 1 | | | | | 1 | 1 |
-----------+---+---+---+---+---+---+---+
E | 1 | 1 | | | | | 1 |
-----------+---+---+---+---+---+---+---+
F | 1 | | 1 | 1 | | | 1 |
-----------+---+---+---+---+---+---+---+
G | | 1 | | 1 | 1 | 1 | |
-----------+---+---+---+---+---+---+---+
Where the 1s show what is compatible and the blanks are what is not compatible. In the actual table there are a lot more components. Currently the real table is in an excel spreadsheet but could easily be converted to csv or text for convenience.
What I need to do is create a list of possible combinations. I know there are things like itertools but I need it to only create a list of the compatible ones and ignore the non compatible ones. For this with a dat file I pull when I run pyomo;
set NODES := A B C D E F G;
param: ARCS:=
A
B
C
...
A C
A D
B E
...
A C F
BGE
...
I need everything listed together to be compatible together. So ACF can be together because they are all compatible with each other but not ADG because G is not compatible with A.
Long Term Plan:
Eventually I plan to use Pyomo to find the the best combination of components to minimize the resources needed associated with each component. Therefore in the dat file there will eventually be and additional cost associated with each combination.
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
df = pd.read_excel(r"/path/to/file.xlsx", sheet_name="Sheet4",index_col=0,usecols = "A:H")
df.edge=nx.from_pandas_adjacency(df)
print(list(nx.enumerate_all_cliques(nx.Graph(df.edge))))

Interval intersection in pandas

Update 5:
This feature has been released as part of pandas 20.1 (on my birthday :] )
Update 4:
PR has been merged!
Update 3:
The PR has moved here
Update 2:
It seems like this question may have contributed to re-opening the PR for IntervalIndex in pandas.
Update:
I no longer have this problem, since I'm actually now querying for overlapping ranges from A and B, not points from B which fall within ranges in A, which is a full interval tree problem. I won't delete the question though, because I think it's still a valid question, and I don't have a good answer.
Problem statement
I have two dataframes.
In dataframe A, two of the integer columns taken together represent an interval.
In dataframe B, one integer column represents a position.
I'd like to do a sort of join, such that points are assigned to each interval they fall within.
Intervals are rarely but occasionally overlapping. If a point falls within that overlap, it should be assigned to both intervals. About half of points won't fall within an interval, but nearly every interval will have at least one point within its range.
What I've been thinking
I was initially going to dump my data out of pandas, and use intervaltree or banyan or maybe bx-python but then I came across this gist. It turns out that the ideas shoyer has in there never made it into pandas, but it got me thinking -- it might be possible to do this within pandas, and since I want this code to be as fast as python can possibly go, I'd rather not dump my data out of pandas until the very end. I also get the feeling that this is possible with bins and pandas cut function, but I'm a total newbie to pandas, so I could use some guidance! Thanks!
Notes
Potentially related? Pandas DataFrame groupby overlapping intervals of variable length
This feature is was released as part of pandas 20.1
Answer using pyranges, which is basically pandas sprinkled with bioinformatics sugar.
Setup:
import numpy as np
np.random.seed(0)
import pyranges as pr
a = pr.random(int(1e6))
# +--------------+-----------+-----------+--------------+
# | Chromosome | Start | End | Strand |
# | (category) | (int32) | (int32) | (category) |
# |--------------+-----------+-----------+--------------|
# | chr1 | 8830650 | 8830750 | + |
# | chr1 | 9564361 | 9564461 | + |
# | chr1 | 44977425 | 44977525 | + |
# | chr1 | 239741543 | 239741643 | + |
# | ... | ... | ... | ... |
# | chrY | 29437476 | 29437576 | - |
# | chrY | 49995298 | 49995398 | - |
# | chrY | 50840129 | 50840229 | - |
# | chrY | 38069647 | 38069747 | - |
# +--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 1,000,000 rows and 4 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
b = pr.random(int(1e6), length=1)
# +--------------+-----------+-----------+--------------+
# | Chromosome | Start | End | Strand |
# | (category) | (int32) | (int32) | (category) |
# |--------------+-----------+-----------+--------------|
# | chr1 | 52110394 | 52110395 | + |
# | chr1 | 122640219 | 122640220 | + |
# | chr1 | 162690565 | 162690566 | + |
# | chr1 | 117198743 | 117198744 | + |
# | ... | ... | ... | ... |
# | chrY | 45169886 | 45169887 | - |
# | chrY | 38863683 | 38863684 | - |
# | chrY | 28592193 | 28592194 | - |
# | chrY | 29441949 | 29441950 | - |
# +--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 1,000,000 rows and 4 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
Execution:
result = a.join(b, strandedness="same")
# +--------------+-----------+-----------+--------------+-----------+-----------+--------------+
# | Chromosome | Start | End | Strand | Start_b | End_b | Strand_b |
# | (category) | (int32) | (int32) | (category) | (int32) | (int32) | (category) |
# |--------------+-----------+-----------+--------------+-----------+-----------+--------------|
# | chr1 | 227348436 | 227348536 | + | 227348516 | 227348517 | + |
# | chr1 | 18901135 | 18901235 | + | 18901191 | 18901192 | + |
# | chr1 | 230131576 | 230131676 | + | 230131636 | 230131637 | + |
# | chr1 | 84829850 | 84829950 | + | 84829903 | 84829904 | + |
# | ... | ... | ... | ... | ... | ... | ... |
# | chrY | 44139791 | 44139891 | - | 44139821 | 44139822 | - |
# | chrY | 51689785 | 51689885 | - | 51689859 | 51689860 | - |
# | chrY | 45379140 | 45379240 | - | 45379215 | 45379216 | - |
# | chrY | 37469479 | 37469579 | - | 37469576 | 37469577 | - |
# +--------------+-----------+-----------+--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 16,153 rows and 7 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
df = result.df
# Chromosome Start End Strand Start_b End_b Strand_b
# 0 chr1 227348436 227348536 + 227348516 227348517 +
# 1 chr1 18901135 18901235 + 18901191 18901192 +
# 2 chr1 230131576 230131676 + 230131636 230131637 +
# 3 chr1 84829850 84829950 + 84829903 84829904 +
# 4 chr1 189088140 189088240 + 189088163 189088164 +
# ... ... ... ... ... ... ... ...
# 16148 chrY 38968068 38968168 - 38968124 38968125 -
# 16149 chrY 44139791 44139891 - 44139821 44139822 -
# 16150 chrY 51689785 51689885 - 51689859 51689860 -
# 16151 chrY 45379140 45379240 - 45379215 45379216 -
# 16152 chrY 37469479 37469579 - 37469576 37469577 -
#
# [16153 rows x 7 columns]

Categories