PySpark Dataframe transformation for coordinate data

PySpark Dataframe transformation for coordinate data - python

I have the following dataframe in PySpark, where each entry is a location for a journey with "constant" values Id, Start and Stop, and varying coordinates.
Id
Lat
Lon
Start
Stop
1
40.5
40
A
B
1
41.0
45
A
B
1
40.5
40
A
B
2
31.4
59
A
C
2
34.5
60
A
C
2
37.0
61
A
C
...
I want to transform this dataframe into
Id
Start
Stop
Trajectory
1
A
B
Vector of Coordinates
2
A
C
Vector of Coordinates
...
My first thought was to iterate through each row in the dataframe, create a "trip object" for each new Id with Id, Start and Stop and a trajectory list, and then append the lat and lon as a Coordinate object to the trajectory list, however since the dataset is enormous, this would probably computationally expensive.
I have never used Spark before, so there are probably some smart tricks one could use to make it very efficient.

I was not sure what exactly do you need on output but check this code as a started for improvements or discussion. For sure you can chcange columns which are used in groupBy, add ordering or the way how points are created (i am just adding lat and long separated by :)
import pyspark.sql.functions as F
inputData = [
("1", "40.5", "40", "A", "B"),
("1", "41.0", "45", "A", "B"),
("1", "40.5", "40", "A", "B"),
("2", "31.4", "59", "A", "C"),
("2", "34.5", "60", "A", "C"),
("2", "37.0", "61", "A", "C"),
]
df = spark.createDataFrame(inputData, schema=["id", "Lat", "Lon", "Start", "Stop"])
aggregatedDf = (
df.withColumn("Point", F.concat_ws(':', F.col("Lat"), F.col("Lon")))
.groupBy("id", "Start", "Stop")
.agg(F.collect_list("Point").alias('Trajectory'))
)
aggregatedDf.show(truncate = False)
sample output:
+---+-----+----+---------------------------+
|id |Start|Stop|Trajectory |
+---+-----+----+---------------------------+
|1 |A |B |[40.5:40, 41.0:45, 40.5:40]|
|2 |A |C |[31.4:59, 34.5:60, 37.0:61]|
+---+-----+----+---------------------------+

Related

Applying a function to a list of columns in a dataframe

I have a set of data in a dataframe (df), generated from another programme, which outputs to csv. It looks something like this:
| |Molecule |Conversion factor |Condition1 |Condition2 |Condition3 |...|
|0 |A |-0.5 |-5.5 |-5.7 |-5.9||
|1 |B |-0.1 |-10.3 |-10.6 |-11.0||
|2 |C |-0.3 |-6.5 |-6.6 |-6.7||
Because the conditions are the result of another calculation, the condition header names are variable, so I have generated a list of headers from the data called "condition_list".
I want to apply a function which does: conversion factor / condition, for every value in the columns listed in condition_list, but not the other columns (they are strings, so this gives errors).
The code I have so far is this:
condition_list = list(df.columns)
del condition_list[:2]
def conversion(x):
return [1] * x ** -1
df = apply(lambda x: conversion(x) if x.name in [condition_list] else x, axis=1)
print(df)
But it doesn't seem to read the list correctly, and so I get the original table back.
What am I doing wrong?

it's a simple case of using your filter condition in comprehensions
your conversion function is so simple (divide) I have elected to use a lambda function
given your source data has structural issue it demonstrates that other columns are not included in calculation
import io
import pandas as pd
df = pd.read_csv(io.StringIO(""" | |Molecule |Conversion factor |Condition1 |Condition2 |Condition3 |...|
|0 |A |-0.5 |-5.5 |-5.7 |-5.9||
|1 |B |-0.1 |-10.3 |-10.6 |-11.0||
|2 |C |-0.3 |-6.5 |-6.6 |-6.7||"""),sep="|",)
df.join(
df.loc[:, ["Conversion factor "] + [c for c in df.columns if "Condition" in c]].apply(
lambda s: pd.Series({f"{n}res": s["Conversion factor "] / s[n]
for n in s.index if n != "Conversion factor "}),
axis=1,)
)
output
Molecule
Conversion factor
Condition1
Condition2
Condition3
...
Unnamed: 8
Condition1 res
Condition2 res
Condition3 res
0
nan
0
A
-0.5
-5.5
-5.7
-5.9
nan
nan
0.0909091
0.0877193
0.0847458
1
nan
1
B
-0.1
-10.3
-10.6
-11
nan
nan
0.00970874
0.00943396
0.00909091
2
nan
2
C
-0.3
-6.5
-6.6
-6.7
nan
nan
0.0461538
0.0454545
0.0447761

Python - Replicate rows in Pandas Dataframe based on condition

I have a Pandas DataFrame on which I need to replicate some of the rows based on the presence of a given list of values in certain columns. If a row contains one of these values in the specified columns, then I need to replicate that row.
df = pd.DataFrame({"User": [1, 2], "col_01": ["C", "A"], "col_02": ["A", "C"], "col_03": ["B", "B"], "Block": ["01", "03"]})
User col_01 col_02 col_03 Block
0 1 C A B 01
1 2 A C B 03
values = ["C", "D"]
columns = ["col_01", "col_02", "col_03"]
rep_times = 3
Given these two lists of values and columns, each row that contains either 'C' or 'D' in the columns named 'col_01', 'col_02' or 'col_03' has to be repeated rep_times times, therefore the output table has to be like this:
User col_01 col_02 col_03 Block
0 1 C A B 01
1 1 C A B 01
2 1 C A B 01
3 2 A A B 03
I tried something like the following but it doesn't work, I don't know how to create this final table. The preferred way would be a one-line operation that does the work.
df2 = pd.DataFrame((pd.concat([row] * rep_times, axis=0, ignore_index=True)
if any(x in values for x in list(row[columns])) else row for index, row in df.iterrows()), columns=df.columns)

import pandas as pd
Firstly create a boolean mask to check your condition by using isin() method:
mask=df[columns].isin(values).any(1)
Finally use reindex() method ,repeat those rows rep_times and append() method to append rows back to dataframe that aren't satisfying the condition:
df=df.reindex(df[mask].index.repeat(rep_times)).append(df[~mask])

Convert data into a DataFrame with format as a mix of dictionary and tuple

Following is an example of result of my code. I want to convert the part in a dictionary into a DataFrame, I don't know how to perform it, can any one help me please?
x = ([('f', 66)], {('f', 66): ([('ft', 88, "#", '592063472')])},
[('x', 12)], {('x', 12): ([('uuuu', 9, "dd", '592063472')])})
x
The data frame that I want
0 1 2 3 4 5
f 66 ft 88 # 592063472
x 12 uuu 9 dd 592063472

approach.
find the dictionary instances
build a list of tuples.
import pandas as pd
x_data=[d for d in x if isinstance(d,(dict))] # first loop , find if it's instance of dictionary
x_data=[(e1 + e[e1][0]) for e in x_data for e1 in e ] # 2nd loop, build a list of tuples
df=pd.DataFrame(x_data_1) # create the dataframe.
df
output:

Splitting dataframe into multiple dataframes by using groupby

I have a question related to splitting dataframes into multiple dataframes using groupby such that on each iteration i cover more than one grouped by item. I looked at the forum n found the below example to be very close to my problem. However, i was wondering if there is any possibility of printing all the rows of more than one grouped by item per iteration in the loop. So from below example, in my 1st iteration, is it possible to print all the rows of Region A, B, C and then iterate again for next 3 regions?
for region, df_region in df.groupby('Region'):
print(df_region)
Competitor Region ProductA ProductB
0 Comp1 A £10 £15
3 Comp2 A £9 £16
6 Comp3 A £11 £16
Competitor Region ProductA ProductB
1 Comp1 B £11 £16
4 Comp2 B £12 £14
7 Comp3 B £10 £15
Competitor Region ProductA ProductB
2 Comp1 C £11 £15
5 Comp2 C £14 £17
8 Comp3 C £12 £15
I am learning and implementing python/ pandas so still a beginner of this language. Any help would be really appreciated. Thanks

The technical terminology for this is batching: you're returning values from your dataframe in batches of some size in order to avoid getting everything at once (batch of size equal to the length of your dataframe) or one item at a time (batch size equal to 1). As you said, under certain conditions this can improve performance. Here's one way you might go about this:
import pandas as pd
df = pd.DataFrame({"Region": ["A", "B", "C", "C", "D", "E", "E", "E", "F", "F"],
"Product A": [1, 2, 1, 2, 2, 1, 1, 1, 1, 3]})
Don't use those lines above, this is just so that I can replicate your dataframe. Here's the approach, feel free to change batch_size as you wish:
batch_size = 3
regions = df["Region"].unique()
b = 0
while True:
r = regions[b*batch_size:(b+1)*batch_size]
if len(r) == 0:
break # already went through all the regions
else:
b += 1 # increment so that the next iterations gets the next set of regions
print(f"batch: {b}")
sub_df = df.loc[df["Region"].isin(r)]
print(sub_df)
sub_df will be the batched results, with the values for only batch_size number of regions each iteration.

How to return the index (or column) lable of a value if column (or index) is known in pandas

How can I return the lable of a column if the row is knows as well as the value?
I have a pandas dataframe with rows called "A", "B", "C" and Column called "X", "Y", "Z". Knowing a value being in row (e.g. A), I want to have the Column returned. Looking at the example I want to have "X" returned, when I know that the value is "1" in Row "A". How can this be achieved?
data=[[1,2,3],[4,5,6],[7,8,9]]
d=pd.DataFrame(data, ["A", "B", "C"], ["X", "Y", "Z"])
X Y Z
A 1 2 3
B 4 5 6
C 7 8 9

If you know 1 is in A, use loc and get the resulting index
s = df.loc['A'].eq(1)
s[s].index
which returns
Index(['X'], dtype='object')
If you know there is only one cell with value 1 in your row, then use .item()
>>> s[s].index.item()
'X'

You can using dot
d.eq(1).dot(d.columns).loc[lambda x : x!='']
A X
dtype: object

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark Dataframe transformation for coordinate data - python

Related

Applying a function to a list of columns in a dataframe

Python - Replicate rows in Pandas Dataframe based on condition

Convert data into a DataFrame with format as a mix of dictionary and tuple

Splitting dataframe into multiple dataframes by using groupby

How to return the index (or column) lable of a value if column (or index) is known in pandas

Categories

Resources