I'm using sparksql dataframes.
df = sql.read.parquet("toy_data")
df.show()
+-----------+----------+
| x| y|
+-----------+----------+
| -4.5707927| -5.282721|
| -5.762503| -4.832158|
| 7.907721| 6.793022|
| 7.4408655| -6.601918|
| -4.2428184| -4.162871|
I have a list of tuples the following structure:
(Row(x=-8.45811653137207, y=-5.179722309112549), ((-1819.748514533043, 47.745243303477764), 333))
where the first ele is a point, the second ele is a (sum_of_points, number_of_points) tuple.
When I divide the sum_of_points by the num_of_points, like this:
new_centers = center_sum_num.map(lambda tup: np.asarray(tup[1][0])/tup[1][1]).collect()
I get the following, which is an array of numpy arrays.
[array([-0.10006594, -6.7719144 ]), array([-0.25844196, 5.28381418]), array([-5.12591623, -4.5685448 ]), array([ 5.40192709, -4.35950824])]
However, I want to keep them points of the original format, like this:
[Row(x=-5.659833908081055, y=7.705344200134277), Row(x=3.17942214012146, y=-9.446121215820312), Row(x=9.128270149230957, y=4.5666022300720215), Row(x=-6.432034969329834, y=-4.432190895080566)]
Meaning I don't want an array of numpy_arrays - I want an array of Row(x = ..., y = ...) thingys.
How can I do this?
My full code is attached for reference:
new_centers = [Row(x=-5.659833908081055, y=7.705344200134277), Row(x=3.17942214012146, y=-9.446121215820312), Row(x=9.128270149230957, y=4.5666022300720215), Row(x=-6.432034969329834, y=-4.432190895080566)]
while old_centers is None or not has_converged(old_centers, new_centers, epsilon) and iteration < max_iterations:
# update centers
old_centers = new_centers
center_pt_1 = points.rdd.map(lambda point: ( old_centers[nearest_center(old_centers, point)[0]], (point, 1) ) )
note that nearest_center()[0] is the index
center_sum_num =center_pt_1.reduceByKey(lambda a, b: ((a[0][0] + b[0][0], a[0][1] + b[0][1]) ,a[1] + b[1]))
new_centers = center_sum_num.map(lambda tup: np.asarray(tup[1][0])/tup[1][1]).collect()
iteration += 1
return new_centers
Define the structure
from pyspark.sql import Row
row = Row("x", "y")
and unpack results:
x = (
Row(x=-8.45811653137207, y=-5.179722309112549),
((-1819.748514533043, 47.745243303477764), 333)
)
f = lambda tup: row(*np.asarray(tup[1][0]) / tup[1][1])
f(x)
## Row(x=-5.4647102538529815, y=0.14337910901945275)
Related
I have two dataframes (attached image). For each of the given row in Table-1 -
Part1 - I need to find the row in Table-2 which gives the minimum Euclidian distance. Output-1 is the expected answer.
Part2 - I need to find the row in Table-2 which gives the minimum Euclidian distance. Output-2 is the expected answer. Here the only difference is that a row from Table-2 cannot be selected two times.
I tried this code to get the distance but not sure on how to add other fields -
import numpy as np
from scipy.spatial import distance
s1 = np.array([(2,2), (3,0), (4,1)])
s2 = np.array([(1,3), (2,2),(3,0),(0,1)])
print(distance.cdist(s1,s2).min(axis=1))
Two dataframes and the expected output:
The code now gives the desired output, and there's a commented out print statement for extra output.
It's also flexible to different list lengths.
Credit also to: How can the Euclidean distance be calculated with NumPy?
Hope it helps:
from numpy import linalg as LA
list1 = [(2,2), (3,0), (4,1)]
list2 = [(1,3), (2,2),(3,0),(0,1)]
names = range(0, len(list1) + len(list2))
names = [chr(ord('`') + number + 1) for number in names]
i = -1
j = len(list1) #Start Table2 names
for tup1 in list1:
collector = {} #Let's collect values for each minimum check
j = len(list1)
i += 1
name1 = names[i]
for tup2 in list2:
name2 = names[j]
a = numpy.array(tup1)
b = numpy.array(tup2)
# print ("{} | {} -->".format(name1, name2), tup1, tup2, " ", numpy.around(LA.norm(a - b), 2))
j += 1
collector["{} | {}".format(name1, name2)] = numpy.around(LA.norm(a - b), 2)
if j == len(names):
min_key = min(collector, key=collector.get)
print (min_key, "-->" , collector[min_key])
Output:
a | e --> 0.0
b | f --> 0.0
c | f --> 1.41
I tried taking a schema as a common schema by df.schema() and load all the CSV files to it .But fails as to the assigned schema , the headers of other CSV files doesnot match
Any suggestions would be appreciated. as in a function or spark script
as I understand it. You want to Union / Merge files with different schemas ( though subset of one Master Schema) ..
I wrote this function UnionPro which I think just suits your requirement -
EDIT - Added a Pyspark version
def unionPro(DFList: List[DataFrame], caseDiff: String = "Y"): DataFrame = {
val spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession.active
/**
* This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
* Creates a Unioned DataFrame
*/
//"This doesn't preserve Order------------------------------------"
//val MasterColList2 = DFList.map(_.columns.toSet).flatMap(x => x).toSet
val inputDFList = if (caseDiff == "N")
DFList
else {
DFList.map(df => {
val cols = df.columns
val selector = cols.map(x => col(x).alias(x.toLowerCase))
df.select(selector: _*)
})
}
//"This Preserves Order------------------------------------"
val masterColStrList: Array[String] = inputDFList.map(x => x.columns).reduce((x, y) => (x.union(y))).distinct
//val masterColList = ???
//Create masterSchema ignoring different Datatype & Nullable in StructField and treating them same based on Name ignoring cases
val ignoreNullable: StructField => StructField = x => StructField(x.name, x.dataType, true)
val masterSchema = StructType(inputDFList.map(_.schema.fields.map(ignoreNullable)).reduce((x, y) => (x.union(y))).groupBy(_.name.toLowerCase).map(_._2.head).toArray)
def unionExpr(myCols: Seq[String], allCols: Seq[String]): Seq[org.apache.spark.sql.Column] = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
// Create EmptyDF
val masterEmptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], masterSchema).select(masterColStrList.head, masterColStrList.tail: _*)
/*
val df1 = DFList(0)
val df1cols = df1.columns
val masterEmptyDF = df1.select(unionExpr(df1cols, MasterColList): _*).where(lit(1) === lit(2))
val DFColumns: List[Array[Column]] = DFList.map(_.columns).map(unionExpr(_, MasterColList).toArray)
val unioned_Data = DFList.zip(DFColumns).map(x => x._1.select(x._2: _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))*/
//For union/unionall Sequence of columns need to be same.. Use unionByName otherwise
//Passing MasterColStrList to Ensure Columns are in correct order
inputDFList.map(df => df.select(unionExpr(df.columns, masterColStrList): _*)).foldLeft(masterEmptyDF)((x, y) => x.unionByName(y))
//inputDFList.map(df => df.select(unionExpr(df.columns, masterColStrList): _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))
}
Here is the sample test for it -
val aDF = Seq(("A", 1), ("B", 2)).toDF("Name", "ID")
val bDF = Seq(("C", 1), ("D", 2)).toDF("Name", "Sal")
unionPro(List(aDF, bDF), spark).show
Which gives output as -
+----+----+----+
|Name| ID| Sal|
+----+----+----+
| A| 1|null|
| B| 2|null|
| C|null| 1|
| D|null| 2|
+----+----+----+
Here's Pyspark version of it -
def unionPro(DFList: List[DataFrame], caseDiff: str = "N") -> DataFrame:
"""
:param DFList:
:param caseDiff:
:return:
This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
Creates a Unioned DataFrame
"""
inputDFList = DFList if caseDiff == "N" else [df.select([F.col(x.lower) for x in df.columns]) for df in DFList]
# "This Preserves Order ( OrderedDict0-----------------------------------"
from collections import OrderedDict
## As columnNames ( String) are hashable
masterColStrList = list(OrderedDict.fromkeys(reduce(lambda x, y: x + y, [df.columns for df in inputDFList])))
# Create masterSchema ignoring different Datatype & Nullable in StructField and treating them same based on Name ignoring cases
ignoreNullable = lambda x: StructField(x.name, x.dataType, True)
import itertools
# to get reliable results by groupby iterable must be sorted by grouping key
# in sorted function key function( lambda) must be passed as named argument ( keyword argument)
# but by Sorting now, I lost original order of columns. Hence I'll use masterColStrList while returning final DF
masterSchema = StructType([list(y)[0] for x, y in itertools.groupby(
sorted(reduce(lambda x, y: x + y, [[ignoreNullable(x) for x in df.schema.fields] for df in inputDFList]),
key=lambda x: x.name),
lambda x: x.name)])
def unionExpr(myCols: List[str], allCols: List[str]) -> List[Column]:
return [F.col(x) if x in myCols else F.lit(None).alias(x) for x in allCols]
# Create Empty Dataframe
masterEmptyDF = spark.createDataFrame([], masterSchema)
return reduce(lambda x, y: x.unionByName(y),
[df.select(unionExpr(df.columns, masterColStrList)) for df in inputDFList], masterEmptyDF).select(
masterColStrList)
I have some data in a pandas dataframe like so:
| Data |
----------------------------
| 10-9 8-6 100-2 |
----------------------------
| 1-2 3-4 |
----------------------------
| 55-45 |
----------------------------
Now my question is, using pandas, what is the best way to do the following:
Calculate the average of the first numbers before the hyphen, and the average of the numbers after the hyphen.
Subract the second from the first, and place into a new column.
For example, for the first row, the value in the new column will be: average(10, 8, 100) - average(9, 6, 2)
I am guessing I will need to use some sort of lambda function, but I am not sure how to go about it.
Any help is appreciated. Thank you!
Make a function to contain the string parsing logic:
import pandas as pd
import numpy as np
def string_handling(string):
values = [it for it in string.strip().split(' ') if it]
values = [v.split('-') for v in values]
first_values = [int(v[0]) for v in values]
second_values = [int(v[1]) for v in values]
return pd.Series([np.mean(first_values), np.mean(second_values)])
Apply the function:
df[['first_value','second_value']] = df['Data'].apply(string_handling)
df['diff'] = df['first_value'] - df['second_value']
This might do the trick. split() will get rid of all the white space. Also using list comprehension to go through all the tokens created by split() (e.g. ['10-9', '8-6', '100-2']).
In [37]: df = DataFrame({'Data': [" 10-9 8-6 100-2 ",
" 1-2 3-4 ",
" 55-45 "]})
In [38]: def process(cell):
...: avg = []
...: for i in range(2):
...: l = [int(x.split("-")[i]) for x in cell.split()]
...: avg.append(sum(l) * 1. / len(l))
...: return avg[0] - avg[1]
...:
In [39]: df['Data'].apply(process)
Out[39]:
0 33.666667
1 -1.000000
2 10.000000
Name: Data, dtype: float64
Hope this helps!
I have an array ar = [2,2,2,1,1,2,2,3,3,3,3].
For this array, I want to find the lengths of consecutive same numbers like:
values: 2, 1, 2, 3
lengths: 3, 2, 2, 4
In R, this is obtained by using rle() function. Is there any existing function in python which provides required output?
You can do this with groupby
from itertools import groupby
ar = [2,2,2,1,1,2,2,3,3,3,3]
print([(k, sum(1 for i in g)) for k,g in groupby(ar)])
# [(2, 3), (1, 2), (2, 2), (3, 4)]
Here is an answer for pure numpy:
import numpy as np
def find_runs(x):
"""Find runs of consecutive items in an array."""
# ensure array
x = np.asanyarray(x)
if x.ndim != 1:
raise ValueError('only 1D array supported')
n = x.shape[0]
# handle empty array
if n == 0:
return np.array([]), np.array([]), np.array([])
else:
# find run starts
loc_run_start = np.empty(n, dtype=bool)
loc_run_start[0] = True
np.not_equal(x[:-1], x[1:], out=loc_run_start[1:])
run_starts = np.nonzero(loc_run_start)[0]
# find run values
run_values = x[loc_run_start]
# find run lengths
run_lengths = np.diff(np.append(run_starts, n))
return run_values, run_starts, run_lengths
Credit goes to https://github.com/alimanfoo
Here is an answer using the high-performance pyrle library for run length arithmetic:
# pip install pyrle
# (pyrle >= 0.0.25)
from pyrle import Rle
v = [2,2,2,1,1,2,2,3,3,3,3]
r = Rle(v)
print(r)
# +--------+-----+-----+-----+-----+
# | Runs | 3 | 2 | 2 | 4 |
# |--------+-----+-----+-----+-----|
# | Values | 2 | 1 | 2 | 3 |
# +--------+-----+-----+-----+-----+
# Rle of length 11 containing 4 elements
print(r[4])
# 1.0
print(r[4:7])
# +--------+-----+-----+
# | Runs | 1 | 2 |
# |--------+-----+-----|
# | Values | 1.0 | 2.0 |
# +--------+-----+-----+
# Rle of length 3 containing 2 elements
r + r + 0.5
# +--------+-----+-----+-----+-----+
# | Runs | 3 | 2 | 2 | 4 |
# |--------+-----+-----+-----+-----|
# | Values | 4.5 | 2.5 | 4.5 | 6.5 |
# +--------+-----+-----+-----+-----+
# Rle of length 11 containing 4 elements
Here is an optimized answer using numpy arrays which runs quickly if the run lengths are long.
In this case I want to encode an array of uint16 that can be much
larger than 2**16 using 16 bit unsigned integer run length encoding.
To allow this the array is "chunked" so the indices never exceed
2**16:
import numpy as np
def run_length_encode(array, chunksize=((1 << 16) - 1), dtype=np.int16):
"Chunked run length encoding for very large arrays containing smallish values."
shape = array.shape
ravelled = array.ravel()
length = len(ravelled)
chunk_cursor = 0
runlength_chunks = []
while chunk_cursor < length:
chunk_end = chunk_cursor + chunksize
chunk = ravelled[chunk_cursor : chunk_end]
chunk_length = len(chunk)
change = (chunk[:-1] != chunk[1:])
change_indices = np.nonzero(change)[0]
nchanges = len(change_indices)
cursor = 0
runlengths = np.zeros((nchanges + 1, 2), dtype=dtype)
for (count, index) in enumerate(change_indices):
next_cursor = index + 1
runlengths[count, 0] = chunk[cursor] # value
runlengths[count, 1] = next_cursor - cursor # run length
cursor = next_cursor
# last run
runlengths[nchanges, 0] = chunk[cursor]
runlengths[nchanges, 1] = chunk_length - cursor
runlength_chunks.append(runlengths)
chunk_cursor = chunk_end
all_runlengths = np.vstack(runlength_chunks).astype(dtype)
description = dict(
shape=shape,
runlengths=all_runlengths,
dtype=dtype,
)
return description
def run_length_decode(description):
dtype = description["dtype"]
runlengths = description["runlengths"]
shape = description["shape"]
array = np.zeros(shape, dtype=dtype)
ravelled = array.ravel()
cursor = 0
for (value, size) in runlengths:
run_end = cursor + size
ravelled[cursor : run_end] = value
cursor = run_end
array = ravelled.reshape(shape) # redundant?
return array
def testing():
A = np.zeros((50,), dtype=np.uint16)
A[20:30] = 10
A[30:35] = 6
A[40:] = 3
test = run_length_encode(A, chunksize=17)
B = run_length_decode(test)
assert np.alltrue(A == B)
print ("ok!")
if __name__=="__main__":
testing()
I built this for a project having to do with classifying
microscopy images of mouse embryos.
https://github.com/flatironinstitute/mouse_embryo_labeller
Note: I edited the entry after I found I had to caste the type
in this line to it to work for large arrays:
all_runlengths = np.vstack(runlength_chunks).astype(dtype)
I have the following dataset in numpy
indices | real data (X) |targets (y)
| |
0 0 | 43.25 665.32 ... |2.4 } 1st block
0 0 | 11.234 |-4.5 }
0 1 ... ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...
Theses are my variables
idx1 = data[:,0]
idx2 = data[:,1]
X = data[:,2:-1]
y = data[:,-1]
I also have a variable W which is a 3D array.
What I need to do in the code is loop through all the blocks in the dataset and return a scalar number for each block after some computation, then sum up all the scalars, and store it in a variable called cost. Problem is that the looping implementation is very slow, so I'm trying to do it vectorized if possible. This is my current code. Is it possible to do this without for loops in numpy?
IDX1 = 0
IDX2 = 1
# get unique indices
idx1s = np.arange(len(np.unique(data[:,IDX1])))
idx2s = np.arange(len(np.unique(data[:,IDX2])))
# initialize global sum variable to 0
cost = 0
for i1 in idx1s:
for i2 in idx2:
# for each block in the dataset
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
# get variables for that block
curr_X = X[mask,:]
curr_y = y[mask]
curr_W = W[:,i2,i1]
# calculate a scalar
pred = np.dot(curr_X,curr_W)
sigm = 1.0 / (1.0 + np.exp(-pred))
loss = np.sum((sigm- (0.5)) * curr_y)
# add result to global cost
cost += loss
Here is some sample data
data = np.array([[0,0,5,5,7],
[0,0,5,5,7],
[0,1,5,5,7],
[0,1,5,5,7],
[1,0,5,5,7],
[1,1,5,5,7]])
W = np.zeros((2,2,2))
idx1 = data[:,0]
idx2 = data[:,1]
X = data[:,2:-1]
y = data[:,-1]
That W was tricky... Actually, your blocks are pretty irrelevant, apart from getting the right slice of W to do the np.dot with the corresponding X, so I went the easy route of creating an aligned_W array as follows:
aligned_W = W[:, idx2, idx1]
This is an array of shape (2, rows) where rows is the number of rows of your data set. You can now proceed to do your whole calculation without any for loops as:
from numpy.core.umath_tests import inner1d
pred = inner1d(X, aligned_W.T)
sigm = 1.0 / (1.0 + np.exp(-pred))
loss = (sigm - 0.5) * curr_y
cost = np.sum(loss)
My guess is the major reason your code is slow is the following line:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
Because you repeatedly scan your input arrays for small number of rows of interest. So you need to do the following:
ni1 = len(np.unique(data[:,IDX1]))
ni2 = len(np.unique(data[:,IDX2]))
idx1s = np.arange(ni1)
idx2s = np.arange(ni2)
key = data[:,IDX1] * ni2 + data[:,IDX2] # 1D key to the rows
sortids = np.argsort(key) #indices to the sorted key
Then inside the loop instead of
mask=np.nonzero(...)
you need to do
curid = i1 * ni2 + i2
left = np.searchsorted(key, curid, 'left', sorter=sortids)
right=np.searchsorted(key, curid, 'right', sorter=sortids)
mask = sortids[left:right]
I don't think that there is a way to compare numpy array of different sizes without using for loops. Would be hard to decide what is the output meaning and shape of something like
[0,1,2,3,4] == [3,4,2]
The only suggestion that I can give you is to get rid of one of the for loop using itertools.product:
import itertools as it
[...]
idx1s = np.unique(data[:,IDX1])
idx2s = np.unique(data[:,IDX2])
# initialize global sum variable to 0
cost = 0
for i1, i2 in it.product(idx1s, idx2):
# for each block in the dataset
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
# get variables for that block
curr_X = X[mask,:]
curr_y = y[mask]
[...]
You can also keep mask as a bool array
mask = (data[:,IDX1] == i1) & (data[:,IDX2] == i2)
The output is the same and you have to use anyway the memory to create the bool array. Doing this way saves you some memory and a function evaluation
EDIT
If you know that the indices do not have holes or have few holes, might be worth to remove the part where you define idx1s and idxs2 and change the for loop to
max1, max2 = data[:,[IDX1, IDX2]].max(axis=0)
for i1, i2 in it.product(xrange(max1), xrange(max2)):
[...]
Both xrange and it.product are iterators, so they create only i1 and i2 when you need.
ps: if you are on python3.x use range instead of xrange