Using map in Spark Python to manipulate Resilient Distributed Datasets

Using map in Spark Python to manipulate Resilient Distributed Datasets - python

I created an array using numpy's arange, and want to convert that array into a RDD using the spark.sparkContext.parallelize.
np_array = [np.arange(0,300)]
rdd_numbers = spark.sparkContext.parallelize(np_array)
times_twelve = rdd_numbers.map(lambda rdd_numbers: rdd_numbers * 12)
I would now like to make an RDD called times_twelve, that is basically every number in rdd_numbers multiplied by twelve. For some reason times_twelve does not print properly, any ideas where I could have gone wrong?

Reading the comments, I can say that Shagun Sodhani is right when he says:
print(anyrdd) will not print the content of the RDD
If you want to see the content of the RDD on the screen, you can use the following command (recommended only for small RDDs):
print times_twelve.take(times_twelve.count())
You can check here the docs about these actions supported by Spark.

Related

Is there another way to convert ee.Number to float except getInfo()?

Hello friends!
Summarization:
I got a ee.FeatureCollection containing around 8500 ee.Point-objects. I would like to calculate the distance of these points to a given coordinate, lets say (0.0, 0.0).
For this i use the function geopy.distance.distance() (ref: https://geopy.readthedocs.io/en/latest/#module-geopy.distance). As input the the function takes 2 coordinates in the form of 2 tuples containing 2 floats.
Problem: When i am trying to convert the coordinates in form of an ee.List to float, i always use the getinfo() function. I know this is a callback and it is very time intensive but i don't know another way to extract them. Long story short: To extract the data as ee.Number it takes less than a second, if i want them as float it takes more than an hour. Is there any trick to fix this?
Code:
fc_containing_points = ee.FeatureCollection('projects/ee-philadamhiwi/assets/Flensburg_100') #ee.FeatureCollection
list_containing_points = fc_containing_points.toList(fc_containing_points.size()) #ee.List
fc_containing_points_length = fc_containing_points.size() #ee.Number
for index in range(fc_containing_points_length.getInfo()): #i need to convert ee.Number to int
point_tmp = list_containing_points.get(i) #ee.ComputedObject
point = ee.Feature(point_tmp) #transform ee.ComputedObject to ee.Feature
coords = point.geometry().coordinates() #ee.List containing 2 ee.Numbers
#when i run the loop with this function without the next part
#i got all the data i want as ee.Number in under 1 sec
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0]) #tuple containing 2 floats
#when i add this part to the function it takes hours
PS: This is my first question, pls be patient with me.

I would use .map instead of your looping. This stays server side until you export the table (or possibly do a .getInfo on the whole thing)
fc_containing_points = ee.FeatureCollection('projects/eephiladamhiwi/assets/Flensburg_100')
fc_containing_points.map(lambda feature: feature.set("distance_to_point", feature.distance(ee.Feature(ee.Geometry.Point([0.0,0.0])))
# Then export using ee.batch.Export.Table.toXXX or call getInfo
(An alternative might be to useee.Image.paint to convert the target point to an image then, use ee.Image.distance to calculate the distance to the point (as an image), then use reduceRegions over the feature collection with all points but 1) you can only calculate distance to a certain distance and 2) I don't think it would be any faster.)
To comment on your code, you are probably aware loops (especially client side loops) are frowned upon in GEE (primarily for the performance reasons you've run into) but also note that any time you call .getInfo on a server side object it incurs a performance cost. So this line
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0])
Would take roughly double the time as this
coords_client = coords.getInfo()
coords_as_tuple_of_ints = (coords_client[1],coords_client[0])
Finally, you could always just export your entire feature collection to a shapefile (using ee.batch.Export.Table.... as above) and do all the operations using geopy locally.

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')

You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.

Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

(Python3.x)Splitting arrays and saving them into new arrays

I'm writing a Python script intended to split a big array of numbers into equal sub-arrays. For that purpose, I use Numpy's split method as follows:
test=numpy.array_split(raw,nslices)
where raw is the complete array containing all the values, which are float64-type by the way.
nslices is the number of sub-arrays I want to create from the raw array.
In the script, nslices may vary depending of the size of the raw array, so I would like to "automatically" save each created sub-arrays in a particular array as : resultsarray(i)in a similar way that it can be made in MATLAB/Octave.
I tried to use afor in range loop in Python but I am only able to save the last sub-array in a variable.
What is the correct way to save the sub-array for each each incrementation from 1 to nslices?
Here, the complete code as is it now (I am a Python beginner, please bother the low-level of the script).
import numpy as np
file = open("results.txt", "r")
raw = np.loadtxt(fname=file, delimiter="/n", dtype='float64')
nslices = 3
rawslice = np.array_split(raw,nslices)
for i in range(0,len(rawslice)):
resultsarray=(rawslice[i])
print(rawslice[i])
Thank you very much for your help solving this problem!

First - you screwed up delimiter :)
It should be backslash+n \n instead of /n.
Second - as Serge already mentioned in comment you can just access to split parts by index (resultarray[0] to [2]). But if you really wanted to assign each part to a separate variable you can do this in fommowing way:
result_1_of_3, result_2_of_3, result_3_of_3 = rawslice
print(result_1_of_3, result_2_of_3, result_3_of_3)
But probably it isn't the way you should go.

Converting python Dataframe to Matlab file

I am trying to convert a python Dataframe to a Matlab (.mat) file.
I initially have a txt (EEG signal) that I import using panda.read_csv:
MyDataFrame = pd.read_csv("data.txt",sep=';',decimal='.'), data.txt being a 2D array with labels. This creates a dataframe which looks like this.
In order to convert it to .mat, I tried this solution where the idea is to convert the dataframe into a dictionary of lists but after trying every aspect of this solution it's still unsuccessful.
scipy.io.savemat('EEG_data.mat', {'struct':MyDataFrame.to_dict("list")})
It did create a .mat file but it did not save my dataframe properly. The file I obtain after looks like this, so all the values are basically gone, and the remaining labels you see are empty when you look into them.
I also tried using mat4py which is designed to export python structures into Matlab files, but it did not work either. I don't understand why, because converting my dataframe to a dictionary of lists is exactly what should be done according to the mat4py documentation.

I believe that the reason the previous solutions haven't worked for you is that your DataFrame column names are not valid MATLAB struct field names, because they contain spaces and/or start with digit characters.
When I do:
import pandas as pd
import scipy.io
MyDataFrame = pd.read_csv('eeg.txt',sep=';',decimal='.')
truncDataFrame = MyDataFrame[0:1000] # reduce data size for test purposes
scipy.io.savemat('EEGdata1.mat', {'struct1':truncDataFrame.to_dict("list")})
the result in MATLAB is a struct with the 4 fields reltime, datetime, iSensor and quality. Each of these has 1000 elements, so the data from these columns has been converted, but the rest of your data is missing.
However if I first rename the DataFrame columns:
truncDataFrame.rename(columns=lambda x:'col_' + x.replace(' ', '_'), inplace=True)
scipy.io.savemat('EEGdata2.mat', {'struct2':truncDataFrame.to_dict("list")})
the result in MATLAB is a struct with 36 fields. This is not the same format as your mat4py solution but it does contain (as far as I can see) all the data from the source DataFrame.
(Note that in your question, you are creating a .mat file that contains a variable called struct and when this is loaded into MATLAB it masks the builtin struct datatype - that might also cause issues with subsequent MATLAB code.)

I finally found a solution thanks to this post. There, the poster did not create a dictionary of lists but a dictionary of integers, which worked on my side. It is a small example, easily reproductible. Then I tried to manually add lists by entering values like [1, 2], an it did not work. But what worked was when I manually added tuples !
MyDataFrame needs to be converted to a dictionary and if a dictionary of lists doesn't work, try with tuples.
For beginners : lists are contained by [] and tuples by (). Here is an image showing both.
This worked for me:
import mat4py as mp
EEGdata = MyDataFrame.apply(tuple).to_dict()
mp.savemat('EEGdata.mat',{'structs': EEGdata})
EEGdata.mat should now be readable by Matlab, as it is on my side.

How to find pyspark dataframe memory usage?

For python dataframe, info() function provides memory usage.
Is there any equivalent in pyspark ?
Thanks

Try to use the _to_java_object_rdd() function:
import py4j.protocol
from py4j.protocol import Py4JJavaError
from py4j.java_gateway import JavaObject
from py4j.java_collections import JavaArray, JavaList
from pyspark import RDD, SparkContext
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
# your dataframe what you'd estimate
df
# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
# First you have to convert it to an RDD
JavaObj = _to_java_object_rdd(df.rdd)
# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.
select 1% of data sample = df.sample(fraction = 0.01)
pdf = sample.toPandas()
get pandas dataframe memory usage by pdf.info()
Multiply that values by 100, this should give a rough estimate of your whole spark dataframe memory usage.
Correct me if i am wrong :|

As per the documentation:
The best way to size the amount of memory consumption a dataset will require is to create an RDD, put it into cache, and look at the “Storage” page in the web UI. The page will tell you how much memory the RDD is occupying.
To estimate the memory consumption of a particular object, use SizeEstimator’s estimate method. This is useful for experimenting with different data layouts to trim memory usage, as well as determining the amount of space a broadcast variable will occupy on each executor heap.

For the dataframe df you can do this:
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf)

You can persist dataframe in memory and take action as df.count(). You would be able to check the size under storage tab on spark web ui.. let me know if it works for you.

How about below? It's in KB, X100 to get the estimated real size.
df.sample(fraction = 0.01).cache().count()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using map in Spark Python to manipulate Resilient Distributed Datasets - python

Related

Is there another way to convert ee.Number to float except getInfo()?

How to get data from object in Python

(Python3.x)Splitting arrays and saving them into new arrays

Converting python Dataframe to Matlab file

How to find pyspark dataframe memory usage?

Categories

Resources