PySpark how to convert an rdd to string - python

I need to pass coordinates in an url but I need to convert the rdd to a string and separate with a semicolon.
all_coord_iso_rdd.take(4)
[(-73.57534790039062, 45.5311393737793),
(-73.574951171875, 45.529457092285156),
(-73.5749282836914, 45.52922821044922),
(-73.57501220703125, 45.52901077270508)]
type(all_coord_iso_rdd)
pyspark.rdd.PipelinedRDD
Results lookin for:
"-73.57534790039062,45.5311393737793;-73.574951171875,45.529457092285156,
-73.5749282836914,45.52922821044922;-73.57501220703125,45.52901077270508"
The form of my URL should be as follows:
http://127.0.0.1/match/v1/driving/-73.57534790039062,45.5311393737793; -73.574951171875,45.529457092285156,-73.5749282836914,45.52922821044922;-73.57501220703125,45.52901077270508

From the snippet you posted all_coord_iso_rdd is an rdd, where each row is a tuple(float, float). Calling take(n) returns n records from the rdd.
x = all_coord_iso_rdd.take(4)
print(x)
#[(-73.57534790039062, 45.5311393737793),
# (-73.574951171875, 45.529457092285156),
# (-73.5749282836914, 45.52922821044922),
# (-73.57501220703125, 45.52901077270508)]
The value returned is simply a list of tuples of floating point numbers. To convert it into the desired format, we can use str.join inside of a list comprehension.
First, you need to convert the floats to str and then we can join the values in each tuple using a ",". We use map(str, ...) to map each value to a str.
This yields:
print([",".join(map(str, item)) for item in x])
#['-73.5753479004,45.5311393738',
# '-73.5749511719,45.5294570923',
# '-73.5749282837,45.5292282104',
# '-73.575012207,45.5290107727']
Finally join the resultant list using ";" to get your desired output.
print(";".join([",".join(map(str, item)) for item in x]))

Here is a pure spark way of doing the same (may be useful for larger
rdds/different use cases):
list=[(-73.57534790039062, 45.5311393737793),(-73.574951171875, 45.529457092285156),\
(-73.5749282836914, 45.52922821044922),(-73.57501220703125, 45.52901077270508)]
rdd=sc.parallelize(list)
rdd.map(lambda row: ",".join([str(elt) for elt in row]))\
.reduce(lambda x,y: ";".join([x,y]))

Related

How can i define a function that reads and returns all values in the list using Python?

I have this string delimited by commas.
'1.0,5.0,6.0,7.0,8.0,9.0'
def var():
for i in listnumbers:
return i +'.0'
When I do
var()
I only get
1.0
How do i get the result to include all the numbers in a loop?
1.0,5.0,6.0,7.0,8.0,9.0
def myfun(mycsv):
return [i+'.0' for i in mycsv.split(',')]
print(myfun('1.0,5.0,6.0,7.0,8.0,9.0'))
#['1.0.0', '5.0.0', '6.0.0', '7.0.0', '8.0.0', '9.0.0']
If you want a string, then just use join:
print(','.join(myfun('1.0,5.0,6.0,7.0,8.0,9.0')))
Or change the function to return a string;
return ','.join([i+'.0' for i in mycsv.split(',')])
You are returning inside the for loop, before the cycle is completed.
If I understood correctly your question, it looks like what you're looking for is list comprehension.
If your input is a list:
def var(l):
return [i + '.0' for i in l]
If your input is a string, like it seems from your description, you have to split it first:
def var(l):
return [i + '.0' for i in l.split(',')]
This is equivalent to mapping in other languages.
You can divide your string in a list using string.split(',') the you iterate over the freshly created list and print each element. A the code can be arranged like this:
for s in string.split(','):
print(s+'.0')

Mix list comprehension on character split and int() conversion...?

Given following variable:
fsw="M543x620S30006482x483S14c10520x483S14c51498x537S14c39492x593S20500496x582S22a04494x564"
if I do this:
z=[sub.split('x') for sub in re.findall("\d{3}x\d{3}",fsw[8:])]
it returns :
[['482', '483'], ['520', '483'], ['498', '537'], ['492', '593'], ['496', '582'], ['494', '564']]
but I'd like to get a list of pairs of integers ([[482,483],[520,483],...]). Is there a one-liner that would do this operation ?
Thanks.
z=[map(lambda x: int(x), sub.split('x')) for sub in re.findall("\d{3}x\d{3}",fsw[8:])]

How to filter a list without converting to string or loop it

I've got an object of type list and second object of type string.
I would like to filter for all values in the list-object which do not match the value of the string-object.
I have created a loop which splits the list into string and with regex found all those not matching and added these results to a new list.
This example uses hostnames "ma-tsp-a01", "ma-tsp-a02" an "ma-tsp-a03".
Currently I do further work on this new list to create a clean list of hostnames.
import re
local_hostname = 'ma-tsp-a01'
profile_files = ['/path/to/file/TSP_D01_ma-tsp-a01\n', \
'/path/to/file/TSP_D02_ma-tsp-a02\n', \
'/path/to/file/TSP_ASCS00_ma-tsp-a01\n', \
'/path/to/file/TSP_DVEBMGS03_ma-tsp-a03\n', \
'/path/to/file/TSP_DVEBMGS01_ma-tsp-a01\n']
result_list = [local_hostname]
for list_obj in profile_files:
if re.search(".*\w{3}\_\w{1,7}\d{2}\_(?!"+local_hostname+").*", list_obj):
result_list.append(list_obj.split("/")[-1].splitlines()[0].\
split("_")[-1])
print(result_list)
At the end I get the following output
['ma-tsp-a01', 'ma-tsp-a02', 'ma-tsp-a03']. This looks exactly what I am searching for. But is there a way to make this in a more pythonic way without the "for" loop?
You can create a filter object:
filtered = filter(lambda x: re.search(".*\w{3}\_\w{1,7}\d{2}\_(?!"+local_hostname+").*", x), profile_files)
Or use a generator comprehension:
filtered = (x for x in profile_files if re.search(".*\w{3}\_\w{1,7}\d{2}\_(?!"+local_hostname+").*", x))
Both behave the same

byte literal array to string

I am new to Python, I am calling an external service and printing the data which is basically byte literal array.
results = q.sync('([] string 2#.z.d; `a`b)')
print(results)
[(b'2018.06.15', b'a') (b'2018.06.15', b'b')]
To Display it without the b, I am looping through the elements and decoding the elements but it messes up the whole structure.
for x in results:
for y in x:
print(y.decode())
2018.06.15
a
2018.06.15
b
Is there a way to covert the full byte literal array to string array (either of the following) or do I need to write a concatenate function to stitch it back?
('2018.06.15', 'a') ('2018.06.15', 'b')
(2018.06.15,a) (2018.06.15,b)
something like the following (though I want to avoid this approach )
for x in results:
s=""
for y in x:
s+="," +y.decode()
print(s)
,2018.06.15,a
,2018.06.15,b
Following the previous answer, your command should be as follows:
This code will result in a list of tuples.
[tuple(x.decode() for x in item) for item in result]
The following code will return tuples:
for item in result:
t = ()
for x in item:
t = t + (x.decode(),)
print(t)
You can do it in one line, which gives you back a list of decoded tuples.
[tuple(i.decode() for i in y) for x in result for y in x]

How to compare an element of a tuple (int) to determine if it exists in a list

I have the two following lists:
# List of tuples representing the index of resources and their unique properties
# Format of (ID,Name,Prefix)
resource_types=[('0','Group','0'),('1','User','1'),('2','Filter','2'),('3','Agent','3'),('4','Asset','4'),('5','Rule','5'),('6','KBase','6'),('7','Case','7'),('8','Note','8'),('9','Report','9'),('10','ArchivedReport',':'),('11','Scheduled Task',';'),('12','Profile','<'),('13','User Shared Accessible Group','='),('14','User Accessible Group','>'),('15','Database Table Schema','?'),('16','Unassigned Resources Group','#'),('17','File','A'),('18','Snapshot','B'),('19','Data Monitor','C'),('20','Viewer Configuration','D'),('21','Instrument','E'),('22','Dashboard','F'),('23','Destination','G'),('24','Active List','H'),('25','Virtual Root','I'),('26','Vulnerability','J'),('27','Search Group','K'),('28','Pattern','L'),('29','Zone','M'),('30','Asset Range','N'),('31','Asset Category','O'),('32','Partition','P'),('33','Active Channel','Q'),('34','Stage','R'),('35','Customer','S'),('36','Field','T'),('37','Field Set','U'),('38','Scanned Report','V'),('39','Location','W'),('40','Network','X'),('41','Focused Report','Y'),('42','Escalation Level','Z'),('43','Query','['),('44','Report Template ','\\'),('45','Session List',']'),('46','Trend','^'),('47','Package','_'),('48','RESERVED','`'),('49','PROJECT_TEMPLATE','a'),('50','Attachments','b'),('51','Query Viewer','c'),('52','Use Case','d'),('53','Integration Configuration','e'),('54','Integration Command f'),('55','Integration Target','g'),('56','Actor','h'),('57','Category Model','i'),('58','Permission','j')]
# This is a list of resource ID's that we do not want to reference directly, ever.
unwanted_resource_types=[0,1,3,10,11,12,13,14,15,16,18,20,21,23,25,27,28,32,35,38,41,47,48,49,50,57,58]
I'm attempting to compare the two in order to build a third list containing the 'Name' of each unique resource type that currently exists in unwanted_resource_types. e.g. The final result list should be:
result = ['Group','User','Agent','ArchivedReport','ScheduledTask','...','...']
I've tried the following that (I thought) should work:
result = []
for res in resource_types:
if res[0] in unwanted_resource_types:
result.append(res[1])
and when that failed to populate result I also tried:
result = []
for res in resource_types:
for type in unwanted_resource_types:
if res[0] == type:
result.append(res[1])
also to no avail. Is there something i'm missing? I believe this would be the right place to perform list comprehension, but that's still in my grey basket of understanding fully (The Python docs are a bit too succinct for me in this case).
I'm also open to completely rethinking this problem, but I do need to retain the list of tuples as it's used elsewhere in the script. Thank you for any assistance you may provide.
Your resource types are using strings, and your unwanted resources are using ints, so you'll need to do some conversion to make it work.
Try this:
result = []
for res in resource_types:
if int(res[0]) in unwanted_resource_types:
result.append(res[1])
or using a list comprehension:
result = [item[1] for item in resource_types if int(item[0]) in unwanted_resource_types]
The numbers in resource_types are numbers contained within strings, whereas the numbers in unwanted_resource_types are plain numbers, so your comparison is failing. This should work:
result = []
for res in resource_types:
if int( res[0] ) in unwanted_resource_types:
result.append(res[1])
The problem is that your triples contain strings and your unwanted resources contain numbers, change the data to
resource_types=[(0,'Group','0'), ...
or use int() to convert the strings to ints before comparison, and it should work. Your result can be computed with a list comprehension as in
result=[rt[1] for rt in resource_types if int(rt[0]) in unwanted_resource_types]
If you change ('0', ...) into (0, ... you can leave out the int() call.
Additionally, you may change the unwanted_resource_types variable into a set, like
unwanted_resource_types=set([0,1,3, ... ])
to improve speed (if speed is an issue, else it's unimportant).
The one-liner:
result = map(lambda x: dict(map(lambda a: (int(a[0]), a[1]), resource_types))[x], unwanted_resource_types)
without any explicit loop does the job.
Ok - you don't want to use this in production code - but it's fun. ;-)
Comment:
The inner dict(map(lambda a: (int(a[0]), a[1]), resource_types)) creates a dictionary from the input data:
{0: 'Group', 1: 'User', 2: 'Filter', 3: 'Agent', ...
The outer map chooses the names from the dictionary.

Categories