Convert List to A String - python

I am having problems keeping the data into a string format. The data converts to a list once I perform a split on each row (x.split). What do I need to do to keep the data in a string format?
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
document = sc.textFile("/content/sample_data/dr_csv")
print type(document)
print document.count()
document.take(5)
document.takeSample(True, 5, 3)
record = document.map(lambda x: x.split(','))
record.take(3)

You can just have a copy of x to split it without affecting x as follows:
temp = x
record = document.map(lambda temp: temp.split(','))

You can use the .join method if you want to get a string with all of the elements of the list. Suppose you have lst = ['cat', 'dog', 'pet']. Performing " ".join(lst) would return a string with all the elements of lst separated by a space: "cat dog pet".
''.join([str(i) for i in document.map(lambda x: x.split(',')])

Related

How to turn a list containing strings into a list containing integers (Python)

I am optimizing PyRay (https://github.com/oscr/PyRay) to be a usable Python ray-casting engine, and I am working on a feature that takes a text file and turns it into a list (PyRay uses as a map). But when I use the file as a list, it turns the contents into strings, therefore not usable by PyRay. So my question is: How do I convert a list of strings into integers? Here is my code so far. (I commented the actual code so I can test this)
print("What map file to open?")
mapopen = input(">")
mapload = open(mapopen, "r")
worldMap = [line.split(',') for line in mapload.readlines()]
print(worldMap)
The map file:
1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,
2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
1,0,2,0,0,3,0,0,0,0,0,0,0,2,3,2,3,0,0,2,
2,0,3,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
1,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,2,
2,3,1,0,0,2,0,0,0,2,3,2,0,0,0,0,0,0,0,1,
1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,2,0,0,0,2,
2,0,0,0,0,0,0,0,0,2,0,2,0,0,2,1,0,0,0,1,
1,0,0,0,0,0,0,0,0,1,3,1,0,0,0,0,0,0,0,2,
2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
1,0,2,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,2,
2,0,3,0,0,2,0,0,0,0,0,0,0,2,3,2,1,2,0,1,
1,0,0,0,0,3,0,0,0,0,0,0,0,1,0,0,2,0,0,2,
2,3,1,0,0,2,0,0,2,1,3,2,0,2,0,0,3,0,3,1,
1,0,0,0,0,0,0,0,0,3,0,0,0,1,0,0,2,0,0,2,
2,0,0,0,0,0,0,0,0,2,0,0,0,2,3,0,1,2,0,1,
1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,3,0,2,
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,1,
2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,
Please help me, I have been searching all about and I can't find anything.
try this: Did you want a list of lists? or just one big list?
with open(filename, "r") as txtr:
data = txtr.read()
data = txtr.split("/n") # split into list of strings
data = [ list(map(int, x.split(","))) for x in data]
fourth line splits string into list by removing comma, then appliea int() on each element then turns it into a list. It does this for every element in data. I hope it helps.
Here is for just one large list.
with open(filename, "r") as txtr:
data = txtr.readlines() # remove empty lines in your file!
data = ",".join(data) # turns it into a large string
data = data.split(",") # now you have a list of strings
data = list(map(int, data)) # applies int() to each element in data.
Look into the map built-in function in python.
L=['1', '2', '3']
map = map(int, L)
for el in map:
print(el)
>>> 1
... 2
... 3
As per you question, please find below a way you can change list of strings to list of integers (or integers if you use list index to get the integer value). Hope this helps.
myStrList = ["1","2","\n","3"]
global myNewIntList
myNewIntList = []
for x in myStrList:
if(x != "\n"):
y = int(x)
myNewIntList.append(y)
print(myNewIntList)

Extract numeric values from a string for python

I have a string with contains numeric values which are inside quotes. I need to remove numeric values from these and also the [ and ]
sample string: texts = ['13007807', '13007779']
texts = ['13007807', '13007779']
texts.replace("'", "")
texts..strip("'")
print texts
# this will return ['13007807', '13007779']
So what i need to extract from string is:
13007807
13007779
If your texts variable is a string as I understood from your reply, then you can use Regular expressions:
import re
text = "['13007807', '13007779']"
regex=r"\['(\d+)', '(\d+)'\]"
values=re.search(regex, text)
if values:
value1=int(values.group(1))
value2=int(values.group(2))
output:
value1=13007807
value2=13007779
You can use * unpack operator:
texts = ['13007807', '13007779']
print (*texts)
output:
13007807 13007779
if you have :
data = "['13007807', '13007779']"
print (*eval(data))
output:
13007807 13007779
The easiest way is to use map and wrap around in list
list(map(int,texts))
Output
[13007807, 13007779]
If your input data is of format data = "['13007807', '13007779']" then
import re
data = "['13007807', '13007779']"
list(map(int, re.findall('(\d+)',data)))
or
list(map(int, eval(data)))

Python: How to spilt string in dictionary

I have the following JSON Data:
json_data = {"window_string": "X=-10 H=30 Y=20 W=40"}
How would I split the values to a list that is similar to this:
window_string = ["X = -10", "Y = 20", "W=40"]
json_data = {"window_string": "X=-10 H=30 Y=20 W=40"}
print(json_data["window_string"].split()) #Use str.split()
Output:
['X=-10', 'H=30', 'Y=20', 'W=40']
split() function takes a string and splits it in list of strings, where every item in that splitted list is a word in the original string. Since json_data['window_string'] has 4 words that every word is one item in the output list, it works just fine:
json_data = {'window_string': 'X=-10 H=30 Y=20 W=40'}
window_string = json_data['window_string'].split()

Creating RDD from input data with repeated delimiters - Spark

I have input data as key value pairs with pipe delimitation as below, some of values contain delimiters in its fields.
key1:value1|key2:val:ue2|key3:valu||e3
key1:value4|key2:value5|key3:value6
Expected output is below.
value1|val:ue2|valu||e3
value4|value5|value6
i tried as below to create RDD,
rdd=sc.textFile("path").map(lambda l: [x.split(":")[1] for x in l.split("|")]).map(tuple)
Above mapping works when we don't have these delimiters in the input value fields as below.
key1:value1|key2:value2|key3:value3
key1:value4|key2:value5|key3:value6
And also i tried regex as below,
rdd=sc.textFile("path").map(lambda l: [x.split(":")[1] for x in l.split("((?<!\|)\|(?!\|))")]).map(tuple)
Input data without delimiters
key1:value1|key2:value2|key3:value3
key1:value4|key2:value5|key3:value6
>>> rdd=sc.textFile("testcwp").map(lambda l: [x.split(":")[1] for x in l.split("|")])
>>> rdd.collect()
[(u'value1', u'value2', u'value3'), (u'value4', u'value5', u'value6')]
Input data with delimiters
key1:value1|key2:val:ue2|key3:valu||e3
key1:value4|key2:value5|key3:value6
Without regex
>>> rdd=sc.textFile("testcwp").map(lambda l: [x.split(":")[1] for x in l.split("|")]).map(tuple)
>>> rdd.collect()
Error: IndexError: list index out of range
with regex
>>> rdd=sc.textFile("testcwp").map(lambda l: [x.split(":")[1] for x in l.split("((?<!\|)\|(?!\|))")).map(tuple)
>>> rdd.collect()
[(u'value1|key2'), (u'value4|key2')]
How can i achieve below result from the input?
[(u'value1', u'val:ue2', u'valu||e3'), (u'value4', u'value5', u'value6')]
From this i will create dataframe do some processing.
Any suggestions from pure python also welcome. Thanks in Advance!
Here is the solution:
The main issue is l.split() works for fixed delimiter only.
rdd=sc.textFile("testcwp").map(lambda l: [x.split(":")[1:] for x in re.split("((?<!\|)\|(?!\|))",l)]).map(tuple)
>>> rdd.collect()
[([u'value1'], [u'val', u'ue2'], [u'val||ue3']), ([u'value4'], [u'value5'], [u'value6'])]
Following RDD concatenates elements inside lists,
>>> rdd2=rdd.map(lambda l: ['|'.join(x) for x in l]).map(tuple)
>>> rdd2.collect()
[(u'value1', u'value2', u'val||ue3'), (u'value4', u'value5', u'value6')]

Remove string quotes from array in Python

I'm trying to get rid of some characters in my array so I'm just left with the x and y coordinates, separated by a comma as follows:
[[316705.77017187304,790526.7469308273]
[321731.20991025254,790958.3493565321]]
I have used zip() to create a tuple of the x and y values (as pairs from a list of strings), which I've then converted to an array using numpy. The array currently looks like this:
[['316705.77017187304,' '790526.7469308273,']
['321731.20991025254,' '790958.3493565321,']]
I need the output to be an array.
I'm pretty stumped about how to get rid of the single quotes and the second comma. I have read that map() can change string to numeric but I can't get it to work.
Thanks in advance
Using 31.2. ast — Abstract Syntax Trees¶
import ast
xll = [['321731.20991025254,' '790958.3493565321,'], ['321731.20991025254,' '790958.3493565321,']]
>>> [ast.literal_eval(xl[0]) for xl in xll]
[(321731.20991025254, 790958.3493565321), (321731.20991025254, 790958.3493565321)]
Above gives list of tuples for list of list, type following:
>>> [list(ast.literal_eval(xl[0])) for xl in xll]
[[321731.20991025254, 790958.3493565321], [321731.20991025254, 790958.3493565321]]
OLD: I think this:
>>> sll
[['316705.770172', '790526.746931'], ['321731.20991', '790958.349357']]
>>> fll = [[float(i) for i in l] for l in sll]
>>> fll
[[316705.770172, 790526.746931], [321731.20991, 790958.349357]]
>>>
old Edit:
>>> xll = [['321731.20991025254,' '790958.3493565321,'], ['321731.20991025254,' '790958.3493565321,']]
>>> [[float(s) for s in xl[0].split(',') if s.strip() != ''] for xl in xll]
[[321731.20991025254, 790958.3493565321], [321731.20991025254, 790958.3493565321]]

Categories