Creating RDD from input data with repeated delimiters - Spark - python

I have input data as key value pairs with pipe delimitation as below, some of values contain delimiters in its fields.
key1:value1|key2:val:ue2|key3:valu||e3
key1:value4|key2:value5|key3:value6
Expected output is below.
value1|val:ue2|valu||e3
value4|value5|value6
i tried as below to create RDD,
rdd=sc.textFile("path").map(lambda l: [x.split(":")[1] for x in l.split("|")]).map(tuple)
Above mapping works when we don't have these delimiters in the input value fields as below.
key1:value1|key2:value2|key3:value3
key1:value4|key2:value5|key3:value6
And also i tried regex as below,
rdd=sc.textFile("path").map(lambda l: [x.split(":")[1] for x in l.split("((?<!\|)\|(?!\|))")]).map(tuple)
Input data without delimiters
key1:value1|key2:value2|key3:value3
key1:value4|key2:value5|key3:value6
>>> rdd=sc.textFile("testcwp").map(lambda l: [x.split(":")[1] for x in l.split("|")])
>>> rdd.collect()
[(u'value1', u'value2', u'value3'), (u'value4', u'value5', u'value6')]
Input data with delimiters
key1:value1|key2:val:ue2|key3:valu||e3
key1:value4|key2:value5|key3:value6
Without regex
>>> rdd=sc.textFile("testcwp").map(lambda l: [x.split(":")[1] for x in l.split("|")]).map(tuple)
>>> rdd.collect()
Error: IndexError: list index out of range
with regex
>>> rdd=sc.textFile("testcwp").map(lambda l: [x.split(":")[1] for x in l.split("((?<!\|)\|(?!\|))")).map(tuple)
>>> rdd.collect()
[(u'value1|key2'), (u'value4|key2')]
How can i achieve below result from the input?
[(u'value1', u'val:ue2', u'valu||e3'), (u'value4', u'value5', u'value6')]
From this i will create dataframe do some processing.
Any suggestions from pure python also welcome. Thanks in Advance!

Here is the solution:
The main issue is l.split() works for fixed delimiter only.
rdd=sc.textFile("testcwp").map(lambda l: [x.split(":")[1:] for x in re.split("((?<!\|)\|(?!\|))",l)]).map(tuple)
>>> rdd.collect()
[([u'value1'], [u'val', u'ue2'], [u'val||ue3']), ([u'value4'], [u'value5'], [u'value6'])]
Following RDD concatenates elements inside lists,
>>> rdd2=rdd.map(lambda l: ['|'.join(x) for x in l]).map(tuple)
>>> rdd2.collect()
[(u'value1', u'value2', u'val||ue3'), (u'value4', u'value5', u'value6')]

Related

Convert List to A String

I am having problems keeping the data into a string format. The data converts to a list once I perform a split on each row (x.split). What do I need to do to keep the data in a string format?
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
document = sc.textFile("/content/sample_data/dr_csv")
print type(document)
print document.count()
document.take(5)
document.takeSample(True, 5, 3)
record = document.map(lambda x: x.split(','))
record.take(3)
You can just have a copy of x to split it without affecting x as follows:
temp = x
record = document.map(lambda temp: temp.split(','))
You can use the .join method if you want to get a string with all of the elements of the list. Suppose you have lst = ['cat', 'dog', 'pet']. Performing " ".join(lst) would return a string with all the elements of lst separated by a space: "cat dog pet".
''.join([str(i) for i in document.map(lambda x: x.split(',')])

List of List to Key-Value Pairs

I have a string which is semicolon delimited and then space delimited:
'gene_id EFNB2; Gene_type cDNA_supported; transcript_id EFNB2.aAug10; product_id EFNB2.aAug10;'
I want to create a dictionary in one line by splitting based on the delimiters but so far I can only get to a list of lists:
filter(None,[x.split() for x in atts.split(';')])
Which gives me:
[['gene_id', 'EFNB2'], ['Gene_type', 'cDNA_supported'], ['transcript_id', 'EFNB2.aAug10'], ['product_id', 'EFNB2.aAug10']]
When what I want is:
{'gene_id': 'EFNB2', 'Gene_type': 'cDNA_supported', 'transcript_id': 'EFNB2.aAug10', 'product_id': 'EFNB2.aAug10'}
I have tried:
filter(None,{k:v for k,v in x.split() for x in atts.split(';')})
but it gives me nothing. Anybody know how to accomplish this?
You are very close now, you can just call dict on your list of lists:
>>> lst = [['gene_id', 'EFNB2'], ['Gene_type', 'cDNA_supported'], ['transcript_id', 'EFNB2.aAug10'], ['product_id', 'EFNB2.aAug10']]
>>> dict(lst)
{'Gene_type': 'cDNA_supported',
'gene_id': 'EFNB2',
'product_id': 'EFNB2.aAug10',
'transcript_id': 'EFNB2.aAug10'}

How to convert column data from text file into list in Python?

I have a text file that contains data in two columns, each of which I want to read into Python as a list.
For example :
2500.3410 -0.60960758
2505.5803 -1.3031826
2510.8197 -0.64067196
2516.0593 -1.0230898
2521.2991 -0.20078891
I want to create two lists, one containing the data from column 1 and the other column 2, but I don't know how to tell Python to do this.
E.g.
list1 = [2500.3410, 2505.5803, 2510.8197, 2516.0593, 2521.2991]
I have opened the file in the shell and can read in the data, as above, but I'm stuck when it comes to creating the lists.
First of all you have to read the text file in Python ,
for an instance if you have a file named record.txt containg dataset,
file <- open('record.txt')
now you have to read the file line by line :
lst = []
stores the whole file as a list of list and each inner list represents an instance
for line in file:
lst.append([ float(x) for x in line.split()])
now you can extract the column1 as a list and column 2 as a list by following comprehension
column1 = [ x[0] for x in lst]
column2 = [ x[1] for x in lst]
You can use zip function and float within map :
zip(*[map(float,line.split()) for line in open('in_file')])
Demo:
>>> s=""" 2500.3410 -0.60960758
... 2505.5803 -1.3031826
... 2510.8197 -0.64067196
... 2516.0593 -1.0230898
... 2521.2991 -0.20078891"""
>>>
>>> [i.split() for i in s.split('\n')]
[['2500.3410', '-0.60960758'], ['2505.5803', '-1.3031826'], ['2510.8197', '-0.64067196'], ['2516.0593', '-1.0230898'], ['2521.2991', '-0.20078891']]
>>> zip(*[map(float,i.split()) for i in s.split('\n')])
[(2500.341, 2505.5803, 2510.8197, 2516.0593, 2521.2991), (-0.60960758, -1.3031826, -0.64067196, -1.0230898, -0.20078891)]
But note as zip return a list of tuples you can use map function to convert the result to list :
>>> map(list,zip(*[map(float,i.split()) for i in s.split('\n')]))
[[2500.341, 2505.5803, 2510.8197, 2516.0593, 2521.2991], [-0.60960758, -1.3031826, -0.64067196, -1.0230898, -0.20078891]]

Remove string quotes from array in Python

I'm trying to get rid of some characters in my array so I'm just left with the x and y coordinates, separated by a comma as follows:
[[316705.77017187304,790526.7469308273]
[321731.20991025254,790958.3493565321]]
I have used zip() to create a tuple of the x and y values (as pairs from a list of strings), which I've then converted to an array using numpy. The array currently looks like this:
[['316705.77017187304,' '790526.7469308273,']
['321731.20991025254,' '790958.3493565321,']]
I need the output to be an array.
I'm pretty stumped about how to get rid of the single quotes and the second comma. I have read that map() can change string to numeric but I can't get it to work.
Thanks in advance
Using 31.2. ast — Abstract Syntax Trees¶
import ast
xll = [['321731.20991025254,' '790958.3493565321,'], ['321731.20991025254,' '790958.3493565321,']]
>>> [ast.literal_eval(xl[0]) for xl in xll]
[(321731.20991025254, 790958.3493565321), (321731.20991025254, 790958.3493565321)]
Above gives list of tuples for list of list, type following:
>>> [list(ast.literal_eval(xl[0])) for xl in xll]
[[321731.20991025254, 790958.3493565321], [321731.20991025254, 790958.3493565321]]
OLD: I think this:
>>> sll
[['316705.770172', '790526.746931'], ['321731.20991', '790958.349357']]
>>> fll = [[float(i) for i in l] for l in sll]
>>> fll
[[316705.770172, 790526.746931], [321731.20991, 790958.349357]]
>>>
old Edit:
>>> xll = [['321731.20991025254,' '790958.3493565321,'], ['321731.20991025254,' '790958.3493565321,']]
>>> [[float(s) for s in xl[0].split(',') if s.strip() != ''] for xl in xll]
[[321731.20991025254, 790958.3493565321], [321731.20991025254, 790958.3493565321]]

python parse csv to lists

I have a csv file thru which I want to parse the data to the lists.
So I am using the python csv module to read that
so basically the following:
import csv
fin = csv.reader(open(path,'rb'),delimiter=' ',quotechar='|')
print fin[0]
#gives the following
['"1239","2249.00","1","3","2011-02-20"']
#lets say i do the following
ele = str(fin[0])
ele = ele.strip().split(',')
print ele
#gives me following
['[\'"1239"', '"2249.00"', '"1"', '"3"', '"2011-02-20"\']']
now
ele[0] gives me --> output---> ['"1239"
How do I get rid of that ['
In the end, I want to do is get 1239 and convert it to integer.. ?
Any clues why this is happening
Thanks
Edit:*Never mind.. resolved thanks to the first comment *
Change your delimiter to ',' and you will get a list of those values from the csv reader.
It's because you are converting a list to a string, there is no need to do this. Grab the first element of the list (in this case it is a string) and parse that:
>>> a = ['"1239","2249.00","1","3","2011-02-20"']
>>> a
['"1239","2249.00","1","3","2011-02-20"']
>>> a[0]
'"1239","2249.00","1","3","2011-02-20"'
>>> b = a[0].replace('"', '').split(',')
>>> b[-1]
'2011-02-20'
of course before you do replace and split string methods you should check if the type is string or handle the exception if it isn't.
Also Blahdiblah is correct your delimiter is probably wrong.

Categories