I have a spark dataframe which has a column 'X'.The column contains elements which are in the form:
u'[23,4,77,890,455,................]'
. How can I convert this unicode to list.That is my output should be
[23,4,77,890,455...................]
. I have apply it for each element in the 'X' column.
I have tried df.withColumn("X_new", ast.literal_eval(x)) and got the error
"Malformed String"
I also tried
df.withColumn("X_new", json.loads(x)) and got the error "Expected
String or Buffer"
and
df.withColumn("X_new", json.dumps(x)) which says JSON not
serialisable.
and also
df_2 = df.rdd.map(lambda x: x.encode('utf-8')) which says rdd has no
attribute encode.
I dont want to use collect and toPandas() because its memory consuming.(But if thats the only way please do tell).I am using Pyspark
Update: cph_sto gave the answer using UDF.Though it worked well,I find that it is Slow.Can Somebody suggest any other method?
import ast
from pyspark.sql.functions import udf
values = [(u'[23,4,77,890.455]',10),(u'[11,2,50,1.11]',20),(u'[10.05,1,22.04]',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|[23,4,77,890.455]| 10|
| [11,2,50,1.11]| 20|
| [10.05,1,22.04]| 30|
+-----------------+---+
# Creating a UDF to convert the string list to proper list
string_list_to_list = udf(lambda row: ast.literal_eval(row))
df = df.withColumn('list',string_list_to_list(col('list')))
df.show()
+--------------------+---+
| list| A|
+--------------------+---+
|[23, 4, 77, 890.455]| 10|
| [11, 2, 50, 1.11]| 20|
| [10.05, 1, 22.04]| 30|
+--------------------+---+
Extension of the Q, as asked by OP -
# Creating a UDF to find length of resulting list.
length_list = udf(lambda row: len(row))
df = df.withColumn('length_list',length_list(col('list')))
df.show()
+--------------------+---+-----------+
| list| A|length_list|
+--------------------+---+-----------+
|[23, 4, 77, 890.455]| 10| 4|
| [11, 2, 50, 1.11]| 20| 4|
| [10.05, 1, 22.04]| 30| 3|
+--------------------+---+-----------+
Since it's a string, you could remove the first and last characters:
From '[23,4,77,890,455]' to '23,4,77,890,455'
Then apply the split() function to generate an array, taking , as the delimiter.
Please use the below code to ignore unicode
df.rdd.map(lambda x: x.encode("ascii","ignore"))
Related
I have this dataframe :
cSchema = StructType([StructField("id1", StringType()), StructField("id2", StringType()), StructField("params", StringType())\
,StructField("Col2", IntegerType())])
test_list = [[1, 2, '{"param1": "val1", "param2": "val2"}', 1], [1, 3, '{"param1": "val4", "param2": "val5"}', 3]]
df = spark.createDataFrame(test_list,schema=cSchema)
+---+---+--------------------+----+
|id1|id2| params|Col2|
+---+---+--------------------+----+
| 1| 2|{"param1": "val1"...| 1|
| 1| 3|{"param1": "val4"...| 3|
+---+---+--------------------+----+
I want to explode params into columns :
+---+---+----+------+------+
|id1|id2|Col2|param1|param2|
+---+---+----+------+------+
| 1| 2| 1| val1| val2|
| 1| 3| 3| val4| val5|
+---+---+----+------+------+
So I coded this :
schema2 = StructType([StructField("param1", StringType()), StructField("param2", StringType())])
df.withColumn(
"params", from_json("params", schema2)
).select(
col('id1'), col('id2'),col('Col2'), col('params.*')
).show()
The problem is that params schema is dynamic (variable schema2), he may change from one execution to another, so I need to infer the schema dynamically (It's ok to have all columns with String Type)... And I can't figure out of a way to do this..
Can anyone help me up with that please ?
In Pyspark the syntax should be:
import pyspark.sql.functions as F
schema = F.schema_of_json(df.select('params').head()[0])
df2 = df.withColumn(
"params", F.from_json("params", schema)
).select(
'id1', 'id2', 'Col2', 'params.*'
)
df2.show()
+---+---+----+------+------+
|id1|id2|Col2|param1|param2|
+---+---+----+------+------+
| 1| 2| 1| val1| val2|
| 1| 3| 3| val4| val5|
+---+---+----+------+------+
Here is how you can do it, hope you can change it to python
Get the schema dynamically with schema_of_json from the value and use from_json to read.
val schema = schema_of_json(df.first().getAs[String]("params"))
df.withColumn("params", from_json($"params", schema))
.select("id1", "id2", "Col2", "params.*")
.show(false)
If you want to get a larger sample of data to compare, you can read the params field into a list, convert that to an RDD, then read using "spark.read.json()"
params_list = df.select("params").rdd.flatMap(lambda x: x).collect()
params_rdd = sc.parallelize(params_list)
spark.read.json(params_rdd).schema
Caveat here being that you probably don't want to load too much data, as it's all being stuffed into local variables. Try taking the top 1000 or whatever an appropriate sample size may be.
I am trying to extract regex patterns from a column using PySpark. I have a data frame which contains the regex patterns and then a table which contains the strings I'd like to match.
columns = ['id', 'text']
vals = [
(1, 'here is a Match1'),
(2, 'Do not match'),
(3, 'Match2 is another example'),
(4, 'Do not match'),
(5, 'here is a Match1')
]
df_to_extract = sql.createDataFrame(vals, columns)
columns = ['id', 'Regex', 'Replacement']
vals = [
(1, 'Match1', 'Found1'),
(2, 'Match2', 'Found2'),
]
df_regex = sql.createDataFrame(vals, columns)
I'd like to match the 'Regex' column within the 'text' column of 'df_to_extract'. I'd like to extract the terms against each id with the resulting table containing the id and 'replacement' which corresponds to the 'Regex'. For example:
+---+------------+
| id| replacement|
+---+------------+
| 1| Found1|
| 3| Found2|
| 5| Found1|
+---+------------+
Thanks!
One way is to use a pyspark.sql.functions.expr, which allows you to use a column value as a parameter, in your join condition.
For example:
from pyspark.sql.functions import expr
df_to_extract.alias("e")\
.join(
df_regex.alias("r"),
on=expr(r"e.text LIKE concat('%', r.Regex, '%')"),
how="inner"
)\
.select("e.id", "r.Replacement")\
.show()
#+---+-----------+
#| id|Replacement|
#+---+-----------+
#| 1| Found1|
#| 3| Found2|
#| 5| Found1|
#+---+-----------+
Here I used the sql expression:
e.text LIKE concat('%', r.Regex, '%')
Which will join all rows where the text column is like the Regex column with the % acting as wildcards to capture anything before and after.
I want to convert my list of dictionaries into DataFrame. This is the list:
mylist =
[
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
This is my code:
from pyspark.sql.types import StringType
df = spark.createDataFrame(mylist, StringType())
df.show(2,False)
+-----------------------------------------+
| value|
+-----------------------------------------+
|{type_activity_id=1,type_activity_id=xxx}|
|{type_activity_id=2,type_activity_id=yyy}|
|{type_activity_id=3,type_activity_id=zzz}|
+-----------------------------------------+
I assume that I should provide some mapping and types for each column, but I don't know how to do it.
Update:
I also tried this:
schema = ArrayType(
StructType([StructField("type_activity_id", IntegerType()),
StructField("type_activity_name", StringType())
]))
df = spark.createDataFrame(mylist, StringType())
df = df.withColumn("value", from_json(df.value, schema))
But then I get null values:
+-----+
|value|
+-----+
| null|
| null|
+-----+
In the past, you were able to simply pass a dictionary to spark.createDataFrame(), but this is now deprecated:
mylist = [
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
df = spark.createDataFrame(mylist)
#UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
# warnings.warn("inferring schema from dict is deprecated,"
As this warning message says, you should use pyspark.sql.Row instead.
from pyspark.sql import Row
spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False)
#+----------------+------------------+
#|type_activity_id|type_activity_name|
#+----------------+------------------+
#|1 |xxx |
#|2 |yyy |
#|3 |zzz |
#+----------------+------------------+
Here I used ** (keyword argument unpacking) to pass the dictionaries to the Row constructor.
You can do it like this. You will get a dataframe with 2 columns.
mylist = [
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
myJson = sc.parallelize(mylist)
myDf = sqlContext.read.json(myJson)
Output :
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
| 1| xxx|
| 2| yyy|
| 3| zzz|
+----------------+------------------+
in Spark Version 2.4 it is possible to do directly with
df=spark.createDataFrame(mylist)
>>> mylist = [
... {"type_activity_id":1,"type_activity_name":"xxx"},
... {"type_activity_id":2,"type_activity_name":"yyy"},
... {"type_activity_id":3,"type_activity_name":"zzz"}
... ]
>>> df1=spark.createDataFrame(mylist)
>>> df1.show()
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
| 1| xxx|
| 2| yyy|
| 3| zzz|
+----------------+------------------+
I was also facing the same issue when creating dataframe from list of dictionaries.
I have resolved this using namedtuple.
Below is my code using data provided.
from collections import namedtuple
final_list = []
mylist = [{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
ExampleTuple = namedtuple('ExampleTuple', ['type_activity_id', 'type_activity_name'])
for my_dict in mylist:
namedtupleobj = ExampleTuple(**my_dict)
final_list.append(namedtupleobj)
sqlContext.createDataFrame(final_list).show(truncate=False)
output
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|1 |xxx |
|2 |yyy |
|3 |zzz |
+----------------+------------------+
my version informations are as follows
spark: 2.4.0
python: 3.6
It is not necessary to have my_list variable. since it was available I have used it to create namedtuple object otherwise directly namedtuple object can be created.
This question already has answers here:
Spark SQL: apply aggregate functions to a list of columns
(4 answers)
Closed 4 years ago.
I have edited this question to provide an example -
I have a list of columns names :
colnames = ['col1','col2','col3']
I need to pass these to a Dataframe function one after another to return values for each. I would not use the groupBy function, so this is not a duplicate of the other question. I just need the max, min, sum of all values of each column in my Dataframe.
Code example -
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext("local[2]", "Count App")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(
[(1, 100, 200), (100, 200, 100), (100, 200, 100), (-100, 50, 200)],
("col1", "col2", "col3"))
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 100| 200|
| 100| 200| 100|
| 100| 200| 100|
|-100| 50| 200|
+----+----+----+
colnames = ['col1','col2','col3']
maxval = map(lambda x: df.agg(sparkMax(df[x]).alias('max_of_{}'.format(x))), colnames)
## This gives me a list of Dataframes, NOT a single Dataframe as required
for x in maxval:
print (x.show())
+-----------+
|max_of_col1|
+-----------+
| 100|
+-----------+
None
+-----------+
|max_of_col2|
+-----------+
| 200|
+-----------+
None
+-----------+
|max_of_col3|
+-----------+
| 200|
+-----------+
How do I get a single Dataframe back from my lambda function, instead of a List of Dataframes. Looking like this -
+----------------+
|Column_name| Max|
+-----------+----+
|max_of_col1| 100|
+-----------+----+
|max_of_col2| 200|
+-----------+----+
|max_of_col3| 300|
+-----------+----+
I'm guessing something like a flatMap?
Appreciated.
Map function in Python takes 2 args, first one being a function and second being a iterable.
newdf = map(lambda x: len(x), colnames)
This might be helpful - http://book.pythontips.com/en/latest/map_filter.html
df.x will not work. df is an object and you're accessing an attribute of the df object called x.
Have a look at the following example.
obj = type("MyObj", (object,), {'name':1})
a = obj()
print a.name
Above example will print the value of the attribute name as 1.
However if I try to do,
obj = type("MyObj", (object,), {'name':1})
a = obj()
var = 'name'
print a.var
this is going to give me a AttributeError as the object a does not have an attribute called var.
I have a Dataframe with two columns: BrandWatchErwaehnungID and word_counts.
The word_counts column is the output of `CountVectorizer (a sparse vector). After dropped the empty rows I have created two new columns one with the indices of the sparse vector and one with their values.
help0 = countedwords_text['BrandWatchErwaehnungID','word_counts'].rdd\
.filter(lambda x : x[1].indices.size!=0)\
.map(lambda x : (x[0],x[1],DenseVector(x[1].indices) , DenseVector(x[1].values))).toDF()\
.withColumnRenamed("_1", "BrandWatchErwaenungID").withColumnRenamed("_2", "word_counts")\
.withColumnRenamed("_3", "word_indices").withColumnRenamed("_4", "single_word_counts")
I needed to convert them to dense vectors before adding to my Dataframe due to spark did not accept numpy.ndarray. My problem is that I now want to explode that Dataframeon the word_indices column but the explode method from pyspark.sql.functions does only support arrays or map as input.
I have tried:
help1 = help0.withColumn('b' , explode(help0.word_indices))
and get the following error:
cannot resolve 'explode(`word_indices')' due to data type mismatch: input to function explode should be array or map type
Afterwards I tried:
help1 = help0.withColumn('b' , explode(help0.word_indices.toArray()))
Which also did not worked...
Any suggestions?
You have to use udf:
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
from pyspark.ml.linalg import *
#udf("array<integer>")
def indices(v):
if isinstance(v, DenseVector):
return list(range(len(v)))
if isinstance(v, SparseVector):
return v.indices.tolist()
df = spark.createDataFrame([
(1, DenseVector([1, 2, 3])), (2, SparseVector(5, {4: 42}))],
("id", "v"))
df.select("id", explode(indices("v"))).show()
# +---+---+
# | id|col|
# +---+---+
# | 1| 0|
# | 1| 1|
# | 1| 2|
# | 2| 4|
# +---+---+