I have a pyspark dataframe which looks like this-
RowNumber
value
1
[{mail=abc#xyz.com, Name=abc}, {mail=mnc#xyz.com, Name=mnc}]
2
[{mail=klo#xyz.com, Name=klo}, {mail=mmm#xyz.com, Name=mmm}]
The column "value" is of string type.
root
|--value: string (nullable = false)
|--rowNumber: integer (nullable = false)
Step 1 I need to explode the dictionaries inside the list on each row under column "value" like this-
Step 2 And then to further explode the column so that the resultant table looks like :
Although When I try to get to Step1 using:
df.select(explode(col('value')).alias('value'))
it shows me error:
Analysis Exception: cannot resolve 'explode("value")' due to data type mismatch: input to function explode should be array or map type, not string
How do I convert this string under column 'value' to compatible data types so that I can proceed with exploding the dictionary elements as valid array/json (step1) and then into separate columns (step2) ?
please help
EDIT: There may be a simpler way to do this with the from_json function and StructTypes, for that you can check out this link.
The ast way
To parse a string as a dictionary or list in Python, you can use the ast library's literal_eval function. If we convert this function to a PySpark UDF, then the following code will suffice:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import MapType
from ast import literal_eval
literal_eval_udf = udf(literal_eval, ArrayType(MapType()))
table = table.withColumn("value", literal_eval_udf(col("value"))) # Make strings into ArrayTypes of MapTypes
table = table.withColumn("value", explode(col("value"))) # Explode ArrayTypes such that each row contains a MapType
After applying these functions to the table, what should remain is what you originally referred to as the start of "step 2." From here, we want to split each "value" column key into a column with entries from the corresponding value. This is accomplished with another function application which gives us the dict values:
table = table.withColumn("value", map_values(col("value")))
Now the values column contains an ArrayType of the values contained in each dictionary. To make a separate column for each of these, we simply add them in a loop:
keys = ['mail', 'Name']
for k in range(len(keys)):
table = table.withColumn(keys[k], table.value[k])
Then you can drop the original value column because we wouldn't need it anymore, as you'll now have the columns mail and Name with the information from the corresponding maps.
Related
I am using PySpark and I need to process the log files that are appended into a single data frame. Most of the columns are look normal, but one of the columns has JSON string in {}. Basically, each row is an individual event and for JSON string I can apply individual Schema. But I don't know what is the best way to process data here.
Example:
This table later will help to aggregate the events in the way I need.
I tried to use function withColumn and use from_json. It worked successfully for a single column:
from pyspark.sql.types import *
import pyspark.sql.functions as F
df = (df
.withColumn("nested_json",
F.when(F.col("event_name") == "EventStart",F.from_json("json_string","Name String, Version Int, Id Int")))
It did for my 1st row what I want when I will query nested_json. But it is applied schema on the whole column, and I would like to process each row depends on the event_name
I was naive and try to do this:
from pyspark.sql.types import *
import pyspark.sql.functions as F
df = (df
.withColumn("nested_json",
F.when(F.col("event_name") == "EventStart",F.from_json("json_string","Name String, Version Int, Id Int"))
F.when(F.col("event_name") == "Action1",F.from_json("json_string","Name String, Version Int, UserName String, PosX int, PosY int"))
)
And this is failed to run with when() can only be applied on a Column previously generated by when() function
I assumed, my 1st withColumn applied schema for the whole column.
What other options do I have to apply JSON schema based on event_name value and flattened values?
What if you chain your when statements?
For example,
df.withColumn("nested_json", F.when(F.col("event_name") =="EventStart",F.from_json(...)).when(F.col("event_name") == "Action1", F. from_json(...)))
I try to select a subset of the object type column cells with str.split(pat="'")
dataset['pictures'].str.split(pat=",")
I want to get the values of the numbers 40092 and 39097 and the two dates of the pictures as two columns ID and DATE but as result I get one column consisting of NaNs.
'pictures' column:
{"col1":"40092","picture_date":"2017-11-06"}
{"col1":"39097","picture_date":"2017-10-31"}
...
Here's what I understood from your question:
You have a pandas Dataframe with one of the columns containing json strings (or any other string that need to be parsed into multiple columns)
E.g.
df = pd.DataFrame({'pictures': [
'{"col1":"40092","picture_date":"2017-11-06"}',
'{"col1":"39097","picture_date":"2017-10-31"}']
})
You want to parse the two elements ('col1' and 'picture_date') into two separate columns for further processing (or perhaps just one of them)
Define a function for parsing the row:
import json
def parse_row(r):
j=json.loads(r['pictures'])
return j['col1'],j['picture_date']
And use Pandas DataFrame.apply() method as follows
df1=df.apply(parse_row, axis=1,result_type='expand')
The result is a new dataframe with two columns - each containing the parsed data:
0 1
0 40092 2017-11-06
1 39097 2017-10-31
If you need just one column you can return a single element from parse_row (instead of a two element tuple in the above example) and just use df.apply(parse_row).
If the values are not in json format, just modify parse_row accordingly (Split, convert string to numbers, etc.)
Thanks for the replies but I solved it by loading the 'pictures' column from the dataset into a list:
picturelist= dataset['pictures'].values.tolist()
And afterwards creating a dataframe of the list made from the column pictures and concat it with the original dataset without the picture column
two_new_columns = pd.Dataframe(picturelist)
new_dataset = pd.concat(dataset, two_new_columns)
I want to create a dictionary from a dataframe in python.
In this dataframe, frame one column contains all the keys and another column contains multiple values of that key.
DATAKEY DATAKEYVALUE
name mayank,deepak,naveen,rajni
empid 1,2,3,4
city delhi,mumbai,pune,noida
I tried this code to first convert it into simple data frame but all the values are not separating row-wise:
columnnames=finaldata['DATAKEY']
collist=list(columnnames)
dfObj = pd.DataFrame(columns=collist)
collen=len(finaldata['DATAKEY'])
for i in range(collen):
colname=collist[i]
keyvalue=finaldata.DATAKEYVALUE[i]
valuelist2=keyvalue.split(",")
dfObj = dfObj.append({colname: valuelist2}, ignore_index=True)
You should modify you title question, it is misleading because pandas dataframes are "kind of" dictionaries in themselves, that is why the first comment you got was relating to the .to_dict() pandas' built-in method.
What you want to do is actually iterate over your pandas dataframe row-wise and for each row generate a dictionary key from the first column, and a dictionary list from the second column.
For that you will have to use:
an empty dictionary: dict()
the method for iterating over dataframe rows: dataframe.iterrows()
a method to split a single string of values separated by a separator as the split() method you suggested: str.split().
With all these tools all you have to do is:
output = dict()
for index, row in finaldata.iterrows():
output[row['DATAKEY']] = row['DATAKEYVALUE'].split(',')
Note that this generates a dictionary whose values are lists of strings. And it will not work if the contents of the 'DATAKEYVALUE' column are not singles strings.
Also note that this may not be the most efficient solution if you have a very large dataframe.
I have a big dataframe consisting of 144005 rows. One of the columns of the dataframe is a string of dictionaries like
'{"Step ID":"78495","Choice Number":"0","Campaign Run ID":"23199"},
{"Step ID":"78495","Choice Number":"0","Campaign Run ID":"23199"},
{"Step ID":"78495","Choice Number":"0","Campaign Run ID":"23199"}'
I want to convert this string to seperate dictionaries. I have been using json.loads() for this purpose, however, I have had to iterate over this string of dictionary one at a time, convert it to a dictionary using json.loads(), then convert this to a new dataframe and keep appending to this dataframe while I iterate over the entire original dataframe.
I wanted to know whether there was a more efficient way to do this as it takes a long time to iterate over an entire dataframe of 144005 rows.
Here is a snippet of what I have been doing:
d1 = df1['attributes'].values
d2 = df1['ID'].values
for i,j in zip(d1,d2):
data = json.loads(i)
temp = pd.DataFrame(data, index = [j])
temp['ID'] = j
df2 = df2.append(temp, sort=False)
My 'attributes' column consist of a string of dictionary as a row, and the 'Id' column contains it's corresponding Id
Did it myself.
I used map along with lambda functions to efficiently apply json.loads() on each row, then I converted this data to a dataframe and stored the output.
Here it is.
l1 = df1['attributes'].values
data = map(lambda x: json.loads(x), l1)
df2 = pd.DataFrame(data)
Just check the type of your column by using type()
If the type is Series:
data['your column name'].apply(pd.Series)
then you will see all keys as separate column in a dataframe with their key values.
I have just started using databricks/pyspark. Im using python/spark 2.1. I have uploaded data to a table. This table is a single column full of strings. I wish to apply a mapping function to each element in the column. I load the table into a dataframe:
df = spark.table("mynewtable")
The only way I could see was others saying was to convert it to RDD to apply the mapping function and then back to dataframe to show the data. But this throws up job aborted stage failure:
df2 = df.select("_c0").rdd.flatMap(lambda x: x.append("anything")).toDF()
All i want to do is just apply any sort of map function to my data in the table.
For example append something to each string in the column, or perform a split on a char, and then put that back into a dataframe so i can .show() or display it.
You cannot:
Use flatMap because it will flatten the Row
You cannot use append because:
tuple or Row have no append method
append (if present on collection) is executed for side effects and returns None
I would use withColumn:
df.withColumn("foo", lit("anything"))
but map should work as well:
df.select("_c0").rdd.flatMap(lambda x: x + ("anything", )).toDF()
Edit (given the comment):
You probably want an udf
from pyspark.sql.functions import udf
def iplookup(s):
return ... # Some lookup logic
iplookup_udf = udf(iplookup)
df.withColumn("foo", iplookup_udf("c0"))
Default return type is StringType, so if you want something else you should adjust it.