Extracting geo coordinates from a complex nested Twitter json, using Python - python

I am reading multiple complex json files and trying to extract geo coordinates.
I cannot attach the file itself right now, but I can print the tree here.
The file has several hundred options and some objects repeat.
Please see the structure of the file in .txt format.
When I read the json with Spark in Python, it shows me these coordinates in coordinates column and it is there.
It is stored in coordinates column. Please see a proof.
I am obviously trying to reduce the number of columns and select some columns.
The last two columns are my geo coordinates. I tried both coordinates and geo and also coordinates.coordinates with geo.coordinates. Both options do not work.
df_tweets = tweets.select(['text',
'user.name',
'user.screen_name',
'user.id',
'user.location',
'place.country',
'place.full_name',
'place.name',
'user.followers_count',
'retweet_count',
'retweeted',
'user.friends_count',
'entities.hashtags.text',
'created_at',
'timestamp_ms',
'lang',
'coordinates.coordinates', # or just `coordinates`
'geo.coordinates' # or just `geo`
])
In the first case with coordinates and geo I get the following, printing the schema:
df_tweets.printSchema()
root
|-- text: string (nullable = true)
|-- name: string (nullable = true)
|-- screen_name: string (nullable = true)
|-- id: long (nullable = true)
|-- location: string (nullable = true)
|-- country: string (nullable = true)
|-- full_name: string (nullable = true)
|-- name: string (nullable = true)
|-- followers_count: long (nullable = true)
|-- retweet_count: long (nullable = true)
|-- retweeted: boolean (nullable = true)
|-- friends_count: long (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
|-- created_at: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- lang: string (nullable = true)
|-- coordinates: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
|-- geo: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
When I do coordinates.coordinates and geo.coordinates, I get
root
|-- text: string (nullable = true)
|-- name: string (nullable = true)
|-- screen_name: string (nullable = true)
|-- id: long (nullable = true)
|-- location: string (nullable = true)
|-- country: string (nullable = true)
|-- full_name: string (nullable = true)
|-- name: string (nullable = true)
|-- followers_count: long (nullable = true)
|-- retweet_count: long (nullable = true)
|-- retweeted: boolean (nullable = true)
|-- friends_count: long (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
|-- created_at: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- lang: string (nullable = true)
|-- coordinates: array (nullable = true)
| |-- element: double (containsNull = true)
|-- coordinates: array (nullable = true)
| |-- element: double (containsNull = true)
When I print both dataframes in Pandas, none of them gives me coordinates, I still have None.
How to extract geo coordinates properly?

If I look at my dataframe with tweet data, I see it like this
In [44]: df[df.coordinates.notnull()]['coordinates']
Out[44]:
98 {'type': 'Point', 'coordinates': [-122.32111, ...
99 {'type': 'Point', 'coordinates': [-122.32111, ...
Name: coordinates, dtype: object
So it's a dictionary that has to be parsed
tweets_coords = df[df.coordinates.notnull()]['coordinates'].tolist()
for coords in tweets_coords:
print(coords)
print(coords['coordinates'])
print(coords['coordinates'][0])
print(coords['coordinates'][1])
Output:
{'type': 'Point', 'coordinates': [-122.32111, 47.62366]}
[-122.32111, 47.62366]
-122.32111
47.62362
{'type': 'Point', 'coordinates': [-122.32111, 47.62362]}
[-122.32111, 47.62362]
-122.32111
47.62362
You can setup a lambda function in apply() to parse these out row by row, otherwise you can use list comprehension using what i've provided as the basis for your analysis.
All that said, maybe check this first...
Where you are using coordinates.coordinates and geo.coordinates, try coordinates['coordinates'] and geo['coordinates']

Related

How to rename the first level keys of struct with PySpark in Azure Databricks?

I would like to rename the keys of the first level objects inside my payload.
from pyspark.sql.functions import *
ds = {'Fruits': {'apple': {'color': 'red'},'mango': {'color': 'green'}}, 'Vegetables': None}
df = spark.read.json(sc.parallelize([ds]))
df.printSchema()
"""
root
|-- Fruits: struct (nullable = true)
| |-- apple: struct (nullable = true)
| | |-- color: string (nullable = true)
| | |-- shape: string (nullable = true)
| |-- mango: struct (nullable = true)
| | |-- color: string (nullable = true)
|-- Vegetables: string (nullable = true)
"""
Desired output:
root
|-- Fruits: struct (nullable = true)
| |-- APPLE: struct (nullable = true)
| | |-- color: string (nullable = true)
| | |-- shape: string (nullable = true)
| |-- MANGO: struct (nullable = true)
| | |-- color: string (nullable = true)
|-- Vegetables: string (nullable = true)
In this case I would like to rename the keys in the first level to uppercase.
If I had a map type I could use transform keys:
df.select(transform_keys("Fruits", lambda k, _: upper(k)).alias("data_upper")).display()
Unfortunately, I have a struct type.
AnalysisException: cannot resolve 'transform_keys(Fruits,
lambdafunction(upper(x_18), x_18, y_19))' due to argument data type
mismatch: argument 1 requires map type, however, 'Fruits' is of
structapple:struct<color:string,shape:string,mango:structcolor:string>
type.;
I'm using Databricks runtime 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12).
The function you tried to use (transform_keys) is for map type columns. Your column type is struct.
You could use withField.
from pyspark.sql import functions as F
ds = spark.createDataFrame([], 'Fruits struct<apple:struct<color:string,shape:string>,mango:struct<color:string>>, Vegetables string')
ds.printSchema()
# root
# |-- Fruits: struct (nullable = true)
# | |-- apple: struct (nullable = true)
# | | |-- color: string (nullable = true)
# | | |-- shape: string (nullable = true)
# | |-- mango: struct (nullable = true)
# | | |-- color: string (nullable = true)
# |-- Vegetables: string (nullable = true)
ds = ds.withColumn('Fruits', F.col('Fruits').withField('APPLE', F.col('Fruits.apple')))
ds = ds.withColumn('Fruits', F.col('Fruits').withField('MANGO', F.col('Fruits.mango')))
ds.printSchema()
# root
# |-- Fruits: struct (nullable = true)
# | |-- APPLE: struct (nullable = true)
# | | |-- color: string (nullable = true)
# | | |-- shape: string (nullable = true)
# | |-- MANGO: struct (nullable = true)
# | | |-- color: string (nullable = true)
# |-- Vegetables: string (nullable = true)
You can also recreate the structure, but you will need to include all of the struct fields when recreating.
ds = ds.withColumn('Fruits', F.struct(
F.col('Fruits.apple').alias('APPLE'),
F.col('Fruits.mango').alias('MANGO'),
))
ds.printSchema()
# root
# |-- Fruits: struct (nullable = true)
# | |-- APPLE: struct (nullable = true)
# | | |-- color: string (nullable = true)
# | | |-- shape: string (nullable = true)
# | |-- MANGO: struct (nullable = true)
# | | |-- color: string (nullable = true)
# |-- Vegetables: string (nullable = true)

How do I get column and nested field names in pyspark?

I have a pyspark df.
df.printSchema()
root
|-- bio: string (nullable = true)
|-- city: string (nullable = true)
|-- company: string (nullable = true)
|-- custom_fields: struct (nullable = true)
| |-- nested_field1: string (nullable = true)
|-- email: string (nullable = true)
|-- first_conversion: struct (nullable = true)
| |-- nested_field2: struct (nullable = true)
| | |-- number: string (nullable = true)
| | |-- state: string (nullable = true)
I would like to iterate over column and nested fields in order to get their names (just their names). I should be able to print them and get the following result:
bio
city
company
custom_fields
nested_field1
email
first_conversion
nested_field2
number
state
I can easily print the first level with:
for st in df.schema:
print(st.name)
But how do I check for deeper levels in runtime and recursively?
dtypes will give you more details of the schema, you will have to parse it though
df.printSchema()
root
|-- id: integer (nullable = true)
|-- rec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = true)
| | |-- b: float (nullable = true)
df.dtypes
# [('id', 'int'), ('rec', 'array<struct<a:int,b:float>>')]

Transform array to column dynamically using pyspark

I'm having a trouble with a json dataframe:
{
"keys":[
{
"id":1,
"start":"2019-05-10",
"end":"2019-05-11",
"property":[
{
"key":"home",
"value":"1000"
},
{
"key":"office",
"value":"exit"
},
{
"key":"car",
"value":"ford"
}
]
},
{
"id":2,
"start":"2019-05-11",
"end":"2019-05-12",
"property":[
{
"key":"home",
"value":"2000"
},
{
"key":"office",
"value":"out"
},
{
"key":"car",
"value":"fiat"
}
]
}
]
}
root
|-- id: long (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- property: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I need to have key and value as column, where key is the name of column and value is the value in the dataframe.
At first I used getItem with an alias,:
df.select("id","start","end",col("property.value").getItem(0).alias("home"),col("property.value").getItem(1).alias("office"),col("property.value").getItem(2).alias("car")
But elements number and position can change, so i thought to provide a new schema with all the possible value for key, and to set value from my dataframe, without being joined to the position, but i think it is a low performance solution.
I tried also using pivot but i don't have the correct result as shown in figure, in fact i need to have split column, without a comma in the column name and value
id |start |end |[home, office, car]
+---+--------------+------------+--------------
|1 |2019-05-10 |2019-05-11 |[1000,exit,ford]
|2 |2019-05-11 |2019-05-12 |[2000,out,fiat]
I need this schema updating dynamically the fields, which number can be fixed:
|-- root
|-- id: long (nullable = true)
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- home: string (nullable = true)
|-- office: string (nullable = true)
|-- car: string (nullable = true)
|-- cycle: string (nullable = true)
Anyone can help me, please?
Please find my try below. I deliberately expanded it into a couple of steps so that you could see how the final df was created (feel free to wrap these steps, however this would not have any impact on the performance).
inputJSON = "/tmp/my_file.json"
dfJSON = spark.read.json(inputJSON, multiLine=True)
from pyspark.sql import functions as F
df = dfJSON.select(F.explode(dfJSON["keys"]).alias("x"))
df2 = df.select(F.col("x.start").alias("start"),F.col("x.end").alias("end"),F.col("x.id").alias("id"),F.col("x.property").alias("property"))
df3 = df2.select(F.col("start"),F.col("end"),F.col("id"), F.explode(df2["property"]).alias("properties"))
df4 = df3.select(F.col("start"),F.col("end"),F.col("id"), F.col("properties.key").alias("key"), F.col("properties.value").alias("value"))
df4.groupBy("start","end","id").pivot('key').agg(F.last('value', True)).show()
Output:
+----------+----------+---+----+----+------+
| start| end| id| car|home|office|
+----------+----------+---+----+----+------+
|2019-05-11|2019-05-12| 2|fiat|2000| out|
|2019-05-10|2019-05-11| 1|ford|1000| exit|
+----------+----------+---+----+----+------+
Schemas:
dfJSON.printSchema()
root
|-- keys: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- end: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- property: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- key: string (nullable = true)
| | | | |-- value: string (nullable = true)
| | |-- start: string (nullable = true)
df2.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- property: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
df3.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- properties: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
df4.printSchema()
root
|-- start: string (nullable = true)
|-- end: string (nullable = true)
|-- id: long (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Try with groupBy and pivot.
from pyspark.sql.functions import *
cols=['home','office','car']
spark.read.option("multiline","true").\
json("<path>").\
selectExpr("explode(keys)").\
selectExpr("col.id","col.start","col.end","explode(col.property)").\
select("id","start","end","col.*").\
groupBy("id","start","end").\
pivot("key").\
agg(first("value")).\
withColumn("[home,office,car]",array(*cols)).\
drop(*cols).\
show()
#+---+----------+----------+------------------+
#| id| start| end| [home,office,car]|
#+---+----------+----------+------------------+
#| 1|2019-05-10|2019-05-11|[1000, exit, ford]|
#| 2|2019-05-11|2019-05-12| [2000, out, fiat]|
#+---+----------+----------+------------------+

PySpark - Json explode nested with Struct and array of struct

I am trying to parse nested json with some sample json. Below is the print schema
|-- batters: struct (nullable = true)
| |-- batter: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- type: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- ppu: double (nullable = true)
|-- topping: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = true)
|-- type: string (nullable = true)
Trying to explode batters,topping separately and combine them.
df_batter = df_json.select("batters.*")
df_explode1= df_batter.withColumn("batter", explode("batter")).select("batter.*")
df_explode2= df_json.withColumn("topping", explode("topping")).select("id",
"type","name","ppu","topping.*")
Unable to combine the two data frame.
Tried using single query
exploded1 = df_json.withColumn("batter", df_batter.withColumn("batter",
explode("batter"))).withColumn("topping", explode("topping")).select("id",
"type","name","ppu","topping.*","batter.*")
But getting error.Kindly help me to solve it. Thanks
You basically have to explode the arrays together using arrays_zip which returns a merged array of structs. Try this. I haven't tested but it should work.
from pyspark.sql import functions as F
df_json.select("id","type","name","ppu","topping","batters.*")\
.withColumn("zipped", F.explode(F.arrays_zip("batter","topping")))\
.select("id","type","name","ppu","zipped.*").show()
You could also do it one by one:
from pyspark.sql import functions as F
df1=df_json.select("id","type","name","ppu","topping","batters.*")\
.withColumn("batter", F.explode("batter"))\
.select("id","type","name","ppu","topping","batter")
df1.withColumn("topping", F.explode("topping")).select("id","type","name","ppu","topping.*","batter.*")

How to efficiently process records in rdd and maintain the structure of a record

I have been working with Google Analytics data I have got in s3. I am loading the file as follows -
df = sc.textFile('s3n://BUCKET_NAME/2017/1/2/')
After this, I get an RDD. But if we want to see the schema I have loaded the data into spark SQL and the data schema is like this -
root
|-- channelGrouping: string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- index: string (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable = true)
| |-- browser: string (nullable = true)
| |-- browserSize: string (nullable = true)
| |-- browserVersion: string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |-- isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)
| |-- language: string (nullable = true)
| |-- mobileDeviceBranding: string (nullable = true)
| |-- mobileDeviceInfo: string (nullable = true)
| |-- mobileDeviceMarketingName: string (nullable = true)
| |-- mobileDeviceModel: string (nullable = true)
| |-- mobileInputSelector: string (nullable = true)
| |-- operatingSystem: string (nullable = true)
| |-- operatingSystemVersion: string (nullable = true)
| |-- screenColors: string (nullable = true)
| |-- screenResolution: string (nullable = true)
|-- fullVisitorId: string (nullable = true)
|-- geoNetwork: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- cityId: string (nullable = true)
| |-- continent: string (nullable = true)
| |-- country: string (nullable = true)
| |-- latitude: string (nullable = true)
| |-- longitude: string (nullable = true)
| |-- metro: string (nullable = true)
| |-- networkDomain: string (nullable = true)
| |-- networkLocation: string (nullable = true)
| |-- region: string (nullable = true)
| |-- subContinent: string (nullable = true)
what I tried -
def remove_null_device(val):
_ori = json.loads(val)
# _ori = val
_dic_val = _ori['device']
for key, _value in _dic_val.items():
if _value == "null":
_dic_val[key] = "Hello There I am Testing this"
_ori["device"] = _dic_val
return _ori
device_data = df_rdd.map(remove_null_device)
Problem Statement - I want to iterate over every record as this is the nested structure I am thinking to pass one main key at a time like we have device, geoNetwork and check if the values are empty or not or null.
But this seems to change the structure of the whole record and the items are not getting updated don't know why. Please suggest any better approach for same.
Thanks!
Ok I want to check for all the fields in device if they are empty or null or (not set) and then updated those values and return the row I have updated and schema should remain intact.

Categories