I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using "from_json" and "get_json_object", but have been unable to read the data. Here's the smallest snippet of the source data that I've been trying to read:
{"value": "\u0000\u0000\u0000\u0000/{\"context\":\"data\"}"}
I need to extract the nested dict value. I used below code to clean the data and read it into a dataframe
from pyspark.sql.functions import *
from pyspark.sql.types import *
input_path = '/FileStore/tables/enrl/context2.json #path for the above file
schema1 = StructType([StructField("context",StringType(),True)]) #Schema I'm providing
raw_df = spark.read.json(input_path)
cleansed_df = raw_df.withColumn("cleansed_value",regexp_replace(raw_df.value,'/','')).select('cleansed_value') #Removed extra '/' in the data
cleansed_df.select(from_json('cleansed_value',schema=schema1)).show(1, truncate=False)
I get a null dataframe each time I run the above code. Please help.
Tried below stuff and it didn't work:
PySpark: Read nested JSON from a String Type Column and create columns
Also tried to write it to a JSON file and read it. It didn't work as well:
reading a nested JSON file in pyspark
The null chars \u0000 affect the parsing of the JSON. You can replace them as well:
df = spark.read.json('path')
df2 = df.withColumn(
'cleansed_value',
F.regexp_replace('value','[\u0000/]','')
).withColumn(
'parsed',
F.from_json('cleansed_value','context string')
)
df2.show(20,0)
+-----------------------+------------------+------+
|value |cleansed_value |parsed|
+-----------------------+------------------+------+
|/{"context":"data"}|{"context":"data"}|[data]|
+-----------------------+------------------+------+
Related
[{"Apertura":35,"Apertura_Homogeneo":35,"Cantidad_Operaciones":1,"Cierre":35,"Cierre_Homogeneo":35,"Denominacion":"INSUMOS AGROQUIMICOS S.A.","Fecha":"02\/02\/2018","Maximo":35,"Maximo_Homogeneo":35,"Minimo":35,"Minimo_Homogeneo":35,"Monto_Operado_Pesos":175,"Promedio":35,"Promedio_Homogeneo":35,"Simbolo":"INAG","Variacion":-5.15,"Variacion_Homogeneo":0,"Vencimiento":"48hs","Volumen_Nominal":5},
{"Apertura":34.95,"Apertura_Homogeneo":34.95,"Cantidad_Operaciones":2,"Cierre":34.95,"Cierre_Homogeneo":34.95,"Denominacion":"INSUMOS AGROQUIMICOS S.A.","Fecha":"05\/02\/2018","Maximo":34.95,"Maximo_Homogeneo":34.95,"Minimo":34.95,"Minimo_Homogeneo":34.95,"Monto_Operado_Pesos":5243,"Promedio":-79228162514264337593543950335,"Promedio_Homogeneo":-79228162514264337593543950335,"Simbolo":"INAG","Variacion":-0.14,"Variacion_Homogeneo":-0.14,"Vencimiento":"48hs","Volumen_Nominal":150},
{"Apertura":32.10,"Apertura_Homogeneo":32.10,"Cantidad_Operaciones":2,"Cierre":32.10,"Cierre_Homogeneo":32.10,"Denominacion":"INSUMOS AGROQUIMICOS S.A.","Fecha":"07\/02\/2018","Maximo":32.10,"Maximo_Homogeneo":32.10,"Minimo":32.10,"Minimo_Homogeneo":32.10,"Monto_Operado_Pesos":98756,"Promedio":32.10,"Promedio_Homogeneo":32.10,"Simbolo":"INAG","Variacion":-8.16,"Variacion_Homogeneo":-8.88,"Vencimiento":"48hs","Volumen_Nominal":3076}]
Hi,
in the same example as above, if I do get a CSV file with that data Arpertura.csv, how can I import and parse it in a PANDAS dataframe? The real file is a few gigabytes large. I want to get
Sum Volumen_Nominal for all Aperturas (3076+150+5) and some other slice and dice.
Thanks.
Chibi
I tried importing the CSV with
df = pd.read_csv(r\'filename')
df_json = df.to_JSON()
pd.read_json(_, orient='split')
but it would not work. I think the list structure in front has to be removed.
The result I now gets is
Header = [{"Apertura":35,"Apertura_Homogeneo":35,"Cantidad_Operaciones":1,"Cierre":35,"Cierre_Homogeneo":35,"Denominacion":"INSUMOS AGROQUIMICOS S.A.","Fecha":"02\/02\/2018","Maximo":35,"Maximo_Homogeneo":35,"Minimo":35,"Minimo_Homogeneo":35,"Monto_Operado_Pesos":175,"Promedio":35,"Promedio_Homogeneo":35,"Simbolo":"INAG","Variacion":-5.15,"Variacion_Homogeneo":0,"Vencimiento":"48hs","Volumen_Nominal":5}
Body starts with nan and follows with the rest:
{"Apertura":34.95,"Apertura_Homogeneo":34.95,"Cantidad_Operaciones":2,"Cierre":34.95,"Cierre_Homogeneo":34.95,"Denominacion":"INSUMOS AGROQUIMICOS S.A.","Fecha":"05\/02\/2018","Maximo":34.95,"Maximo_Homogeneo":34.95,"Minimo":34.95,"Minimo_Homogeneo":34.95,"Monto_Operado_Pesos":5243,"Promedio":-79228162514264337593543950335,"Promedio_Homogeneo":-79228162514264337593543950335,"Simbolo":"INAG","Variacion":-0.14,"Variacion_Homogeneo":-0.14,"Vencimiento":"48hs","Volumen_Nominal":150}
You do not need to convert the dataframe to json and back.
If you want the sum of a column you can use:
df = pd.read_csv(r'filename')
df["Volumen_Nominal"].sum()
Python 3.9.5/Pandas 1.1.3
I use the following code to create a nested dictionary object from a csv file with headers:
import pandas as pd
import json
import os
csv = "/Users/me/file.csv"
csv_file = pd.read_csv(csv, sep=",", header=0, index_col=False)
csv_file['org'] = csv_file[['location', 'type']].apply(lambda s: s.to_dict(), axis=1)
This creates a nested object called org from the data in the columns called location and type.
Now let's say the type column doesn't even exist in the csv file, and I want to pass a literal string as a type value instead of the values from a column in the csv file. So for example, I want to create a nested object called org using the values from the data column as before, but I want to just use the string foo for all values of a key called type. How to accomplish this?
You could just build it by hand:
csv_file['org'] = csv_file['location'].apply(lambda x: {'location': x,
'type': 'foo'})
use Chainmap. This will allow to use multiple columns (columns_to_use), and even override existing ones (if type is in these columns, it will be overridden):
from collections import ChainMap
# .. some code
csv_file['org'] = csv_file[columns_to_use].apply(
lambda s: ChainMap({'type': 'foo'}, s.to_dict()), axis=1)
BTW, without adding constant values it could be done by df.to_dict():
csv_file['org'] = csv_file[['location', 'type']].to_dict('records')
I am writing a pyspark program that takes a txt file and then add a few columns to the left(beginning) of the columns in the file.
My text file looks like this:
ID,Name,Age
1233,James,15
After I run the program I want it to add two columns named creation_DT and created_By to the left of the table. I am trying to get it to look like this:
Creation_DT,Created_By,ID,Name,Age
"current timestamp", Sean,1233,James,15
This code below get my required output but I was wondering if there was an easier way to do this to optimize my script below using pyspark.
import pandas as pd
import numpy as np
with open
df = pd.read_csv("/home/path/Sample Text Files/sample5.txt", delimiter = ",")
df=pd.DataFrame(df)
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Create_BY',value="Sean")
df.write("/home/path/new/new_file.txt")
Any ideas or suggestions?
yes it is relatively easy to convert to pyspark code
from pyspark.sql import DataFrame, functions as sf
import datetime
# read in using dataframe reader
# path here if you store your csv in local, should use file:///
# or use hdfs:/// if you store your csv in a cluster/HDFS.
spdf = (spark.read.format("csv").option("header","true")
.load("file:///home/path/Sample Text Files/sample5.txt"))
spdf2 = (
spdf
.withColumn("Creation_DT", sf.lit(datetime.date.today().strftime("%Y-%m-%d")))
.withColumn("Create_BY", sf.lit("Sean"))
spdf2.write.csv("file:///home/path/new/new_file.txt")
this code assumes you are appending the creation_dt or create_by using the same value.
I don't see you use any pyspark in your code, so I'll just use pandas this way:
cols = df.columns
df['Creation_DT'] =pd.to_datetime('today')
df['Create_BY']="Sean"
cols = cols.insert(0, 'Create_BY')
cols = cols.insert(0, 'Creation_DT')
df.columns = cols
df.write("/home/path/new/new_file.txt")
Not able to differentiate the datatypes while I am doing the profiling for csv file, Giving every filed as string only
I have tried the below code
rdd = sc.textFile(file)
header = rdd.first()
rdd = rdd.filter(lambda x: x != header)
rdd1 = rdd.mapPartitions(lambda x: csv.reader(x))
spark_df = rdd1.toDF(header.split(','))
After done the profiling for the CSV file, I am getting all the fileds are strings only, not able to identify as numeric, date
The function textFile() does not support schema inference.
If you are reading from a structured source (such as csv), use sc.read.csv instead, which supports schema inference.
Your code would be:
df = sc.read.option("header", "true").option("inferSchema", "true").csv(file)
I have a resulting RDD labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions). This has output in this format:
[(0.0, 0.08482142857142858), (0.0, 0.11442786069651742),.....]
What I want is to create a CSV file with one column for labels (the first part of the tuple in above output) and one for predictions(second part of tuple output). But I don't know how to write to a CSV file in Spark using Python.
How can I create a CSV file with the above output?
Just map the lines of the RDD (labelsAndPredictions) into strings (the lines of the CSV) then use rdd.saveAsTextFile().
def toCSVLine(data):
return ','.join(str(d) for d in data)
lines = labelsAndPredictions.map(toCSVLine)
lines.saveAsTextFile('hdfs://my-node:9000/tmp/labels-and-predictions.csv')
I know this is an old post. But to help someone searching for the same, here's how I write a two column RDD to a single CSV file in PySpark 1.6.2
The RDD:
>>> rdd.take(5)
[(73342, u'cells'), (62861, u'cell'), (61714, u'studies'), (61377, u'aim'), (60168, u'clinical')]
Now the code:
# First I convert the RDD to dataframe
from pyspark import SparkContext
df = sqlContext.createDataFrame(rdd, ['count', 'word'])
The DF:
>>> df.show()
+-----+-----------+
|count| word|
+-----+-----------+
|73342| cells|
|62861| cell|
|61714| studies|
|61377| aim|
|60168| clinical|
|59275| 2|
|59221| 1|
|58274| data|
|58087|development|
|56579| cancer|
|50243| disease|
|49817| provided|
|49216| specific|
|48857| health|
|48536| study|
|47827| project|
|45573|description|
|45455| applicant|
|44739| program|
|44522| patients|
+-----+-----------+
only showing top 20 rows
Now write to CSV
# Write CSV (I have HDFS storage)
df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('file:///home/username/csv_out')
P.S: I am just a beginner learning from posts here in Stackoverflow. So I don't know whether this is the best way. But it worked for me and I hope it will help someone!
It's not good to just join by commas because if fields contain commas, they won't be properly quoted, e.g. ','.join(['a', 'b', '1,2,3', 'c']) gives you a,b,1,2,3,c when you'd want a,b,"1,2,3",c. Instead, you should use Python's csv module to convert each list in the RDD to a properly-formatted csv string:
# python 3
import csv, io
def list_to_csv_str(x):
"""Given a list of strings, returns a properly-csv-formatted string."""
output = io.StringIO("")
csv.writer(output).writerow(x)
return output.getvalue().strip() # remove extra newline
# ... do stuff with your rdd ...
rdd = rdd.map(list_to_csv_str)
rdd.saveAsTextFile("output_directory")
Since the csv module only writes to file objects, we have to create an empty "file" with io.StringIO("") and tell the csv.writer to write the csv-formatted string into it. Then, we use output.getvalue() to get the string we just wrote to the "file". To make this code work with Python 2, just replace io with the StringIO module.
If you're using the Spark DataFrames API, you can also look into the DataBricks save function, which has a csv format.
def toCSV(RDD):
for element in RDD:
return ','.join(str(element))
rows_of_csv=RDD.map(toCSV)
rows_of_csv.saveAsTextFile('/FileStore/tables/name_of_csv_file.csv')
# choose your path based on your distributed file system