I ran a spark job that culminated in saving a Parquet file, and the job completed successfully. However, I only specified the name of the file, and did not specify the HDFS path. Is there a way to print out the default HDFS path that spark wrote the file to? I looked at sc._conf.getAll(), but there doesn't seem to be anything useful there.
AFAIK this is one of the way (apart from simple command way is hadoop fs -ls -R | grep -i yourfile)....
Below is example scala code snippet.... (if you want to do it in python or java you can emulate the same api calls)
To get list of parquet files. and filter them like below....
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
import org.apache.hadoop.io.{BytesWritable, Text}
import org.apache.spark.{SparkConf, SparkContext}
//other imports here
lazy val sparkConf = new SparkConf()
lazy val sc = SparkContext.getOrCreate(sparkConf)
lazy val fileSystem = FileSystem.get(sc.hadoopConfiguration)
val fileSystem = listChaildStatuses(fileSystem , new Path("yourbasepathofHDFS")) // normally hdfs://server/user like this...
val allparquet = fileSystem.filter(_.getPath.getName.endsWith(".parquet"))
// now you can print these parquet files out of which your files will be present and you can know the base path...
Support methods are like below
/**
* Get [[org.apache.hadoop.fs.FileStatus]] objects for all Chaild children (files) under the given base path. If the
* given path points to a file, return a single-element collection containing [[org.apache.hadoop.fs.FileStatus]] of
* that file.
*/
def listChaildStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] = {
listChaildStatuses(fs, fs.getFileStatus(basePath))
}
/**
* Get [[FileStatus]] objects for all Chaild children (files) under the given base path. If the
* given path points to a file, return a single-element collection containing [[FileStatus]] of
* that file.
*/
def listChaildStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = {
def recurse(status: FileStatus): Seq[FileStatus] = {
val (directories, leaves) = fs.listStatus(status.getPath).partition(_.isDirectory)
leaves ++ directories.flatMap(f => listChaildStatuses(fs, f))
}
if (baseStatus.isDirectory) recurse(baseStatus) else Seq(baseStatus)
}
Related
Ok, so I am having a weird one. I am running python in a SideFX Hython (their custom build) implementation that is using PDG. The only real difference between Hython and vanilla Python is some internal functions for handling geometry data and compiled nodes, which shouldn't be an issue even though they are being used.
The way the code runs, I am generating a list of files from the disk which creates PDG work items. Those work items are then processed in parallel by PDG. Here is the code for that:
import importlib.util
import pdg
import os
from pdg.processor import PyProcessor
import json
class CustomProcessor(PyProcessor):
def __init__(self, node):
PyProcessor.__init__(self,node)
self.extractor_module = 'GeoExtractor'
def onGenerate(self, item_holder, upstream_items, generation_type):
for upstream_item in upstream_items:
new_item = item_holder.addWorkItem(parent=upstream_item, inProcess=True)
return pdg.result.Success
def onCookTask(self, work_item):
spec = importlib.util.spec_from_file_location("callback", "Geo2Custom.py")
GE = importlib.util.module_from_spec(spec)
spec.loader.exec_module(GE)
GE.convert(f"{work_item.attribValue('directory')}/{work_item.attribValue('filename')}{work_item.attribValue('extension')}", work_item.index, f'FRAME { work_item.index }', self.extractor_module)
return pdg.result.Success
def bulk_convert (path_pattern, extractor_module = 'GeoExtractor'):
type_registry = pdg.TypeRegistry.types()
try:
type_registry.registerNode(CustomProcessor, pdg.nodeType.Processor, name="customprocessor", label="Custom Processor", category="Custom")
except Exception:
pass
whereItWorks = pdg.GraphContext("testBed")
whatWorks = whereItWorks.addScheduler("localscheduler")
whatWorks.setWorkingDir(os.getcwd (), '$HIP')
whereItWorks.setValues(f'{whatWorks.name}', {'maxprocsmenu':-1, 'tempdirmenu':0, 'verbose':1})
findem = whereItWorks.addNode("filepattern")
whereItWorks.setValue(f'{findem.name}', 'pattern', path_pattern, 0)
generic = whereItWorks.addNode("genericgenerator")
whereItWorks.setValue(generic.name, 'itemcount', 4, 0)
custom = whereItWorks.addNode("customprocessor")
custom.extractor_module = extractor_module
node1 = [findem]
node2 = [custom]*len(node1)
for n1, n2 in zip(node1, node2):
whereItWorks.connect(f'{n1.name}.output', f'{n2.name}.input')
n2.cook(True)
for node in whereItWorks.graph.nodes():
node.dirty(False)
whereItWorks.disconnect(f'{n1.name}.output', f'{n2.name}.input')
print ("FULLY DONE")
import os
import hou
import traceback
import CustomWriter
import importlib
def convert (filename, frame_id, marker, extractor_module = 'GeoExtractor'):
Extractor = importlib.__import__ (extractor_module)
base, ext = os.path.splitext (filename)
if ext == '.sc':
base = os.path.splitext (base)[0]
dest_file = base + ".custom"
geo = hou.Geometry ()
geo.loadFromFile (filename)
try:
frame = Extractor.extract_geometry (geo, frame_id)
except Exception as e:
print (f'F{ frame_id } Geometry extraction failed: { traceback.format_exc () }.')
return None
print (f'F{ frame_id } Geometry extracted. Writing file { dest_file }.')
try:
CustomWriter.write_frame (frame, dest_file)
except Exception as e:
print (f'F{ frame_id } writing failed: { e }.')
print (marker + " SUCCESS")
The onCookTask code is run when the work item is processed.
Inside of the GeoExtractor.py program I am importing the geometry file defined by the work item, then converting it into a couple Pandas dataframes to collate and process the massive volumes of data quickly, which is then passed to a custom set of functions for writing binary files to disk from the Pandas data.
Everything appears to run flawlessly, until I check my output binaries and see that they escalate in file size much more than they should, indicating that either something is being shared between instances or not cleared from memory and subsequent loads of the extractor code is appending the dataframes which are named the same.
I have run the GeoExtractor code sequentially with the python instance closing between each file conversion using the exact same code and the files are fine, growing only very slowly as the geometry data volume grows, so the issue has to lie somewhere in the parallelization of it using PDG and calling the GeoExtractor.py code over and over for each work item.
I have contemplated moving the importlib stuff to the __init__() for the class leaving only the call to the member function in the onCookTask() function. Maybe even going so far as to pass a unique variable for each work item which is used inside GeoExtractor to create a closure of the internal functions so they are unique instances in memory.
I tried to do a stripped down version of GeoExtractor and since I'm not sure where the leak is, I just ended up pulling out comments with proprietary or superfluous information and changing some custom library names, but the file ended up kinda long so I am including a pastebin: https://pastebin.com/4HHS8D2W
As for CustomGeometry and CustomWriter, there is no working form of either of those libraries that will be NDA safe, so unfortunately they have to stay blackboxed. The CustomGeometry is a handful of container classes which organize all of the data coming out of the geometry, and the writer is a formatter/writer for the binary format we are utilizing. I am hoping the issue wouldn't be in either of them.
Edit 1: I fixed an issue in the example code.
Edit 2: Added larger examples.
I have a pretty simple script running, it works in one folder but doesnt work in the other. Have been going at this for a couple hours ):. Permissions for both files are the same, script is the same, nothing python side. Seems very technical. I have tried allowing full control to the files I am attempting to interact with and the duplicated python script aswell. I tested the duplicate script in C:/ and Program Files (x86) with no resolution. The script only seems to work from one folder. Nothing is different from both scripts.
Script I am attempting to copy into a new folder and use:
import os
import sys
import shutil
def parse(p):
q = p
return q
#per line in textbox, create element in list
zz = ((parse(sys.argv[1]).replace("'", "")).split("\n"))
zz = list(filter(("").__ne__, zz))
last_element = zz[-1:]
last_element = (last_element[0]).split("[[")
zz = zz[:-1]
zz.append(last_element[0])
last_element = last_element[1]
if last_element == "product_url.txt":
os.chdir(r"C:\Cactus (2022)\supported_websites\XXX")
else:
os.chdir(r'C:\Program Files (x86)\Cactus (2022)\supported_websites\XXX\XXX')
a_file = open('%s' % last_element, "w")
for x in zz:
if x == "":
pass
else:
a_file.write("%s\n" % x)
Calling from C#:
MessageBox.Show(richTextBox4.Text);
panel1.Visible = false;
string task_information;
task_information = richTextBox4.Text + #"[[product_url.txt";
ProcessStartInfo rtInfo = new ProcessStartInfo(#"C://Program Files (x86)//Cactus (2022)//repo//python.exe");
rtInfo.FileName = "C://Program Files (x86)//Cactus (2022)//repo//python.exe";
rtInfo.Arguments = "C://Cactus (2022)//modifytextfilelines.py '" + task_information + "'";
rtInfo.UseShellExecute = false;
rtInfo.CreateNoWindow = true;
Process.Start(rtInfo);
Only way it works:
MessageBox.Show(richTextBox4.Text);
panel1.Visible = false;
string task_information;
task_information = richTextBox4.Text + #"[[product_url.txt";
ProcessStartInfo rtInfo = new ProcessStartInfo(#"C://Program Files (x86)//Cactus (2022)//repo//python.exe");
rtInfo.FileName = "C://Program Files (x86)//Cactus (2022)//repo//python.exe";
rtInfo.Arguments = "C://Cogs//modifytextfilelines.py '" + task_information + "'";
rtInfo.UseShellExecute = false;
rtInfo.CreateNoWindow = true;
Process.Start(rtInfo);
Let me emphasize, the script is exactly the same, copy and pasted from my "Cogs" folder. It doesnt work at all if I attempt to copy and paste it to a new folder and modify my Arguments line to that directory for C#.
Edit:
Did more testing, seems it is the space in "Cactus (2022)", I replaced the folder name with XXX in the pasted code below... I copy and pasted my Cogs folder into C:/ and it worked fine, I renamed it to "REPO", I changed this to "RE PO" and it stopped working. So a syntax error is the issue underneath arguments.
We will receive up to 10k JSON files in a separate directory that must be parsed and converted to separate .csv files. Then the file at the URL in each must be downloaded to another directory. I was planning on doing this in Automator on the Mac and calling a Python script for downloading the files. I have the portion of the shell script done to convert to CSV but have no idea where to start with python to download the URLs.
Here's what I have so far for Automator:
- Shell = /bin/bash
- Pass input = as arguments
- Code = as follows
#!/bin/bash
/usr/bin/perl -CSDA -w <<'EOF' - "$#" > ~/Desktop/out_"$(date '+%F_%H%M%S')".csv
use strict;
use JSON::Syck;
$JSON::Syck::ImplicitUnicode = 1;
# json node paths to extract
my #paths = ('/upload_date', '/title', '/webpage_url');
for (#ARGV) {
my $json;
open(IN, "<", $_) or die "$!";
{
local $/;
$json = <IN>;
}
close IN;
my $data = JSON::Syck::Load($json) or next;
my #values = map { &json_node_at_path($data, $_) } #paths;
{
# output CSV spec
# - field separator = SPACE
# - record separator = LF
# - every field is quoted
local $, = qq( );
local $\ = qq(\n);
print map { s/"/""/og; q(").$_.q("); } #values;
}
}
sub json_node_at_path ($$) {
# $ : (reference) json object
# $ : (string) node path
#
# E.g. Given node path = '/abc/0/def', it returns either
# $obj->{'abc'}->[0]->{'def'} if $obj->{'abc'} is ARRAY; or
# $obj->{'abc'}->{'0'}->{'def'} if $obj->{'abc'} is HASH.
my ($obj, $path) = #_;
my $r = $obj;
for ( map { /(^.+$)/ } split /\//, $path ) {
if ( /^[0-9]+$/ && ref($r) eq 'ARRAY' ) {
$r = $r->[$_];
}
else {
$r = $r->{$_};
}
}
return $r;
}
EOF
I'm unfamiliar with Automator so perhaps someone else can address that but as far as the Python portion goes, it is fairly simple to download a file from a url. It would go something like this:
import requests
r = requests.get(url) # assuming you don't need to do any authentication
with open("my_file_name", "wb") as f:
f.write(r.content)
Requests is a great library for handling http(s) and since the content attribute of the Response is a byte string we can open a file for writing bytes (the "wb") and write it directly. This works for executable payloads too so be sure you know what you are downloading. If you don't already have requests installed run pip install requests or the Mac equivalent.
If you were inclined to do your whole process in python I would suggest you look at the json and csv packages. Both of these are part of the standard library and provide high-level interfaces for exactly what you are doing
Edit:
Here's an example if you were using the json module on a file like this:
[
{
"url": <some url>,
"name": <the name of the file>
}
]
Your Python code might look similar to this:
import requests
import json
with open("my_json_file.json", "r") as json_f:
for item in json.load(json_f)
r = requests.get(item["url"])
with open(item["name"], "wb") as f:
f.write(r.content)
Hi I'm trying to set up a toolchain for the Fn project. The approach is to set up a toolchain per binary available in GitHub and then, in theory use it in a rule.
I have a common package which has the available binaries:
default_version = "0.5.44"
os_list = [
"linux",
"mac",
"windows"
]
def get_bin_name(os):
return "fn_cli_%s_bin" % os
The download part looks like this:
load(":common.bzl", "get_bin_name", "os_list", "default_version")
_url = "https://github.com/fnproject/cli/releases/download/{version}/{file}"
_os_to_file = {
"linux" : "fn_linux",
"mac" : "fn_mac",
"windows" : "fn.exe",
}
def _fn_binary(os):
name = get_bin_name(os)
file = _os_to_file.get(os)
url = _url.format(
file = file,
version = default_version
)
native.http_file(
name = name,
urls = [url],
executable = True
)
def fn_binaries():
"""
Installs the hermetic binary for Fn.
"""
for os in os_list:
_fn_binary(os)
Then I set up the toolchain like this:
load(":common.bzl", "get_bin_name", "os_list")
_toolchain_type = "toolchain_type"
FnInfo = provider(
doc = "Information about the Fn Framework CLI.",
fields = {
"bin" : "The Fn Framework binary."
}
)
def _fn_cli_toolchain(ctx):
toolchain_info = platform_common.ToolchainInfo(
fn_info = FnInfo(
bin = ctx.attr.bin
)
)
return [toolchain_info]
fn_toolchain = rule(
implementation = _fn_cli_toolchain,
attrs = {
"bin" : attr.label(mandatory = True)
}
)
def _add_toolchain(os):
toolchain_name = "fn_cli_%s" % os
native_toolchain_name = "fn_cli_%s_toolchain" % os
bin_name = get_bin_name(os)
compatibility = ["#bazel_tools//platforms:%s" % os]
fn_toolchain(
name = toolchain_name,
bin = ":%s" % bin_name,
visibility = ["//visibility:public"]
)
native.toolchain(
name = native_toolchain_name,
toolchain = ":%s" % toolchain_name,
toolchain_type = ":%s" % _toolchain_type,
target_compatible_with = compatibility
)
def setup_toolchains():
"""
Macro te set up the toolchains for the different platforms
"""
native.toolchain_type(name = _toolchain_type)
for os in os_list:
_add_toolchain(os)
def fn_register():
"""
Registers the Fn toolchains.
"""
path = "//tools/bazel_rules/fn/internal/cli:fn_cli_%s_toolchain"
for os in os_list:
native.register_toolchains(path % os)
In my BUILD file I call setup_toolchains:
load(":toolchain.bzl", "setup_toolchains")
setup_toolchains()
With this set up I have a rule which looks like this:
_toolchain = "//tools/bazel_rules/fn/cli:toolchain_type"
def _fn(ctx):
print("HEY")
bin = ctx.toolchains[_toolchain].fn_info.bin
print(bin)
# TEST RULE
fn = rule(
implementation = _fn,
toolchains = [_toolchain]
)
Workpace:
workspace(name = "basicwindow")
load("//tools/bazel_rules/fn:defs.bzl", "fn_binaries", "fn_register")
fn_binaries()
fn_register()
When I query for the different binaries with bazel query //tools/bazel_rules/fn/internal/cli:fn_cli_linux_bin they are there but calling bazel build //... results in an error which complains of:
ERROR: /Users/marcguilera/Code/Marc/basicwindow/tools/bazel_rules/fn/internal/cli/BUILD.bazel:2:1: in bin attribute of fn_toolchain rule //tools/bazel_rules/fn/internal/cli:fn_cli_windows: rule '//tools/bazel_rules/fn/internal/cli:fn_cli_windows_bin' does not exist. Since this rule was created by the macro 'setup_toolchains', the error might have been caused by the macro implementation in /Users/marcguilera/Code/Marc/basicwindow/tools/bazel_rules/fn/internal/cli/toolchain.bzl:35:15
ERROR: Analysis of target '//tools/bazel_rules/fn/internal/cli:fn_cli_windows' failed; build aborted: Analysis of target '//tools/bazel_rules/fn/internal/cli:fn_cli_windows' failed; build aborted
INFO: Elapsed time: 0.079s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
I tried to follow the toolchain tutorial in the documentation but I can't get it to work. Another interesting thing is that I'm actually using mac so the toolchain compatibility seems to also be wrong.
I'm using this toolchain in a repo so the paths vary but here's a repo containing only the fn stuff for ease of read.
Two things:
One, I suspect this is your actual issue: https://github.com/bazelbuild/bazel/issues/6828
The core of the problem is that, is the toolchain_type target is in an external repository, it always needs to be referred to by the fully-qualified name, never by the locally-qualified name.
The second is a little more fundamental: you have a lot of Starlark macros here that are generating other targets, and it's very hard to read. It would actually be a lot simpler to remove a lot of the macros, such as _fn_binary, fn_binaries, and _add_toolchains. Just have setup_toolchains directly create the needed native.toolchain targets, and have a repository macro that calls http_archive three times to declare the three different sets of binaries. This will make the code much easier to read and thus easier to debug.
For debugging toolchains, I follow two steps: first, I verify that the tool repositories exist and can be accessed directly, and then I check the toolchain registration and resolution.
After going several levels deep, it looks like you're calling http_archive, naming the new repository #linux, and downloading a specific binary file. This isn't how http_archive works: it expects to fetch a zip file (or tar.gz file), extract that, and find a WORKSPACE and at least one BUILD file inside.
My suggestions: simplify your macros, get the external repositories clearly defined, and then explore using toolchain resolution to choose the right one.
I'm happy to help answer further questions as needed.
Scenario:
My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.
Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.
Sample XML:
<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>
How will this be loaded into the RDD?
Yes it possible but details will differ depending on an approach you take.
If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
For larger files you can use Hadoop input formats.
If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
Otherwise Mahout provides XmlInputFormat
Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:
use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
use second mapPartitionsWithIndex to repair broken records
Edit:
There is also relatively new spark-xml package which allows you to extract specific records by tag:
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "foo")
.load("bar.xml")
Here's the way to perform it using HadoopInputFormats to read XML data in spark as explained by #zero323.
Input data:
<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>
Code for reading XML Input:
You will get some jars at this link
Imports:
//---------------spark_import
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
//----------------xml_loader_import
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{ LongWritable, Text }
import com.cloudera.datascience.common.XmlInputFormat
Code:
object Tester_loader {
case class User(account: String, name: String, number: String)
def main(args: Array[String]): Unit = {
val sparkHome = "/usr/big_data_tools/spark-1.5.0-bin-hadoop2.6/"
val sparkMasterUrl = "spark://SYSTEMX:7077"
var jars = new Array[String](3)
jars(0) = "/home/hduser/Offload_Data_Warehouse_Spark.jar"
jars(1) = "/usr/big_data_tools/JARS/Spark_jar/avro/spark-avro_2.10-2.0.1.jar"
val conf = new SparkConf().setAppName("XML Reading")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setMaster("local")
.set("spark.cassandra.connection.host", "127.0.0.1")
.setSparkHome(sparkHome)
.set("spark.executor.memory", "512m")
.set("spark.default.deployCores", "12")
.set("spark.cores.max", "12")
.setJars(jars)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// ---- loading user from XML
// calling function 1.1
val pages = readFile("src/input_data", "<user>", "<\\user>", sc)
val xmlUserDF = pages.map { tuple =>
{
val account = extractField(tuple, "account")
val name = extractField(tuple, "name")
val number = extractField(tuple, "number")
User(account, name, number)
}
}.toDF()
println(xmlUserDF.count())
xmlUserDF.show()
}
Functions:
def readFile(path: String, start_tag: String, end_tag: String,
sc: SparkContext) = {
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, start_tag)
conf.set(XmlInputFormat.END_TAG_KEY, end_tag)
val rawXmls = sc.newAPIHadoopFile(
path, classOf[XmlInputFormat], classOf[LongWritable],
classOf[Text], conf)
rawXmls.map(p => p._2.toString)
}
def extractField(tuple: String, tag: String) = {
var value = tuple.replaceAll("\n", " ").replace("<\\", "</")
if (value.contains("<" + tag + ">") &&
value.contains("</" + tag + ">")) {
value = value.split("<" + tag + ">")(1).split("</" + tag + ">")(0)
}
value
}
}
Output:
+-------+------+------+
|account| name|number|
+-------+------+------+
| 1234|name_1| 34233|
| 58789|name_2| 54697|
+-------+------+------+
The result obtained is in dataframes you can convert them to RDD as per your requirement like this->
val xmlUserRDD = xmlUserDF.toJavaRDD.rdd.map { x =>
(x.get(0).toString(),x.get(1).toString(),x.get(2).toString()) }
Please evaluate it, if it could help you some how.
This will help you.
package packagename;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import com.databricks.spark.xml.XmlReader;
public class XmlreaderSpark {
public static void main(String arr[]){
String localxml="file path";
String booksFileTag = "user";
String warehouseLocation = "file:" + System.getProperty("user.dir") + "spark-warehouse";
System.out.println("warehouseLocation" + warehouseLocation);
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("Java Spark SQL Example")
.config("spark.some.config.option", "some-value").config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport().config("set spark.sql.crossJoin.enabled", "true")
.getOrCreate();
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> df = (new XmlReader()).withRowTag(booksFileTag).xmlFile(sqlContext, localxml);
df.show();
}
}
You need to add this dependency in your POM.xml:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.10</artifactId>
<version>0.4.0</version>
</dependency>
and your input file is not in proper format.
Thanks.
There are two good options for simple cases:
wholeTextFiles. Use map method with your XML parser which could be Scala XML pull parser (quicker to code) or the SAX Pull Parser (better performance).
Hadoop streaming XMLInputFormat which you must define the start and end tag <user> </user> to process it, however, it creates one partition per user tag
spark-xml package is a good option too.
With all options you are limited to only process simple XMLs which can be interpreted as dataset with rows and columns.
However, if we make it a little complex, those options won’t be useful.
For example, if you have one more entity there:
<root>
<users>
<user>...</users>
<companies>
<company>...</companies>
</root>
Now you need to generate 2 RDDs and change your parser to recognise the <company> tag.
This is just a simple case, but the XML could be much more complex and you would need to include more and more changes.
To solve this complexity we’ve built Flexter on top of Apache Spark to take the pain out of processing XML files on Spark. I also recommend to read about converting XML on Spark to Parquet. The latter post also includes some code samples that show how the output can be queried with SparkSQL.
Disclaimer: I work for Sonra