Reading in Spark data frame from multiple files - python

Suppose you have two s3 buckets that you want to read a spark data frame from. For one file reading in a spark data frame would look like this:
file_1 = ("s3://loc1/")
df = spark.read.option("MergeSchema","True").load(file_1)
If we have two files:
file_1 = ("s3://loc1/")
file_2 = ("s3://loc2/")
how would we read in a spark data frame? Is there a way to merge those two file locations?

As the previous comment states, you could read in each individually and then do a union function.
Another option could be to try the Spark RDD API and then convert that into a data frame. So for example:
sc = spark.sparkContext
raw_data_RDD = sc.textfile(<dir1> , <dir2>, ...)
For nested directories, you can do wildcard symbol (*). Now one thing you have to consider is whether your schemas for both locations are equal. You may have to do some pre-processing before converting to the dataframe. Once your schema is set up, you can just do:
raw_df = spark.createDataFrame(raw_data_RDD, schema=<schema>)

Related

Passing multiple JSON filenames to pandas and adding to a dataframe

I'm working on the following code, where I am passing filenames from a variable "filelist", then performing some pandas dataframe procedures including obtaining data and filtering columns.
filelist=filestring
li=[]
merged_df=[]
for file in filelist:
# temp df
df=pd.read_json(file)
# filter columns
df2=pd.DataFrame(df[['a','e']])
#find data from each JSON
epf=df['s']['d']['Pf']
ep=pd.json_normalize(epf, record_path=['u','D'], meta=['l','dn'])
#append data from each JSON and add to li_dframe
li.append(ep)
frame=pd.concat(li,axis=0,ignore_index=True)
#merge filtered columns and li_dframe
merged_df=pd.concat([df2,frame.reset_index(drop=True)],axis=1).ffill()
print(merged_df)
merged_df.head(n=40)
Where I am having trouble is in the final merged data frame, I can only see data from one of the underlying Json files - the second in the list. This is strange as the print function, shows that the code has correctly handled both JSON and extracted the required values. Can anyone advise what I am missing here to correctly ensure the merged_df has all the necessary info?

How to perform operations on a Dask dataframe and export the results to a csv?

I have a large input csv file (several GBs) that I import in Dask with a blocksize of 5e6. The input csv contains two columns: "ID" and "Text".
ddf1 = dd.read_csv('REL_Input.csv', names=['ID', 'Text'], blocksize=5e6)
I need to add a third column to ddf1, "Hash", by parsing the existing "Text" column for a string between "Hash=" and ";". In Pandas, I can simply do this:
ddf1['Hash'] = ddf1['Text'].str.extract(r'Hash=(.*?);')
When I do this in Dask, I get an error saying that the "column assignment doesn't support dask.dataframe.core.DataFrame". I tried to use assign but had no luck.
I also need to read multiple large csv files (each several GBs in size) from a directory, concatenate them into another Dask dataframe, ddf2. Each of these csv files have 100s of columns but I only need 2: "Hash" and "Name". Here is the code to create ddf2:
ddf2 = dd.concat([dd.read_csv(f, usecols=['Hash', 'Name'], blocksize=5e6) for f in glob.glob('*.tsv')], ignore_index=True, axis=0)
Then, I need to merge the two dataframes on the "Hash" columns--something like this:
ddf3 = ddf1[['ID', 'ddf1_Hash']].merge(ddf2[['ddf2_Hash', 'Name']], left_on='ddf1_Hash', right_on='ddf2_Hash', how='left')
Finally, I need to export ddf3 as a csv:
df3.to_csv('Output.csv')
I looked and it seems I can create the column for ddf1 and perform the merge operation by changing both ddf1 and ddf2 to pandas dfs using compute. However, that's not an option for me due to the sheer size of these dataframes. I also tried using the chunks approach in Pandas, but that does not work due to the "out of memory" error.
Is there a good way to tackle this problem? I'm still learning Python so any help would be appreciated.
UPDATE:
I am able to create the third column and merge the two dataframes. Though, now the issue is that I can't export the merged dataframe as a csv.
Running regex on a string column. The following snippet uses assign:
import dask.dataframe as dd
import pandas as pd
# this step is just to setup a minimum reproducible example
df = pd.DataFrame(list("abcdefghi"), columns=['A'])
ddf = dd.from_pandas(df, npartitions=3)
# this uses assign to extract the relevant content
ddf = ddf.assign(check_c = lambda x: x['A'].str.extract(r'([a-z])'))
# you can see that the computation was done correctly
ddf.compute()
Concatenating csv files. Do csv files have the same structure/columns? If so, you can just use dd.read_csv("path_to_csv_files/*csv"), but if the files have different structures, then your approach is correct:
ddf2 = dd.concat([dd.read_csv(f, usecols=['Hash', 'Name'], blocksize=5e6) for f in glob.glob('*.tsv')], ignore_index=True, axis=0)
Merging the dataframes. This is going to be an expensive operation, here's a couple of options to potentially reduce the cost of this:
if any of the dataframes can be put into memory, then it would help to run .compute() to get pandas dataframe before the merge;
setting the key variable as index on one or both dataframes:
ddf1 = ddf1.set_index('Hash')
ddf2 = ddf2.set_index('Hash')
ddf3 = ddf1.merge(ddf2, left_index=True, right_index=True)
Saving csv, by default, dask will save each partition to its own csv file, so your path needs to contain an asterisk, e.g.:
df3.to_csv('Output_*.csv', index=False)
There are other options possible (explicit paths, custom name function, see https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
If you need a single file, you can use
df3.to_csv('Output.csv', index=False, single_file=True)
However, this option is not supported on all systems, so you might want to check that it works using a small sample first (see documentation).

write spark dataframe as array of json (pyspark)

I would like to write my spark dataframe as a set of JSON files and in particular each of which as an array of JSON.
Let's me explain with a simple (reproducible) code.
We have:
import numpy as np
import pandas as pd
df = spark.createDataFrame(pd.DataFrame({'x': np.random.rand(100), 'y': np.random.rand(100)}))
Saving the dataframe as:
df.write.json('s3://path/to/json')
each file just created has one JSON object per line, something like:
{"x":0.9953802385540144,"y":0.476027611419198}
{"x":0.929599290575914,"y":0.72878523939521}
{"x":0.951701684432855,"y":0.8008064729546504}
but I would like to have an array of those JSON per file:
[
{"x":0.9953802385540144,"y":0.476027611419198},
{"x":0.929599290575914,"y":0.72878523939521},
{"x":0.951701684432855,"y":0.8008064729546504}
]
It is not currently possible to have spark "natively" write a single file in your desired format because spark works in a distributed (parallel) fashion, with each executor writing its part of the data independently.
However, since you are okay with having each file be an array of json not only [one] file, here is one workaround that you can use to achieve your desired output:
from pyspark.sql.functions import to_json, spark_partition_id, collect_list, col, struct
df.select(to_json(struct(*df.columns)).alias("json"))\
.groupBy(spark_partition_id())\
.agg(collect_list("json").alias("json_list"))\
.select(col("json_list").cast("string"))\
.write.text("s3://path/to/json")
First you create a json from all of the columns in df. Then group by the spark partition ID and aggregate using collect_list. This will put all the jsons on that partition into a list. Since you're aggregating within the partition, there should be no shuffling of data required.
Now select the list column, convert to a string, and write it as a text file.
Here's an example of how one file looks:
[{"x":0.1420523746714616,"y":0.30876114874052263}, ... ]
Note you may get some empty files.
Presumably you can force spark to write the data in ONE file if you specified an empty groupBy, but this would result in forcing all of the data into a single partition which could result in an out of memory error.
If the data is not super huge and it's okay to have the list as one JSON file, the following workaround is also valid. First, convert the Pyspark data frame to Pandas and then to a list of dicts. Then, the list can be dumped as JSON.
list_of_dicts = df.toPandas().to_dict('records')
json_file = open('path/to/file.json', 'w')
json_file.write(json.dumps(list_of_dicts))
json_file.close()

Drop partition columns when writing parquet in pyspark

I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.
Here is my approach to partitioning and writing the data:
df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))
df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')
This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.
Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.
Example:
val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.
Checking contents of csv file:
hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv
Output:
a
As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.
To check schema of parquet file:
parquet-tools schema <hdfs://nn:8020/parquet_file>
You can also verify what are all the columns included in your parquet file.
If you use df.write.partitionBy('year','month', 'day').
These columns are not actually physically stored in file data. They simply are rendered via the folder structure that partitionBy creates.
Ex. partitionBy('year').csv("/data") will create something like:
/data/year=2018/part1---.csv
/data/year=2019/part1---.csv
When you read the data back it uses the special path year=xxx to populate these columns.
You can prove it by reading in the data of a single partition directly.
Ex. year will not be a column in this case.
df = spark.read.csv("data/year=2019/")
df.printSchema()
Also #Shu's answer could be used to investigate.
You can sleep safely that these columns are not taking up storage space.
If you really don't want to simply see the columns, you could put a view on top of this table that excludes these columns.

Reading dataframe from multiple input paths and adding columns simultaneously

I am trying to read multiple input paths and based on the dates in the paths adding two columns to the data frame. Actually the files were stored as orc partitioned by these dates using hive so they have a structure like
s3n://bucket_name/folder_name/partition1=value1/partition2=value2
where partition2 = mg_load_date . So here I am trying to fetch multiple directories from multiple paths and based on the partitions I have to create two columns namely mg_load_date and event_date for each spark dataframe. I am reading these as input and combining them after I add these two columns finding the dates for each file respectively.
Is there any other way since I have many reads for each file, to read all the files at once while adding two columns for their specific rows. Or any other way where I can make the read operation fast since I have many reads.
I guess reading all the files like this sqlContext.read.format('orc').load(inputpaths) is faster than reading them individually and then merging them.
Any help is appreciated.
dfs = []
for i in input_paths:
df = sqlContext.read.format('orc').load(i)
date = re.search('mg_load_date=([^/]*)/$', i).group(1)
df = df.withColumn('event_date',F.lit(date)).withColumn('mg_load_date',F.lit(date))
dfs+=[df]
df = reduce(DataFrame.unionAll,dfs)
As #user8371915 says, you should load your data from the root path instead of passing a list of subdirectory:
sqlContext.read.format('orc').load("s3n://bucket_name/folder_name/")
Then you'll have access to your partitioning columns partition1 and partition2.
If for some reason you can't load from the root path you can try using pyspark.sql.functions input_file_name to get the name of the file for each row of your dataframe.
Spark 2.2.0+
to read from multiple folders using orc format.
df=spark.read.orc([path1,path2])
ref: https://issues.apache.org/jira/browse/SPARK-12334

Categories