I have to read a file into spark (databricks) as bytes, and convert it to a string.
file_bytes.decode("utf-8")
This is all fine, and I have my data, as a pipe delimited string, including carriage returns etc. It all looks good. Something like:
"Column1"|"Column2"|"Column3"|"Column4"|"Column5"
"This"|"is"|"some"|"data."|
"Shorter"|"line."|||
I want this in a dataframe though so that I can manipulate it, and initially I was attempting to use the following:
df = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", '|')
.load(???)
I appreciate that the load() portion is really meant to be a path to a location on the filesystem ... so have been struggling with that one.
I have therefore reverted to using pandas as it makes life a lot easier:
import io
import pandas
temp = io.StringIO(file_bytes.decode("utf-8"))
df = pandas.read_csv(temp, sep="|")
This is a pandas dataframe, and not a spark dataframe, which as far as I am aware (and it's a very loose awareness) has pros and cons relating to where it lives (in memory) which relates to scaleability/ cluster-usage etc.
Initially, is there a way for me to get my string into a spark dataframe using sqlContext? Maybe I am missing some parameter or switch etc., or should I just stick with pandas?
The main thing I am worried about is that right now files are quite small (200 kb or so), but they might not be forever, and I'd like to reuse a pattern that will allow me to work with larger things (which is why I am marginally concerned about using pandas).
You can actually load an RDD of strings using the CSV reader.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader
So, assuming lines is an RDD of strings that you parsed as you described, you can run:
df = spark.read.csv(lines, sep='|', header=True, inferSchema=True)
The CSV source will then scan the RDD instead of trying to load files. This lets you perform custom pre-processing prior to parsing.
Related
Coming from here, I'm trying to read the correct values from this dataset in Pyspark. I made a good progress using df = spark.read.csv("hashtag_donaldtrump.csv", header=True, multiLine=True), but now I have some weird values in some cells, as you can see in this picture (last lins):
Do you know how could I get rid of them? Or else, how can I read the CSV with format using another program? It's very hard for me to use a text editor like Vim or Nano and try to guess where are the errors. Thank you!
Spark seems to have difficulty in reading this line:
2020-10-15 00:00:23,1.3165293165079306e+18,"""IS THIS WRONG??!!"" ...
because there are three double quotes. However pandas seem to understand that well, so as a workaround, you can use pandas to read the csv file first, and convert to a Spark dataframe. Normally this is not recommended because of the large overhead involved, but for this small csv file the performance hit should be acceptable.
df = spark.createDataFrame(pd.read_csv('hashtag_donaldtrump.csv').replace({float('nan'): None}))
The replace is for replacing nan with None in the pandas dataframe. Spark thinks nan is a float, and it gets confused when there is nan in string type columns.
If the file is too large for pandas, then you can consider dropping those rows that Spark cannot parse using mode='DROPMALFORMED':
df = spark.read.csv('hashtag_donaldtrump.csv', header=True, multiLine=True, mode='DROPMALFORMED')
I have significantly large volume of "Raw data" stored in pickle files. At first I have to read/load them in pandas dataframe. Afterwards I do some analysis, update something, analyze again and so on.
Every time I run the code, it reads the raw data from the pickle files. This takes quite some time that I want to avoid. One dirty solution is, I load the files once and then comment out the reading section. Of course I remain careful not to %reset the namespace.
import pandas as pd
df_1 = pd.read_pickle('MyFile_1.pkl')
df_2 = pd.read_pickle('MyFile_2.pkl')
df_3 = pd.read_pickle('MyFile_3.pkl')
Do some work on the loaded data.....
Is there some smart way to do it? Something like
if Myfile_1.pkl is NOT loaded:
df_1 = pd.read_pickle('MyFile_1.pkl')
I have a large tabular data, which needs to be merged and splitted by group. The easy method is to use pandas, but the only problem is memory.
I have this code to merge dataframes:
import pandas as pd;
from functools import reduce;
large_df = pd.read_table('large_file.csv', sep=',')
This, basically load the whole data in memory th
# Then I could group the pandas dataframe by some column value (say "block" )
df_by_block = large_df.groupby("block")
# and then write the data by blocks as
for block_id, block_val in df_by_block:
pd.Dataframe.to_csv(df_by_block, "df_" + str(block_id), sep="\t", index=False)
The only problem with above code is memory allocation, which freezes my desktop. I tried to transfer this code to dask but dask doesn't have a neat groupby implementation.
Note: I could have just sorted the file, then read the data line by line and split as the "block" value changes. But, the only problem is that "large_df.txt" is created in the pipeline upstream by merging several dataframes.
Any suggestions?
Thanks,
Update:
I tried the following approach but, it still seems to be memory heavy:
# find unique values in the column of interest (which is to be "grouped by")
large_df_contig = large_df['contig']
contig_list = list(large_df_contig.unique().compute())
# groupby the dataframe
large_df_grouped = large_df.set_index('contig')
# now, split dataframes
for items in contig_list:
my_df = large_df_grouped.loc[items].compute().reset_index()
pd.DataFrame.to_csv(my_df, 'dask_output/my_df_' + str(items), sep='\t', index=False)
Everything is fine, but the code
my_df = large_df_grouped.loc[items].compute().reset_index()
seems to be pulling everything into the memory again.
Any way to improve this code??
but dask doesn't have a neat groupb
Actually, dask does have groupby + user defined functions with OOM reshuffling.
You can use
large_df.groupby(something).apply(write_to_disk)
where write_to_disk is some short function writing the block to the disk. By default, dask uses disk shuffling in these cases (as opposed to network shuffling). Note that this operation might be slow, and it can still fail if the size of a single group exceeds your memory.
i have started the shell with databrick csv package
#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0
Then i read a csv file did some groupby op and dump that to a csv.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv') ####it has columns and df.columns works fine
type(df) #<class 'pyspark.sql.dataframe.DataFrame'>
#now trying to dump a csv
df.write.format('com.databricks.spark.csv').save('path+my.csv')
#it creates a directory my.csv with 2 partitions
### To create single file i followed below line of code
#df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name
#but in both cases no columns information(How to add column names to that csv file???)
# again i am trying to read that csv by
df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv")
#i am not getting any columns in that..1st row becomes column names
Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.
Question1- while giving csv dump is there any way i can add column name with that???
Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???
note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.
Try
df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.
Just in case,
on spark 2.1 you can create a single csv file with the following lines
dataframe.coalesce(1) //So just a single part- file will be created
.write.mode(SaveMode.Overwrite)
.option("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") //Avoid creating of crc files
.option("header","true") //Write the header
.csv("csvFullPath")
with spark >= 2.o, we can do something like
df = spark.read.csv('path+filename.csv', sep = 'ifany',header='true')
df.write.csv('path_filename of csv',header=True) ###yes still in partitions
df.toPandas().to_csv('path_filename of csv',index=False) ###single csv(Pandas Style)
The following should do the trick:
df \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Alternatively, if you want the results to be in a single partition, you can use coalesce(1):
df \
.coalesce(1) \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Note however that this is an expensive operation and might not be feasible with extremely large datasets.
got answer for 1st question, it was a matter of passing one extra parameter header = 'true' along with to csv statement
df.write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
#Alternative for 2nd question
Using topandas.to_csv , But again i don't want to use pandas here, so please suggest if any other way around is there.
I have this simple code
data = pd.read_csv(file_path + 'PSI_TS_clean.csv', nrows=None,
names=None, usecols=None)
data.to_hdf(file_path + 'PSI_TS_clean.h5', 'table')
but my data is too big and I run into memory issues.
What is a clean way to do this chunk by chunk?
If the csv is really big split the file using a method such as detailed here : chunking-data-from-a-large-file-for-multiprocessing
then iterate through the files and use pd.read_csv on each then use the pd.to_hdf method
for to_hdf check the parameters here: DataFrame.to_hdf you need to ensure mode 'a' and consider append.
Without knowing further detail about the dataframe structure its difficult to comment further.
also for read_csv there is the param: low_memory=False