I am just getting started with Pyspark and would like to save a file as a csv instead of a text file. I tried using a couple of answers I found on Stack Overflow such as
def toCSVLine(data):
return ','.join(str(d) for d in data)
and then
rdd = lines.map(toCSVLine)
rdd.saveAsTextFile("file.csv")
It works in that I can open it in excel, however all the information is put into column A in the spreadsheet. I would like to be able to put each column in the rdd (an example would be ("ID", "rating") into a separate column in excel so ID would be in column A and rating would be in column B. Is there a way to do this?
If you're on Spark >= 2.0 and assuming your RDD has a tabular format (which it should, given you want to save it as CSV) one way might be to first create a Dataframe from the RDD and then use DataFrameWriter to export to CSV.
from pyspark.sql import SparkSession
spark = SparkSession(sc).getOrCreate()
df = spark.createDataframe(rdd)
df.write.csv("/path/to/file.csv", sep=',', header=True)
Have a look at the pyspark.sql docs for additional options and further information.
In excel are you splitting the file on the ','?
In excel go to the Data tab and select text to columns under data tools then select delimited and hit next. Then select comma as the delimiter and hit finish.
Edit
Generally it would be best practice to create a csv with a different separator character than comma if commas will be in your data. Per your comment if you are creating the csv, just use a different separator (e.g. ';', '|', '^', or tabs). Another option, which I prefer less, is to wrap the field in question in "" like so:
field0,field1,"field,2",field3
Excel should leave what is in quotes alone and only split on commas outside of the quotes. But again this is not my preferred solution.
One option is to convert RDD to dataframe and then save as CSV.
from pyspark import SparkContext
df = sqlContext.createDataFrame(rdd, ['count', 'word'])
# Write CSV (I have HDFS storage)
df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('file:///home/username/csv_out')
Please see this post I just made:
How to write the resulting RDD to a csv file in Spark python
Related
I am trying to parse this CSV data which has quotes in between in unusual pattern and semicolon in the end of each row.
I am not able to parse this file correctly using pandas.
Here is the link of data (The pastebin was for some reason not recognizing as text / csv so picked up any random formatting please ignore that)
https://paste.gnome.org/pr1pmw4w2
I have tried using the "," as delimiter, and normal call of pandas dataframe object construction by only giving file name as parameter.
header = ["Organization_Name","Organization_Name_URL","Categories","Headquarters_Location","Description","Estimated_Revenue_Range","Operating_Status","Founded_Date","Founded_Date_Precision","Contact_Email","Phone_Number","Full_Description","Investor_Type","Investment_Stage","Number_of_Investments","Number_of_Portfolio_Organizations","Accelerator_Program_Type","Number_of_Founders_(Alumni)","Number_of_Alumni","Number_of_Funding_Rounds","Funding_Status","Total_Funding_Amount","Total_Funding_Amount_Currency","Total_Funding_Amount_Currency_(in_USD)","Total_Equity_Funding_Amount","Total_Equity_Funding_Amount_Currency","Total_Equity_Funding_Amount_Currency_(in_USD)","Number_of_Lead_Investors","Number_of_Investors","Number_of_Acquisitions","Transaction_Name","Transaction_Name_URL","Acquired_by","Acquired_by_URL","Announced_Date","Announced_Date_Precision","Price","Price_Currency","Price_Currency_(in_USD)","Acquisition_Type","IPO_Status,Number_of_Events","SimilarWeb_-_Monthly_Visits","Number_of_Founders","Founders","Number_of_Employees"]
pd.read_csv("data.csv", sep=",", encoding="utf-8", names=header)
First, you can just read the data normally. Now all data would be in the first column. You can use pyparsing module to split based on ',' and assign it back. I hope this solves your query. You just need to do this for all the rows.
import pyparsing as pp
import pandas as pd
df = pd.read_csv('input.csv')
df.loc[0] = pp.commaSeparatedList.parseString(df['Organization Name'][0]).asList()
Output
df #(since there are 42 columns, pasting just a snipped)
I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.
Here is my approach to partitioning and writing the data:
df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))
df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')
This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.
Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.
Example:
val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.
Checking contents of csv file:
hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv
Output:
a
As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.
To check schema of parquet file:
parquet-tools schema <hdfs://nn:8020/parquet_file>
You can also verify what are all the columns included in your parquet file.
If you use df.write.partitionBy('year','month', 'day').
These columns are not actually physically stored in file data. They simply are rendered via the folder structure that partitionBy creates.
Ex. partitionBy('year').csv("/data") will create something like:
/data/year=2018/part1---.csv
/data/year=2019/part1---.csv
When you read the data back it uses the special path year=xxx to populate these columns.
You can prove it by reading in the data of a single partition directly.
Ex. year will not be a column in this case.
df = spark.read.csv("data/year=2019/")
df.printSchema()
Also #Shu's answer could be used to investigate.
You can sleep safely that these columns are not taking up storage space.
If you really don't want to simply see the columns, you could put a view on top of this table that excludes these columns.
I have an excel sheet and I am reading the excel sheet using pandas in python.
Now I want to read the excel file based on a column, if the column has some value then do not read that row, if the column is empty than read that and store the values in a list.
Here is a screenshot
Excel Example
Now in the above image when the uniqueidentifier is yes then it should not read that value, but if it is empty then it should start reading from that value.
How to do that using python and how to get index so that after I have performed some function that I am again able to write to that blank unique identifier column saying that row has been read
This is possible for csv files. There you could do
iter_csv = pandas.read_csv('file.csv', iterator=True, chunksize=100000)
df = pd.concat([chunk[chunk['UniqueIdentifier'] == 'True'] for chunk in iter_csv])
But pd.read_excel does not offer to return an iterator object, maybe some other excel-readers can. But I don't no which ones. Nevertheless you could export your excel file as csv and use the solution for csv files.
I have a dataframe x with the size of ( 2000, 3000) . I would like to export it into CSV to use in R. I tried this code:
x.to_csv("ab.csv", sep='\t')
However, when I open in R by the code:
data = read.csv(".data/ab.csv")
The size of data is (2000,1) because the CSV file can not separate into 3000 columns. Is there any solution to keep the same size after exporting ?
By using the parameter sep='\t' you have written a "CSV" which uses tab to separate fields instead of a comma. You could either remove the parameter and write a normal CSV, or use the sep="\t" argument for read.csv in R. If there's no reason to use tab then I would suggest the former option.
try reading the csv file like this
data = read.csv(".data/ab.csv",sep="\t")
Your csv uses \t to separate each value, using the sep pararmeter you have to specify the separator when opening it
i have started the shell with databrick csv package
#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0
Then i read a csv file did some groupby op and dump that to a csv.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv') ####it has columns and df.columns works fine
type(df) #<class 'pyspark.sql.dataframe.DataFrame'>
#now trying to dump a csv
df.write.format('com.databricks.spark.csv').save('path+my.csv')
#it creates a directory my.csv with 2 partitions
### To create single file i followed below line of code
#df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name
#but in both cases no columns information(How to add column names to that csv file???)
# again i am trying to read that csv by
df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv")
#i am not getting any columns in that..1st row becomes column names
Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.
Question1- while giving csv dump is there any way i can add column name with that???
Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???
note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.
Try
df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.
Just in case,
on spark 2.1 you can create a single csv file with the following lines
dataframe.coalesce(1) //So just a single part- file will be created
.write.mode(SaveMode.Overwrite)
.option("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") //Avoid creating of crc files
.option("header","true") //Write the header
.csv("csvFullPath")
with spark >= 2.o, we can do something like
df = spark.read.csv('path+filename.csv', sep = 'ifany',header='true')
df.write.csv('path_filename of csv',header=True) ###yes still in partitions
df.toPandas().to_csv('path_filename of csv',index=False) ###single csv(Pandas Style)
The following should do the trick:
df \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Alternatively, if you want the results to be in a single partition, you can use coalesce(1):
df \
.coalesce(1) \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Note however that this is an expensive operation and might not be feasible with extremely large datasets.
got answer for 1st question, it was a matter of passing one extra parameter header = 'true' along with to csv statement
df.write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
#Alternative for 2nd question
Using topandas.to_csv , But again i don't want to use pandas here, so please suggest if any other way around is there.