I am trying to work on a samplecsv.csv file(64 MB) in pyspark.
This code generates an error: AttributeError: 'list' object has no attribute 'saveAsTextFile'
I think I have already converted list to RDD using parallelize. If not, how is it done?
file = sc.textFile('/user/project/samplecsv.csv',5)
rdd = file.map(lambda line: (line.split(',')[0], line.split(',')[1],
line.split(',')[2], line.split(',')[3],
line.split(',')[4])).collect()
temp = sc.parallelize([rdd], numSlices=50000).collect()
temp.saveAsTextFile("/user/project/newfile.txt")}
Your problem is that you called collect on the parallelized list, returning it back to a normal python list.
Also, you should not be calling collect in each step, unless you're making it for a testing/debugging process. Otherwise you're not taking advantage of the Spark computing model.
# loads the file as an rdd
file = sc.textFile('/user/project/samplecsv.csv',5)
# builds a computation graph
rdd = file.map(lambda line: (line.split(',')[0], line.split(',')[1],
line.split(',')[2], line.split(',')[3],
line.split(',')[4]))
# saves the rdd to the filesystem
rdd.saveAsTextFile("/user/project/newfile.txt")
Also, you can make the code more optimal by spliting the line only once.
I think you should try the below code, it will solve your purpose:
file = sc.textFile("C://Users/Ravi/Desktop/test.csv",5)
rdd = file.map(lambda line: (line.split(',')[0], line.split(',')[1],
line.split(',')[2], line.split(',')[3]))
rdd.coalesce(1).saveAsTextFile("C://Users/Ravi/Desktop/temp")
If you want partitioned file, don't use coalesce.
Related
I am new to Pyspark and RDD.
I tried to use RDD's to handle a 12 GB large CSV file. But at the end I cannot do anything with it. I tried to convert it to a SparkDataframe, I tried to make some actions like max,min,etc., I tried to save it as CSV file, but everything needs very long time? I think I didnt understand well what a RDD do and not do?
My Proceed:
First I loaded the csv in a pyspark Dataframe, cause i didnt found a good way to load it directly to RDD with a separator = "|". Then I convert it to rdd
spark = SparkSession.builder.appName('Practise').getOrCreate()
path1 = 'myFile.csv'
haystack = spark.read.options().csv(path1, sep = "|", header = True, encoding = "utf-8")
rdd1 = haystack.rdd
Number of partitions are 75.
Then I tried to use some Functions:
rdd2=rdd1.map(lambda x:
(x[0],x[1],myFuntion(x[2]),myFunction(x[3]),myFunction(x[4]),someFunc(x[2]),someFunc(x[3]),someFunc(x[4])))
And I think it worked, if I try to show it with ...
print(rdd2.take(20))
...I get values i expected.
It works in seconds. It seems too fast?
Then I try to get a Dataframe to make stuff like avg,min,max by each column
df2=rdd2.toDF(["id","title","myFuntion_1","myFuntion_2","myFunction_3","myFunction_4",..])
But than I wait for to long? I waited over 2hours and the terminal sayed still this loading-"thing":
[Stage 7:> (0 + 24) / 75]
Im confuesed why the maps and stuff are that super fast and know I'm stucked?
I am doing some simulation where I compute some stuff for several time step. For each I want to save a parquet file where each line correspond to a simulation this looks like so :
def simulation():
nsim = 3
timesteps = [1,2]
data = {} #initialization not shown here
for i in nsim:
compute_stuff()
for j in timesteps:
data[str(j)]= compute_some_other_stuff()
return data
Once I have my dict data containing the result of my simulation (under numpy arrays), I transform it into dask.DataFrame objects to then save them to file using the .to_parquet() method as follows:
def save(data):
for i in data.keys():
data[i] = pd.DataFrame(data[i], bins=...)
df = from_pandas(data[i], npartitions=2)
f.to_parquet(datafolder + i + "/", engine="pyarrow", append=True, ignore_divisions = True)
When use this code only once it works perfectly and the struggle arrises when I try to implement it in parallel. Using dask I do:
client = Client(n_workers=10, processes=True)
def f():
data = simulation()
save(data)
to_compute = [delayed(f)(n) for n in range(20)]
compute(to_compute)
The behaviour of this last portion of code is quite random. At some point this happens:
distributed.worker - WARNING - Compute Failed
Function: f
args: (4)
kwargs: {}
Exception: "ArrowInvalid('Parquet file size is 0 bytes')"
....
distributed.worker - WARNING - Compute Failed
Function: f
args: (12)
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
I think these errors are due to the fact that 2 processes try to write at the same time to the same parquet file, and it not well handled (as it can be on txt file). I already tried to switch to pySpark / Koalas without much success. I there a better way to save the result along a simulation (in case of a crash / wall time on a cluster)?
You are making a classic dask mistake of invoking the dask API from within functions that are themselves delayed. The error indicates that things are happening in parallel (which is what dask does!) which are not expected to change during processing. Specifically, a file is clearly being edited by one task while another one is reading it (not sure which).
What you probably want to do, is use concat on the dataframe pieces and then a single call to to_parquet.
Note that it seems all of your data is actually held in the client, and you are using from_parquet. This seems like a bad idea, since you are missing out on one of dask's biggest features, to only load data when needed. You should, instead, load your data inside delayed functions or dask dataframe API calls.
I run PySpark 3.1 on a Windows computer with local mode on Jupyter Notebook. I call "applyInPandas" on Spark DataFrame.
Below function applies a few data transformations to input Pandas DataFrame, and trains an SGBT model. Then it serializes the trained model into binary and saves to S3 bucket as object. Finally it returns the DataFrame. I call this function from a Spark DataFrame grouped by two columns in the last line. I receive no error and the returned DataFrame is as the same length as the input. Data for each group is returned.
The problem is the saved model objects. There are objects saved in S3 only for 2 groups when there were supposed to be models for each group. There is no missing/wrong data point that would cause model training to fail. (I'd receive an error or warning anyway.) What I have tried so far:
Replace S3 and save to local file system: The same result.
Replace "pickle" with "joblib" and "BytesIO": The same result.
Repartition before calling the function: Now I had more objects saved for different groups, but not all. [I did this by calling "val_large_df.coalesce(1).groupby('la..." in the last line.]
So I suspect this is about parallelism and distribution, but I could not figure it out. Thank you already.
def train_sgbt(pdf):
##Some data transformations here##
#Train the model
sgbt_mdl=GradientBoostingRegressor(--Params.--).fit(--Params.--)
sgbt_mdl_b=pickle.dumps(sgbt_mdl) #Serialize
#Initiate s3_client
s3_client = boto3.client(--Params.--)
#Put file in S3
s3_client.put_object(Body=sgbt_mdl_b, Bucket='my-bucket-name',
Key="models/BT_"+str(pdf.latGroup_m[0])+"_"+str(pdf.lonGroup_m[0])+".mdl")
return pdf
dummy_df=val_large_df.groupby("latGroup_m","lonGroup_m").applyInPandas(train_sgbt,
schema="fcast_error double")
dummy_df.show()
Spark evaluates the dummy_df lazy and therefore train_sgbt will only be called for the groups that are required to complete the Spark action.
The Spark action here is show(). This action prints only the first 20 rows, so train_sgbt is only called for the groups that have at least one element in the first 20 rows. Spark may evaluate more groups, but there is no guarantee for it.
One way to solve to problem would be to call another action, for example csv.
I have a json file with over a million rows, so I am trying to minimize the number of times I have to run through it all to get one aspect of it into an rdd.
Right now, I load each row into a list:
with open('in/json-files/sites.json') as f:
for line in f:
data.append(json.loads(line))
Then, I make another list and import the aspect into that:
for line in range(1,len(data)):
data_companies.append(data[line]['company'])
Then, I parallelize this into an rdd so that I can analyze it. I am worried about how much memory this will take up, so is there an easier and faster way to do this? I have tried loading the json file like this, but it wont work:
data.append(json.loads(line['company'))
As your data is structured(JSON), you can look into Spark-SQL
https://spark.apache.org/docs/2.4.0/sql-programming-guide.html
https://spark.apache.org/docs/2.4.0/sql-data-sources-json.html
You can directly load your JSON into a dataframe and look for the particular column to do your analysis
In the sci-kit learn python library there are many datasets accessed easily by the following commands:
for example to load the iris dataset:
iris=datasets.load_iris()
And we can now assign data and target/label variables as follows:
X=iris.data # assigns feature dataset to X
Y=iris.target # assigns labels to Y
My question is how to create my own data dictionary using my own data either in csv, xml or any other format into something similar above so data can be called easily and features/labels are easily accessed.
Is this possible? someone help me!!
By the way I am using the spyder (anaconda) platform by continuum.
Thanks!
I see at least two (easy) solutions to your problem.
First, you can store your data in whichever structure you like.
# Storing in a list
my_list = []
my_list.append(iris.data)
my_list[0] # your data
# Storing in a dictionary
my_dict = {}
my_dict["data"] = iris.data
my_dict["data"] # your data
Or, you can create your own class:
Class MyStructure:
def __init__(data, target):
self.data = data
self.target = target
my_class = MyStructure(iris.data, iris.target)
my_class.data # your data
Hope it helps
If ALL you want to do is read data from csv files and have them organized , I would recommend you to simply use either pandas or numpy's genfromtxt function.
mydata=numpy.genfromtxt(filepath,*params)
If the CSV is formatted regularly, you can extract for example the names of each column by specifying:
mydata=numpy.genfromtxt(filepath,unpack=True,names=True,delimiter=',')
then you can access any column data you want by simply typing it's name/header:
mydata['your header']
(Pandas also has a similar convenient way of grabbing data in an organized manner from CSV or similar files.)
However if you want to do it the long way and learn:
Simply, you want to write a class for the data that you are using, complete with its own access, modify, read, #dosomething functions. Instead of code for this, I think you would benefit more from going in and reading for example the iris class, or an introduction to a simple Class from any beginners guide to object based programming.
To do what you want, for an object MyData, you could have for example
read(#file) function that reads from a given file of some expected format and returns some specified structure. For reading from csv files, you can simply use numpy's loadtxt method.
modify(#some attribute)
etc.