Outputting pyspark dataframe in capnp (cap'n proto) format - python

I have been tasked with outputting a Pyspark Dataframe into cap'n proto (.capnp) format. Does anyone have a suggestion for the best way to do this?
I have a capnp schema, and I have seen the python wrapper for capnp (http://capnproto.github.io/pycapnp/), but I'm still not sure what's the best way to go from dataframe to capnp.

The easiest way is to go to RDD, the use mapPartitions to collect partitions as serialized byte arrays and join them in collect() or use toLocalIterator with saving to disk, if dataframe is big. See example pseudocode:
create = your_serialization_method
serialize_partition = lambda partition: [b''.join([create(object).to_bytes() for object in partition])] # creates one-element partition
output = b''.join(df.rdd.mapPartitions(serialize_partition).collect())

Related

Values of the columns are null and swapped in pyspark dataframe

I am using pyspark==2.3.1. I have performed data preprocessing on the data using pandas now I want to convert my preprocessing function into pyspark from pandas. But while reading the data CSV file using pyspark lot of values become null of the column that has actually some values. If I try to perform any operation on this dataframe then it is swapping the values of the columns with other columns. I also tried different versions of pyspark. Please let me know what I am doing wrong. Thanks
Result from pyspark:
The values of the column "property_type" have null but the actual dataframe has some value instead of null.
CSV File:
But pyspark is working fine with small datasets. i.e.
In our we faced the similar issue. Things you need to check
Check wether your data as " [double quotes] pypark would consider this as delimiter and data looks like malformed
Check wether your csv data as multiline
We handled this situation by mentioning the following configuration
spark.read.options(header=True, inferSchema=True, escape='"').option("multiline",'true').csv(schema_file_location)
Are you limited to use CSV fileformat?
Try parquet. Just save your DataFrame in pandas with .to_parquet() instead of .to_csv(). Spark works with this format really well.

How to write/read a (pickled) array of strings in sql using df.to_sql?

I'm trying to append scrape data while the scrape is running.
Some of the columns contain an array of multiple strings which are to be saved and postprocessed into dummies after the scrape.
e.g tags= array(['tag1','tag2'])
However writing to and reading from the database doesn't work for the arrays.
Ive tried different storage methods, csv, pickling, HDF all of these dont work for different reasons.
(mainly problems appending to a central database and storing lists like strings).
I also tried different database formats (mysql and postgres) , i tried using dtype ARRAY, however that requires a fixed length (known beforehand) array..
From what i gather, i can go the JSON route or the pickle route.
I chose the pickle route since i dont need the db to do anything with the contents of the array.
from sqlalchemy.types import PickleType
df=pd.DataFrame([],columns=['Name','Tags'])
df['Price'] = array(['tag1','tag2'], dtype='<U8')
type_dict = {'Name': String ,'Tags': PickleType}
engine = create_engine('sqlite://', echo=False)
df.to_sql('test', con=engine, if_exists='append', index=False, dtype=type_dict)
df2=pd.read_sql_table('test' ,con =engine)
expected output:
df2['Tags'].values
array(['tag1','tag2'], dtype='<U8')
actual output:
df2['Tags'].iloc[0]
b'\x80\x04\x95\xa4\x00\x00\x00\x00\x00\x00\x00\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x01\x85\x94h\x03\x8c\x05dtype\x94\x93\x94\x8c\x02U8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNK K\x04K\x08t\x94b\x89C \xac \x00\x00\xac \x00\x00 \x00\x00\x00-\x00\x00\x00 \x00\x00\x00\xac \x00\x00\xac \x00\x00\xac \x00\x00\x94t\x94b.'
So something has gone wrong during pickling, and I cant figure out what.
edit:
Okay, so np.loads(df2['Tags'].iloc[0]) gives the original array back. Is there a way to pass this to read_sql_table? such that i immediately get the "original" dataframe back?
So the problems occurs during the reading, the arrays are pickled, but they are not automatically read back as pickled data. There is no way to pass dtype to read_sql_table right?
def unpickle_the_pickled(df):
df2=df
for col in df.columns:
if type(df[col].iloc[0])==bytes:
df2[col]=df[col].apply(np.loads)
return df2
finally solved it, so happy!

Transforming Python Lambda function without return value to Pyspark

I have a working lambda function in Python that computes the highest similarity between each string in dataset1 and the strings in dataset2. During an iteration, it writes the string, the best match and the similarity together with some other information to bigquery. There is no return value, as the purpose of the function is to insert a row into a bigquery dataset. This process takes rather long which is why I wanted to use Pyspark and Dataproc to speed up the process.
Converting the pandas dataframes to spark was easy. I am having trouble to register my udf, because it has no return value and pyspark requires one. In addition I don't understand how to map the 'apply' function in python to the pyspark variant. So basically my question is how to transform the python code below to work on a spark dataframe.
The following code works in a regular Python environment:
def embargomatch(name, code, embargo_names):
find best match
insert best match and additional information to bigquery
customer_names.apply(lambda x: embargoMatch(x['name'], x['customer_code'],embargo_names),axis=1)
Because pyspark requires a return type, I added 'return 1' to the udf and tried the following:
customer_names = spark.createDataFrame(customer_names)
from pyspark.sql.types import IntegerType
embargo_match_udf = udf(lambda x: embargoMatch(x['name'], x['customer_code'],embargo_names), IntegerType())
Now i'm stuck trying to apply the select function, as I don't know what parameters to give.
I suspect you're stuck on how to pass multiple columns to the udf -- here's a good answer to that question: Pyspark: Pass multiple columns in UDF.
Rather than creating a udf based on a lambda that wraps your function, consider simplifying by creating a udf based on embargomatch directly.
embargo_names = ...
# The parameters here are the columns passed into the udf
def embargomatch(name, customer_code):
pass
embargo_match_udf = udf(embargomatch, IntegerType())
customer_names.select(embargo_match_udf(array('name', 'customer_code')).alias('column_name'))
That being said, it's suspect that your udf doesn't return anything -- I generally see udfs as a way to add columns to the dataframe, but not to have side effects. If you want to insert records into bigquery, consider doing something like this:
customer_names.select('column_name').write.parquet('gs://some/path')
os.system("bq load --source_format=PARQUET [DATASET].[TABLE] gs://some/path")

How to clone RDD object [Pyspark]

1) How can I clone a RDD object to another?
2) or reading a csv file, I'm using pandas to read and then using sc.parallelize to convert list to RDD object. Is that okay or should I use some RDD method to directly read from csv?
3) I understand that I need to convert huge data to RDDs but do I also need to convert single int values into RDDs? If I just declare an int variable, will it be distributed across nodes?
You can use spark-csv to reas a csv file in spark
Here is how you read in spark 2.X
spark.read
.schema(my_schema)
.option("header", "true")
.csv("data.csv")
In Spark < 2.0
sqlContext
.read.format("com.databricks.spark.csv")
.option("header", "true")
.option(schema, my_schema)
.load("data.csv"))
For more options as your requirement please refer here
You can just assign a RDD or dataframe to another variable to get cloned.
Hope this helps.
I am a bit confused on your 1) question. An RDD is an immutable object so you can load your data into one RDD and then you can define two different transformations based on your initial RDD. Each transformation will use the original RDD and generate new RDDs.
Something like this:
# load your CSV
loaded_csv_into_rdd = sc.textFile('data.csv').map(lambda x: x.split(','))
# You could even .cache or .persist the data
# Here two new RDDs will be created based on the data that you loaded
one_rdd = loaded_csv_into_rdd.<apply one transformation>
two_rdd = loaded_csv_into_rdd.<apply another transformation>
This is using the low level API, RDDs. It is probably better if you try to do with the DataFrame API (Dataset[Row]) as it can infer the schema and in general will be easier to use.
On 2) what you are looking for if you want to use RDDs is sc.textFile and then you apply a split by commas to generate lists that you can then manage.
On 3), in Spark you do not declare variables. You are working with functional programming so state is not something that you need to keep. You have accumulators which is a special case, but in general you are defining functions that are applied to the entire dataset, something that is called coarse grained transformations.

Lazily create dask dataframe from generator

I want to lazily create a Dask dataframe from a generator, which looks something like:
[parser.read(local_file_name) for local_file_name in repo.download_files())]
Where both parser.read and repo.download_files return generators (using yield). parser.read yields a dictionary of key-value pairs, which (if I was just using plain pandas) would collect each dictionary in to a list, and then use:
df = pd.DataFrame(parsed_rows)
What's the best way to create a dask dataframe from this? The reason is that a) I don't know necessarily the number of results returned, and b) I don't know the memory allocation of the machine that it will be deployed on.
Alternatively what should I be doing differently (e.g. maybe create a bunch of dataframes and then put those in to dask instead?)
Thanks.
If you want to use the single-machine Dask scheduler then you'll need to know how many files you have to begin with. This might be something like the following:
filenames = repo.download_files()
dataframes = [delayed(load)(filename) for filename in filenames]
df = dd.from_delayed(dataframes)
If you use the distributed scheduler you can add new computations on the fly, but this is a bit more advanced.

Categories