Convert 100k row pyspark df to pandas df - python

I have a pyspark df with 100k rows. I am using spark
df = pandas_df.toPandas()
which takes lot of time to execute this syntax. Is there any other way to do this operation within seconds?
Also to save the pyspark dataframe in .csv format it takes lot of time. Why is it so?

Try to repartition your datafram first before converting to pandas df
df = df.repartition(1)
df = df.toPandas()

Related

I have a pandas dataframe and want to pass each column into the kruskal stats function as separate arrays

Right now I can create the pandas dataframe. I have the stats.kruskal function set up with the first 6 columns of the dataframe by using df.iloc. This works, but in my dataframe I have 33 columns and do not want to use df.iloc for each one. I am sure there is a more pythonic way of doing this.
#make a pandas data frame called df from an excel file, you should see data print in console
df = pd.read_excel(file_path, engine='openpyxl')
#print description of dataframe
print(df.describe())
#kruskal wallis test comparing the columns of dataframe
print(stats.kruskal(df.iloc[:3,:1],df.iloc[:3,1:2],df.iloc[:3,2:3],df.iloc[:3,3:4],df.iloc[:3,4:5],df.iloc[:3,5:6]))

How to divide a spark dataframe into n different chunks and convert them to dataframe and append into one?

I have a spark dataframe of 100000 rows. Is there a way to loop though 1000 rows and convert them to pandas dataframe using toPandas() and append them into a new dataframe?
Directly changing this by using toPandas() is taking a very long time. There is no column by which we can divide the dataframe in a segmented fraction.
You can directly use limit -
pd_df = ....
sparkDF_1k = sparkDF.limit(1000)
pd_df = pd.concat([pd_df,sparkDF_1k.toPandas()])
firstThousand = train.limit(1000).collect()[:1000]
df = spark.createDataFrame(firstThousand)
df.toPandas()

PySpark : Optimize read/load from Delta using selected columns or partitions

I am trying to load data from Delta into a pyspark dataframe.
path_to_data = 's3://mybucket/daily_data/'
df = spark.read.format("delta").load(path_to_data)
Now the underlying data is partitioned by date as
s3://mybucket/daily_data/
dt=2020-06-12
dt=2020-06-13
...
dt=2020-06-22
Is there a way to optimize the read as Dataframe, given:
Only certain date range is needed
Subset of column is only needed
Current way, i tried is :
df.registerTempTable("my_table")
new_df = spark.sql("select col1,col2 from my_table where dt_col > '2020-06-20' ")
# dt_col is column in dataframe of timestamp dtype.
In the above state, does Spark need to load the whole data, filter the data based on date range and then filter columns needed ? Is there any optimization that can be done in pyspark read, to load data since it is already partitioned ?
Something on line of :
df = spark.read.format("delta").load(path_to_data,cols_to_read=['col1','col2'])
or
df = spark.read.format("delta").load(path_to_data,partitions=[...])
In your case, there is no extra step needed. The optimizations would be taken care by Spark. Since you already partitioned the dataset based on column dt when you try to query the dataset with partitioned column dt as filter condition. Spark load only the subset of the data from the source dataset which matches the filter condition, in your case it is dt > '2020-06-20'.
Spark internally does the optimization based partitioning pruning.
To do this without SQL..
from pyspark.sql import functions as F
df = spark.read.format("delta").load(path_to_data).filter(F.col("dt_col") > F.lit('2020-06-20'))
Though for this example you may have some work to do with comparing dates.

is there a possible way to merge excel rows for duplicete cells in a column with python?

I am still new with python could you pleas help me with this
i have this excel sheet
and i want it to be like this
You can convert the csv data to a panda dataframe like this:
import pandas as pd
df = pd.read_csv("Input.csv")
Then do the data manipulation as such:
df = df.groupby(['Name'])['Training'].apply(', '.join).reset_index()
Finally, create an output csv file:
df.to_csv('Output.csv', sep='\t')
You could use pandas for creating a DataFrame to manipulate the excel sheet information. First, load the file using the function read_excel (this creates a DataFrame), and then use the function groupby and apply to concatenate the strings.
import pandas as pd
# Read the Excel File
df = pd.read_excel('tmp.xlsx')
# Group by the column(s) that you need.
# Finally, use the apply function to arrange the data
df.groupby(['Name'])['Training'].apply(','.join).reset_index( )

Python Read from SQL to pandas dataframes [duplicate]

I have a RDD and I want to convert it to pandas dataframe. I know that to convert and RDD to a normal dataframe we can do
df = rdd1.toDF()
But I want to convert the RDD to pandas dataframe and not a normal dataframe. How can I do it?
You can use function toPandas():
Returns the contents of this DataFrame as Pandas pandas.DataFrame.
This is only available if Pandas is installed and available.
>>> df.toPandas()
age name
0 2 Alice
1 5 Bob
You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.
For example, let's say I have a text file, flights.csv, that has been read in to an RDD:
flights = sc.textFile('flights.csv')
You can check the type:
type(flights)
<class 'pyspark.rdd.RDD'>
If you just use toPandas() on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:
# RDD to Spark DataFrame
sparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()
#Spark DataFrame to Pandas DataFrame
pdsDF = sparkDF.toPandas()
You can check the type:
type(pdsDF)
<class 'pandas.core.frame.DataFrame'>

Categories