I have a RDD and I want to convert it to pandas dataframe. I know that to convert and RDD to a normal dataframe we can do
df = rdd1.toDF()
But I want to convert the RDD to pandas dataframe and not a normal dataframe. How can I do it?
You can use function toPandas():
Returns the contents of this DataFrame as Pandas pandas.DataFrame.
This is only available if Pandas is installed and available.
>>> df.toPandas()
age name
0 2 Alice
1 5 Bob
You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.
For example, let's say I have a text file, flights.csv, that has been read in to an RDD:
flights = sc.textFile('flights.csv')
You can check the type:
type(flights)
<class 'pyspark.rdd.RDD'>
If you just use toPandas() on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:
# RDD to Spark DataFrame
sparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()
#Spark DataFrame to Pandas DataFrame
pdsDF = sparkDF.toPandas()
You can check the type:
type(pdsDF)
<class 'pandas.core.frame.DataFrame'>
Related
import pyspark
dfs=[df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df1,df12,df13,df14,df15]
for x in dfs:
y=x.toPandas()
y.to_csv("D:/data")
This is what I wrote, but I actually want the function to take this list and convert every df into a pandas df and then convert it to csv and save it in the order as it appears on dfs list and save it to a particular directory in the order of name. Is there a possible way to write such function?
PS D:/data is just an imaginary path and is used for explanation.
If you will convert a dataframe to a csv, you still need to state it in df.to_csv. So, try:
for x in dfs:
y=x.toPandas()
y.to_csv(f"D:/data/df{dfs.index(x) + 1}.csv")
I set it as df{dfs.index(x) + 1} so that the file names will be df1, df2, ... etc.
Right now I can create the pandas dataframe. I have the stats.kruskal function set up with the first 6 columns of the dataframe by using df.iloc. This works, but in my dataframe I have 33 columns and do not want to use df.iloc for each one. I am sure there is a more pythonic way of doing this.
#make a pandas data frame called df from an excel file, you should see data print in console
df = pd.read_excel(file_path, engine='openpyxl')
#print description of dataframe
print(df.describe())
#kruskal wallis test comparing the columns of dataframe
print(stats.kruskal(df.iloc[:3,:1],df.iloc[:3,1:2],df.iloc[:3,2:3],df.iloc[:3,3:4],df.iloc[:3,4:5],df.iloc[:3,5:6]))
How to filter pyspark dataframe but still on dataframe format?
I used this
datalabel = datalabel.filter(datalabel.subs_no.isNotNull()).collect()
but datalabel format is change to list.
You can filter the required columns using select which will return a DataFrame
datalabel_subs_no = datalabel.filter(datalabel.subs_no.isNotNull()).select(F.col('subs_no'))
I have a csv import of datas store in such fashion
username;groups
alice;(admin,user)
bob;(user)
I want to do some data analysis on it and import them to a pandas dataframe so that the first column is stored as a string and the second as a tuple.
I tried mydataframe = pd.read_csv('file.csv', sep=';') then convert the groups column with astype method mydataframe['groups'].astype('tuple') but it won't work.
How to store other objects than strings/ints/floats in dataframes?
Thanks.
Untested, but try
mydataframe['groups'].apply(lambda text: tuple(text[1:-1].split(',')))
I have a pyspark df with 100k rows. I am using spark
df = pandas_df.toPandas()
which takes lot of time to execute this syntax. Is there any other way to do this operation within seconds?
Also to save the pyspark dataframe in .csv format it takes lot of time. Why is it so?
Try to repartition your datafram first before converting to pandas df
df = df.repartition(1)
df = df.toPandas()