Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.
from sklearn.datasets import load_iris
import dask.dataframe as dd
d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)
What I would like to do:
in_memory = ddf.iloc[:,2:4].compute()
What I have been able to do:
ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()
map_partitions works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.
Although iloc is not implemented for dask-dataframes, you can achieve the indexing easily enough as follows:
cols = list(ddf.columns[2:4])
ddf[cols].compute()
This has the additional benefit, that dask knows immediately the types of the columns selected, and needs to do no additional work. For the map_partitions variant, dask at the least needs to check the data types produces, since the function you call is completely arbitrary.
Related
Introduction
I have two dataframes. I would like to apply a function to each row of the first one. This function depends on the row and the entire second dataframe. I would like to do this efficiently.
Reprudicible Example
Setting up the dataframes
import pandas as pd
import numpy as np
Let the two dataframes be:
df0 = pd.DataFrame.from_dict({'a':np.random.normal(0,1,5),'b':np.random.normal(0,1,5)})
df1 = pd.DataFrame.from_dict({'c':np.random.normal(0,1,10),'d':np.random.normal(0,1,10)})
(In the real application, they are much bigger.)
I would like to find which row from df1 is closest to each row in df0, where closest is defined as having the least squared_dist between them:
def squared_dist(x,y):
return np.sum(np.square(x-y))
What I have tried
What I do is create to numpy arrays from the dataframes
df0np=df0.to_numpy()
df1np=df1.to_numpy()
Iterate through these arrays:
res=[]
for row in df0np:
distances = [squared_dist(row,df1np[i,]) for i in range(len(df1np))]
index=np.argmin(distances)
res.append(index)
Add the result to df0 as a new column:
df0['res']=res
How fast is it?
The whole code in one piece, including timings for the method described above:
import time
import pandas as pd
import numpy as np
df0 = pd.DataFrame.from_dict({'a':np.random.normal(0,1,5),'b':np.random.normal(0,1,5)})
df1 = pd.DataFrame.from_dict({'c':np.random.normal(0,1,10),'d':np.random.normal(0,1,10)})
df0np=df0.to_numpy()
df1np=df1.to_numpy()
start=time.time()
df0np=df0.to_numpy()
df1np=df1.to_numpy()
res=[]
for row in df0np:
distances = [squared_dist(row,df1np[i,]) for i in range(len(df1np))]
index=np.argmin(distances)
res.append(index)
df0['res']=res
end=time.time()
print(end-start) # prints 0.0014030933380126953
Question
How could I make this more efficient, ie how could I achieve lower execution times? This method works fine for this example above, but in my real world application where dataframes are much bigger, this is unusably slow.
i am loading a big dataframe in python, with several columns and million of rows, so for sure this is quite memory consuming. To exclude some types in a specific column I use:
import pandas as pd
files = glob.glob("Path/*.csv")
dfs = [pd.read_csv(f, sep='\t', encoding='unicode_escape') for f in files]
df = pd.concat(dfs,ignore_index=True)
df = df.loc[~df['Type'].isin('A', 'B',...,'F')]
What is a better way to exclude the specific types/characters to drop the rows which have this character inside? As this keeps on crashing.
You can deal with the memory issues using dask
import dask.dataframe as dd
df = dd.read_csv('file.csv')
df = df.loc[~df.Type.isin(['A', 'B',...,'F'])]
df = df.compute() # this will give back the pandas dataframe
This will silently carryout operations chunk-wise in the background.
I am pretty new to pandas and numpy, and I'm trying to figure out the best way to do some things.
Right now I am trying to call a function on every row of a dataframe. If I pass in three numpy arrays to this function, it's very fast, but using apply on the dataframe is very slow.
My guess is that numpy is using vectorized functions in the first case, and not in the second. Is there a way to get pandas to use that optimization? Basically, in pseudocode I think apply is doing something like for row in frame: func(row['a'], row['b'], row['c']) but I want it to do func(col['a'], col['b'], col['c']).
Here is an example of what I am trying to do.
import numpy as np
import pandas as pd
from scipy.stats import beta
count = 100000
# If I start with a given dataframe and use apply, it's very slow
df = pd.DataFrame(np.random.uniform(0, 1, size=(count, 3)), columns=['a', 'b', 'c'])
df.apply(lambda frame: beta.cdf(frame['a'], frame['b'], frame['c']), axis=1)
# However, if I split out each column into a numpy array, this is very fast.
a = df['a'].as_matrix()
b = df['b'].as_matrix()
c = df['c'].as_matrix()
beta.cdf(a, b, c)
# But at this point I've lost the context of the dataframe.
# I would like to keep the results in a new column for further processing
It's not clear why you're trying to use apply. You can just do beta.cdf(df.a, df.b, df.c).
I have a DataFrame with a single column 'value'. I want to split it by space, remove the first item from the split, and recombine the remaining items into a vector column.
It's very easy to do with a UDF or by converting to and from RDD, but I want to use only DataFrame API for performance and code simplicity reasons.
The best I could do was this:
import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
df = sqlContext.createDataFrame([['10 11 12']], ['value'])
df_split = df.select(F.split('value', ' ').alias('split'))
n = df_split.select(F.size(df_split['split'])).collect()[0][0]
df_columns = df_split.select([F.col('split')[i].astype('int').alias(str(i)) for i in range(1, n)])
v = VectorAssembler(inputCols=[str(i) for i in range(1, n)], outputCol='result')
df_result = v.transform(df_columns).select('result')
It works, but requires an extra action (to get the size of the column after split), and a lot of code for such a simple task. Is there a simpler way of doing this?
In addition, VectorAssembler won't work for non-numeric types.
Spark 2.0.0, python 3.5.
I have this DataFrame in pandas that I have that I've grouped by a column.
After this operation I need to generate all unique pairs between the rows of
each group and perform some aggregate operation on all the pairs of a group.
I've implemented the following sample algorithm to give you an idea. I want to refactor this code in order to make it work with pandas to yield performance increase and/or decrease code complexity.
Code:
import numpy as np
import pandas as pd
import itertools
#Construct Dataframe
samples=40
a=np.random.randint(3,size=(1,samples))
b=np.random.randint(9,size=(1,samples))
c=np.random.randn(1,samples)
d=np.append(a,b,axis=0)
e=np.append(d,c,axis=0)
e=e.transpose()
df = pd.DataFrame(e,columns=['attr1','attr2','value'])
df['attr1'] = df.attr1.astype('int')
df['attr2'] = df.attr2.astype('int')
#drop duplicate rows so (attr1,attr2) will be key
df = df.drop_duplicates(['attr1','attr2'])
#df = df.reset_index()
print(df)
for key,tup in df.groupby('attr1'):
print('Group',key,' length ',len(tup))
#generate pairs
agg=[]
for v1,v2 in itertools.combinations(list(tup['attr2']),2):
p1_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v1)]['value'])
p2_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v2)]['value'])
agg.append([key,(v1,v2),(p1_val-p2_val)**2])
#insert pairs to dataframe
p = pd.DataFrame(agg,columns=['group','pair','value'])
top = p.sort_values(by='value').head(4)
print(top['pair'])
#Perform some operation in df based on pair values
#....
I am really afraid that pandas DataFrames can not provide such sophisticated analysis functionality.
Do I have to stick to traditional python like in the example?
I'm new to Pandas so any comments/suggestions are welcome.