I have a DataFrame with a single column 'value'. I want to split it by space, remove the first item from the split, and recombine the remaining items into a vector column.
It's very easy to do with a UDF or by converting to and from RDD, but I want to use only DataFrame API for performance and code simplicity reasons.
The best I could do was this:
import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
df = sqlContext.createDataFrame([['10 11 12']], ['value'])
df_split = df.select(F.split('value', ' ').alias('split'))
n = df_split.select(F.size(df_split['split'])).collect()[0][0]
df_columns = df_split.select([F.col('split')[i].astype('int').alias(str(i)) for i in range(1, n)])
v = VectorAssembler(inputCols=[str(i) for i in range(1, n)], outputCol='result')
df_result = v.transform(df_columns).select('result')
It works, but requires an extra action (to get the size of the column after split), and a lot of code for such a simple task. Is there a simpler way of doing this?
In addition, VectorAssembler won't work for non-numeric types.
Spark 2.0.0, python 3.5.
Related
I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters
Usually pyspark.ml.feature.IDF returns one outputCol that contains SparseVector. All i need is having N-columns with real number values, where N is a number of features defined in IDF(to use that dataframe in catboost later).
I have tried to convert column to array
def dense_to_array(v):
new_array = list([float(x) for x in v])
return new_array
dense_to_array_udf = F.udf(dense_to_array, T.ArrayType(T.FloatType()))
data = data.withColumn('tf_idf_features_array', dense_to_array_udf('tf_idf_features'))
and after that use Pandas to convert to columns
data = data.toPandas()
cols = [f'tf_idf_{i}' for i in range(32)]
data = pd.DataFrame(info['tf_idf_features_array'].values.tolist(), columns=cols)
I don't like that way, because i find it really slow. Is there a way to solve my problem over pyspark without pandas?
I have a dataframe like this
now i want to normalize the string in the 'comments' column for the word 'election' . I tried using fuzzywuzzy but wasn't able to implement it on pandas dataframe to partially match the word 'election'. The output dataframe should have the word 'election' in the 'comments' column like this
Assume that i have around 100k rows and possible combinations for the word 'election' can be many.
Kindly guide me on this part.
with the answer you gave, you can use pandas apply, stack and groupby functions to accelerate your code. you have input such as:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'Merchant details': ['Alpha co','Bravo co'],
'Comments':['electionsss are around',
'vote in eelecttions']})
For the column 'comments', you can create a temporary mutiindex DF containing a word per row by splitting and using stack function:
df_temp = pd.DataFrame(
{'split_comments':df['Comments'].str.split(' ',expand=True).stack()})
Then you create the column with corrected word (according to your idea), using apply and the comparision of fuzz.ratio:
df_temp['corrected_comments'] = df_temp['split_comments'].apply(
lambda wd: 'election' if fuzz.ratio(wd, 'election') > 75 else wd)
Finally, you write back in your column Comments of df with the corrected data using groupby and join functions:
df['Comments'] = df_temp.reset_index().groupby('level_0').apply(
lambda wd: ' '.join(wd['corrected_comments']))
Don't operate on the dataframe. The overhead will kill you. Turn the column into a list, then iteratecover that. And finally assign that list back to the column.
Ok i tried this myself and came up with this code -
for i in range(len(df)):
a = []
a = df.comments[i].split()
for j in word:
for k in range(len(a)):
if fuzz.ratio(j,a[k]) > 75:
a[k] = j
df.comments[i] = a
df.comments[i] = ' '.join(df.comments[i])
But this approach seems slow for a large dataframe.
Can someone provide a better pythonic way of implementing this.
Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.
from sklearn.datasets import load_iris
import dask.dataframe as dd
d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)
What I would like to do:
in_memory = ddf.iloc[:,2:4].compute()
What I have been able to do:
ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()
map_partitions works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.
Although iloc is not implemented for dask-dataframes, you can achieve the indexing easily enough as follows:
cols = list(ddf.columns[2:4])
ddf[cols].compute()
This has the additional benefit, that dask knows immediately the types of the columns selected, and needs to do no additional work. For the map_partitions variant, dask at the least needs to check the data types produces, since the function you call is completely arbitrary.
I have this DataFrame in pandas that I have that I've grouped by a column.
After this operation I need to generate all unique pairs between the rows of
each group and perform some aggregate operation on all the pairs of a group.
I've implemented the following sample algorithm to give you an idea. I want to refactor this code in order to make it work with pandas to yield performance increase and/or decrease code complexity.
Code:
import numpy as np
import pandas as pd
import itertools
#Construct Dataframe
samples=40
a=np.random.randint(3,size=(1,samples))
b=np.random.randint(9,size=(1,samples))
c=np.random.randn(1,samples)
d=np.append(a,b,axis=0)
e=np.append(d,c,axis=0)
e=e.transpose()
df = pd.DataFrame(e,columns=['attr1','attr2','value'])
df['attr1'] = df.attr1.astype('int')
df['attr2'] = df.attr2.astype('int')
#drop duplicate rows so (attr1,attr2) will be key
df = df.drop_duplicates(['attr1','attr2'])
#df = df.reset_index()
print(df)
for key,tup in df.groupby('attr1'):
print('Group',key,' length ',len(tup))
#generate pairs
agg=[]
for v1,v2 in itertools.combinations(list(tup['attr2']),2):
p1_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v1)]['value'])
p2_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v2)]['value'])
agg.append([key,(v1,v2),(p1_val-p2_val)**2])
#insert pairs to dataframe
p = pd.DataFrame(agg,columns=['group','pair','value'])
top = p.sort_values(by='value').head(4)
print(top['pair'])
#Perform some operation in df based on pair values
#....
I am really afraid that pandas DataFrames can not provide such sophisticated analysis functionality.
Do I have to stick to traditional python like in the example?
I'm new to Pandas so any comments/suggestions are welcome.