I've a UDF function with output in tuple format. I want to apply that UDF to my input column and based on what I need out, I want to choose either out1 or out2 value as the value for my column.
Something like this:
def my_f(inp):
return out1,out2
df = df.withColumn('first_val', F.udf(my_f, StringType())(F.col('inp_col'))[0]
df = df.withColumn('second_val', F.udf(my_f, StringType())(F.col('inp_col'))[1]
I want the first_val col to have first element of tuple, and second_val to have second element of the tuple. This code of course doesn't work.
I tried it by passing the required part of tuple as function input and it worked. Like this:
def my_f(inp, out='full'):
if out=='first':
return out1
elif out=='second':
return out2
else: # case 'full'
return out1,out2
df = df.withColumn('first_val', F.udf(my_f, StringType())(F.col('inp_col'), F.col('inp_col'))
df = df.withColumn('second_val', F.udf(my_f, StringType())(F.col('inp_col'),'F.col('second'))
But is there a simpler way of getting the nth element of tuple within the line without passing this parameter?
If your UDF is returning a tuple you should change your return type to ArrayType(StringType) assuming you are returning a tuple of Strings. Then you will be able to access the first and second element of your tuple by using the [n] notation. Here is an example:
import pyspark.sql.functions as F
import pyspark.sql.types as T
...
#F.udf(T.ArrayType(T.StringType())
def my_f(inp):
...
return (out1, out2)
df = df.withColumn('first_val', my_f('inp_col')[0])
df = df.withColumn('second_val', my_f('inp_col')[1])
In case you need different types in your tuple you might want to consider returning a StructType instead. Here would be an example where the first element of the tuple is a string and the second is an integer:
import pyspark.sql.functions as F
import pyspark.sql.types as T
...
#F.udf(T.StructType([
T.StructField("first", T.StringType()),
T.StructField("second", T.IntegerType())
]))
def my_f(inp):
...
return {"first": out1, "second": out2}
df = df.withColumn('first_val', my_f('inp_col')["first"])
df = df.withColumn('second_val', my_f('inp_col')["second"])
Related
I'm passing dataframe from mapInPandas function in pyspark. so I need all values of ID column should be seperated by comma(,) like this 'H57R6HU87','A1924334','496A4806'
x1['ID'] looks like this
H57R6HU87
A1924334
496A4806'
Here is my code to get unique ID's, I am getting TypeError: string indices must be integers
# batch_iter= cust.toPandas()
for x1 in batch_iter:
IDs= ','.join(f"'{i}'" for i in x1['ID'].unique())
You probably don't need a loop, try:
batch_iter = cust.toPandas()
IDs = ','.join(f"'{i}'" for i in batch_iter['ID'].unique())
Or you can try using Spark functions only:
df2 = df.select(F.concat_ws(',', F.collect_set('ID')).alias('ID'))
If you want to use mapInPandas:
def pandas_func(iter):
for x1 in iter:
IDs = ','.join(f"'{i}'" for i in x1['ID'].unique())
yield pd.DataFrame({'ID': IDs}, index=[0])
df.mapInPandas(pandas_func)
# But I suspect you want to do this instead:
# df.repartition(1).mapInPandas(pandas_func)
I have a JSON structure which I need to convert it into data-frame. I have converted through pandas library but I am having issues in two columns where one is an array and the other one is key-pair value.
Pito Value
{"pito-key": "Number"} [{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]
How to break columns into the data-frames.
As far as I understood your question, you can apply regular expressions to do that.
import pandas as pd
import re
data = {'pito':['{"pito-key": "Number"}'], 'value':['[{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]']}
df = pd.DataFrame(data)
def get_value(s):
s = s[1]
v = re.findall(r'VALUE\":\".*\"', s)
return int(v[0][8:-1])
def get_pito(s):
s = s[0]
v = re.findall(r'key\": \".*\"', s)
return v[0][7:-1]
df['value'] = df.apply(get_value, axis=1)
df['pito'] = df.apply(get_pito, axis=1)
df.head()
Here I create 2 functions that transform your scary strings to values you want them to have
Let me know if that's not what you meant
I am trying to merge two csv and transforming the values in one csv by looking up constant values in another csv.i am able to get series but but not able to get the correct cell value. Can you please suggest?
I am calling the below function in reading the main csv and transforming the language column
dataDF['language'] =
dataDF['language'].apply(translateLanguagetest)
def translateLanguagetest( keystring):
print("keystring" + keystring)
ref_Data_File = Path('C:\sampletest')/ "constant.csv"
refDataDF = pd.read_csv(ref_Data_File)
refDataDF['refKey']=refDataDF['sourcedomain']+"#"+refDataDF['value']
+"#"+refDataDF['targetdomain']
refDataDF['refValue']=refDataDF['target']
modRef= refDataDF['refValue'].where(refDataDF['refKey']==
'languageSRC#'+keystring+'#languagetarget')
print("modRef: "+modRef )
cleanedRef = modRef.dropna()
f(cleanedRef)
print(cleanedRef)
value = cleanedRef.loc[('refValue')]
return value
The contents of constant.csv is
value,sourcedomain,targetdomain,target
ita,languageSRC,languagetarget,it
eng,languageSRC,languagetarget,en
Got the solution and it was a simple one. Being new to python, took some time to find the answer. I am reading the constants csv before and passing the constants dataframe as parameter to the method for transformation of column value.
import unittest
from pathlib import Path
import pandas as pd
class AdvancedTestSuite(unittest.TestCase):
"""Advanced test cases."""
def test_transformation(self):
data_File = Path('C:\Test_python\stackflow')/ "data.csv"
data_mod_File = Path('C:\Test_python\stackflow')/ "data_mod.csv"
dataDF = pd.read_csv(data_File)
ref_Data_File = Path('C:\Test_python\stackflow')/ "constant.csv"
refDataDF = pd.read_csv(ref_Data_File)
refDataDF['refKey']=refDataDF['sourcedomain'] \
+"#"+refDataDF['value']+"#"+refDataDF['targetdomain']
refDataDF['refValue']=refDataDF['target']
dataDF['language'] = dataDF['language'].apply(
lambda x: translateLanguagetest(x, refDataDF))
dataDF['gender'] = dataDF['gender'].apply(
lambda x: translateGendertest(x, refDataDF))
dataDF.to_csv(data_mod_File,index=False)
def translateLanguagetest( keystring, refDataDF):
print("keystring" + keystring)
modRef= refDataDF['refValue'].where(refDataDF['refKey']==
'languageSRC#'+keystring+'#languagetarget')
#removes the NaN number. modRef is an numpy.ndarray.
cleanedRef = modRef.dropna()
#after ckeab up,since only one row is remaining, item to select the value
#with one element
value = cleanedRef.item()
return value
def translateGendertest( keystring, refDataDF):
print("keystring" + keystring)
modRef= refDataDF['refValue'].where(refDataDF['refKey']==
'genderSRC#'+keystring+'#gendertarget')
#removes the NaN number modRef is an numpy.ndarray.
cleanedRef = modRef.dropna()
#after ckeab up,since only one row is remaining, item to select the value
value = cleanedRef.item()
return value
if __name__ == '__main__':
unittest.main()
The data.csv before transformation
Id,language,gender
1,ita,male
2,eng,female
The constant.csv
value,sourcedomain,targetdomain,target
ita,languageSRC,languagetarget,it
eng,languageSRC,languagetarget,en
male,genderSRC,gendertarget,Male
female,genderSRC,gendertarget,Female
The csv after transformation:
Id,language,gender
1,it,Male
2,en,Female
I created a dictionary with pandas and I'm trying to get only the value
a b
hello_friend HELLO<by>
hi_friend HI<byby>
good_friend GOOD<bybyby>
I would like to get the list of values, apply multiple methods only on it and at the end return the key and the modified values
def open_pandas():
df = pandas.read_csv('table.csv', encoding = 'utf-8')
dico = df.groupby('a')['b'].apply(list).to_dict()
return dico
def methods_values(dico)
removes = b.str.replace(r'<.*>', '')
b_lower = removes.astype(str).str.lower()
b_list = dico.to_dict('b')
#here, I'm going to apply a clustering on the values
return dico_with_modified_values
I need the two functions (but my second function is not working) and my desired output:
{"hello_friend": ['hello'],"hi_friend": ['hi'], "good_friend": ['good']}
Is this possible?
I think need first processes column b of DataFrame and then convert it to dictionary of lists:
df = pandas.read_csv('table.csv', encoding = 'utf-8')
df['b'] = df['b'].str.replace(r'<.*>', '').str.lower()
dico = df.groupby('a')['b'].apply(list).to_dict()
print (dico)
{'good_friend': ['good'], 'hello_friend': ['hello'], 'hi_friend': ['hi']}
I'm trying to write a function to swap a dictionary of targets with results in a pandas dataframe. I'd like to match a tuple of values and swap out new values. I tried building it as follows, but the the row select isn't working. I feel like I'm missing some critical function here.
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
target=("Mammals","Birds")
swapVals={("Cats","Parrots"):("Rats","Canaries")}
for x in swapVals:
#Attempt 1:
#testData.loc[x,target]=swapVals[x]
#Attempt 2:
testData[testData.loc[:,target]==x,target]=swapVals[x]
This was written in Python 2, but the basic idea should work for you. It uses the apply function:
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
swapVals={("Cats","Parrots"):("Rats","Canaries")}
target=["Mammals","Birds"]
def swapper(in_row):
temp =tuple(in_row.values)
if temp in swapVals:
return list(swapVals[temp])
else:
return in_row
testData[target] = testData[target].apply(swapper, axis=1)
testData
Note that if you loaded the other keys into the dict, you could do the apply without the swapper function:
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
swapVals={("Cats","Parrots"):("Rats","Canaries"), ("Dogs","Cockatiels"):("Dogs","Cockatiels")}
target=["Mammals","Birds"]
testData[target] = testData[target].apply(lambda x: list(swapVals[tuple(x.values)]), axis=1)
testData