I have just started using databricks/pyspark. Im using python/spark 2.1. I have uploaded data to a table. This table is a single column full of strings. I wish to apply a mapping function to each element in the column. I load the table into a dataframe:
df = spark.table("mynewtable")
The only way I could see was others saying was to convert it to RDD to apply the mapping function and then back to dataframe to show the data. But this throws up job aborted stage failure:
df2 = df.select("_c0").rdd.flatMap(lambda x: x.append("anything")).toDF()
All i want to do is just apply any sort of map function to my data in the table.
For example append something to each string in the column, or perform a split on a char, and then put that back into a dataframe so i can .show() or display it.
You cannot:
Use flatMap because it will flatten the Row
You cannot use append because:
tuple or Row have no append method
append (if present on collection) is executed for side effects and returns None
I would use withColumn:
df.withColumn("foo", lit("anything"))
but map should work as well:
df.select("_c0").rdd.flatMap(lambda x: x + ("anything", )).toDF()
Edit (given the comment):
You probably want an udf
from pyspark.sql.functions import udf
def iplookup(s):
return ... # Some lookup logic
iplookup_udf = udf(iplookup)
df.withColumn("foo", iplookup_udf("c0"))
Default return type is StringType, so if you want something else you should adjust it.
Related
I have a pyspark dataframe which looks like this-
RowNumber
value
1
[{mail=abc#xyz.com, Name=abc}, {mail=mnc#xyz.com, Name=mnc}]
2
[{mail=klo#xyz.com, Name=klo}, {mail=mmm#xyz.com, Name=mmm}]
The column "value" is of string type.
root
|--value: string (nullable = false)
|--rowNumber: integer (nullable = false)
Step 1 I need to explode the dictionaries inside the list on each row under column "value" like this-
Step 2 And then to further explode the column so that the resultant table looks like :
Although When I try to get to Step1 using:
df.select(explode(col('value')).alias('value'))
it shows me error:
Analysis Exception: cannot resolve 'explode("value")' due to data type mismatch: input to function explode should be array or map type, not string
How do I convert this string under column 'value' to compatible data types so that I can proceed with exploding the dictionary elements as valid array/json (step1) and then into separate columns (step2) ?
please help
EDIT: There may be a simpler way to do this with the from_json function and StructTypes, for that you can check out this link.
The ast way
To parse a string as a dictionary or list in Python, you can use the ast library's literal_eval function. If we convert this function to a PySpark UDF, then the following code will suffice:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import MapType
from ast import literal_eval
literal_eval_udf = udf(literal_eval, ArrayType(MapType()))
table = table.withColumn("value", literal_eval_udf(col("value"))) # Make strings into ArrayTypes of MapTypes
table = table.withColumn("value", explode(col("value"))) # Explode ArrayTypes such that each row contains a MapType
After applying these functions to the table, what should remain is what you originally referred to as the start of "step 2." From here, we want to split each "value" column key into a column with entries from the corresponding value. This is accomplished with another function application which gives us the dict values:
table = table.withColumn("value", map_values(col("value")))
Now the values column contains an ArrayType of the values contained in each dictionary. To make a separate column for each of these, we simply add them in a loop:
keys = ['mail', 'Name']
for k in range(len(keys)):
table = table.withColumn(keys[k], table.value[k])
Then you can drop the original value column because we wouldn't need it anymore, as you'll now have the columns mail and Name with the information from the corresponding maps.
I'm attempting to use Pandas with a JSON object to flatten it, clean up the data, and write it to a relational database.
For any List objects, I want to convert them to a string and create a new column that has a count of the values in the original column.
I've gotten as far as getting a Series of columns which did contain a list. Now I want to filter that list to get back the list of columns that were true. I feel like there should be a straight forward way to filter this Series to only true items but things like filter only seem to work on the index.
Current:
d False
a.b True
Desired:
a.b True
My original code:
import pandas as _pd
data = {"a":{"b":['x','y','z']},"c":1,"d":None}
df = _pd.json_normalize(data).convert_dtypes()
ldf = df.select_dtypes(include=['object']).applymap(lambda x: isinstance(x,list)).max()
Any suggestions on how to easily filter this down to just the true values?
I'm doing:
df.apply(lambda x: x.rename(x.name + "_something"))
I think this should return the column with _something appended to all columns, but it just returns the same df.
What am I doing wrong?
EDIT: I need to act on the series column by column, not on the dataframe obejct, as I'll be applying other transformations to x in the lambda, not shown here.
EDIT 2 Full Context:
I've got a time series dataframe, and I'm trying to generate features from the data.
I've written a bunch of primitive functions like:
def sumn(n, s):
return s.rolling(n).sum().rename(s.name + "_sum_" + str(n))
When I apply those to Series, it renames them well.
When I apply them to columns in a DataFrame, the numerical transformation goes through, but the rename doesn't work.
(I suppose it implies that a DataFrame isn't just a collection of Series, which means in all likelihood, I now have to explicitly rename things on the df)
I think you can do this use pd.concat:
pd.concat([df[e].rename(df[e].name+'_Something') for e in df],1)
Inside the list comprehension, you can add your other logics:
df[e].rename(df[e].name+'_Something').apply(...)
If you directly use df.apply, you can't change the column name. There is no way I can think of
I am iterating over a groupby column in a pandas dataframe in Python 3.6 with the help of a for loop. The problem with this is that it becomes slow if I have a lot of data. This is my code:
import pandas as pd
dataDict = {}
for metric, df_metric in frontendFrame.groupby('METRIC'): # Creates frames for each metric
dataDict[metric] = df_metric.to_dict('records') # Converts dataframe to dictionary
frontendFrame is a dataframe containing two columns: VALUE and METRIC. My end goal is basically creating a dictionary where there is a key for each metric containing all data connected to it. I now this should be possible to do with lambda or map but I can't get it working with multiple arguments. frontendFrame.groupby('METRIC').apply(lambda x: print(x))
How can I solve this and make my script faster?
If you do not need any calculation involved after groupby , do not groupby data , you can using .loc to get what you need
s=frontendFrame.METRIC.unique()
frontendFrame.loc[frontendFrame.METRIC==s[0],]
I'm reading in a .csv file using pandas, and then I want to filter out the rows where a specified column's value is not in a dictionary for example. So something like this:
df = pd.read_csv('mycsv.csv', sep='\t', encoding='utf-8', index_col=0,
names=['col1', 'col2','col3','col4'])
c = df.col4.value_counts(normalize=True).head(20)
values = dict(zip(c.index.tolist()[1::2], c.tolist()[1::2])) # Get odd and create dict
df_filtered = filter out all rows where col4 not in values
After searching around a bit I tried using the following to filter it:
df_filtered = df[df.col4 in values]
but that unfortunately didn't work.
I've done the following to make it works for what I want to do, but it's incredibly slow for a large .csv file, so I thought there must be a way to do it that's built in to pandas:
t = [(list(df.col1) + list(df.col2) + list(df.col3)) for i in range(len(df.col4)) if list(df.col4)[i] in values]
If you want to check against the dictionary values:
df_filtered = df[df.col4.isin(values.values())]
If you want to check against the dictionary keys:
df_filtered = df[df.col4.isin(values.keys())]
As A.Kot mentioned you could use the values method of the dict to search. But the values method returns either a list or an iterator depending on your version of Python.
If your only reason for creating that dict is membership testing, and you only ever look at the values of the dict then you are using the wrong data structure.
A set will improve your lookup performance, and simplify your check back to:
df_filtered = df[df.col4 in values]
If you use values elsewhere, and you want to check against the keys, then you're ok because membership testing against keys is efficient.