Object to dictonary to use get() python pandas - python

I'm having some issues with a column in my csv that the type is an 'object', but it's should be an dict series (a dict for which row).
The point is to make which row as a dict to use get('id') on the key to return the id's values for which row in the 'Conta' column.
Thats the way it's as 'object' column:
| Conta |
| ---------------------------------------------|
| {'name':'joe','id':'4347176000574713087'} |
| {'name':'mary','id':'4347176000115055151'} |
| {'name':'fred','id':'4347176000574610147'} |
| {'name':'Marcos','id':'4347176000555566806'} |
| {'name':'marcos','id':'4347176000536834310'} |
Thats the way it's should be in the end
| Conta |
| ------------------- |
| 4347176000574713087 |
| 4347176000115055151 |
| 4347176000574610147 |
| 4347176000555566806 |
| 4347176000536834310 |
I tried to use:
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df['Conta'] = df['Conta'].to_dict()
df['Conta'] = [x.get('id', 0) for x in df['Conta']]
#return: AttributeError: 'str' object has no attribute 'get'
I also tried to use ast.literal_eval() but it dosen't work as well
import ast
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df = df[['Conta','ID_CS']]
df['Conta'] = df['Conta'].apply(ast.literal_eval)
#return: ValueError: malformed node or string: nan
Can someone help me?

Consider replacing the following line:
df['Conta'] = df['Conta'].apply(ast.literal_eval)
If it's being correctly detected as a dictionary then:
df['Conta'] = df['Conta].map(lambda x: x['id'])
If each row is a string:
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(x)['id'])
However, if you are getting a malformed node or json error. Consider first using str and then ast.literal_eval():
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(str(x))['id'])

Related

Replace datetime column in DF with hour from int column

I have a pyspark df with an hour column (int) like this:
hour
0
0
1
...
14
And I have an execution_datetime variable that looks like 2022-01-02 17:23:11
Now I want to calculate a new column for my DF that holds my execution_datetime but the hour is replaced by the values from the hour col. Output should look like:
hour
exec_dttm_with_hour
0
2022-01-02 00:23:11
0
2022-01-02 00:23:11
1
2022-01-02 01:23:11
...
...
14
2022-01-02 14:23:11
I know there are ways using i.e. .collect(), then edit the list and insert as new col. But I need to make use of sparks parallel execution since it could be a super high data load. Also, casting it to pandas and then editing it is not suitable for my use case.
Thanks in advance for any suggestions!
You can use the make_timestamp function if you're on spark 3+, else you can use an UDF.
spark.range(15). \
withColumnRenamed('id', 'hour'). \
withColumn('static_dttm', func.lit(execution_datetime).cast('timestamp')). \
withColumn('dttm',
func.expr('''make_timestamp(year(static_dttm),
month(static_dttm),
day(static_dttm),
hour,
minute(static_dttm),
second(static_dttm)
)
''')
). \
drop('static_dttm'). \
show()
# +----+-------------------+
# |hour| dttm|
# +----+-------------------+
# | 0|2022-01-02 00:23:11|
# | 1|2022-01-02 01:23:11|
# | 2|2022-01-02 02:23:11|
# | 3|2022-01-02 03:23:11|
# | 4|2022-01-02 04:23:11|
# | 5|2022-01-02 05:23:11|
# | 6|2022-01-02 06:23:11|
# | 7|2022-01-02 07:23:11|
# | 8|2022-01-02 08:23:11|
# | 9|2022-01-02 09:23:11|
# | 10|2022-01-02 10:23:11|
# | 11|2022-01-02 11:23:11|
# | 12|2022-01-02 12:23:11|
# | 13|2022-01-02 13:23:11|
# | 14|2022-01-02 14:23:11|
# +----+-------------------+
Using UDF
def update_ts(string_ts, hour_col):
import datetime
dttm = datetime.datetime.strptime(string_ts, '%Y-%m-%d %H:%M:%S')
return datetime.datetime(dttm.year, dttm.month, dttm.day, hour_col, dttm.minute, dttm.second)
update_ts_udf = func.udf(update_ts, TimestampType())
spark.range(15). \
withColumnRenamed('id', 'hour'). \
withColumn('dttm', update_ts_udf(func.lit(execution_datetime), func.col('hour'))). \
show()
# +----+-------------------+
# |hour| dttm|
# +----+-------------------+
# | 0|2022-01-02 00:23:11|
# | 1|2022-01-02 01:23:11|
# | 2|2022-01-02 02:23:11|
# | 3|2022-01-02 03:23:11|
# | 4|2022-01-02 04:23:11|
# | 5|2022-01-02 05:23:11|
# | 6|2022-01-02 06:23:11|
# | 7|2022-01-02 07:23:11|
# | 8|2022-01-02 08:23:11|
# | 9|2022-01-02 09:23:11|
# | 10|2022-01-02 10:23:11|
# | 11|2022-01-02 11:23:11|
# | 12|2022-01-02 12:23:11|
# | 13|2022-01-02 13:23:11|
# | 14|2022-01-02 14:23:11|
# +----+-------------------+
You can use concat to connect multiple strings in a row and lit to add a constant value to each row.
In the following code, a new column timestamp is introduced where the first 11 characters of execution_datetime are conceited with characters after the hours and in between the hours are added. It also makes sure, that the hours have a leading zero.
import pyspark.sql.functions as f
df = df.withColumn('timestamp', f.concat(f.lit(execution_datetime [0:11]), f.lpad(f.col('hour'), 2, '0') , f.lit(f.lit(execution_datetime [13:]))))
Remark: This might be the faster version than using timestamp functions as suggested in samkart's answer, but also less safe in catching wrong inputs.

Split an array with ; and deleted at the end ofstring if it exist to get an array

want to create a new column based on a string column that have as separator(";") and delete (";") in the end if exist using python/pyspark :
Inputs :
"511;520;611;"
"322;620"
"3;321;"
"334;344"
expected Output :
+Column | +new column
"511;520;611;" | [511,520,611]
"322;620" | [322,620]
"3;321;" | [3,321]
"334;344" | [334,344]
try :
data = data.withColumn(
"newcolumn",
split(col("column"), ";"))
but i get an empty string at the end of the array like here and i want to delete it if exist
+Column | +new column
"511;520;611;" | [511,520,611,empty string]
"322;620" | [322,620]
"3;321;" | [3,321,empty string]
"334;344" | [334;344]
for spark version >= 2.4, use filter function with != '' condition to filter out empty strings in an array
from pyspark.sql.functions import expr
data = data.withColumn("newcolumn", expr("filter(split(column, ';'), x -> x != '')"))

Python : 'Series' objects are mutable, thus they cannot be hashed

I have a DataFrame df with text as below :
|---------------------|-----------------------------------|
| File_name | Content |
|---------------------|-----------------------------------|
| BI1.txt | I am writing this letter ... |
|---------------------|-----------------------------------|
| BI2.txt | Yes ! I would like to pursue... |
|---------------------|-----------------------------------|
I would like to create an additional column which provides the syllable count with :
df['syllable_count']= textstat.syllable_count(df['content'])
The error :
Series objects are mutable, thus they cannot be hashed
How can I change the Content column to hashable? How can I fix this error?
Thanks for your help !
Try doing it this way:
df['syllable_count'] = df.content.apply(lambda x: textstat.syllable_count(x))

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue:
>>> df.printSchema()
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
| | | -- delete: string(nullable=true)
which you can instantiate like such:
schema = StructType([StructField("big", ArrayType(StructType([
StructField("keep", StringType()),
StructField("delete", StringType())
])))])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
My goal is to convert the dataframe (along with the values in the columns I want to keep) to one that excludes certain nested structs, like delete for example.
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
According to the solution I linked that tries to leverage pyspark.sql's to_json and from_json functions, it should be accomplishable with something like this:
new_schema = StructType([StructField("big", ArrayType(StructType([
StructField("keep", StringType())
])))])
test_df = df.withColumn("big", to_json(col("big"))).withColumn("big", from_json(col("big"), new_schema))
>>> test_df.printSchema()
root
| -- big: struct(nullable=true)
| | -- big: array(nullable=true)
| | | -- element: struct(containsNull=true)
| | | | -- keep: string(nullable=true)
>>> test_df.show()
+----+
| big|
+----+
|null|
+----+
So either I'm not following his directions right, or it doesn't work. How do you do this without a udf?
Pyspark to_json documentation
Pyspark from_json documentation
It should be working, you just need to adjust your new_schema to include metadata for the column 'big' only, not for the dataframe:
new_schema = ArrayType(StructType([StructField("keep", StringType())]))
test_df = df.withColumn("big", from_json(to_json("big"), new_schema))

Applying a udf function in a distributed fashion in PySpark

Say I have a very basic Spark DataFrame that consists of a couple of columns, one of which contains a value that I want to modify.
|| value || lang ||
| 3 | en |
| 4 | ua |
Say, I want to have a new column per specific class where I would add a float number to the given value (this is not much relevant to the final question though, in reality I do a prediction with sklearn there, but for simplicity let's assume we are adding stuff, the idea is I am modifying the value in some way). So given a dict classes={'1':2.0, '2':3.0} I would like to have a column for each class where I add the value from DF to the value of the class and then save it to a csv:
class_1.csv
|| value || lang || my_class | modified ||
| 3 | en | 1 | 5.0 | # this is 3+2.0
| 4 | ua | 1 | 6.0 | # this is 4+2.0
class_2.csv
|| value || lang || my_class | modified ||
| 3 | en | 2 | 6.0 | # this is 3+3.0
| 4 | ua | 2 | 7.0 | # this is 4+3.0
So far I have the following code that works and modifies the value for each defined class, but it is done with a for loop and I am looking for a more advanced optimization for it:
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
from pyspark.sql.functions import lit
# create session and context
spark = pyspark.sql.SparkSession.builder.master("yarn").appName("SomeApp").getOrCreate()
conf = SparkConf().setAppName('Some_App').setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)
my_df = spark.read.csv("some_file.csv")
# modify the value here
def do_stuff_to_column(value, separate_class):
# do stuff to column, let's pretend we just add a specific value per specific class that is read from a dictionary
class_dict = {'1':2.0, '2':3.0} # would be loaded from somewhere
return float(value+class_dict[separate_class])
# iterate over each given class later
class_dict = {'1':2.0, '2':3.0} # in reality have more than 10 classes
# create a udf function
udf_modify = udf(do_stuff_to_column, FloatType())
# loop over each class
for my_class in class_dict:
# create the column first with lit
my_df2 = my_df.withColumn("my_class", lit(my_class))
# modify using udf function
my_df2 = my_df2.withColumn("modified", udf_modify("value","my_class"))
# write to csv now
my_df2.write.format("csv").save("class_"+my_class+".csv")
So the question is, is there a better/faster way of doing this then in a for loop?
I would use some form of join, in this case crossJoin. Here's a MWE:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(3, 'en'), (4, 'ua')], ['value', 'lang'])
classes = spark.createDataFrame([(1, 2.), (2, 3.)], ['class_key', 'class_value'])
res = df.crossJoin(classes).withColumn('modified', F.col('value') + F.col('class_value'))
res.show()
For saving as separate CSV's I think there is no better way than to use a loop.

Categories