Python : 'Series' objects are mutable, thus they cannot be hashed

Python : 'Series' objects are mutable, thus they cannot be hashed - python

I have a DataFrame df with text as below :
|---------------------|-----------------------------------|
| File_name | Content |
|---------------------|-----------------------------------|
| BI1.txt | I am writing this letter ... |
|---------------------|-----------------------------------|
| BI2.txt | Yes ! I would like to pursue... |
|---------------------|-----------------------------------|
I would like to create an additional column which provides the syllable count with :
df['syllable_count']= textstat.syllable_count(df['content'])
The error :
Series objects are mutable, thus they cannot be hashed
How can I change the Content column to hashable? How can I fix this error?
Thanks for your help !

Try doing it this way:
df['syllable_count'] = df.content.apply(lambda x: textstat.syllable_count(x))

Related

Object to dictonary to use get() python pandas

I'm having some issues with a column in my csv that the type is an 'object', but it's should be an dict series (a dict for which row).
The point is to make which row as a dict to use get('id') on the key to return the id's values for which row in the 'Conta' column.
Thats the way it's as 'object' column:
| Conta |
| ---------------------------------------------|
| {'name':'joe','id':'4347176000574713087'} |
| {'name':'mary','id':'4347176000115055151'} |
| {'name':'fred','id':'4347176000574610147'} |
| {'name':'Marcos','id':'4347176000555566806'} |
| {'name':'marcos','id':'4347176000536834310'} |
Thats the way it's should be in the end
| Conta |
| ------------------- |
| 4347176000574713087 |
| 4347176000115055151 |
| 4347176000574610147 |
| 4347176000555566806 |
| 4347176000536834310 |
I tried to use:
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df['Conta'] = df['Conta'].to_dict()
df['Conta'] = [x.get('id', 0) for x in df['Conta']]
#return: AttributeError: 'str' object has no attribute 'get'
I also tried to use ast.literal_eval() but it dosen't work as well
import ast
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df = df[['Conta','ID_CS']]
df['Conta'] = df['Conta'].apply(ast.literal_eval)
#return: ValueError: malformed node or string: nan
Can someone help me?

Consider replacing the following line:
df['Conta'] = df['Conta'].apply(ast.literal_eval)
If it's being correctly detected as a dictionary then:
df['Conta'] = df['Conta].map(lambda x: x['id'])
If each row is a string:
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(x)['id'])
However, if you are getting a malformed node or json error. Consider first using str and then ast.literal_eval():
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(str(x))['id'])

Split an array with ; and deleted at the end ofstring if it exist to get an array

want to create a new column based on a string column that have as separator(";") and delete (";") in the end if exist using python/pyspark :
Inputs :
"511;520;611;"
"322;620"
"3;321;"
"334;344"
expected Output :
+Column | +new column
"511;520;611;" | [511,520,611]
"322;620" | [322,620]
"3;321;" | [3,321]
"334;344" | [334,344]
try :
data = data.withColumn(
"newcolumn",
split(col("column"), ";"))
but i get an empty string at the end of the array like here and i want to delete it if exist
+Column | +new column
"511;520;611;" | [511,520,611,empty string]
"322;620" | [322,620]
"3;321;" | [3,321,empty string]
"334;344" | [334;344]

for spark version >= 2.4, use filter function with != '' condition to filter out empty strings in an array
from pyspark.sql.functions import expr
data = data.withColumn("newcolumn", expr("filter(split(column, ';'), x -> x != '')"))

How to group column values into 'others'?

I have a huge list of website names in my dataframe.
e.g array(['google','facebook','yahoo','youtube', and many other small websites])
Dataframe has around 40 more websites.
I want to group the other websites name as 'other'
My input table is something like
|Website |
|-------------|
|google.com |
|youtube.com |
|yahoo.com |
|nyu.com |
|something.com|
My desired output will be something like
|Website |
|-----------|
|google.com |
|youtube.com|
|yahoo.com |
|others |
|others |
I tried a few things but didn't work. Should I be manually renaming them ? Or is there any way, I can create a new column and mention them as others with a few exceptions as above ?
Thanks in advance.

try:
m=df['Website'].isin(['google.com','youtube.com','yahoo.com'])
#Finally:
df.loc[~m,'Website']='others'
OR
m=df['Website'].str.contains('google|youtube|yahoo')
#Finally:
df.loc[~m,'Website']='others'

Try using str.contains:
df.loc[df['Website'].str.contains('google|youtube|yahoo|facebook'),'Website']='others'

Maybe...
# maintain a list of sites you wish to keep
sitesToKeep = ['google.com', 'youtube.com', 'yahoo.com']
# for all rows where the value in the column 'Website' is not present in the list 'sitesToKeep' change the value to 'other'
df.loc[~df.Website.isin(sitesToKeep), 'Website'] = 'Other'

Filtering Spark Dataframe

I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you

Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')

Python Pandas MD5 Value not in index

I'm trying to modify and add some columns in an imported csv file.
The idea is that I want 2 extra columns, one with the MD5 value of the email address, and one with the SHA256 value of the email.
+----+-----------+---------+
| id | email | status |
| 1 | 1#foo.com | ERROR |
| 2 | 2#foo.com | SUCCESS |
| 3 | 3#bar.com | SUCCESS |
+----+-----------+---------+
I have tryed with
df['email_md5'] = md5_crypt.hash(df[df.email])
This gives me an error saying:
KeyError: "['1#foo.com'
'2#foo.com'\n '3#bar.com'] not
in index"
I have seen in another post Pandas KeyError: value not in index its suggested to use reindex, but I can't get this to work.

If you are looking for md5_crypt.hash, you will have to apply the hash function of the md5_crypt module to each of the emails using pd.apply() -
from passlib.hash import md5_crypt
df['email_md5'] = df['email'].apply(md5_crypt.hash)
Output
id email status email_md5
1 1#foo.com ERROR 11 lHP8aPeE$5T4jqc/qir9yFszVikeSM0
2 2#foo.com SUCCESS 11 jyOWkcrw$I8iStC3up3cwLLLBwnT5S/
3 3#bar.com SUCCESS 11 oDfnN5UH$/2N6YljJRMfDxY2gXLYCA/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : 'Series' objects are mutable, thus they cannot be hashed - python

Try doing it this way: df['syllable_count'] = df.content.apply(lambda x: textstat.syllable_count(x))

Related

Object to dictonary to use get() python pandas

Split an array with ; and deleted at the end ofstring if it exist to get an array

How to group column values into 'others'?

Filtering Spark Dataframe

Python Pandas MD5 Value not in index

Categories

Resources