Python : 'Series' objects are mutable, thus they cannot be hashed - python

I have a DataFrame df with text as below :
|---------------------|-----------------------------------|
| File_name | Content |
|---------------------|-----------------------------------|
| BI1.txt | I am writing this letter ... |
|---------------------|-----------------------------------|
| BI2.txt | Yes ! I would like to pursue... |
|---------------------|-----------------------------------|
I would like to create an additional column which provides the syllable count with :
df['syllable_count']= textstat.syllable_count(df['content'])
The error :
Series objects are mutable, thus they cannot be hashed
How can I change the Content column to hashable? How can I fix this error?
Thanks for your help !

Try doing it this way:
df['syllable_count'] = df.content.apply(lambda x: textstat.syllable_count(x))

Related

Object to dictonary to use get() python pandas

I'm having some issues with a column in my csv that the type is an 'object', but it's should be an dict series (a dict for which row).
The point is to make which row as a dict to use get('id') on the key to return the id's values for which row in the 'Conta' column.
Thats the way it's as 'object' column:
| Conta |
| ---------------------------------------------|
| {'name':'joe','id':'4347176000574713087'} |
| {'name':'mary','id':'4347176000115055151'} |
| {'name':'fred','id':'4347176000574610147'} |
| {'name':'Marcos','id':'4347176000555566806'} |
| {'name':'marcos','id':'4347176000536834310'} |
Thats the way it's should be in the end
| Conta |
| ------------------- |
| 4347176000574713087 |
| 4347176000115055151 |
| 4347176000574610147 |
| 4347176000555566806 |
| 4347176000536834310 |
I tried to use:
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df['Conta'] = df['Conta'].to_dict()
df['Conta'] = [x.get('id', 0) for x in df['Conta']]
#return: AttributeError: 'str' object has no attribute 'get'
I also tried to use ast.literal_eval() but it dosen't work as well
import ast
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df = df[['Conta','ID_CS']]
df['Conta'] = df['Conta'].apply(ast.literal_eval)
#return: ValueError: malformed node or string: nan
Can someone help me?
Consider replacing the following line:
df['Conta'] = df['Conta'].apply(ast.literal_eval)
If it's being correctly detected as a dictionary then:
df['Conta'] = df['Conta].map(lambda x: x['id'])
If each row is a string:
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(x)['id'])
However, if you are getting a malformed node or json error. Consider first using str and then ast.literal_eval():
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(str(x))['id'])

Split an array with ; and deleted at the end ofstring if it exist to get an array

want to create a new column based on a string column that have as separator(";") and delete (";") in the end if exist using python/pyspark :
Inputs :
"511;520;611;"
"322;620"
"3;321;"
"334;344"
expected Output :
+Column | +new column
"511;520;611;" | [511,520,611]
"322;620" | [322,620]
"3;321;" | [3,321]
"334;344" | [334,344]
try :
data = data.withColumn(
"newcolumn",
split(col("column"), ";"))
but i get an empty string at the end of the array like here and i want to delete it if exist
+Column | +new column
"511;520;611;" | [511,520,611,empty string]
"322;620" | [322,620]
"3;321;" | [3,321,empty string]
"334;344" | [334;344]
for spark version >= 2.4, use filter function with != '' condition to filter out empty strings in an array
from pyspark.sql.functions import expr
data = data.withColumn("newcolumn", expr("filter(split(column, ';'), x -> x != '')"))

How to group column values into 'others'?

I have a huge list of website names in my dataframe.
e.g array(['google','facebook','yahoo','youtube', and many other small websites])
Dataframe has around 40 more websites.
I want to group the other websites name as 'other'
My input table is something like
|Website |
|-------------|
|google.com |
|youtube.com |
|yahoo.com |
|nyu.com |
|something.com|
My desired output will be something like
|Website |
|-----------|
|google.com |
|youtube.com|
|yahoo.com |
|others |
|others |
I tried a few things but didn't work. Should I be manually renaming them ? Or is there any way, I can create a new column and mention them as others with a few exceptions as above ?
Thanks in advance.
try:
m=df['Website'].isin(['google.com','youtube.com','yahoo.com'])
#Finally:
df.loc[~m,'Website']='others'
OR
m=df['Website'].str.contains('google|youtube|yahoo')
#Finally:
df.loc[~m,'Website']='others'
Try using str.contains:
df.loc[df['Website'].str.contains('google|youtube|yahoo|facebook'),'Website']='others'
Maybe...
# maintain a list of sites you wish to keep
sitesToKeep = ['google.com', 'youtube.com', 'yahoo.com']
# for all rows where the value in the column 'Website' is not present in the list 'sitesToKeep' change the value to 'other'
df.loc[~df.Website.isin(sitesToKeep), 'Website'] = 'Other'

Filtering Spark Dataframe

I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')

Python Pandas MD5 Value not in index

I'm trying to modify and add some columns in an imported csv file.
The idea is that I want 2 extra columns, one with the MD5 value of the email address, and one with the SHA256 value of the email.
+----+-----------+---------+
| id | email | status |
| 1 | 1#foo.com | ERROR |
| 2 | 2#foo.com | SUCCESS |
| 3 | 3#bar.com | SUCCESS |
+----+-----------+---------+
I have tryed with
df['email_md5'] = md5_crypt.hash(df[df.email])
This gives me an error saying:
KeyError: "['1#foo.com'
'2#foo.com'\n '3#bar.com'] not
in index"
I have seen in another post Pandas KeyError: value not in index its suggested to use reindex, but I can't get this to work.
If you are looking for md5_crypt.hash, you will have to apply the hash function of the md5_crypt module to each of the emails using pd.apply() -
from passlib.hash import md5_crypt
df['email_md5'] = df['email'].apply(md5_crypt.hash)
Output
id email status email_md5
1 1#foo.com ERROR 11 lHP8aPeE$5T4jqc/qir9yFszVikeSM0
2 2#foo.com SUCCESS 11 jyOWkcrw$I8iStC3up3cwLLLBwnT5S/
3 3#bar.com SUCCESS 11 oDfnN5UH$/2N6YljJRMfDxY2gXLYCA/

Categories