Python Pandas MD5 Value not in index - python

I'm trying to modify and add some columns in an imported csv file.
The idea is that I want 2 extra columns, one with the MD5 value of the email address, and one with the SHA256 value of the email.
+----+-----------+---------+
| id | email | status |
| 1 | 1#foo.com | ERROR |
| 2 | 2#foo.com | SUCCESS |
| 3 | 3#bar.com | SUCCESS |
+----+-----------+---------+
I have tryed with
df['email_md5'] = md5_crypt.hash(df[df.email])
This gives me an error saying:
KeyError: "['1#foo.com'
'2#foo.com'\n '3#bar.com'] not
in index"
I have seen in another post Pandas KeyError: value not in index its suggested to use reindex, but I can't get this to work.

If you are looking for md5_crypt.hash, you will have to apply the hash function of the md5_crypt module to each of the emails using pd.apply() -
from passlib.hash import md5_crypt
df['email_md5'] = df['email'].apply(md5_crypt.hash)
Output
id email status email_md5
1 1#foo.com ERROR 11 lHP8aPeE$5T4jqc/qir9yFszVikeSM0
2 2#foo.com SUCCESS 11 jyOWkcrw$I8iStC3up3cwLLLBwnT5S/
3 3#bar.com SUCCESS 11 oDfnN5UH$/2N6YljJRMfDxY2gXLYCA/

Related

Object to dictonary to use get() python pandas

I'm having some issues with a column in my csv that the type is an 'object', but it's should be an dict series (a dict for which row).
The point is to make which row as a dict to use get('id') on the key to return the id's values for which row in the 'Conta' column.
Thats the way it's as 'object' column:
| Conta |
| ---------------------------------------------|
| {'name':'joe','id':'4347176000574713087'} |
| {'name':'mary','id':'4347176000115055151'} |
| {'name':'fred','id':'4347176000574610147'} |
| {'name':'Marcos','id':'4347176000555566806'} |
| {'name':'marcos','id':'4347176000536834310'} |
Thats the way it's should be in the end
| Conta |
| ------------------- |
| 4347176000574713087 |
| 4347176000115055151 |
| 4347176000574610147 |
| 4347176000555566806 |
| 4347176000536834310 |
I tried to use:
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df['Conta'] = df['Conta'].to_dict()
df['Conta'] = [x.get('id', 0) for x in df['Conta']]
#return: AttributeError: 'str' object has no attribute 'get'
I also tried to use ast.literal_eval() but it dosen't work as well
import ast
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df = df[['Conta','ID_CS']]
df['Conta'] = df['Conta'].apply(ast.literal_eval)
#return: ValueError: malformed node or string: nan
Can someone help me?
Consider replacing the following line:
df['Conta'] = df['Conta'].apply(ast.literal_eval)
If it's being correctly detected as a dictionary then:
df['Conta'] = df['Conta].map(lambda x: x['id'])
If each row is a string:
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(x)['id'])
However, if you are getting a malformed node or json error. Consider first using str and then ast.literal_eval():
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(str(x))['id'])

PySpark - How to loop through the dataframe and match against another common value in another dataframe

This is a recommender system and I have a Dataframe which contains about 10 recommended item for each user (recommendation_df) and I have another Dataframe which consist of the recent purchases of each user (recent_df).
I am trying to code out this task but I can't seem to get along the syntax, and the manipulation
I am implementing a hit/miss ratio, basically for every new_party_id in recent_df, if any of the merch_store_code matches the merch_store_code for the same party_id in the recommendation_df, count +=1 (Hit)
Then calculating the hit/miss ratio by using count/total user count
(However in recent_df, each user might have multiple recent purchases, but if any of the purchases is in the list of recommendations_list for the same user, take it as a hit (count +=1)
recommendation_df
+--------------+----------------+-----------+----------+
|party_id_index|merch_store_code| rating| party_id|
+--------------+----------------+-----------+----------+
| 148| 900000166| 0.4021678|G18B00332C|
| 148| 168339566| 0.27687865|G18B00332C|
| 148| 168993309| 0.15999989|G18B00332C|
| 148| 168350313| 0.1431974|G18B00332C|
| 148| 168329726| 0.13634883|G18B00332C|
| 148| 168351967|0.120235085|G18B00332C|
| 148| 168993312| 0.11800903|G18B00332C|
| 148| 168337234|0.116267696|G18B00332C|
| 148| 168993256| 0.10836013|G18B00332C|
| 148| 168339482| 0.10341005|G18B00332C|
| 463| 168350313| 0.93455887|K18M926299|
| 463| 900000072| 0.8275664|K18M926299|
| 463| 700012303| 0.70220494|K18M926299|
| 463| 700012180| 0.23209469|K18M926299|
| 463| 900000157| 0.1727839|K18M926299|
| 463| 700013689| 0.13854747|K18M926299|
| 463| 900000166| 0.12866624|K18M926299|
| 463| 168993284|0.107065596|K18M926299|
| 463| 168993269| 0.10272527|K18M926299|
| 463| 168339566| 0.10256036|K18M926299|
+--------------+----------------+-----------+----------+
recent_df
+------------+---------------+----------------+
|new_party_id|recent_purchase|merch_store_code|
+------------+---------------+----------------+
| A11275842R| 2022-05-21| 168289403|
| A131584211| 2022-06-01| 168993311|
| A131584211| 2022-06-01| 168349493|
| A131584211| 2022-06-01| 168350192|
| A182P3539K| 2022-03-26| 168341707|
| A182V2883F| 2022-05-26| 168350824|
| A183B5482P| 2022-05-10| 168993464|
| A183C6900K| 2022-05-14| 168338795|
| A183D56093| 2022-05-20| 700012303|
| A183J5388G| 2022-03-18| 700013650|
| A183U8880P| 2022-04-01| 900000072|
| A183U8880P| 2022-04-01| 168991904|
| A18409762L| 2022-05-10| 168319352|
| A18431276J| 2022-05-14| 168163905|
| A18433684M| 2022-03-21| 168993324|
| A18433978F| 2022-05-20| 168341876|
| A184410389| 2022-05-04| 900000166|
| A184716280| 2022-04-06| 700013653|
| A18473797O| 2022-05-24| 168330339|
| A18473797O| 2022-05-24| 168350592|
+------------+---------------+----------------+
Here is my current coding logic:
count = 0
def hitratio(recommendation_df, recent_df):
for i in recent_df['new_party_id']:
for j in recommendation_df['party_id']:
if (i = j) & i.merch_store_code == j.merch_store_code:
count += 1
return (count/recent_df.count())
In Spark, refrain from loops on rows. Spark does not work like that, you need to think of the whole column, not about row-by-row scenario.
You need to join both tables and select users, but they need to be without duplicates (distinct)
from pyspark.sql import functions as F
df_distinct_matches = (
recent_df
.join(recommendation_df, F.col('new_party_id') == F.col('party_id'))
.select('party_id').distinct()
)
hit = df_distinct_matches.count()
assumption :
i am taking all the count rows of recent df as denominator for calculating the hit/miss ratio you can change the formula.
from pyspark.sql import functions as F
matching_cond = ((recent_df["merch_store_code"]==recommender_df["merch_store_code"]) &(recommendation_df["party_id"].isNotNull()))
df_recent_fnl= df_recent.join(recommendation_df,df_recent["new_party_id"]==recommendation_df["party_id"],"left")\
.select(df_recent["*"],recommender_df["merch_store_code"],recommendation_df["party_id"])\
.withColumn("hit",F.when(matching_cond,F.lit(True)).otherwise(F.lit(False)))\
.withColumn("hit/miss",df_recent_fnl.filter(F.col("hit")).count()/df_recent.count())
do let me know if you have any questions around this .
If you like my solution , you can upvote

django get record which is latest and other column value

I have a model with some columns, between them there are 2 columns: equipment_id (a CharField) and date_saved (a DateTimeField).
I have multiple rows with the same equipment_id and but different date_saved (each time the user saves the record I save the now date time).
I want to retrieve the record that has a specific equipment_id and is the latest saved, i.e.:
| Equipment_id | Date_saved |
| --- ----- | --------------------- -------- |
| 1061a | 26-DEC-2020 10:10:23|
| 1061a | 26-DEC-2020 10:11:52|
| 1061a | 26-DEC-2020 10:22:03|
| 1061a | 26-DEC-2020 10:31:15|
| 1062a | 21-DEC-2020 10:11:52|
| 1062a | 25-DEC-2020 10:22:03|
| 1073a | 20-DEC-2020 10:31:15|
I want to retrieve for example the latest equipment_id=1061.
I have tried various approach without success:
prg = Program.objects.filter(equipment_id=id)
program = Program.objects.latest('date_saved')
when I use program I get the latest record saved with no relation to the previous filter
You can chain the filtering as,
result = Program.objects.filter(equipment_id=id).latest('date_saved')

PySpark : Performance optimzation during create_map for categorizing the page visit

I am working on optimizing the below operation whose exection time is relatively high on the actual dataset(large datset).I tried below on two of the pyspark dataset 1 & 2 to arrive at the "page_category" column of dataset-2
pyspark dataset-1 :
page_click | page_category
---------------------------
facebook | Social_network
insta | Social_nework
coursera | educational
Another dataset on which i am applying the create_map operation looks like :
pyspark dataset-2 :
id | page_click
---------------
1 | facebook
2 |Coursera
I am creating the dictionary of the dataset-1 and applying the
page_map = create_map([lit(x) for x in chain(*dict_dataset_1.items()])
dataset_2.withColumn('page_category', page_map[dataset_2['page_click']])
and then performing with_column on 'page_click' column of dataset-2 to arrive at the another column called 'page_category'
final dataset :
id | page_click | Page_category
-------------------------------
1 | facebook |social_network
2 |Coursera |educational
But this operation is taking too much time to complete, more than 4-5 minutes. Is there another way to speed up the operation ?
Thank you
Implement a simple broadcast join
df2.join(broadcast(df1),df2.page_click==df1.page_click,'left').\
select(df2.id, df2.page_click, df1.page_category).show()
+---+----------+--------------+
| id|page_click| page_category|
+---+----------+--------------+
| 1| facebook|Social_network|
| 2| coursera| educational|
+---+----------+--------------+

Pandas: Why are my headers being inserted into the first row of my dataframe?

I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!
Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.

Categories