How to generate md5 has of column in pandas dataframe [duplicate] - python

This question already has an answer here:
Hash each row of pandas dataframe column using apply
(1 answer)
Closed 2 years ago.
I have a dataframe df with columns as
Index(['learner_assignment_xid', 'assignment_xid', 'assignment_attempt_xid',
'learner_xid', 'section_xid', 'final_score_unweighted',
'attempt_score_unweighted', 'points_possible_unweighted',
'scored_datetime', 'gradebook_category_weight', 'status', 'is_deleted',
'is_scorable', 'drop_state', 'is_manual', 'created_datetime',
'updated_datetime'],
dtype='object')
i want to add a new column to thif df called checksum which will concatenate some of these columns and do md5 hash of it.
I am trying this :
df_gradebook['updated_checksum']=df_gradebook['final_score_unweighted'].astype(str)+df_gradebook['attempt_score_unweighted'].astype(str)+df_gradebook['points_possible_unweighted'].astype(str)+df_gradebook['scored_datetime'].astype(str)+df_gradebook['status'].astype(str)+df_gradebook['is_deleted'].astype(str)+df_gradebook['is_scorable'].astype(str)+df_gradebook['drop_state'].astype(str)+df_gradebook['updated_datetime'].astype(str)
Part I am struggling with is hash. How to apply md5 after concatenation is done.
I can do this in spark scala like this :
.withColumn("update_checksum",md5(concat(
$"final_score_unweighted",
$"attempt_score_unweighted",
$"points_possible_unweighted",
$"scored_datetime",
$"status",
$"is_deleted",
$"is_scorable",
$"drop_state",
$"updated_datetime"
)))
wanted to know how can I do md5 in python

df_gradebook['concat']=df_gradebook['final_score_unweighted'].astype(str)+df_gradebook['attempt_score_unweighted'].astype(str)+df_gradebook['points_possible_unweighted'].astype(str)+df_gradebook['scored_datetime'].astype(str)+df_gradebook['status'].astype(str)+df_gradebook['is_deleted'].astype(str)+df_gradebook['is_scorable'].astype(str)+df_gradebook['drop_state'].astype(str)+df_gradebook['updated_datetime'].astype(str)
df_gradebook['digest'] = df_gradebook['concat'].apply(lambda x: hashlib.md5(x.encode()).hexdigest())
Don't do everything in a single line, it makes it harder to read.

Related

Why does df.loc not seem to work in a loop (key error) [duplicate]

This question already has answers here:
Why do I get an IndexError (or TypeError, or just wrong results) from "ar[i]" inside "for i in ar"?
(4 answers)
How to iterate over rows in a DataFrame in Pandas
(31 answers)
Closed 3 months ago.
Can anyone tell me why df.loc can't seem to work in a loop like so
example_data = {
'ID': [1,2,3,4,5,6],
'score': [10,20,30,40,50,60]
}
example_data_df = pd.DataFrame(example_data)
for row in example_data_df:
print(example_data_df.loc[row,'ID'])
and is raising the error "KeyError: 'ID'"?
Outside of the loop, this works fine:
row = 1
print(example_data_df.loc[row,'ID']
I have been trying different version of this such as example_data_df['ID'].loc[row] and tried to see if the problem is with the type of object that is in the columns, but nothing worked.
Thank you in advance!
EDIT: If it plays a role, here is why I think I need to use the loop: I have two dataframes A and B, and need to append certain columns from B to A - however only for those rows where A and B have a matching value in a particular column. B is longer than A, not all rows in A are contained in B. I don't know how this would be possible without looping, that would be another question I might ask separately
If you check 'row' as each step, you'll notice that iterating directly over a DataFrame yields the column names.
You want:
for idx, row in example_data_df.iterrows():
print(example_data_df.loc[idx,'ID'])
Or, better:
for idx, row in example_data_df.iterrows():
print(row['ID'])
Now, I don't know why you want to iterate manually over the rows, but know that this should be limited to small datasets as it's the least efficient method of working with a DataFrame.

Adding rows and duplicating values in a Pandas based on a list of duplicates [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
So here's my daily challenge :
I have an Excel file containing a list of streets, and some of those streets will be doubled (or tripled) based on their road type. For instance :
In another Excel file, I have the street names (without duplicates) and their mean distances between features such as this :
Both Excel files have been converted to pandas dataframes as so :
duplicates_df = pd.DataFrame()
duplicates_df['Street_names'] = street_names
dist_df=pd.DataFrame()
dist_df['Street_names'] = names_dist_values
dist_df['Mean_Dist'] = dist_values
dist_df['STD'] = std_values
I would like to find a way to append the values of mean distance and STD many times in the duplicates_df whenever a street has more than one occurence, but I am struggling with the proper syntax. This is probably an easy fix, but I've never done this before.
The desired output would be :
Any help would be greatly appreciated!
Thanks again!
pd.merge(duplicates_df, dist_df, on="Street_names")

I've Mixed values in a column (string+float) in data frame how can i change them to object [duplicate]

This question already has answers here:
Convert columns to string in Pandas
(9 answers)
Closed 4 years ago.
I have a data frame where few of columns have mixed type values string + float and when I'm writing them to_excel getting this notification after writing in excel enter image description here
this is the dataframe
df = placementsummaryfinalnew.loc[:,["Placement# Name","PRODUCT","Engagements Rate",
"Viewer CTR","Engager CTR","Viewer VCR",
"Engager VCR","Interaction Rate","Active Time Spent"]]
I tried to convert them by using few of lines
placementvdxsummaryfirst["Viewer VCR"] = placementvdxsummaryfirst["Viewer VCR"].astype(object)
its not working
then I tried this one.
placementvdxsummaryfirst["Viewer VCR"] = placementvdxsummaryfirst["Viewer VCR"].astype(float)
its giving error
then I tried this one.
placementvdxsummaryfirst['Viewer VCR'] = pd.to_numeric(placementvdxsummaryfirst['Viewer VCR'],errors='coerce')
its working but its replacing the "N/A" values with blanks which I don't want.
Kindly help.
Try:
placementvdxsummaryfirst["Viewer VCR"] = placementvdxsummaryfirst["Viewer VCR"].astype(str)

Using the format function to name columns [duplicate]

This question already has an answer here:
Renaming columns when using resample
(1 answer)
Closed 5 years ago.
The line of code below takes columns that represent each months total sales and averages the sales by quarter.
mdf = tdf[sel_cols].resample('3M',axis=1).mean()
What I need to do is title the columns with a str (cannot use pandas .Period function).
I attempting to use the following code, but I cannot get it to work.
mdf = tdf[sel_cols].resample('3M',axis=1).mean().rename(columns=lambda x: '{:}q{:}'.format(x.year, [1, 2, 3, 4][x.quarter==1]))
I want the columns to read... 2000q1, 2000q2, 2000q3, 2000q4, 2001q1,... etc, but keep getting wrong things like 2000q1, 2000q1, 2000q1, 2000q2, 2001q1.
How can I use the .format function to make this work properly.
The easiest way is to perform the quarter function on the datetime list like so
mdf = tdf[sel_cols].resample('3M',axis=1).mean().rename(columns=lambda x: '{:}q{:}'.format(x.year,x.quarter))

Merging dataframes together in a for loop [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a dictionary of pandas dataframes, each frame contains timestamps and market caps corresponding to the timestamps, the keys of which are:
coins = ['dashcoin','litecoin','dogecoin','nxt']
I would like to create a new key in the dictionary 'merge' and using the pd.merge method merge the 4 existing dataframes according to their timestamp (I want completed rows so using 'inner' join method will be appropriate.
Sample of one of the data frames:
data2['nxt'].head()
Out[214]:
timestamp nxt_cap
0 2013-12-04 15091900
1 2013-12-05 14936300
2 2013-12-06 11237100
3 2013-12-07 7031430
4 2013-12-08 6292640
I'm currently getting a result using this code:
data2['merged'] = data2['dogecoin']
for coin in coins:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp')
but this repeats 'dogecoin' in 'merged', however if data2['merged'] is not = data2['dogecoin'] (or some similar data) then the merge function won't work as the values are non existent in 'merge'
EDIT: my desired result is create one merged dataframe seen in a new element in dictionary 'data2' (data2['merged']), containing the merged data frames from the other elements in data2
Try replacing the generalized pd.merge() with actual named df but you must begin dataframe with at least a first one:
data2['merged'] = data2['dashcoin']
# LEAVE OUT FIRST ELEMENT
for coin in coins[1:]:
data2['merged'] = data2['merged'].merge(data2[coin], on='timestamp')
Since you've already made coins a list, why not just something like
data2['merged'] = data2[coins[0]]
for coin in coins[1:]:
data2['merged'] = pd.merge(....
Unless I'm misunderstanding, this question isn't specific to dataframes, it's just about how to write a loop when the first element has to be treated differently to the rest.

Categories