rename values from df.columns - python

suppose I have a dataframe column values as shown below
print(df.columns)
Index([ "CustomerID", "CompanyName", "ContactName", "ContactTitle",
"Address", "City", "Region", "PostalCode",
"Country", "Phone", "Fax"],
dtype='object')
I want to remove quotes from the values.
This is what I have tried:
for cols in list(df.columns):
new_col = str(cols).replace('"','').replace('"','')
cols.rename(index={cols: new_col})
Any solution will be helpfull.

Its a string type for .columns, If you remove quote its invalid.

Related

Remap values in a Pandas column based on dictionary key/value pairs using RegEx in replace() function

I have the following Pandas dataframe:
foo = {
"first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
"last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
"salary" : [101000, 50000, 56943, 330532, 92750],
}
df = pd.DataFrame(foo)
I'd like to be able to validate column data using a RegEx pattern, then replace with NaN if the validation fails.
To do this, I use the following hard-coded RegEx patterns in the .replace() method:
df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)
This approach works. But, I have 15-20 columns. So, this approach is going to be difficult to maintain.
I'd like to set up a dictionary that looks as follows:
regex_patterns = {
'last_name' : '[^A-Za-z \/\-\.\']',
'first_name' : '[^A-Za-z \/\-\.\']',
'salary' : '[^0-9 ]'
}
Then, I'd like to pass a value to the .replace() function based on the name of the column in the df. It would look as follows:
df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
How would I reference the name of the df column, then use that to look up the key in the dictionary and get its associated value?
For example, look up first_name, then access its dictionary value [^A-Za-z \/\-\.\'] and pass this value into .replace()?
Thanks!
P.S. if there is a more elegant approach, I'm all ears.
One approach would be using columns attribute:
regex_patterns = {
'last_name' : '[^A-Za-z \/\-\.\']',
'first_name' : '[^A-Za-z \/\-\.\']',
'salary' : '[^0-9 ]'
}
for column in df.columns:
df[column] = df[[column]].replace(regex_pattern[column], np.NaN, regex=True)
You can actually pass a nested dictionary of the form {'col': {'match': 'replacement'}} to replace
In your case:
d = {k:{v:np.nan} for k,v in regex_patterns.items()}
df.replace(d, regex=True)

Rename Pandas columns with inplace

I want to rename a few columns and set inplace = False then I am able to view the result. Why set inplace = True will result in None (I have refresh the runtime to run inplace=True)?
ckd_renamed = ckd_data.rename(
columns = {"id": "Id",
"age": "Age",
"bp": "blood_pressure"},
inplace=True)
The dataset that I use is
"https://raw.githubusercontent.com/dphi-official/Datasets/master/Chronic%20Kidney%20Disease%20(CKD)%20Dataset/ChronicKidneyDisease.csv"
Because when you set inplace=True, there is no new copy created, and ckd_data is modified, well, in-place.
The convention of returning None for in-place modifications stems from how e.g. Python's .sort() works.
After
ckd_data.rename(columns = {"id": "Id", "age": "Age", "bp": "blood_pressure"}, inplace=True)
you'd continue to refer to the dataframe, now with modified column names, as ckd_data.
EDIT: If you're using Colab and you want the cell to also display the new value, you should be able to add an additional expression statement with just the name of the variable.
ckd_data.rename(columns = {"id": "Id", "age": "Age", "bp": "blood_pressure"}, inplace=True)
ckd_data

Remove duplicate values while group by in pandas data frame

given input data frame
Required output
I am able to achieve this using groupby fn as df_ = (df.groupby("entity_label", sort=True)["entity_text"].apply(tuple).reset_index(name="entity_text")), but duplicates are still there in the output tuple
You can use SeriesGroupBy.unique() to get the unique values of entity_text before applying tuple to the list, as follows:
(df.groupby("entity_label", sort=False)["entity_text"]
.unique()
.apply(tuple)
.reset_index(name="entity_text")
)
Result:
entity_label entity_text
0 job_title (Full Stack Developer, Senior Data Scientist, Python Developer)
1 country (India, Malaysia, Australia)
Try this:
import pandas as pd
df = pd.DataFrame({'entity_label':["job_title", "job_title","job_title","job_title", "country", "country", "country", "country", "country"],
'entity_text':["full stack developer", "senior data scientiest","python developer","python developer", "Inida", "Malaysia", "India", "Australia", "Australia"],})
df.drop_duplicates(inplace=True)
df['entity_text'] = df.groupby('entity_label')['entity_text'].transform(lambda x: ','.join(x))
df.drop_duplicates().reset_index().drop(['index'], axis='columns')
output:
entity_label entity_text
0 job_title full stack developer,senior data scientiest,py...
1 country Inida,Malaysia,India,Australia

Merge rows of a dataframe according to values in a specific column by concatenating the strings in the other columns into one

I have a dataframe that looks like this:
df1 = pd.DataFrame({
"Business_Process_Activity" : ["SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "PreparingAndSendingAgenda", "PreparingAndSendingAgenda"],
"Case":[1,1,2,2,2,3,4],
"Application":["MicrosoftWord", "MicrosoftOutlook", "MicrosoftWord", "MicrosoftOutlook", "MicrosoftOutlook", "MicrosoftWord", "MicrosoftWord"],
"Activity_of_the_User":["SavingADocument", "SendingAnEmail", "SavingADocument", "SendingAnEmail", "SendingAnEmail", "SavingADocument", "SavingADocument"],
"Receiver_email_root":["None", "idatta91 adarandall larryjacob", "None", "idatta91 larryjacob"," vanessaHudgens prithakaur", "None", "None"],
"Receiever_email_domains":["None", "gmail yahoo", "None", "gmail", "gmail yahoo", "None", "None"],
"Receiver_email_count_Catg":["None", "Few", "None", "Double", "Double", "None", "None"],
"Subject":["None","Activity Report", "None", "Project Progress Report", "Project Progress Report 2", "None", "None"]
})
I want to merge the rows of the dataframe according to the column Case. So, if the number in the Case column of two or more rows is the same then the strings of the other columns of those rows are concatenated into one row.
Also the the values in the column Business_Process_Activity is the same for the cases of the same number. For that column, I do not want to concatenate the Business_Process_Activity values but keep only one of them since that column needs to be categorical. I want the final dataframe to look like this:
df2 = pd.DataFrame({"Case":[1,2,3,4],
"Business_Process_Activity" : ["SendingReportToManager", "SendingReportToManager", "PreparingAndSendingAgenda", "PreparingAndSendingAgenda"],
"Application":["MicrosoftWord MicrosoftOutlook", "MicrosoftWord MicrosoftOutlook MicrosoftOutlook", "MicrosoftWord", "MicrosoftWord"],
"Activity_of_the_User":["SavingADocument SendingAnEmail","SavingADocument SendingAnEmail SendingAnEmail", "SavingADocument", "SavingADocument"],
"Receiver_email_root":["idatta91 adarandall larryjacob", "idatta91 larryjacob vanessaHudgens prithakaur", "None", "None"],
"Receiever_email_domains":["gmail yahoo","gmail gmail yahoo", "None", "None"],
"Receiver_email_count_Catg":["Few", "Double Double", "None", "None"],
"Subject":["Activity Report", "Project Progress Report Project Progress Report 2", "None", "None"]
})
If strings are merged with a "None" column, the "None" string should be deleted since that value is not empty anymore. The duplicate number of the case columns should be deleted as the rows are merged into one.
How do I do this? Thanks in advance!
Idea is remove None values and also None strings per groups, join together and last replace empty strings to None:
df = (df1.groupby('Case')
.agg(lambda x: ' '.join(x[x.ne('None') & x.notna()]))
.where(lambda x: x.astype(bool), None)
.reset_index())
Another solution with custom function:
def f(x):
y = x[x.ne('None') & x.notna()]
return None if y.empty else ' '.join(y)
df = df1.groupby('Case').agg(f).reset_index()
Use:
g = df1.groupby('Case')
df2 = g.agg(lambda s: ' '.join(s[s.ne('None')] if s.ne('None').any() else ['None']))
df2['Business_Process_Activity'] = g['Business_Process_Activity'].first()
df2 = df2.reset_index()
# print(df2)
Case Business_Process_Activity ... Receiver_email_count_Catg Subject
0 1 SendingReportToManager ... Few Activity Report
1 2 SendingReportToManager ... Double Double Project Progress Report Project Progress Report 2
2 3 PreparingAndSendingAgenda ... None None
3 4 PreparingAndSendingAgenda ... None None

extracting values by keywords in a pandas column

I have a column that is a list of dictionary. I extracted only the values by the name key, and saved it to a list. Since I need to run the column to a tfidVectorizer, I need the columns to be a string of words. My code is as follows.
def transform(s,to_extract):
return [object[to_extract] for object in json.loads(s)]
cols = ['genres','keywords']
for col in cols:
lst = df[col]
df[col] = list(map(lambda x : transform(x,to_extract='name'), lst))
df[col] = [', '.join(x) for x in df[col]]
for testing, here's 2 rows.
data = {'genres': [[{"id": 851, "name": "dual identity"},{"id": 2038, "name": "love of one's life"}],
[{"id": 5983, "name": "pizza boy"},{"id": 8828, "name": "marvel comic"}]],
'keywords': [[{"id": 9663, "name": "sequel"},{"id": 9715, "name": "superhero"}],
[{"id": 14991, "name": "tentacle"},{"id": 34079, "name": "death", "id": 163074, "name": "super villain"}]]
}
df = pd.DataFrame(data)
I'm able to extract the necessary data and save it accordingly. However, I find the codes too verbose, and I would like to know if there's a more pythonic way to achieve the same outcome?
Desired output of one row should be a string, delimited only by a comma. Ex, 'Dual Identity,love of one's life'.
Is this what you need ?
df.applymap(lambda x : pd.DataFrame(x).name.tolist())
Out[278]:
genres keywords
0 [dual identity, love of one's life] [sequel, superhero]
1 [pizza boy, marvel comic] [tentacle, super villain]
Update
df.applymap(lambda x : pd.DataFrame(x).name.str.cat(sep=','))
Out[280]:
genres keywords
0 dual identity,love of one's life sequel,superhero
1 pizza boy,marvel comic tentacle,super villain

Categories