I am trying to automatically read rows when loading in dataframe by automatically normalizing to one term. The following code works:
import pandas as pd
df=pd.read_csv('Test.csv', encoding = "ISO-8859-1", index_col=0)
firstCol=['FirstName','First Name','Nombre','NameFirst', 'Name', 'Given name', 'given name', 'Name']
df.rename(columns={typo: 'First_Name' for typo in firstCol}, inplace=True)
addressCol=['Residence','Primary Address', 'primary address' ]
df.rename(columns={typo: 'Address' for typo in addressCol}, inplace=True)
computerCol=['Laptop','Desktop', 'server', 'mobile' ]
df.rename(columns={typo: 'Address' for typo in computerCol}, inplace=True)
Is there a more efficient way of looping or rewriting it so it is less redundant?
The only way I can think of is to just reduce it to one df.rename op, by building a complete dictionary once off, eg:
replacements = {
'Name': ['FirstName','First Name','Nombre','NameFirst', 'Name', 'Given name', 'given name', 'Name'],
'Address': ['Residence','Primary Address', 'primary address' ],
#...
}
df.rename(columns={el:k for k,v in replacements.iteritems() for el in v}, inplace=True)
So it should be more efficient as to function call overhead, but I'd personally view it as more readable by having a dict of keys, which are the "to" values, with the values being the "from"'s to replace.
Related
orginal:
expected result:
Task:
I am trying to merge the 'urls column' into one row if there exist a same name in the other column ('full path') using python and jupyter notebook.
I have tried using groupby but it doesnt pass me the result i want.
Code:
df.groupby("Full Path").apply(lambda x: ", ".join(x)).reset_index()
not what i am expecting:
The reason it is not working is that you need to modify the column for the full path first before passing it to group by since there are differences in the full paths.
Based on the sample here the following should work:
df['Full Path'] = df['Full Path'].str.split('/').str[0:2].str.join('/')
test = df.groupby(by=['Full Path']).agg({'url': ', Next'.join})
test['url'] = test['url'].str.replace("Next","\n")
This code of course assumes that the grouping you want for the full path occurs in the first two items. The \n will disappear when you write the df out to Excel.
NOTE: Unless the Type and Date fields are all the same value, you cannot include them in the group by since for example, if you did groupby(['Full Path', 'Type', 'Date']) you would end up with not all the links being aggregated for an individual path+folder combination. If you wanted them to be included in a comma-separated next line column like the url, you would need to add that to the agg statement and use the replace for those as well.
Code used for testing:
import pandas as pd
pd.options.display.max_colwidth = 999
data_dict = {
'Full Path': [
'downloads/Residences Singapore',
'downloads/Residences Singapore/15234523524352',
'downloads/Residences Singapore/41242341324',
],
'Type': [
'Folder',
'File',
'File',
],
'Date': [
'07-05-22 19:24',
'07-05-22 19:24',
'07-05-22 19:24',
],
'url': [
'https://www.google.com/drive/storage/345243534534522345',
'https://www.google.com/drive/storage/523405923405672340567834589065',
'https://www.google.com/drive/storage/90658360945869012141234',
],
}
df = pd.DataFrame(data_dict)
df['Full Path'] = df['Full Path'].str.split('/').str[0:2].str.join('/')
test = df.groupby(by=['Full Path']).agg({'url': ', Next'.join})
test['url'] = test['url'].str.replace("Next","\n")
test
Output
Just groupby the FullPath and value as URL field, aggregate with comma separator. enter image description here
I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.
I have the below JSON string in data. I want it to look like the Expected Result Below
import json
import pandas as pd
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
Expected Result:
Category_matchType
Category_expression
Action_matchType
Action_expression
Label_matchType
Label_expression
0
EXACT
ABC
EXACT
DEF
REGEXP
GHI|JKL
What I've Tried:
This question is similar, but I'm not using the index the way the OP is. Following this example, I've tried using json_normalize and then using various forms of melt, stack, unstack, pivot, etc. But there has to be an easier way!
# this bit of code produces the below result where I can start using reshaping functions to get to what I need but it seems messy
df = pd.json_normalize(data, 'eventConditions')
type
matchType
expression
0
CATEGORY
EXACT
ABC
1
ACTION
EXACT
DEF
2
LABEL
REGEXP
GHI|JKL
We can use json_normalize to read the json data as pandas dataframe, then use stack followed by unstack to reshape the dataframe
df = pd.json_normalize(data, 'eventConditions')
df = df.set_index([df.groupby('type').cumcount(), 'type']).stack().unstack([1, 2])
df.columns = df.columns.map('_'.join)
CATEGORY_matchType CATEGORY_expression ACTION_matchType ACTION_expression LABEL_matchType LABEL_expression
0 EXACT ABC EXACT DEF REGEXP GHI|JKL
If your data is not too large in size, you could maybe process the json data first and then create a dataframe like this:
import pandas as pd
import json
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
new_data = {}
for i in data:
for event in i['eventConditions']:
for key in event.keys():
if key != 'type':
col_name = event['type'] + '_' + key
new_data[col_name] = [event[key]] if col_name not in new_data else new_data[col_name].append(event[key])
df = pd.DataFrame(new_data)
df
Just found a way to do it with Pandas only:
df = pd.json_normalize(data, 'eventConditions')
df = df.melt(id_vars=[('type')])
df['type'] = df['type'] + '_' + df['variable']
df.drop(columns=['variable'], inplace=True)
df.set_index('type', inplace=True)
df = df.T
I have a data frame with the following columns:
job_post.columns
Index(['Job.ID_list', 'Provider', 'Status', 'Slug', 'Title', 'Position',
'Company', 'City', 'State.Name', 'State.Code', 'Address', 'Latitude',
'Longitude', 'Industry', 'Job.Description', 'Requirements', 'Salary',
'Listing.Start', 'Listing.End', 'Employment.Type', 'Education.Required',
'Created.At', 'Updated.At', 'Job.ID_desc', 'text'],
dtype='object')
I want to select only the following columns from the dataframe:
columns_job_post = ['Job.ID_listing', 'Slug', 'Position', 'Company', 'Industry', 'Job.Description','Employment.Type', 'Education.Required', 'text'] # columns to keep
However, I get the result:
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported
I solved the issue by writing:
jobs_final = job_post.reindex(columns = columns_job_post)
Similarly, I have a data frame with the following columns:
cand_exp.columns
Index(['Applicant.ID', 'Position.Name', 'Employer.Name', 'City', 'State.Name',
'State.Code', 'Start.Date', 'End.Date', 'Job.Description', 'Salary',
'Can.Contact.Employer', 'Created.At', 'Updated.At'],
dtype='object')```
I also selected just some columns from the whole list using .loc but I didn't get the KeyError: Passing list-like...
columns_cand_exp = ['Applicant.ID', 'Position.Name', 'Employer.Name', 'Job.Description', 'Salary']``` # columns to keep
resumes_final = cand_exp.loc[:, columns_cand_exp]
What is the reason for this?
Thank you in advance!
Because in the first example you introduced column names that are not exists in the original data frame (ex: Job.ID_listing).
In the second example all the columns were already in the original data frame.
as the error says: 'Passing list-likes to .loc or [] with any missing labels .....
I have a csv merge that has many columns. I am having trouble formatting price columns.I need to have them follow this format $1,000.00.Is there a function I can use to achieve this for just two columns (Sales Price and Payment Amount)? Here is my code so far:
df3 = pd.merge(df1, df2, how='left', on=['Org ID', 'Org Name'])
cols = ['Org Name', 'Org Type', 'Chapter', 'Join Date', 'Effective Date', 'Expire Date',
'Transaction Date', 'Product Name', 'Sales Price',
'Invoice Code', 'Payment Amount', 'Add Date']
df3 = df3[cols]
df3 = df3.fillna("-")
out_csv = root_out + "report-merged.csv"
df3.to_csv(out_csv, index=False)
A solution that I thought was going to work but I get an error (ValueError: Unknown format code 'f' for object of type 'str')
df3['Sales Price'] = df3['Sales Price'].map('${:,.2f}'.format)
Based on your error ("Unknown format code 'f' for object of type 'str'"), the columns that you are trying to format are being treated as strings. So using .astype(float) in the code below addresses this.
There is not a great way to set this formatting during (within) your to_csv call. However, in an intermediate line you could use:
cols = ['Sales Price', 'Payment Amount']
df3.loc[:, cols] = df3[cols].astype(float).applymap('${:,.2f}'.format)
Then call to_csv.