Get a list of column headers based on string list - python

Problem: I have a dataframe with various column headers that have names with variations of the multiple strings: 'Fee_code','zip_code', etc. and also some others with: 'street_address','violation_street address', etc.
Expected Outcome: A list with all the column headers that match the keywords: Fee, address, code, name, and possibly others based on the specific file that I'll work on. Note that I DO want to keep the 'agency name' column header.
Solution: I came up with this function to list all of the strings listed above - and some more-:
def drop_cols(df):
list1= list(df.filter(like='nam', axis=1))
list1.remove('agency_name')
list2= list(df.filter(like='add', axis=1))
list3= list(df.filter(like='fee', axis=1))
list4 = list(df.filter(like='code', axis=1))
list5 = list(df.filter(like='status', axis=1))
entry= list1+list2+list3+list4+list5
return entry
Challenge: This code works, but it's bulky and I'm wondering if there are better ways to achieve the same
Sample of column headers: 'ticket_id', 'agency_name', 'inspector_name', 'violator_name', 'violation_street_number', 'violation_street_name', 'violation_zip_code', 'mailing_address_str_number', 'mailing_address_str_name', 'city', 'state', 'zip_code', 'non_us_str_code', 'country', 'ticket_issued_date', 'hearing_date', 'violation_code', 'violation_description', 'disposition', 'fine_amount', 'admin_fee', 'state_fee', 'late_fee', 'discount_amount', 'clean_up_cost', 'judgment_amount', 'payment_amount', 'balance_due', 'payment_date', 'payment_status', 'collection_status', 'grafitti_status', 'compliance_detail', 'compliance']

One way you could go about it:
#create search collection of relevant terms
search='|'.join(['fee','address','code','name'])
#use the filter method in pandas with the regex option
#then drop the 'agency_name' column
#d is the dataframe
d.filter(regex=search,axis=1).drop('agency_name',axis=1)

Related

How to make dictionary of column names in PySpark?

I am receiving files and for some files columns are named differently.
For example:
In file 1, column names are: "studentID" , "ADDRESS", "Phone_number".
In file 2, column names are: "Common_ID", "Common_Address", "Mobile_number".
In file 3, column names are: "S_StudentID", "S_ADDRESS", "HOME_MOBILE".
I want to pass a dictionary after loading the file data into dataframes and in that dictionary I want to pass values like:
StudentId -> STUDENT_ID
Common_ID -> STUDENT_ID
S_StudentID -> STUDENT_ID
ADDRESS -> S_ADDRESS
Common_Address -> S_ADDRESS
S_ADDRESS -> S_ADDRESS
The reason for doing this because in my next dataframe I am reading column names like "STUDENT_ID", "S_ADDRESS" and if it will not find "S_ADDRESS", "STUDENT_ID" names in the dataframe, it will throw error for files whose names are not standardized. I want to run my dataframe and get values from those files after renaming in the above DF and one question when in run the new df will it pick the column name form dictionary having data in it.
You can have the dictionary as you want and use toDF with a list comprehension in order to rename the columns.
Input dataframe and column names:
from pyspark.sql import functions as F
df = spark.createDataFrame([], 'Common_ID string, ADDRESS string, COL3 string')
print(df.columns)
# ['Common_ID', 'ADDRESS', 'COL3']
Dictionary and toDF:
dict_cols = {
'StudentId': 'STUDENT_ID',
'Common_ID': 'STUDENT_ID',
'S_StudentID': 'STUDENT_ID',
'ADDRESS': 'S_ADDRESS',
'Common_Address': 'S_ADDRESS',
'S_ADDRESS': 'S_ADDRESS'
}
df = df.toDF(*[dict_cols.get(c, c) for c in df.columns])
Resultant column names:
print(df.columns)
# ['STUDENT_ID', 'S_ADDRESS', 'COL3']
Use dict and list comprehension. An easier way and which would work even if some of the columns are not in the list is
df.toDF(*[dict_cols[x] if x in dict_cols else x for x in df.columns ]).show()
+----------+---------+----+
|STUDENT_ID|S_ADDRESS|COL3|
+----------+---------+----+
+----------+---------+----+

melt some values of a column to a separate column

I have a dataframe like below:
But I want to have new data frame with the sate is a seperate column as below:
DO you know how to do it using Python? Thank you so much.
If you provide an example dataset it would be helpful and we can work on it. I created an example dataset like the table below:
numbers were given randomly.
I am not sure if there is an easy way. You should put all your states in a list beforehand. The main idea behind my approach is detecting the empty rows between the states. The first string coming after the empty row is the state name and filling this name until the empty row is reached. (since there might be another country name like the United States and probably comes from an empty row, we created the states list beforehand to avoid mistakes.)
Here is my approach:
import pandas as pd
import numpy as np
data = pd.read_excel("data.xlsx")
states = ["Alabama","Alaska","Arizona"]
data['states'] = np.nan #creating states column
flag = ""
for index, value in data['location'].iteritems():
if pd.notnull(value):
if value in states:
flag = value
data['states'].iloc[index] = flag
#relocating 'states' column to the second position in the dataframe
column = data.pop('states')
data.insert(1,'states',column)
And the result:
Well, let's say we have this data:
data = {
'County':[
' USA',
'',
' Alabama',
'Autauga',
'Baldwin',
'Barbour',
'',
' Alaska',
'Aleutians',
'Anchorage',
'',
' Arizona',
'Apache',
'Cochise'
]
}
df = pd.DataFrame(data)
We could use empty lines as marks of a new state like this:
spaces = (df.County == '')
states = spaces.shift().fillna(False)
df['States'] = df.loc[states, 'County']
df['States'].ffill(inplace=True)
In the code above states is a mask of cells under empty lines, where states are located. At the next step we connect states by genuine index to the new column. After that we apply forward fill of NaN values which will duplicate each states until the next one.
Additionally we could do some cleaning. This, IMO, would be more relevant somewhat earlier, but anyway:
df['States'] = df['States'].fillna('')
df.loc[spaces, 'States'] = ''
df.loc[states, 'States'] = ''
This method rely on the structure with spaces between states. Let's make something different in case if there's no spaces between states. Say we have some data like this (no empty rows, no spaces around names):
data = [
'USA',
'Alabama',
'Autauga',
'Baldwin',
'Barbour',
'Alaska',
'Aleutians',
'Anchorage',
'Arizona',
'Apache',
'Cochise'
]
df = pd.DataFrame(data, columns=['County'])
We can work with a list of known states and pandas.Series.isin in this case. All the other logic can stay the same:
States = ['Alabama','Alaska','Arizona',...]
mask = df.County.isin(States)
df = df.assign(**{'States':df.loc[mask, 'County']}).ffill()

i have a comma separated list of email ids in a column in python. I want to extract unique list of domain names (sorted) in a new column

I have a column in a python data frame with comma separated list of email ids. I want to extract unique list of domain names, sorted in alphabetical order.
Email Ids
Required Output
jgj#myu.com
myu.com
abc#gmail.com, lll#yyy.com,xyz#svc.com,abc#yyy.com
gmail.com, svc.com, yyy.com
zya#try.com,abs#cba.com
cba.com, try.com
I tried the following code, however its returning the output of first row for all rows
def Dom1(lpo):
mylist1 = []
for i in lpo:
domain = str(i).split("#")[1]
domain1=domain.replace('>','')
domain1=domain1.replace(']'," ")
if domain1 not in mylist1:
mylist1.append(domain1)
mylist1=sorted(mylist1, key=str.lower)
return mylist1
df['Email_Id1']=df.apply(lambda row: Dom1(df['Email_Id']),axis=1)
How to fix this issue?
I assume that the column Email_Id is a list of email ids.
Here is how your dataframe should look. All the values should be a list even if it has only 1 item. I have a feeling that a single email is not being stored as a list of strings and this is probably your source of error.
df = pd.DataFrame({ 'Email_Id': [['jgj#myu.com'], ['abc#gmail.com', 'lll#yyy.com', 'xyz#svc.com,abc#yyy.com'], ['zya#try.com','abs#cba.com']] })
df
Initial Dataframe
And then with a few minor changes and cleanup here is how you can apply the lambda function.
Apply it to only a series instead of the whole dataframe.
Also I am not sure why you are calling
domain1=domain.replace('>','') and domain1=domain1.replace(']'," ") domain names should not have such characters.
You don't need to sort after every insertion. Just sort it while returning the list as it will be called only once.
Change your variable names so that they make sense.
You could use a python set, but if you do not have a lot of emails in a single row, a list should do just fine
def get_domain(emails):
domains = []
for email in emails:
d = str(email).split("#")[1]
if d not in domains:
domains.append(d)
return sorted(domains, key=str.lower)
df['Email_Id1'] = df['Email_Id'].apply(lambda x: get_domain(x))
df
Final Dataframe
I would simply do a one-liner here:
df["domains"]=df["emails"].apply(lambda row: [ email[email.find("#")+1:] for email in row]).apply(sorted)
import re
col1 = ['jgj#myu.com', 'abc#gmail.com, lll#yyy.com,xyz#svc.com,abc#yyy.com', 'zya#try.com,abs#cba.com']
df1 = pd.DataFrame({'Email Ids':col1})
def getUniqueEmail(st1):
result_obj = {}
for i in st1.split(','):
if i not in result_obj:
result_obj[re.sub('^.+#','', i)] = 1
return ','.join(sorted(list(result_obj.keys()), key=str.lower))
df1['Required output'] = df1['Email Ids'].apply(lambda x: getUniqueEmail(x))

Python Pandas Pivot Table group by match_id

I have this dataframe example:
match_id, map_type, server and duration_minutes are common variables of a match. In this example we have 5 different matches.
profile_id, country, rating, color, team, civ, won are specific variables for every player that played this specified match.
How can i obtain new dataframe with this structure?
match_id, map_type, server, duration_minutes, profile_id_player1, country_player1, rating_player1, color_player1, team_player1, civ_player1, won_player1, profile_id_player2, country_player2, rating_player2, color_player2, team_player2, civ_player2, won_player2?
Only one row by match_id with all specific variables for every player.
EDIT:This is the result by the solution of #darth baba almost done
Thank you in advance.
First groupby match_id then aggregate all the other columns to the list and then expand those list to columns, to achieve that try this:
df = pd.groupby(['match_id', 'map_type', 'server', 'duration_minutes'])['profile_id', 'country', 'rating', 'color', 'team', 'civ', 'won'].agg(list)
df = pd.concat([df[i].apply(pd.Series).set_index(df.index) for i in df.columns], axis=1).reset_index()
# Rename the columns accordingly
df.columns = [ 'match_id', 'map_type', 'server', 'duration_minutes', 'profile_id_player1', 'country_player1', 'rating_player1', 'color_player1', 'team_player1', 'civ_player1', 'won_player1', 'profile_id_player2', 'country_player2', 'rating_player2', 'color_player2', 'team_player2', 'civ_player2', 'won_player2']

Save data frame from inside for loop

I have a function that takes in a dataframe and returns a (reduced) dataframe, e.g. like this:
def transforming_data(dataframe, col_1, col_2, normalized = True):
''' takes in dataframe, groups col_1 according to col_2 and returns dataframe
'''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return dataframe
For the following code, this gives me:
import pandas as pd
import numpy as np
np.random.seed(12)
def transforming_data(df, col_1, col_2, normalized = True):
''' takes in df, groups col_1 according to col_2 and returns df '''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return df
numrows = 1000
dataframe = pd.DataFrame({'Numerical': np.random.randn(numrows),
'Category': np.random.choice(['Panda', 'Elephant', 'Anaconda'], numrows),
'Response 1': np.random.choice(['Yes', 'Maybe', 'No', 'Don\'t know'], numrows),
'Response 2': np.random.choice(['Very Much', 'Much', 'A bit', 'Not at all'], numrows)})
test = transforming_data(dataframe, 'Response 1', 'Category')
print(test)
# Output
# Response 1 Don't know Maybe No Yes
# Category
# Anaconda 0.275229 0.232416 0.217125 0.275229
# Elephant 0.220588 0.270588 0.255882 0.252941
# Panda 0.258258 0.222222 0.273273 0.246246
So far, so good.
Now I want to use the function transforming_data inside a for loop for every column in dataframe (as I have lots of columns, not just two) and save the resulting dataframe to a new dataframe, e.g. test_response_1 and test_response_2 for this example.
Can someone point me in the right direction - i.e. how to implement the loop correctly?
So far, I am using something like this - but cannot figure out how to save the data frame
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
# here, I need to save tmp_df outside of the loop but don't know how to
Thanks a lot for pointers and help. (Note: the most similar question I found does not talk about actually saving the data frame, so it doesn't help me with this.
If you want to save (in memory) all of the temp_df's from your loop, you can append them to a list that you can then index afterwards:
temp_dfs = []
for column in dataframe.columns.tolist(): #you don't actually need the tolist() method here
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs.append(temp_df)
If you rather be able to access these temp_df's by the column name that was used to transform them, then you could assign each to a dictionary, using the column as the key:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs[column] = temp_df
If by "save" you meant "write to disk", then you can use one of the many to_<file_format>() methods that pandas provides:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_df.to_csv('temp_df{}.csv'.format(column))
Here's the to_csv() docs.
The most simple solution would be to save the result dataframes into a list. Assuming that all columns that you want to loop over have the text Response in their column name:
result_dframes = []
for col_name in dataframe.filter(like='Response').columns:
result_dframe = transforming_data(dataframe, col_name, 'Category')
result_dframes.append(result_dframe)
Alternatively you can also obtain the exact same result with a list comprehension instead of a for-loop:
result_dframes = [
transforming_data(dataframe, col_name, 'Category')
for col_name in dataframe.filter(like='Response')
]

Categories