loop through pandas columns inside function - python

I have following function:
def match_function(column):
df_1 = df[column].str.split(',', expand=True)
df_11=df_1.apply(lambda s: s.value_counts(), axis=1).fillna(0)
match = df_11.iloc[:, 0][0]/df_11.sum(axis=1)*100
df[column] = match
return match
this functuion only works if I enter specific column name
how to change this function in the way, if I pass it a certain dataframe, it will loop through all of its columns automatically. so I won't have to enter each column separately?
ps. I know the function it self written very poorly, but im kinda new to coding, sorry

You need to wrap the function so that it does this iteratively over all columns.
If you add this to your code then it'll iterate over the columns while returning the match results in a list (as you will have multiple results as you're running over multiple columns).
def match_over_dataframe_columns(dataframe):
return [match_function(column) for column in dataframe.columns]
results = match_over_dataframe_columns(df)

Instead of inputting column to your function, input the entire dataframe. Then, cast the columns of the df to a list and loop over the columns, performing your analysis on each column. For example:
def match_function(df):
columns = df.columns.tolist()
matches = {}
for column in columns:
#do your analysis
#instead of returning match,
matches[column] = match
return matches
This will return a dictionary with keys of your columns and values of the corresponding match value.

just loop through the columns
def match_function(df):
l_match = []
for column in df.columns:
df_1 = df[column].str.split(',', expand=True)
df_11=df_1.apply(lambda s: s.value_counts(), axis=1).fillna(0)
match = df_11.iloc[:, 0][0]/df_11.sum(axis=1)*100
df[column] = match
l_match.append(match)
return l_match

Related

Selecting specific columns in where condition using Pandas

I have a below Dataframe with 3 columns:
df = DataFrame(query, columns=["Processid", "Processdate", "ISofficial"])
In Below code, I get Processdate based on Processid==204 (without Column Names):
result = df[df.Processid == 204].Processdate.to_string(index=False)
But I wan the same result for Two columns at once without column names, Something like below code:
result = df[df.Processid == 204].df["Processdate","ISofficial"].to_string(index=False)
I know how to get above result but I dont want Column names, Index and data type.
Can someone help?
I think you are looking for header argument in to_string parameters. Set it to False.
df[df.Processid==204][['Processdate', 'ISofficial']].to_string(index=False, header=False)

call dataframe from list of dataframes python

I have a use case where i have an unknown list of dfs that are generated from a groupby. The groupby is contained in a list. Once the groupby gets done, a unique df is created for each iteration.
I can dynamically create a list and dictionary of the dataframe names, however, I cannot figure out how to use the position in the list/dictionary as the name of the dataframe. data and Code below:
Code so far:
data_list = [['A','F','B','existing'],['B','F','W','new'],['C','M','H','new'],['D','M','B','existing'],['E','F','A','existing']];
# Create the pandas DataFrame
data_long = pd.DataFrame(data_list, columns = ['PAT_ID', 'sex', 'race_ethnicity','existing_new_client'])
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
def sex_race_summary(groupbyColumns):
grouplist = groupbyColumns[grouping].copy();
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index();
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
return df;
for grouping in range(len(groupbyColumns)):
exec(f'df_{grouping} = sex_race_summary(groupbyColumns)');
print(df_0)
# create a dictionary
dict_of_dfs = dict();
# a list of the dataframes
df_names = [];
for i in range(len(groupbyColumns)):
df_names.append('df_'+ str(i))
print (df_names)
What I'd like to do next is:
# loop through every dataframe and transpose the dataframe
for i in df_names:
df_transposed = i.pivot_table(index='sex', columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index();
print(i)
The index of the list matches the suffix of the dataframe.
But i is being passed as a string and thus throwing an error. The reason I need to build it this way and not hard code the dataframes is because I will not know how many df_x will be created
Thanks for your help!
Update based on R Y A N comment:
Thank you so much for this! you actually gave me another idea, so i made some tweeks: this is my final code that results in what I want: a summary and transposed table for each iteration. However, I am learning that globals() are bad practice and I should use a dictionary instead. How could I convert this to a dictionary based process? #create summary tables for each pair of group by columns
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
for grouping in range(len(groupbyColumns)):
grouplist = groupbyColumns[grouping].copy();
prefix = grouplist[0];
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index().fillna(0);
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
#transpose the dataframe from long to wide
df_transposed = df.pivot_table(index= df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index().fillna(0);
# create new df with the index as suffix
globals()['%s_summary'%prefix] = df;
globals()['%s_summary_transposed' %prefix] = df_transposed;
You can use the globals() method to index the global variable (here a DataFrame) based on the string value i in the loop. See below:
# loop through every dataframe and transpose the dataframe
for i in df_names:
# call globals() to return a dictionary of global vars and index on i
i_df = globals()[i]
# changing the index= argument to indirectly reference the first column since it's 'sex' in df_0 but 'race_ethnicity' in df_1
df_transposed = i_df.pivot_table(index=i_df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum')
print(df_transposed)
However with just this line added, the .pivot_table() function gives an index error since the 'sex' column exists only in df_0 and not df_1 in df_names. To fix this, you can indirectly reference the first column (i.e. index=i_df.columns[0]) in the .pivot_table() method to handle the different column names of the dfs coming through the loop

Apply function to specific row in dataframe

I have a dataframe of which I want to clean up a specific row, in this case the first row. I have written a function in which I believe will return the string if the regex is matched.
def clean_cells(string):
if '201' in string:
return re.findall('201[0-9]', string)[0]
else:
return string
I want to apply this function to the first row in the dataframe then replace the first row with the cleaned up version then contatenate the rest of the original dataframe.
I have tried:
df = df.iloc[[0]].apply(clean_cells)
Select first row by position and all columns with : and assign back:
df.iloc[0, :] = df.iloc[0, :].apply(clean_cells)
Another solution:
df.iloc[0] = df.iloc[0].apply(clean_cells)

How to use Apply() and self defined function to change data in DataFrame?

What is the easiest way to make some changes in the index column of different rows in a DataFrame ?
def fn(country):
if any(char.isdigit() for char in country):
return country[:-2]
else:
return country
df.loc["Country"].apply(fn,axis=1)
I cant test now. Can you try: df['Country'] = df.apply(lambda row: fn(row),axis = 1) and change your function argument to take the row into account (like row['Country']). This way you can manipulate anything you want row by row using other column values.

How to add suffix and prefix to all columns in python/pyspark dataframe

I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff

Categories