Changing all the indexes of all dataframes in a dict - python

I have a dictionary with a list of dataframes, each with a column that is in datetime format (column name "Datetime Format"). I am attempting to set the index of each dataframe to be that column, and am having difficulty.
I've simplified the issue and tried to find a solution, my technique is not sticking:
def test_func(dataframe):
dataframe = dataframe.set_index('Datetime Format')
return dataframe
test_dict = {'DF_1': df1, 'DF_2': df2}
for k, v in test_dict.items():
v = test_func(v)
Upon looking at the resulting test_dict, or each individual dataframe (df1 and df2), I was not successful at setting the indexes to be the 'Datetime Format' column.
I know when I do:
df1 = df1.set_index('Datetime Format')
it works correctly.
Please advise as to how to get this to function through a list (or dict in this case).
Thank you!

The set_index function returns a new DataFrame by default, which is why your changes aren't sticking.
There are two ways around this: you could re-assign the dict value with the DataFrame returned by the function.
for k, v in test_dict.items():
test_dict[k] = test_func(v)
Or you could pass the inplace argument when calling set_index.
def test_func(dataframe):
dataframe = dataframe.set_index('Datetime Format', inplace=True)
This will modify the original DataFrame, without creating a new version.

Related

Pandas conditionally copy values from one column to another row

I have this Dataframe:
I would like to copy the value of the Date column to the New_Date column, but not only to the same exact row, I want to every row that has the same User_ID value.
So, it will be:
I tried groupby and then copy, but groupby made all values become lists and other columns with same user_id can have different values in different rows and then it messes up many things.
I tried also:
df['New_Date'] = df.apply(lambda x: x['Date'] if x['User_ID'] == x['User_ID'] else x['New_Date'], axis=1)
But it only copied values to the same row and left the other two empty.
And this:
if (df['User_ID'] == df['User_ID']):
df['New_Date'] = np.where(df['New_Date'] == '', df['Date'], df['New_Date'])
None accomplished my intention.
Help is appreciated, Thanks!
try this:
df['New_Date'] = df.groupby('User_Id')['Date'].transform('first')
If I'm understanding you correctly, just copy the Date column and then use .fillna() with ffill=True. If you post your data as text I can provide example code.

call dataframe from list of dataframes python

I have a use case where i have an unknown list of dfs that are generated from a groupby. The groupby is contained in a list. Once the groupby gets done, a unique df is created for each iteration.
I can dynamically create a list and dictionary of the dataframe names, however, I cannot figure out how to use the position in the list/dictionary as the name of the dataframe. data and Code below:
Code so far:
data_list = [['A','F','B','existing'],['B','F','W','new'],['C','M','H','new'],['D','M','B','existing'],['E','F','A','existing']];
# Create the pandas DataFrame
data_long = pd.DataFrame(data_list, columns = ['PAT_ID', 'sex', 'race_ethnicity','existing_new_client'])
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
def sex_race_summary(groupbyColumns):
grouplist = groupbyColumns[grouping].copy();
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index();
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
return df;
for grouping in range(len(groupbyColumns)):
exec(f'df_{grouping} = sex_race_summary(groupbyColumns)');
print(df_0)
# create a dictionary
dict_of_dfs = dict();
# a list of the dataframes
df_names = [];
for i in range(len(groupbyColumns)):
df_names.append('df_'+ str(i))
print (df_names)
What I'd like to do next is:
# loop through every dataframe and transpose the dataframe
for i in df_names:
df_transposed = i.pivot_table(index='sex', columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index();
print(i)
The index of the list matches the suffix of the dataframe.
But i is being passed as a string and thus throwing an error. The reason I need to build it this way and not hard code the dataframes is because I will not know how many df_x will be created
Thanks for your help!
Update based on R Y A N comment:
Thank you so much for this! you actually gave me another idea, so i made some tweeks: this is my final code that results in what I want: a summary and transposed table for each iteration. However, I am learning that globals() are bad practice and I should use a dictionary instead. How could I convert this to a dictionary based process? #create summary tables for each pair of group by columns
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
for grouping in range(len(groupbyColumns)):
grouplist = groupbyColumns[grouping].copy();
prefix = grouplist[0];
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index().fillna(0);
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
#transpose the dataframe from long to wide
df_transposed = df.pivot_table(index= df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index().fillna(0);
# create new df with the index as suffix
globals()['%s_summary'%prefix] = df;
globals()['%s_summary_transposed' %prefix] = df_transposed;
You can use the globals() method to index the global variable (here a DataFrame) based on the string value i in the loop. See below:
# loop through every dataframe and transpose the dataframe
for i in df_names:
# call globals() to return a dictionary of global vars and index on i
i_df = globals()[i]
# changing the index= argument to indirectly reference the first column since it's 'sex' in df_0 but 'race_ethnicity' in df_1
df_transposed = i_df.pivot_table(index=i_df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum')
print(df_transposed)
However with just this line added, the .pivot_table() function gives an index error since the 'sex' column exists only in df_0 and not df_1 in df_names. To fix this, you can indirectly reference the first column (i.e. index=i_df.columns[0]) in the .pivot_table() method to handle the different column names of the dfs coming through the loop

How to obtain output files labeled with dictionary keys

I am a python/pandas user and I have a multiple dataframe like df1, df2,df3....
I want to name them as A, B, C, ... thus I wrote as below.
df_dict = {"A":df1, "B":df2,'C':df3,....}
Each dataframe has "Price" column and I want to know the output from the following formula.
frequency=df.groupby("Price").size()/len(df)
I made the following definition and want to obtain outputs from each dataframe.
def Price_frequency(df,keys=["Price"]):
frequency=df.groupby(keys).size()/len(df)
return frequency.reset_index().to_csv("Output_%s.txt" %(df),sep='\t')
As a first trial, I did
Price_frequency(df1,keys=["Price"])
but this did not work. It seems %s is wrong.
Ideally, I want output files named as "Output_A.txt", "Output_B.txt"...
If you could help me, I would be grateful for that very much.
A couple of points:
%s requires you to input a string. But in Python 3.6+ you can use formatted string literals, which you may find more readable.
Your function doesn't need to return anything here. You are using it to output csv files in a loop. Don't feel the need to add a return statement if it doesn't serve a purpose.
So you can do the following:
def price_frequency(df_dict, df_name, keys=['Price']):
frequency = df_dict[df_name].groupby(keys).size() / len(df_dict[df_name].index)
frequency.reset_index().to_csv(f'Output_{df_name}.txt', sep='\t')
df_dict = {'A': df1, 'B': df2, 'C': df3}
for df_name in df:
price_frequency(df_dict, df_name, keys=['Price'])
Iterating through columns will get output.
def Price_frequency(df):
for col in df.columns[2:]
frequency=df.groupby(col).size()/len(df)
return frequency.reset_index().to_csv("Output_%s.txt" %(col),sep='\t')
Reference: Pandas: Iterate through columns and starting at one column
Note: haven't gotten to test this yet

Why can I create a DataFrame using function returning Series?

I want to aggregate a pandas DataFrame with following function f. The original DataFrame df has many columns and I want to exract only few of them to a new DataFrame
I can not understand why I have to return a Series. I would think that I need to return a DataFrame because the output is multidimensional.
def f(x):
return Series(dict(Number_of_tweets = x['content'].count(),
Company=x['Company'].min(),
Description=x['from_user_description'].min(),
))
account_count = df.groupby('from_user_screen_name').apply(f)
print(len(account_count))
account_count
You have to create Series, because for each value of column from_user_screen_name is returned only one aggregate value per columns. Last groupby.apply join all Series together for DataFrame.
Your solution is rewrite for agg function:
d = {'content': 'count','Company': 'min','from_user_description': 'min'}
account_count = df.groupby('from_user_screen_name').agg(d)

How do you effectively use pd.DataFrame.apply on rows with duplicate values?

The function that I'm applying is a little expensive, as such I want it to only calculate the value once for unique values.
The only solution I've been able to come up with has been as follows:
This step because apply doesn't work on arrays, so I have to convert the unique values into a series.
new_vals = pd.Series(data['column'].unique()).apply(function)
This one because .merge has to be used on dataframes.
new_dataframe = pd.DataFrame( index = data['column'].unique(), data = new_vals.values)
Finally Merging The results
yet_another= pd.merge(data, new_dataframe, right_index = True, left_on = column)
data['calculated_column'] = yet_another[0]
So basically I had to Convert my values to a Series, apply the function, convert to a Dataframe, merge the results and use that column to create me new column.
I'm wondering if there is some one-line solution that isn't as messy. Something pythonic that doesn't involve re-casting object types multiple times. I've tried grouping by but I just can't figure out how to do it.
My best guess would have been to do something along these lines
data[calculated_column] = dataframe.groupby(column).index.apply(function)
but that isn't right either.
This is an operation that I do often enough to want to learn a better way to do, but not often enough that I can easily find the last time I used it, so I end up re-figuring a bunch of things again and again.
If there is no good solution I guess I could just add this function to my library of common tools that I hedonistically > from me_tools import *
def apply_unique(data, column, function):
new_vals = pd.Series(data[column].unique()).apply(function)
new_dataframe = pd.DataFrame( data = new_vals.values, index =
data[column].unique() )
result = pd.merge(data, new_dataframe, right_index = True, left_on = column)
return result[0]
I would do something like this:
def apply_unique(df, orig_col, new_col, func):
return df.merge(df[[orig_col]]
.drop_duplicates()
.assign(**{new_col: lambda x: x[orig_col].apply(func)}
), how='inner', on=orig_col)
This will return the same DataFrame as performing:
df[new_col] = df[orig_col].apply(func)
but will be much more performant when there are many duplicates.
How it works:
We join the original DataFrame (calling) to another DataFrame (passed) that contains two columns; the original column and the new column transformed from the original column.
The new column in the passed DataFrame is assigned using .assign and a lambda function, making it possible to apply the function to the DataFrame that has already had .drop_duplicates() performed on it.
A dict is used here for convenience only, as it allows a column name to be passed in as a str.
Edit:
As an aside: best to drop new_col if it already exists, otherwise the merge will append suffixes to each new_col
if new_col in df:
df = df.drop(new_col, axis='columns')

Categories