Save data frame from inside for loop - python

I have a function that takes in a dataframe and returns a (reduced) dataframe, e.g. like this:
def transforming_data(dataframe, col_1, col_2, normalized = True):
''' takes in dataframe, groups col_1 according to col_2 and returns dataframe
'''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return dataframe
For the following code, this gives me:
import pandas as pd
import numpy as np
np.random.seed(12)
def transforming_data(df, col_1, col_2, normalized = True):
''' takes in df, groups col_1 according to col_2 and returns df '''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return df
numrows = 1000
dataframe = pd.DataFrame({'Numerical': np.random.randn(numrows),
'Category': np.random.choice(['Panda', 'Elephant', 'Anaconda'], numrows),
'Response 1': np.random.choice(['Yes', 'Maybe', 'No', 'Don\'t know'], numrows),
'Response 2': np.random.choice(['Very Much', 'Much', 'A bit', 'Not at all'], numrows)})
test = transforming_data(dataframe, 'Response 1', 'Category')
print(test)
# Output
# Response 1 Don't know Maybe No Yes
# Category
# Anaconda 0.275229 0.232416 0.217125 0.275229
# Elephant 0.220588 0.270588 0.255882 0.252941
# Panda 0.258258 0.222222 0.273273 0.246246
So far, so good.
Now I want to use the function transforming_data inside a for loop for every column in dataframe (as I have lots of columns, not just two) and save the resulting dataframe to a new dataframe, e.g. test_response_1 and test_response_2 for this example.
Can someone point me in the right direction - i.e. how to implement the loop correctly?
So far, I am using something like this - but cannot figure out how to save the data frame
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
# here, I need to save tmp_df outside of the loop but don't know how to
Thanks a lot for pointers and help. (Note: the most similar question I found does not talk about actually saving the data frame, so it doesn't help me with this.

If you want to save (in memory) all of the temp_df's from your loop, you can append them to a list that you can then index afterwards:
temp_dfs = []
for column in dataframe.columns.tolist(): #you don't actually need the tolist() method here
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs.append(temp_df)
If you rather be able to access these temp_df's by the column name that was used to transform them, then you could assign each to a dictionary, using the column as the key:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs[column] = temp_df
If by "save" you meant "write to disk", then you can use one of the many to_<file_format>() methods that pandas provides:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_df.to_csv('temp_df{}.csv'.format(column))
Here's the to_csv() docs.

The most simple solution would be to save the result dataframes into a list. Assuming that all columns that you want to loop over have the text Response in their column name:
result_dframes = []
for col_name in dataframe.filter(like='Response').columns:
result_dframe = transforming_data(dataframe, col_name, 'Category')
result_dframes.append(result_dframe)
Alternatively you can also obtain the exact same result with a list comprehension instead of a for-loop:
result_dframes = [
transforming_data(dataframe, col_name, 'Category')
for col_name in dataframe.filter(like='Response')
]

Related

Add values from a nested JSON to a pandas dataframe

I have the following JSON object:
{"code":"Ok","matchings":[{"confidence":0.025755,"geometry":"qnp{bBww{kH??~D_I}E_J{EaJ{E{I{AsCoJgQfKuTjJwNtF}HdBuBnAgBpFsF~EeEzAsAt#i#lA}#x#q#lEmCjDuBdDoAvFmAfYmEtAUrJyDj#_#h#m#`#u#T}#J{#B_A?gAGmAM}#Su#]u#wN{QwI{KcA}Aa#gASiAWsBOwCGmDCoJ??cEH?{FA{HgIXuG`#eHrAsLdDkI|CkIfDq#VoDlB_GzDaE`D_A|#kA`AeAx#sI~G}DlDk#j#mClCiOrQwGvJiGxJoFdK_HjP{Pne#aLt\\sK~]oKb_#sG~TeJ`_#q#fD{#dEoBlMwBxQaAbI{Dh\\wKrfAiRbvBy#`KaLjwAyHj_AANM~AUxC}#tKi#bHe#jGfBj#t#V|#\\TFjAXz#HhASxAy#vCcBjX~GvG`BlEjAv\\xJfBf#dThG~Ad#nFrBnCbBdCvBzB`DbCfEr{#b~A","legs":[{"annotation":{"nodes":[330029575,5896466632,330029575,5896466588,5896466587,5896466586,5896466637,330029340,330029339,330029338,1497356855,1880770263,46388213,1880770262,1880770257,2021835257,3306177380,46387099,2021835255,6909770873,46385948,6909770874,46384887,46382454]},"steps":[],"distance":332.2,"duration":93.1,"summary":"","weight":93.1},{"annotation":{"nodes":[46384887,46382454,5888264001,6909802199,3296872014,6909802198,5888264003,6909802197,3296872012,6909802194,6909802195,6909802193,6909802196,3296872013,3296872015]},"steps":[],"distance":88.1,"duration":13.5,"summary":"","weight":13.5},{"annotation":{"nodes":[3296872013,3296872015,6909802186,6909802187,6909770884,3296872017,6909802185,4904066416,3296872018,1614187163]},"steps":[],"distance":62.3,"duration":12.4,"summary":"","weight":12.4},{"annotation":{"nodes":[3296872018,1614187163,2054127599,1614187129,5896479942,6909802219,46384372,1027299576,6909802220,46389815]},"steps":[],"distance":144,"duration":25.2,"summary":"","weight":25.2},{"annotation":{"nodes":[6909802220,46389815,6296436095,6296436094,298079716,6296436096,46391324,1083528076,6909802221,6909802222,46393158]},"steps":[],"distance":90.6,"duration":10.1,"summary":"","weight":10.1},{"annotation":{"nodes":[6909802222,46393158,46393795,6909802223,1027299602,6909802224,46396846,46398397,2054127645,46399502,46400708,1027299589,6712474212,6903665704,46402805,46403163,4374153462]},"steps":[],"distance":422.9,"duration":40.1,"summary":"","weight":40.1},{"annotation":{"nodes":[46403163,4374153462,46404084,1027299603,364146312,2262500170]},"steps":[],"distance":273.6,"duration":24.7,"summary":"","weight":24.7},{"annotation":{"nodes":[364146312,2262500170,5289718695]},"steps":[],"distance":170.9,"duration":15.3,"summary":"","weight":15.3},{"annotation":{"nodes":[2262500170,5289718695,2054127657,1693195716,46408565,6913837768,1693195721,2262500247,1693195714,2262500104,1693195717]},"steps":[],"distance":56.9,"duration":14.2,"summary":"","weight":14.2},{"annotation":{"nodes":[46397705,46401323,46405521]},"steps":[],"distance":86.6,"duration":12.6,"summary":"","weight":12.6},{"annotation":{"nodes":[46401323,46405521,46410773]},"steps":[],"distance":156.5,"duration":22.5,"summary":"","weight":22.5},{"annotation":{"nodes":[46405521,46410773,452003319,452003320]},"steps":[],"distance":95.4,"duration":13.8,"summary":"","weight":13.8},{"annotation":{"nodes":[452003319,452003320,46411428,46414457,46419384,46421801]},"steps":[],"distance":226.4,"duration":32.6,"summary":"","weight":32.6},{"annotation":{"nodes":[46419384,46421801,46421802,46421735]},"steps":[],"distance":69.2,"duration":10,"summary":"","weight":10},{"annotation":{"nodes":[46421802,46421735,46421416]},"steps":[],"distance":34.1,"duration":4.9,"summary":"","weight":4.9},{"annotation":{"nodes":[46421735,46421416,46420466]},"steps":[],"distance":2.7,"duration":0.3,"summary":"","weight":0.3},{"annotation":{"nodes":[46421416,46420466]},"steps":[],"distance":31.4,"duration":4.6,"summary":"","weight":4.6},{"annotation":{"nodes":[46421416,46420466,452003307,452003308,46421260,46422467,5761752102,46423905]},"steps":[],"distance":135.5,"duration":25,"summary":"","weight":25},{"annotation":{"nodes":[5761752102,46423905,46424346,5777055555,5713213408,46425605,5777055050,5777346784,5777055556,5713221227,46426685,46427741,3175895442,3183752428,5826014405,46428227]},"steps":[],"distance":106.5,"duration":14.9,"summary":"","weight":14.9},{"annotation":{"nodes":[5826014405,46428227,3175895443,5826014406,3175895444,5826014368,5826014369,5826014374,46429570,5826014373,5826014375,5826014372,5826014358,5826014371,5826014370,5826014376]},"steps":[],"distance":172.7,"duration":15.7,"summary":"","weight":15.7},{"annotation":{"nodes":[2054127660,2054127638,2054127605,6296435009,2054127599,6909770882,3296872018,4904066416,6909802185,3296872017,6909770884,6909802187,6909802186,3296872015,3296872013,6909802196,6909802193,6909802195,6909802194,3296872012,6909802197,5888264003,6909802198,3296872014,6909802199,5888264001,46382454,46384887,6909770874,46385948,6909770873,2021835255,46387099,3306177380,2021835257]},"steps":[],"distance":317.7,"duration":46.1,"summary":"","weight":46.1},{"annotation":{"nodes":[3306177380,2021835257,1880770257,1880770262,46388213,1880770263,1497356855,330029338,330029339,330029340,5896466637]},"steps":[],"distance":150.4,"duration":29.4,"summary":"","weight":29.4}],"distance":80317.8,"duration":10983.5,"weight_name":"duration","weight":10983.5}],"tracepoints":[{"alternatives_count":0,"waypoint_index":0,"matchings_index":0,"location":[4.929932,52.372217],"name":"Willem Theunisse Blokstraat","distance":10.791613,"hint":"CAkHgHAJBwAlAAAAAAAAAAAAAAAAAAAALCd0QQAAAAAAAAAAAAAAACUAAAAAAAAAAAAAAAAAAAABAAAAjDlLAPkiHwP3OEsAGiMfAwAArxMz7Ejh"},null,{"alternatives_count":0,"waypoint_index":1,"matchings_index":0,"location":[4.932506,52.3709],"name":"Frans de Wollantstraat","distance":11.915926,"hint":"pwUBAPYEAYAHAAAARwAAAAAAAAAAAAAA3_qaQE0JPUIAAAAAAAAAAAcAAABHAAAAAAAAAAAAAAABAAAAmkNLANQdHwPtQksAxB0fAwAA_xUz7Ejh"},{"alternatives_count":0,"waypoint_index":472,"matchings_index":0,"location":[4.932745,52.373288],"name":"Piet Heinkade","distance":0.98867,"hint":"gwUBgMgFAQAFAAAADQAAABoBAABYAAAAQMS3QHTNW0HsWZ1DmZ2WQgUAAAANAAAAGgEAAFgAAAABAAAAiURLACgnHwN9REsAIycfAwoADwkz7Ejh"},null,null,{"alternatives_count":1,"waypoint_index":473,"matchings_index":0,"location":[4.934022,52.371637],"name":"Piet Heinkade","distance":2.713742,"hint":"NA8HADsPB4ACAAAADwAAADoAAAA-AAAAjU82QIAqg0FUpSdCLoWJQgIAAAAPAAAAOgAAAD4AAAABAAAAhklLALUgHwNfSUsAsCAfAwQAvxUz7Ejh"},null,null,{"alternatives_count":1,"waypoint_index":474,"matchings_index":0,"location":[4.93213,52.371794],"name":"Frans de Wollantstraat","distance":10.337677,"hint":"AgUBgAcFAQABAAAABAAAAAwAAAAAAAAA1paeP-KrBUAomAdBAAAAAAEAAAAEAAAADAAAAAAAAAABAAAAIkJLAFIhHwOrQksAeiEfAwIA7xQz7Ejh"},{"alternatives_count":1,"waypoint_index":475,"matchings_index":0,"location":[4.93074,52.372528],"name":"Isaac Titsinghkade","distance":0.65222,"hint":"AwkHgAYJBwA5AAAACwAAAAAAAACMAAAA_Fe_QWP_k0AAAAAA33FqQjkAAAALAAAAAAAAAIwAAAABAAAAtDxLADAkHwOtPEsANCQfAwAADw4z7Ejh"},null,null]}
I want to add all values that belong to the key nodes to one column in a pandas dataframe
When I run:
for i in output["matchings"][0]['legs']:
result = i['annotation']['nodes']
df = pd.DataFrame(result, columns=['node'])
df
only a fraction gets added to the dataframe. What am I doing wrong?
At the end of your for loop, 'df' keeps the last 'node' key of your json. You have to append all 'nodes' keys in a single dataframe instead.
Extending your code:
df = pd.DataFrame({'node':{}})
for i in output["matchings"][0]['legs']:
result = i['annotation']['nodes']
df_temp = pd.DataFrame(result, columns=['node'])
df = df.append(df_temp, ignore_index=True)

How to generalize a function written for a specific column in a dataframe to be usable on any similar column?

How can I adjust this code so that it is useable for any column in the dataframe? Currently it only works on the column called "Gaps", but I have 10 other columns to which I need to apply this same function.
def get_averages(df: pd.DataFrame, column: str) -> pd.DataFrame:
'''
Add a column in place, with the averages
of each `Num` cyclical item for each row
'''
# work with a new dataframe
df2 = (
df[['FileName', 'Num', column]]
.explode('Gaps', ignore_index=True)
)
df2.Gaps = df2.Gaps.astype(float)
df2['tag'] = ( # add cyclic tags to each row, within each FileName
df2.groupby('FileName')[column]
.transform('cumcount') # similar to range(len(group))
% df2.Num # get the modulo of the row number within the group
)
# get averages and collect into lists
df2 = df2.groupby(['FileName', 'tag'])[column].mean() # get average
df2.rename(f'{column}_avgs', inplace=True)
# collect in a list by Filename and merge with original df
df2 = df2.groupby('FileName').agg(list)
df = df.merge(df2, on='FileName')
return df
df = get_averages(df, 'Gaps')
Use the parameter variable instead of hard-coding the column name:
df2 = (
df[['FileName', 'Num', column]]
.explode(column, ignore_index=True)
)
df2[column] = df2[column].astype(float)

Is there a better way to manipulate column names in a pandas dataframe?

I'm working with a large dataframe and need a way to dynamically rename column names.
Here's a slow method I'm working with:
# Create a sample dataframe
df = pd.DataFrame.from_records([
{'Name':'Jay','Favorite Color (BLAH)':'Green'},
{'Name':'Shay','Favorite Color (BLAH)':'Blue'},
{'Name':'Ray','Favorite Color (BLAH)':'Yellow'},
])
# Current columns are: ['Name', 'Favorite Color (BLAH)']
# ------
# build two lambdas to clean the column names
f_clean = lambda x: x.split('(')[0] if ' (' in x else x
f_join = lambda x: '_'.join(x.split())
df.columns = df.columns.map(f_clean, f_join).map(f_join).str.lower()
# Columns are now: ['name', 'favorite_color']
Is there a better method for solving this?
You could define a clean function and just apply to all the columns using list comprehension.
def clean(name):
name = name.split('(')[0] if ' (' in name else name
name = '_'.join(name.split())
return name
df.columns = [clean(col) for col in df.columns]
It's clear what's happening and not overly verbose.

How do I fix the For Loop to return a certain character from a DataFrame?

I have imported an excel file and made it into a DataFrame and iterated over a column called "Titles" to spit out titles with certain keywords. I have the list of titles as "match_titles." What I want to do now is to create a For Loop to return the column before "titles" for each title in match_titles." I'm not sure why the code is not working. Any help would be appreciated.
import pandas as pd
data = pd.read_excel(r'C:\Users\bryanmccormack\Downloads\asin_list.xlsx')
df = pd.DataFrame(data, columns=['Track','Asin','Title'])
excludes = ["Chainsaw", "Diaper pail", "Leaf Blower"]
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
match_titles = [e for e in df.Title if
any(keywords.issubset(e.lower().split()) for keywords in my_excludes)]
a = []
for i in match_titles:
a.append(df['Asin'])
print(a)
In your for loop you are appending the unfiltered column df['Asin'] to your list a as many times as there are values in match_titles. But there isn't any filtering of df.
One solution would be to make a column of the match_values then you can return the column Asin after filtering on that match_values column:
# make a function to perform your match analysis.
def is_match(title, excludes=["Chainsaw", "Diaper pail", "Leaf Blower"]):
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
if any(keywords.issubset(title.lower().split()) for keywords in my_excludes):
return True
return False
# Make a new boolean column for the matches. This applies your
# function to each value in df['Title'] and puts the output in
# the new column.
df['match_titles'] = df['Title'].apply(is_match)
# Filter the df to only matches and return the column you want.
# Because the match_titles column is boolean it can be used as
# an index.
result = df[df['match_titles']]['Asin']

Pandas: apply a specific function to columns and create column in new dataframe

I have a dataframe df1, like this:
date sentence
29/03/1029 i like you
.....
I want to create new dataframe df2 like this:
date verb object
29/03/2019 like you
....
with the function like this:
def getSplit(df1):
verbList = []
objList = []
df2 = pd.DataFrame()
for row in df1['sentence']:
verb = getVerb(row)
obj = getObj(row)
verbList.append(verb)
objList.append(obj)
df2 = df1[[date]].copy
df2['verb'] = verbList
df2['object'] = objList
return df2
my function run well, but it's slow. Could someone help me improve the function so that can run faster?
Thank you
You can Use apply method of pandas to process fast:-
getverb(row):
pass # Your function
getobj(row):
passs # Your function
df2 = df1.copy() # Making copy of your dataframe.
df2['verb'] = df2['sentence'].apply(getverb)
df2['obj'] = df2['sentence'].apply(getobj)
df2.drop('sentence', axis=1, inplace=True) # Droping sentence column
df2
I hope it may help you. (accept and upvote answer)

Categories