How does Python Pandas process a list of tables? - python

I have this simple clean_data function, which will round the numbers in the input data frame. The code works, but I am very puzzled why it works. Could anybody help me understand?
The part where I got confused is this. table_list is a new list of data frame, so after running the code, each item inside table_list should be formatted, while tablea, tableb, and tablec should stay the same. But apparently I am wrong. After running the code, all three tables are formatted correctly. What is going on? Thanks a lot for the help.
table_list = [tablea, tableb, tablec]
def clean_data(df):
for i in df:
df[i] = df[i].map(lambda x: round(x, 4))
return df
map(clean_data, table_list)

Simplest way is to break down this code completely:
# List of 3 dataframes
table_list = [tablea, tableb, tablec]
# function that cleans 1 dataframe
# This will get applied to each dataframe in table_list
# when the python function map is used AFTER this function
def clean_data(df):
# for loop.
# df[i] will be a different column in df for each iteration
# i iterates througn column names.
for i in df:
# df[i] = will overwrite column i
# df[i].map(lambda x: round(x, 4)) in this case
# does the same thing as df[i].apply(lambda x: round(x, 4))
# in other words, it rounds each element of the column
# and assigns the reformatted column back to the column
df[i] = df[i].map(lambda x: round(x, 4))
# returns the formatted SINGLE dataframe
return df
# I expect this is where the confusion comes from
# this is a python (not pandas) function that applies the
# function clean_df to each item in table_list
# and returns a list of the results.
# map was also used in the clean_df function above. That map was
# a pandas map and not the same function as this map. There do similar
# things, but not exactly.
map(clean_data, table_list)
Hope that helps.

In Python, a list of dataframes, or any complicated objects, is simply a list of references that will point to the underlying data frames. For example, the first element of table_list is a reference to tablea. Therefore, clean_data will go directly to the data frame, i.e., tablea, following the reference given by table_list[0].

Related

looping over a list and define dataframe using the list element in Python

I have a list of names. for each name, I start with my dataframe df, and use the elements in the list to define new columns for the df. after my data manipulation is complete, I eventually create a new data frame whose name is partially derived from the list element.
list = ['foo','bar']
for x in list :
df = prior_df
(long code for manipulating df)
new_df_x = df
new_df_x.to_parquet('new_df_x.parquet')
del new_df_x
new_df_foo = pd.read_parquet(new_df_foo.parquet)
new_df_bar = pd.read_parquet(new_df_bar.parquet)
new_df = pd.merege(new_df_foo ,new_df_bar , ...)
The reason I am using this approach is that, if I don't use a loop and just add the foo and bar columns one after another to the original df, my data gets really big and highly fragmented before I go from wide to long and I encounter insufficient memory error. The workaround for me is to create a loop and store the data frame for each element and then at the very end join the long-format data frames together. Therefore, I cannot use the approach suggested in other answers such as creating dictionaries etc.
I am stuck at the line
new_df_x = df
where within the loop, I am using the list element in the name of the data frame.
I'd appreciate any help.
IIUC, you only want the filenames, i.e. the stored parquet files to have the foo and bar markers, and you can reuse the variable name itself.
list = ['foo','bar']
for x in list :
df = prior_df
(long code for manipulating df)
df.to_parquet(f'new_df_{x}.parquet')
del df
new_df_foo = pd.read_parquet(new_df_foo.parquet)
new_df_bar = pd.read_parquet(new_df_bar.parquet)
new_df = pd.merge(new_df_foo ,new_df_bar , ...)
Here is an example, if you are looking to define a variables names dataframe using a list element.
import pandas as pd
data = {"A": [42, 38, 39],"B": [13, 25, 45]}
prior_df=pd.DataFrame(data)
list= ['foo','bar']
variables = locals()
for x in list :
df = prior_df.copy() # assign a dataframe copy to the variable df.
# (smple code for manipulating df)
#-----------------------------------
if x=='foo':
df['B']=df['A']+df['B'] #
if x=='bar':
df['B']=df['A']-df['B'] #
#-----------------------------------
new_df_x="new_df_{0}".format(x)
variables[new_df_x]=df
#del variables[new_df_x]
print(new_df_foo) # print the 1st df variable.
print(new_df_bar) # print the 2nd df variable.

Pandas Dataframe: Function doesn't preserve my custom column order when returning df

I followed this code from user Lala la (https://stackoverflow.com/a/55803252/19896454)
to put 3 columns at the front and leave the rest with no changes. It works well inside the function but when returns the dataframe, it loses column order.
My desperate solution was to put the code on the main program...
Other functions in my code are able to return modified versions of the dataframe with no problem.
Any ideas what is happening?
Thanks!
def define_columns_order(df):
cols_to_order = ['LINE_ID','PARENT.CATEGORY', 'CATEGORY']
new_columns = cols_to_order + (df.columns.drop(cols_to_order).tolist())
df = df[new_columns]
return df
try using return(df.reindex(new_columns, axis=1)) and keep in mind DataFrame modifications are not in place, unless you specify inplace=True, therefore you need to explicitly assign the result returned by your function to your df variable

How to filter multiple dataframes in a loop?

I have a lot of dataframes and I would like to apply the same filter to all of them without having to copy paste the filter condition every time.
This is my code so far:
df_list_2019 = [df_spain_2019,df_amsterdam_2019, df_venice_2019, df_sicily_2019]
for data in df_list_2019:
data = data[['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
]]
but it doesn't apply the filter to the data frame. How can I change the code to do that?
Thank you
The filter (column selection) is actually applied to every DataFrame, you just throw the result away by overriding what the name data points to.
You need to store the results somewhere, a list for example.
cols = ['host_since','host_response_time', ...]
filtered = [df[cols] for df in df_list_2019]
As soon as you write var = new_value, you do not change the original object but have the variable refering a new object.
If you want to change the dataframes from df_list_2019, you have to use an inplace=True method. Here, you could use drop:
keep = set(['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
])
for data in df_list_2019:
data.drop(columns=[col for col in data.columns if col not in keep], inplace=True)
But beware, pandas experts recommend to prefere the df = df. ... idiom to the df...(..., inplace=True) because it allows chaining the operations. So you should ask yourself if #timgeb's answer cannot be used. Anyway this one should work for your requirements.

Python - Function that have a dataframe to store the results from others functions

I'm trying a function that, maybe is very simple, that I want to store the results from other functions and on the end print all the results (Like a logger function).
For that I've the following code:
import pandas as pd
def append_rows(id, result):
df = pd.DataFrame([])
df = df.append(pd.DataFrame(
{'id': id,
'result': result}, index=[0]), ignore_index=False, sort=False)
return df
def calculator_1():
for i in range(5):
print(append_rows(i,'Draft' + i+1))
def calculator_1():
for i in range(2):
print(append_rows(i,'Draft' + 1))
print(append_rows('', ''))
My expected result is:
1,Draft2
2,Draft3
3,Draft4
4,Draft5
5,Draft6
1,Draft1
2,Draft1
But the actual result is (:
"",""
My requirement is to have the a unique function to store the results from others functions, instead of have multiple dataframes from each functions and at the end concatenate all of them into one.
Anyone knows how can I do that?
Thanks!
With the current append_rows function as is you are creating a new dataframe in each iteration. It's not entirely clear what you want to achieve, but I imagine you could be interested in adding new rows to your dataframe in each iteration?
In that case I would reccommend the following steps:
create a dataframe outside of a function
create a list_of_lists outside of the function
add each newly created list from your loop to the list of lists
append the list of lists to the dataframe after the loop
if you are simply interested in creating a log of your iterations then I don't see why you would need a dataframe at all, you can simply print the values in a loop.

Using pandas apply() function on a dataframe to create a new dataframe

I have a problem annoying me for some time now. I have written a function that should, based on the row values of a dataframe, create a new dataframe filled with values based on a condition in the function. My function looks like this:
def intI():
df_ = pd.DataFrame()
df_ = df_.fillna(0)
for index, row in Anno.iterrows():
genes=row['AR_Genes'].split(',')
df=pd.DataFrame()
if 'intI1' in genes:
df['Year']=row['Year']
df['Integrase']= 1
df_=df_.append(df)
elif 'intI2' in genes:
df['Year']=row['Year']
df['Integrase']= 1
df_=df_.append(df)
else:
df['Year']=row['Year']
df['Integrase']= 0
df_=df_.append(df)
return df_
when I call it like this Newdf=Anno['AR_Genes'].apply(intI()), I get the following error:
TypeError: 'DataFrame' object is not callable
I really do not understand why it does not work. I have done similar things before, but there seems to be a difference that I do not get. Can anybody explain what is wrong here?
*******************EDIT*****************************
Anno in the function is the dataframe that the function shal be run on. It contains a string, for example a,b,c,ad,c
DataFrame.apply takes a function which applies to all rows/columns of the DataFrame. That error occurs because your function returns a DataFrame which you then pass to apply.
Why do you do use .fillna(0) on a newly created, empty, DataFrame?
Would not this work? Newdf = intI()

Categories