I have a list of names. for each name, I start with my dataframe df, and use the elements in the list to define new columns for the df. after my data manipulation is complete, I eventually create a new data frame whose name is partially derived from the list element.
list = ['foo','bar']
for x in list :
df = prior_df
(long code for manipulating df)
new_df_x = df
new_df_x.to_parquet('new_df_x.parquet')
del new_df_x
new_df_foo = pd.read_parquet(new_df_foo.parquet)
new_df_bar = pd.read_parquet(new_df_bar.parquet)
new_df = pd.merege(new_df_foo ,new_df_bar , ...)
The reason I am using this approach is that, if I don't use a loop and just add the foo and bar columns one after another to the original df, my data gets really big and highly fragmented before I go from wide to long and I encounter insufficient memory error. The workaround for me is to create a loop and store the data frame for each element and then at the very end join the long-format data frames together. Therefore, I cannot use the approach suggested in other answers such as creating dictionaries etc.
I am stuck at the line
new_df_x = df
where within the loop, I am using the list element in the name of the data frame.
I'd appreciate any help.
IIUC, you only want the filenames, i.e. the stored parquet files to have the foo and bar markers, and you can reuse the variable name itself.
list = ['foo','bar']
for x in list :
df = prior_df
(long code for manipulating df)
df.to_parquet(f'new_df_{x}.parquet')
del df
new_df_foo = pd.read_parquet(new_df_foo.parquet)
new_df_bar = pd.read_parquet(new_df_bar.parquet)
new_df = pd.merge(new_df_foo ,new_df_bar , ...)
Here is an example, if you are looking to define a variables names dataframe using a list element.
import pandas as pd
data = {"A": [42, 38, 39],"B": [13, 25, 45]}
prior_df=pd.DataFrame(data)
list= ['foo','bar']
variables = locals()
for x in list :
df = prior_df.copy() # assign a dataframe copy to the variable df.
# (smple code for manipulating df)
#-----------------------------------
if x=='foo':
df['B']=df['A']+df['B'] #
if x=='bar':
df['B']=df['A']-df['B'] #
#-----------------------------------
new_df_x="new_df_{0}".format(x)
variables[new_df_x]=df
#del variables[new_df_x]
print(new_df_foo) # print the 1st df variable.
print(new_df_bar) # print the 2nd df variable.
I followed this code from user Lala la (https://stackoverflow.com/a/55803252/19896454)
to put 3 columns at the front and leave the rest with no changes. It works well inside the function but when returns the dataframe, it loses column order.
My desperate solution was to put the code on the main program...
Other functions in my code are able to return modified versions of the dataframe with no problem.
Any ideas what is happening?
Thanks!
def define_columns_order(df):
cols_to_order = ['LINE_ID','PARENT.CATEGORY', 'CATEGORY']
new_columns = cols_to_order + (df.columns.drop(cols_to_order).tolist())
df = df[new_columns]
return df
try using return(df.reindex(new_columns, axis=1)) and keep in mind DataFrame modifications are not in place, unless you specify inplace=True, therefore you need to explicitly assign the result returned by your function to your df variable
I have a lot of dataframes and I would like to apply the same filter to all of them without having to copy paste the filter condition every time.
This is my code so far:
df_list_2019 = [df_spain_2019,df_amsterdam_2019, df_venice_2019, df_sicily_2019]
for data in df_list_2019:
data = data[['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
]]
but it doesn't apply the filter to the data frame. How can I change the code to do that?
Thank you
The filter (column selection) is actually applied to every DataFrame, you just throw the result away by overriding what the name data points to.
You need to store the results somewhere, a list for example.
cols = ['host_since','host_response_time', ...]
filtered = [df[cols] for df in df_list_2019]
As soon as you write var = new_value, you do not change the original object but have the variable refering a new object.
If you want to change the dataframes from df_list_2019, you have to use an inplace=True method. Here, you could use drop:
keep = set(['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
])
for data in df_list_2019:
data.drop(columns=[col for col in data.columns if col not in keep], inplace=True)
But beware, pandas experts recommend to prefere the df = df. ... idiom to the df...(..., inplace=True) because it allows chaining the operations. So you should ask yourself if #timgeb's answer cannot be used. Anyway this one should work for your requirements.
I'm trying a function that, maybe is very simple, that I want to store the results from other functions and on the end print all the results (Like a logger function).
For that I've the following code:
import pandas as pd
def append_rows(id, result):
df = pd.DataFrame([])
df = df.append(pd.DataFrame(
{'id': id,
'result': result}, index=[0]), ignore_index=False, sort=False)
return df
def calculator_1():
for i in range(5):
print(append_rows(i,'Draft' + i+1))
def calculator_1():
for i in range(2):
print(append_rows(i,'Draft' + 1))
print(append_rows('', ''))
My expected result is:
1,Draft2
2,Draft3
3,Draft4
4,Draft5
5,Draft6
1,Draft1
2,Draft1
But the actual result is (:
"",""
My requirement is to have the a unique function to store the results from others functions, instead of have multiple dataframes from each functions and at the end concatenate all of them into one.
Anyone knows how can I do that?
Thanks!
With the current append_rows function as is you are creating a new dataframe in each iteration. It's not entirely clear what you want to achieve, but I imagine you could be interested in adding new rows to your dataframe in each iteration?
In that case I would reccommend the following steps:
create a dataframe outside of a function
create a list_of_lists outside of the function
add each newly created list from your loop to the list of lists
append the list of lists to the dataframe after the loop
if you are simply interested in creating a log of your iterations then I don't see why you would need a dataframe at all, you can simply print the values in a loop.
I have a problem annoying me for some time now. I have written a function that should, based on the row values of a dataframe, create a new dataframe filled with values based on a condition in the function. My function looks like this:
def intI():
df_ = pd.DataFrame()
df_ = df_.fillna(0)
for index, row in Anno.iterrows():
genes=row['AR_Genes'].split(',')
df=pd.DataFrame()
if 'intI1' in genes:
df['Year']=row['Year']
df['Integrase']= 1
df_=df_.append(df)
elif 'intI2' in genes:
df['Year']=row['Year']
df['Integrase']= 1
df_=df_.append(df)
else:
df['Year']=row['Year']
df['Integrase']= 0
df_=df_.append(df)
return df_
when I call it like this Newdf=Anno['AR_Genes'].apply(intI()), I get the following error:
TypeError: 'DataFrame' object is not callable
I really do not understand why it does not work. I have done similar things before, but there seems to be a difference that I do not get. Can anybody explain what is wrong here?
*******************EDIT*****************************
Anno in the function is the dataframe that the function shal be run on. It contains a string, for example a,b,c,ad,c
DataFrame.apply takes a function which applies to all rows/columns of the DataFrame. That error occurs because your function returns a DataFrame which you then pass to apply.
Why do you do use .fillna(0) on a newly created, empty, DataFrame?
Would not this work? Newdf = intI()