How to cycle through a list of pandas dataframes - python

I'm trying to work out the correct method for cycling through a number of pandas dataframes using a 'for loop'. All of them contain 'year' columns from 1960 to 2016, and from each df I want to remove the columns '1960' to '1995'.
I created a list of dfs and also a list of str values for the years.
dflist = [apass,rtrack,gdp,pop]
dfnewlist =[]
for i in range(1960, 1996):
dfnewlist.append(str(i))
for df in dflist:
df = df.drop(dfnewlist, axis = 1)
My for loop runs without error, but it does not remove the columns.
Edit - Just to add, when I do this manually without the for loop, such as below, it works fine:
gdp = gdp.drop(dfnewlist, axis = 1)

This is a common issues for people in for loops. When you say
for df in dflist:
and then change df, the changes do not happen to the actual object in the list, just to df
use enumerate to fix
for i,df in enumerate(dflist):
dflist[i]=df.drop(dfnewlist,axis=1)

To ensure some robustness, you can us the errors='ignore' flag just in case one of the columns doesn't exist, the drop won't error out.
However, your real problem is that when you loop, df starts by referring to the thing in the list. But then you overwrite the name df by assigning to that name the results of df.drop(dfnewlist, axis=1). This does not replace the dataframe in your list as you'd hoped but creates a new name df that no longer points to the item in the list.
Instead, you can use the inplace=True flag.
drop_these = [*map(str, range(1960, 1996)]
for df in dflist:
df.drop(drop_these, axis=1, errors='ignore', inplace=True)

Related

Iterating through multiple data frames with for loop to change column names to lower case and snake case? (Pandas)

I am trying to utilize a for-loop to iterate through multiple data frames and change each df's column names to snake case and lower case. I made a data frame list and then created a nested for-loop to iterate through the list and then iterate through each column name. The loop runs but it makes no changes to my data frame. Any help on this?
df_list = [df_1, df_2, df_3, df_4, df_5]
for df in df_list:
for col in df.columns:
col.replace(' ', '_').lower()
You didn't assign the replaced column back, you can try
for df in df_list:
df.rename(columns=lambda col: col.replace(' ', '_').lower(), inplace=True)

Dropping column in dataframe with assignment not workng in a loop

I have two dataframes (df_train and df_test) containing a column ('Date') that I want to drop.
As far as I understood, I could do it in two ways, i.e. either by using inplace or by assigning the dataframe to itself, like:
if 'Date' in df_train.columns:
df_train.drop(['Date'], axis=1, inplace=True)
OR
if 'Date' in df_train.columns:
df_train = df_train.drop(['Date'], axis=1)
Both the methods work on the single dataframe, but the former way should be more memory friendly, since with the assignent a copy of the dataframe is created.
The weird thing is, I have to do it for both the dataframes, so I tried to do the same within a loop:
for data in [df_train, df_test]:
if 'Date' in data.columns:
data.drop(['Date'], axis=1, inplace=True)
and
for data in [df_train, df_test]:
if 'Date' in data.columns:
data = data.drop(['Date'], axis=1)
and the weird thing is that, in this case, only the first ways (using inplace) works. If I use the second way, the 'Date' columns aren't dropped.
Why is that?
It doesn't work because iterating through the list and changing what's in the list doesn't actually change the actual list of dataframes because it only changes the iterators, so you should try:
lst = []
for data in [df_train, df_test]:
if 'Date' in data.columns:
lst.append(data.drop(['Date'], axis=1))
print(lst)
Now lst contains all the dataframes.
Its better to use a list comprehension:
res = [data.drop(['Date'], axis=1) for data in [df_train, df_test] if 'Date' in data.columns]
Here, you will get a copy of both dataframes after columns are dropped.

No changes to original dataframe after applying loop

I have a list of dataframes such that
df_lst = [df1, df2]
I also created a function which removes the rows with '0' in the dataframe:
def dropzeros(df):
newdf = df[df['x']!=0.0]
return newdf
I tried applying this through a loop and placed an assignment variable within the loop, but the original dataframe remained unchanged even after running the loop.
for df in df_lst:
df = dropzeros(df)
I also tried using list comprehensions to go about it
df_lst = [dropzeros(df) for df in df_lst]
I know the function works since when i apply print(len(df)) before and after the command dropzeros(df) there was a drop in the len, however, may I know how might I go about this problem such that my original dataframe is altered after running the loop?
That's because the variable df in your for loop does not reference a value in your list. You are creating a variable df afresh each iteration of your loop.
You can assign via enumerate and pipe your function:
for idx, df in enumerate(df_lst):
df_lst[idx] = df.pipe(dropzeros)

Concat pandas dataframes without following a certain sequence

I have data files which are converted to pandas dataframes which sometimes share column names while others sharing time series index, which all I wish to combine as one dataframe based on both column and index whenever matching. Since there is no sequence in naming they appear randomly for concatenation. If two dataframe have different columns are concatenated along axis=1 it works well, but if the resulting dataframe is combined with new df with the column name from one of the earlier merged pandas dataframe, it fails to concat. For example with these data files :
import pandas as pd
df1 = pd.read_csv('0.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df2 = pd.read_csv('1.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df3 = pd.read_csv('2.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
data1 = pd.DataFrame()
file_list = [df1, df2, df3] # fails
# file_list = [df2, df3,df1] # works
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
I get ValueError: Plan shapes are not aligned when I try to do that. In my case there is no way to first load all the DataFrames and check their column names. Having that I could combine all df with same column names to later only concat these resulting dataframes with different column names along axis=1 which I know always works as shown below. However, a solution which requires preloading all the DataFrames and rearranging the sequence of concatenation is not possible in my case (it was only done for a working example above). I need a flexibility in terms of in whichever sequence the information comes it can be concatenated with the larger dataframe data1. Please let me know if you have a suggested suitable approach.
If you go through the loop step by step, you can find that in the first iteration it goes into the if, so data1 is equal to df1. In the second iteration it goes to the else, since data1 is not empty and ''Temperature product barrel ValueY'' is not in data1.columns.
After the else, data1 has some duplicated column names. In every row of the duplicated column names. (one of the 2 columns is Nan, the other one is a float). This is the reason why pd.concat() fails.
You can aggregate the duplicate columns before you try to concatenate to get rid of it:
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
# new:
data1 = data1.groupby(data1.columns, axis=1).agg(np.nansum)
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
After that, you would get
data1.shape
(30, 23)

Pandas how to concat two dataframes without losing the column headers

I have the following toy code:
import pandas as pd
df = pd.DataFrame()
df["foo"] = [1,2,3,4]
df2 = pd.DataFrame()
df2["bar"]=[4,5,6,7]
df = pd.concat([df,df2], ignore_index=True,axis=1)
print(list(df))
Output: [0,1]
Expected Output: [foo,bar] (order is not important)
Is there any way to concatenate two dataframes without losing the original column headers, if I can guarantee that the headers will be unique?
Iterating through the columns and then adding them to one of the DataFrames comes to mind, but is there a pandas function, or concat parameter that I am unaware of?
Thanks!
As stated in merge, join, and concat documentation, ignore index will remove all name references and use a range (0...n-1) instead. So it should give you the result you want once you remove ignore_index argument or set it to false (default).
df = pd.concat([df, df2], axis=1)
This will join your df and df2 based on indexes (same indexed rows will be concatenated, if other dataframe has no member of that index it will be concatenated as nan).
If you have different indexing on your dataframes, and want to concatenate it this way. You can either create a temporary index and join on that, or set the new dataframe's columns after using concat(..., ignore_index=True).
I don't think the accepted answer answers the question, which is about column headers, not indexes.
I am facing the same problem, and my workaround is to add the column names after the concatenation:
df.columns = ["foo", "bar"]

Categories