How to do certain changes to multiple dataframe - python

I have 4 dataframes, I need to recalculate same column in each dataframe.
I tried to create a list of data frame and then use for loop to iterate the list and apply changes. After the loop if I call the dataframe there is no changes applied to them.
List_df=[df1,df2,df3,df4]
For df in List_df:
Df=df[3:] #triming first 3 rows of data frame
So after this code If I call 'df1' to see if changes has been done by for loop or not - sadly it is still just like before the for loop.

You need to invest some time learning how assignment in Python works. Assignment never mutates data. You are just binding a name Df = ... to some object and then do nothing with it. Then you rebind the name in the next iteration of the loop.
In addition, df[3:] does not trim the first three rows from a DataFrame, df.iloc[3:] does.
One solution to your problem is:
List_df[:] = [df.iloc[3:] for df in List_df]
This operation will mutate List_df by inserting new DataFrames. You will see the trimmed DataFrames as the elements of List_df, but not when you inspect the names df1, ..., df4 after the operation. These names still point to the original DataFrames. (Watch the video.)

Related

How can i use pd.concat' to join all columns at once instead of calling `frame.insert` many times?

I have to create a new dataframe in which each column is determined by a function which has two arguments. The problem is that for each column the function needs a different argument which is given by the number of the column.
There are about 6k rows and 200 columns in the dataframe:
The function that defines each column of the new dataframe is defined like this:
def phiNT(M,nT):
M=M[M.columns[:nT]]
d=pd.concat([M.iloc[:,nT-1]]*nT,axis=1)
d.columns=M.columns
D=M-d
D=D.mean(axis=1)
return D
I tried to create an empty dataframe and then add each column using a loop:
A=pd.DataFrame()
for i in range(1,len(M.columns)):
A[i]=phiNT(M,i)
But this is what pops up:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
So i need a way to apply pd.concat to create all columns at once.
you should create all dataframes in a list or generator then call pd.concat on the list or generator to create a new dataframe with all the dataframe columns in it, instead of doing it once for each column.
the following uses a generator to be memory efficient.
results = (phiNT(M,i) for i in range(1,len(M.columns)))
A = pd.concat(results,axis=1)
this is how it'd be done in a list.
A = pd.concat([phiNT(M,i) for i in range(1,len(M.columns))],axis=1)

Adding columns to a pandas.DataFrame with previous row values before calling apply()

I need to add a new column to a pandas dataframe, where the value is calculated from the value of a column in the previous row.
Coming from a non-functional background (C#), I am trying to avoid loops since I read it is an anti-pattern.
My plan is to use series.shift to add a new column to the dataframe for the previous value, call dataframe.apply and finally remove the additional column. E.g.:
def my_function(row):
# perform complex calculations with row.time, row.time_previous and other values
# return the result
df["time_previous"] = df.time.shift(1)
df.apply(my_function, axis = 1)
df.drop("time.previous", axis=1)
In reality, I need to create four additional columns like this. Is there a better alternative to accomplish this without a loop? Is this a good idea at all?

Looping through DataFrame via zip

I'm using this code to loop through a dataframe:
for r in zip(df['Name']):
#statements
How do I identify a particular row in the dataframe? For example, I want to assign a new value to each row of the Name column while looping through. How do I do that?
I've tried this:
for r in zip(df['Name']):
df['Name']= time.time()
The problem is that every single row is getting the same value instead of different values.
The main problem is in the assignment:
df['Name']= time.time()
This says to grab the current time and assign it to every cell in the Name column. You reference the column vector, rather than a particular row. Note your iteration statement:
for r in zip(df['Name']):
Here, r is the row, but you never refer to it. That makes it highly unlikely that anything you do within the loop will affect an individual row.
Putting on my "teacher" hat ...
Look up examples of how to iterate through the rows of a Pandas data frame.
Within those, see how individual cells are referenced: that technique looks a lot like indexing a nested list.
Now, alter your code so that you put the current time in one cell at a time, one on each iteration. It will look something like
df.at[row]['Name'] = time.time()
or
row['Name'] = time.time()
depending on how you define row in your iteration.
Does that get you to a solution?
The following also works:
import pandas as pd
import time
# example df
df = pd.DataFrame(data={'name': ['Bob', 'Dylan', 'Rachel', 'Mark'],
'age': [23, 27, 30, 35]})
# iterate through each row in the data frame
col_idx = df.columns.get_loc('name') # this is so we can use iloc
for i in df.itertuples():
df.iloc[i[0], col_idx] = time.time()
So, essentially we use the index of the dataframe as the indicator of the position of the row. The first index points to the first row in the dataframe, and so on.
EDIT: as pointed out in the comment, using .index to iterate rows is not a good practice. So, let's use the number of rows of the dataframe itself. This can be obtained via df.shape which returns a tuple (row, column) and so, we only need the row df.shape[0].
2nd EDIT: using df.itertuples() for performance gain and .iloc for integer based indexing.
Additionally, the official pandas doc recommends the use of loc for variable assignment to a pandas dataframe due to potential chained indexing. More information here http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Create Loop to dynamically select rows from dataframe, then append selected rows to another dataframe: df.query()

I am currently working with dataframes in pandas. In sum, I have a dataframe called "Claims" filled with customer claims data, and I want to parse all the rows in the dataframe based on the unique values found in the field 'Part ID.' I would then like to take each set of rows and append it one at a time to an empty dataframe called "emptydf." This dataframe has the same column headings as the "Claims" dataframe. Since the values in the 'Part ID' column change from week to week, I would like to find some way to do this dynamically, rather than comb through the dataframe each week manually. I was thinking of somehow incorporating the df.where() expression and a For Loop, but am at a loss as to how to put it all together. Any insight into how to go about this, or even some better methods, would be great! The code I have thus far is divided into two steps as follows:
emptydf = Claims[0:0]
#Create empty dataframe
2.Parse_Claims = Claims.query('Part_ID == 1009')
emptydf = emptydf.append(Parse_Claims)
#Parse the dataframe by each unique Part ID number and append to empty dataframe. As you can see, I can only hard code one Part ID number at a time so far. This would take hours to complete manually, so I would love to figure out a way to iterate through the Part ID column and append the data dynamically.
Needless to say, I am super new to Python, so I definitely appreciate your patience in advance!
empty_df = list(Claims.groupby(Claims['Part_ID']))
this will create a list of tuples one for each part id. each tuple has 2 elements 1st is part id and 2nd is subset for that part id

Looping through a list of pandas dataframes

Two quick pandas questions for you.
I have a list of dataframes I would like to apply a filter to.
countries = [us, uk, france]
for df in countries:
df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
When I run this, the df's don't change afterwards. Why is that?
If I loop through the dataframes to create a new column, as below, this works fine, and changes each df in the list.
for df in countries:
df["Continent"] = "Europe"
As a follow up question, I noticed something strange when I created a list of dataframes for different countries. I defined the list then applied transformations to each df in the list. After I transformed these different dfs, I called the list again. I was surprised to see that the list still pointed to the unchanged dataframes, and I had to redefine the list to update the results. Could anybody shed any light on why that is?
Taking a look at this answer, you can see that for df in countries: is equivalent to something like
for idx in range(len(countries)):
df = countries[idx]
# do something with df
which obviously won't actually modify anything in your list. It is generally bad practice to modify a list while iterating over it in a loop like this.
A better approach would be a list comprehension, you can try something like
countries = [us, uk, france]
countries = [df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
for df in countries]
Notice that with a list comprehension like this, we aren't actually modifying the original list - instead we are creating a new list, and assigning it to the variable which held our original list.
Also, you might consider placing all of your data in a single DataFrame with an additional country column or something along those lines - Python-level loops are generally slower and a list of DataFrames is often much less convenient to work with than a single DataFrame, which can fully leverage the vectorized pandas methods.
For why
for df in countries:
df["Continent"] = "Europe"
modifies countries, while
for df in countries:
df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
does not, see why should I make a copy of a data frame in pandas. df is a reference to the actual DataFrame in countries, and not the actual DataFrame itself, but modifications to a reference affect the original DataFrame as well. Declaring a new column is a modification. However, taking a subset is not a modification. It is just changing what the reference is referring to in the original DataFrame.

Categories