Looping through a list of pandas dataframes - python

Two quick pandas questions for you.
I have a list of dataframes I would like to apply a filter to.
countries = [us, uk, france]
for df in countries:
df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
When I run this, the df's don't change afterwards. Why is that?
If I loop through the dataframes to create a new column, as below, this works fine, and changes each df in the list.
for df in countries:
df["Continent"] = "Europe"
As a follow up question, I noticed something strange when I created a list of dataframes for different countries. I defined the list then applied transformations to each df in the list. After I transformed these different dfs, I called the list again. I was surprised to see that the list still pointed to the unchanged dataframes, and I had to redefine the list to update the results. Could anybody shed any light on why that is?

Taking a look at this answer, you can see that for df in countries: is equivalent to something like
for idx in range(len(countries)):
df = countries[idx]
# do something with df
which obviously won't actually modify anything in your list. It is generally bad practice to modify a list while iterating over it in a loop like this.
A better approach would be a list comprehension, you can try something like
countries = [us, uk, france]
countries = [df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
for df in countries]
Notice that with a list comprehension like this, we aren't actually modifying the original list - instead we are creating a new list, and assigning it to the variable which held our original list.
Also, you might consider placing all of your data in a single DataFrame with an additional country column or something along those lines - Python-level loops are generally slower and a list of DataFrames is often much less convenient to work with than a single DataFrame, which can fully leverage the vectorized pandas methods.

For why
for df in countries:
df["Continent"] = "Europe"
modifies countries, while
for df in countries:
df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
does not, see why should I make a copy of a data frame in pandas. df is a reference to the actual DataFrame in countries, and not the actual DataFrame itself, but modifications to a reference affect the original DataFrame as well. Declaring a new column is a modification. However, taking a subset is not a modification. It is just changing what the reference is referring to in the original DataFrame.

Related

overwriting dataframes in pandas

I have a given dataframe
new_df :
ID
summary
text_len
1
xxx
45
2
aaa
34
I am performing some df manipulation by concatenating keywords from different df, like that:
keywords = df["keyword"].to_list()
for key in keywords:
new_df[key] = new_df["summary"].str.lower().str.count(key)
new_df
from here I need two separate dataframes to perform few actions (to each of them add some columns, do some calculations etc).
I need a dataframe with occurrences as per given piece of code and a binary dataframe.
WHAT I DID:
assign dataframe for occurrences:
df_freq = new_df (because it is already calculated an done)
I created another dataframe - binary one - on the top of new_df:
#select only numeric columns to change them to binary
numeric_cols = new_df.select_dtypes("number", exclude='float64').columns.tolist()
new_df_binary = new_df
new_df_binary['text_length'] = new_df_binary['text_length'].astype(int)
new_df_binary[numeric_cols] = (new_df_binary[numeric_cols] > 0).astype(int)
Everything works fine - I perform the math I need, but when I want to come back to df_freq - it is no longer dataframe with occurrences.. looks like it changed along with binary code
I need separate tables and perform separate math on them. Do you know how I can avoid this hmm overwriting issue?
You may use pandas' copy method with the deep argument set to True:
df_freq = new_df.copy(deep=True)
Setting deep=True (which is the default parameter) ensures that modifications to the data or indices of the copy do not impact the original dataframe.

How to do certain changes to multiple dataframe

I have 4 dataframes, I need to recalculate same column in each dataframe.
I tried to create a list of data frame and then use for loop to iterate the list and apply changes. After the loop if I call the dataframe there is no changes applied to them.
List_df=[df1,df2,df3,df4]
For df in List_df:
Df=df[3:] #triming first 3 rows of data frame
So after this code If I call 'df1' to see if changes has been done by for loop or not - sadly it is still just like before the for loop.
You need to invest some time learning how assignment in Python works. Assignment never mutates data. You are just binding a name Df = ... to some object and then do nothing with it. Then you rebind the name in the next iteration of the loop.
In addition, df[3:] does not trim the first three rows from a DataFrame, df.iloc[3:] does.
One solution to your problem is:
List_df[:] = [df.iloc[3:] for df in List_df]
This operation will mutate List_df by inserting new DataFrames. You will see the trimmed DataFrames as the elements of List_df, but not when you inspect the names df1, ..., df4 after the operation. These names still point to the original DataFrames. (Watch the video.)

Selecting Various "Pieces" of a List

I have a list of columns in a Pandas DataFrame and looking to create a list of certain columns without manual entry.
My issue is that I am learning and not knowledgable enough yet.
I have tried searching around the internet but nothing was quite my case. I apologize if there is a duplicate.
The list I am trying to cut from looks like this:
['model',
'displ',
'cyl',
'trans',
'drive',
'fuel',
'veh_class',
'air_pollution_score',
'city_mpg',
'hwy_mpg',
'cmb_mpg',
'greenhouse_gas_score',
'smartway']
Here is the code that I wrote on my own: dataframe.columns.tolist()[:6,8:10,11]
In this case scenario I am trying to select everything but 'air_pollution_score' and 'greenhouse_gas_score'
My ultimate goal is to understand the syntax and how to select pieces of a list.
You could do that, or you could just use drop to remove the columns you don't want:
dataframe.drop(['air_pollution_score', 'greenhouse_gas_score'], axis=1).columns
Note that you need to specify axis=1 so that pandas knows you want to remove columns, not rows.
Even if you wanted to use list syntax, I would say that it's better to use a list comprehension instead; something like this:
exclude_columns = ['air_pollution_score', 'greenhouse_gas_score']
[col for col in dataframe.columns if col not in exclude_columns]
This gets all the columns in the dataframe unless they are present in exclude_columns.
Let's say df is your dataframe. You can actually use filters and lambda, though it quickly becomes too long. I present this as a "one-liner" alternative to the answer of #gmds.
df[
list(filter(
lambda x: ('air_pollution_score' not in x) and ('greenhouse_gas_x' not in x),
df.columns.values
))
]
What's happening here are:
filter applies a function to a list to only include elements following a defined function/
We defined that function using lambda to only check if 'air_pollution_score' or 'greenhouse_gas_x' are in the list.
We're filtering on the df.columns.values list; so the resulting list will only retain the elements that weren't the ones we mentioned.
We're using the df[['column1', 'column2']] syntax, which is "make a new dataframe but only containing the 2 columns I define."
Simple solution with pandas
import pandas as pd
data = pd.read_csv('path to your csv file')
df = data['column1','column2','column3',....]
Note: data is your source you have already loaded using pandas, new selected columns will be stored in a new data frame df

How to add values to a new column in pandas dataframe?

I want to create a new named column in a Pandas dataframe, insert first value into it, and then add another values to the same column:
Something like:
import pandas
df = pandas.DataFrame()
df['New column'].append('a')
df['New column'].append('b')
df['New column'].append('c')
etc.
How do I do that?
If I understand, correctly you want to append a value to an existing column in a pandas data frame. The thing is with DFs you need to maintain a matrix-like shape so the number of rows is equal for each column what you can do is add a column with a default value and then update this value with
for index, row in df.iterrows():
df.at[index, 'new_column'] = new_value
Dont do it, because it's slow:
updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.
Better to create a list of data and create DataFrame by contructor:
vals = ['a','b','c']
df = pandas.DataFrame({'New column':vals})
If in case you need to add random values to the newly created column, you could also use
df['new_column']= np.random.randint(1, 9, len(df))

How can I drop a column from multiple dataframes stored in a list?

Apologies as I'm new to all this.
I'm playing around with pandas at the moment. I want to drop one particular column across two dataframes stored within a list. This is what I've written.
combine = [train, test]
for dataset in combine:
dataset = dataset.drop('Id', axis=1)
However, this doesn't work. If I do this explicitly, such as train = train.drop('Id', axis=1), this works fine.
I appreciate in this case it's two lines either way, but is there some way I can use the list of dataframes to drop the column from both?
The reason why your solution didn't work is because dataset is a name that points to the item in the list combine. You had the right idea to reassign it with dataset = dataset.drop('Id', axis=1) but all you did was overwrite the name dataset and not really place a new dataframe in the list combine
Option 1
Create a new list
combine = [d.drop('Id', axis=1) for d in combine]
Option 2
Or alter each dataframe in place with inplace=True
for d in combine:
d.drop('Id', axis=1, inplace=True)
Or maybe
combine = [df1, df2]
for i in range(len(combine)):
combine[i]=combine[i].drop('Id', axis=1)

Categories