I can't understand why this snippet of code:
df = PA.DataFrame()
[df.append(aFunction(x)) for x in aPandaSeries]
does not give me the same DataFrame (df) as:
df = PA.DataFrame()
for x in xrange(len(aPandaSeries)):
df = df.append(aFunction(aPandaSeries[x]))
I am trying to pythonise the second section by using the first section, but df has far fewer rows in the former than the latter.
A couple of things...
.append() method returns None. So df = df.append() will set df to None value.
List comprehensions are useful to filter or process a list of values, so you generally wouldn't use .append() with a list comprehension. It makes more sense to rewrite the 2nd line in first snippet as:
for x in aPandaSeries:
df.append(aFunction(x))
Related
Let's say df is a typical pandas.DataFrame instance, I am trying to understand how come list(df) would return a list of column names.
The goal here is for me to track it down in the source code to understand how list(<pd.DataFrame>) returns a list of column names.
So far, the best resources I've found are the following:
Get a list from Pandas DataFrame column headers
Summary: There are multiple ways of getting a list of DataFrame column names, and each varies either in performance or idiomatic convention.
SO Answer
Summary: DataFrame follows a dict-like convention, thus coercing with list() would return a list of the keys of this dict-like structure.
pandas.DataFrame source code:
I can't find within the source code that point to how list() would create a list of column head names.
DataFrames are iterable. That's why you can pass them to the list constructor.
list(df) is equivalent to [c for c in df]. In both cases, DataFrame.__iter__ is called.
When you iterate over a DataFrame, you get the column names.
Why? Because the developers probably thought this is a nice thing to have.
Looking at the source, __iter__ returns an iterator over the attribute _info_axis, which seems to be the internal name of the columns.
Actually, as you have correctly stated in your question. One can think of a pandas dataframe as a list of lists (or more correctly a dict like object).
Take a look at this code which takes a dict and parses it into a df.
import pandas as pd
# create a dataframe
d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(d)
print(df)
x = list(df)
print(x)
x = list(d)
print(x)
The result in both cases (for the dataframe df and the dict d) is this:
['col1', 'col2']
['col1', 'col2']
This result confirms your thinking that a "DataFrame follows a dict-like convention" .
I'm trying to work out the correct method for cycling through a number of pandas dataframes using a 'for loop'. All of them contain 'year' columns from 1960 to 2016, and from each df I want to remove the columns '1960' to '1995'.
I created a list of dfs and also a list of str values for the years.
dflist = [apass,rtrack,gdp,pop]
dfnewlist =[]
for i in range(1960, 1996):
dfnewlist.append(str(i))
for df in dflist:
df = df.drop(dfnewlist, axis = 1)
My for loop runs without error, but it does not remove the columns.
Edit - Just to add, when I do this manually without the for loop, such as below, it works fine:
gdp = gdp.drop(dfnewlist, axis = 1)
This is a common issues for people in for loops. When you say
for df in dflist:
and then change df, the changes do not happen to the actual object in the list, just to df
use enumerate to fix
for i,df in enumerate(dflist):
dflist[i]=df.drop(dfnewlist,axis=1)
To ensure some robustness, you can us the errors='ignore' flag just in case one of the columns doesn't exist, the drop won't error out.
However, your real problem is that when you loop, df starts by referring to the thing in the list. But then you overwrite the name df by assigning to that name the results of df.drop(dfnewlist, axis=1). This does not replace the dataframe in your list as you'd hoped but creates a new name df that no longer points to the item in the list.
Instead, you can use the inplace=True flag.
drop_these = [*map(str, range(1960, 1996)]
for df in dflist:
df.drop(drop_these, axis=1, errors='ignore', inplace=True)
The function that I'm applying is a little expensive, as such I want it to only calculate the value once for unique values.
The only solution I've been able to come up with has been as follows:
This step because apply doesn't work on arrays, so I have to convert the unique values into a series.
new_vals = pd.Series(data['column'].unique()).apply(function)
This one because .merge has to be used on dataframes.
new_dataframe = pd.DataFrame( index = data['column'].unique(), data = new_vals.values)
Finally Merging The results
yet_another= pd.merge(data, new_dataframe, right_index = True, left_on = column)
data['calculated_column'] = yet_another[0]
So basically I had to Convert my values to a Series, apply the function, convert to a Dataframe, merge the results and use that column to create me new column.
I'm wondering if there is some one-line solution that isn't as messy. Something pythonic that doesn't involve re-casting object types multiple times. I've tried grouping by but I just can't figure out how to do it.
My best guess would have been to do something along these lines
data[calculated_column] = dataframe.groupby(column).index.apply(function)
but that isn't right either.
This is an operation that I do often enough to want to learn a better way to do, but not often enough that I can easily find the last time I used it, so I end up re-figuring a bunch of things again and again.
If there is no good solution I guess I could just add this function to my library of common tools that I hedonistically > from me_tools import *
def apply_unique(data, column, function):
new_vals = pd.Series(data[column].unique()).apply(function)
new_dataframe = pd.DataFrame( data = new_vals.values, index =
data[column].unique() )
result = pd.merge(data, new_dataframe, right_index = True, left_on = column)
return result[0]
I would do something like this:
def apply_unique(df, orig_col, new_col, func):
return df.merge(df[[orig_col]]
.drop_duplicates()
.assign(**{new_col: lambda x: x[orig_col].apply(func)}
), how='inner', on=orig_col)
This will return the same DataFrame as performing:
df[new_col] = df[orig_col].apply(func)
but will be much more performant when there are many duplicates.
How it works:
We join the original DataFrame (calling) to another DataFrame (passed) that contains two columns; the original column and the new column transformed from the original column.
The new column in the passed DataFrame is assigned using .assign and a lambda function, making it possible to apply the function to the DataFrame that has already had .drop_duplicates() performed on it.
A dict is used here for convenience only, as it allows a column name to be passed in as a str.
Edit:
As an aside: best to drop new_col if it already exists, otherwise the merge will append suffixes to each new_col
if new_col in df:
df = df.drop(new_col, axis='columns')
I have this simple clean_data function, which will round the numbers in the input data frame. The code works, but I am very puzzled why it works. Could anybody help me understand?
The part where I got confused is this. table_list is a new list of data frame, so after running the code, each item inside table_list should be formatted, while tablea, tableb, and tablec should stay the same. But apparently I am wrong. After running the code, all three tables are formatted correctly. What is going on? Thanks a lot for the help.
table_list = [tablea, tableb, tablec]
def clean_data(df):
for i in df:
df[i] = df[i].map(lambda x: round(x, 4))
return df
map(clean_data, table_list)
Simplest way is to break down this code completely:
# List of 3 dataframes
table_list = [tablea, tableb, tablec]
# function that cleans 1 dataframe
# This will get applied to each dataframe in table_list
# when the python function map is used AFTER this function
def clean_data(df):
# for loop.
# df[i] will be a different column in df for each iteration
# i iterates througn column names.
for i in df:
# df[i] = will overwrite column i
# df[i].map(lambda x: round(x, 4)) in this case
# does the same thing as df[i].apply(lambda x: round(x, 4))
# in other words, it rounds each element of the column
# and assigns the reformatted column back to the column
df[i] = df[i].map(lambda x: round(x, 4))
# returns the formatted SINGLE dataframe
return df
# I expect this is where the confusion comes from
# this is a python (not pandas) function that applies the
# function clean_df to each item in table_list
# and returns a list of the results.
# map was also used in the clean_df function above. That map was
# a pandas map and not the same function as this map. There do similar
# things, but not exactly.
map(clean_data, table_list)
Hope that helps.
In Python, a list of dataframes, or any complicated objects, is simply a list of references that will point to the underlying data frames. For example, the first element of table_list is a reference to tablea. Therefore, clean_data will go directly to the data frame, i.e., tablea, following the reference given by table_list[0].
I have a for loop. At each iteration a dataframe is created. I want this dataframe to be appended to an overall result dataframe.
Currently I tried to do it with this code:
resultDf = pd.DataFrame()
for name in list:
iterationresult = calculatesomething(name)
resultDf.append(iterationresult)
print(resultDf)
However, the resultDf is empty.
How can this be done?
UPDATE
I think changing
resultDf.append(iterationresult)
to
resultDf = resultDf.append(iterationresult)
does the trick
Not iterative, but how about simply:
df = pd.DataFrame([calculatesomething(name) for name in list])
This is much more straightforward, and faster as well.
Another idiomatic idea could be to do this:
df = pd.DataFrame(list, columns = ["name"])
df["calc"] = df.name.map(calculatesomething)
By the way, it's a bad practice to call a list list, because it will shadow the builtin type.