In Pandas we can generate a correlation matrix with .corr(). My question is quite simple: is the column order of the original dataframe preserved? From my testing it seems the case, but I want to be sure.
I am asking because I am on Python 3.7.3 where dictionaries maintain insertion order. I don't know if the current question is related to that, but if Pandas uses dictionaries behind the scenes, then it might very well be that corr() is ordered as expected in Python 3.6+ but not in lower versions.
Well, if you look at the source code for corr, the following code is at the start:
numeric_df = self._get_numeric_data()
cols = numeric_df.columns
idx = cols.copy()
mat = numeric_df.values
As you see here, as long as the method _get_numeric_data preserves order, corr should preserve order as well. Diving a bit deeper into _get_numeric_data, you can see this block:
self._consolidate_inplace()
return self.combine([b for b in self.blocks if b.is_numeric], copy)
_consolidate_inplace constructs consolidated sections of the dataframe in a tuple (order preserved), while _get_numeric_data uses a list comprehension to filter this tuple to only numeric blocks (order still preserved).
More to the point, pandas isn't actually using a dictionary for your column names.
columns themselves are just instances of the Index class, which (from the doc string) is ordered.
So, to answer your question: yes, order is guaranteed in corr, because the way it obtains and iterates through the dataframe columns also preserves order.
Related
There is this code commonly used in python pandas "for index, row in df.iterrows()".
What is the difference between displaying these during the loop:
print(row)
print(row.index)
print(row.index[index])
print(row[index])
I tried printing them and cant comprehend what it does and how it selects the content and I cant find a well explained source online.
For one, it's more concise.
Secondly, you're only supposed to use it for displaying data rather than modifying. According to the docs you may get unpredictable results (a concurrent modification thing, methinks).
As to how it selects it, the docs also say it just returns the individual rows as pd.Series with the index being the id pandas uses to keep track of each row in the pd.DataFrame. I'd guess it'd be a akin to using a python zip() function on a list of int [0..n] and a list of pd.Series.
I have some legacy code with multiple instances like this...
result = function(df['a'], df['b'], df['z'])
The function accepts *args so I wondered if I could "tidy" the code by doing the following...
result = function(df[['a','b','z']].iteritems())
But iteritems() returns a list of (name, Series) pairs, so it doesn't work.
Is there a "tidy" way to get access to the list of Series only? (no pairs, no name)
(Changing the function is not ideal; it's designed to work with Scalars and Arrays, and as a Series is ArrayLike they work too. So I just would "like" a list of the Series on their own...)
My best attempt is just to get the Series as Arrays instead, but I "dis-like" it due to multiple instances of boiler-plate code, it feels like there "should" be a direct way to iterate on the Series?
result = function(*(df[['a','b','z']].to_numpy().T))
Looping through a Dataframe returns a list of column names, so you can use list comprehension:
function(*[df[i] for i in df[["a","b","z"]]])
If I have a "reference" to a dataframe, there appears to be no way to append to it in pandas because neither append nor concat support the inplace=True parameter.
An (overly) simple example:
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df = chosen_df.append(chosen_row)
Now because Python does something akin to copy reference by value, chosen_df will initially be a reference to whichever candidate dataframe passed some_test.
But the update semantics of pandas mean that the referenced dataframe is not updated by the result of the append function; a new label is created instead. I believe, if there was the possibility to use inplace=True this would work, but it looks like that isn't likely to happen, given discussion here https://github.com/pandas-dev/pandas/issues/14796
It's worth noting that with a simpler example using lists rather than dataframes does work, because the contents of lists are directly mutated by append().
So my question is --- How could an updatable abstraction over N dataframes be achieved in Python?
The idiom is commonplace, useful and trivial in languages that allow references, so I'm guessing I'm missing a Pythonic trick, or thinking about the whole problem with the wrong hat on!
Obviously the pure illustrative example can be resolved by duplicating the append in the body of an if...else and concretely referencing each underlying dataframe in turn. But this isn't scalable to more complex examples and it's a generic solution akin to references I'm looking for.
Any ideas?
There is a simple way to do this specifically for pandas dataframes - so I'll answer my own question.
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df.loc[max_idx+1] = chosen_row
The calculation of max_idx very much depends on the structure of chosen_df. In the simplest case when it is a dataframe with a sequential index starting at 0, then you can simply use the length of the index to calculate it.
If chosen_df is non-sequential you'll need call max() on the index column rather than rely on the length of the index.
If chosen_df is a slice or groupby object then you'll need to calculate the index off the max parent dataframe to ensure it's truly the max across all rows.
I have two lists in form of pandas DataFrames which both contain a column of names. Now I want to compare these names and return a list of names which appear in both DataFrames. The problem is that my solution is way too slow since both list have several thousand entries.
Now I want to know if there is anything else I can do to accelerate the solution of my problem.
I already sorted my pandas dataframe by alphabet using "df.sort_values" in Order to create an alphabetical index so that a name in the first list which starts with the letter "X" will only be compared to entries with the same first letter in the second list.
I suspect that the main reason my program is running so slow is my way of accessing the fields which I am comparing.
I use a specific comparison function to compare the names and access the dataframe elements through the df.at[i, 'column_title'] method.
Edit: Note that this specific comparison function is more complex than a simple "==" since I am doing a kind of fuzzy string comparison to make sure names with slightly different spelling still get marked as a match. I use the whoswho library which returns me a match rate between 0 and 100. A simplified example focussing on my slow solution for the pandas dataframe comparison looks as follows:
for i in range(len(list1)):
for j in range(len(list2)):
# who.ratio returns a match rate between two strings
ratio = who.ratio(list1.at[i, 'name'], list2.at[j, 'name'])
if ratio > 75:
save(i,j) # stores values i and j in a result list
I also thought about switching from pandas to numpy but I read that this might slow it down even further since pandas is faster for big data amounts.
Can anybody tell me if there is there a faster way of accessing specific elements in a pandas array? Or is there a faster way in general to run a custom comparison function through two pd dataframes?
Edit2: spelling, addtitional information.
I am new to programming and right now i'm writing a league table in python. I would like to sort my league by first points, and if there are two teams with the same points I would like to sort them by goal difference, and if they have the same goal difference i would like to sort by name.
The first condition is pretty easy and is working by the following:
table.sort(reverse=True, key=Team.getPoints)
how do I insert the two following conditions?
Have the key function return a tuple, with items in decreasing order of priority:
table.sort(reverse=True, key=lambda team: (Team.getPoints(team),
Team.getGoalDifference(team),
Team.getName(team))
Alternately, you could remember a factoid from algorithms 101, and make use of the fact .sort() is a stable sort, and thus doesn't change the relative order of items in a list if they compare as equal. This means you can sort three times, in increasing order of priority:
table.sort(reverse=True, key=Team.getName)
table.sort(reverse=True, key=Team.getGoalDifference)
table.sort(reverse=True, key=Team.getPoints)
This will be slower, but allows you to easily specify whether each step should be done in reverse or not. This can be done without multiple sorting passes using cmp_to_key(), but the comparator function would be nontrivial, something like:
def team_cmp(t1, t2):
for key_func, reverse in [(Team.getName, True),
(Team.getGoalDifference, True),
(Team.getPoints, True)]:
result = cmp(key_func(t1), key_func(t2))
if reverse: result = -result;
if result: return result
return 0
table.sort(functools.cmp_to_key(team_cmp))
(Disclaimer: the above is written from memory, untested.) Emphasis is on "without multiple passes", which does not necessarily imply "faster". The overhead from the comparator function and cmp_to_key(), both of which are implemented in Python (as opposed to list.sort() and operator.itemgetter(), which should be part of the C core) is likely to be significant.
As an aside, you don't need to create dummy functions to pass to the key parameters. You can access the attribute directly, using:
table.sort(key=lambda t: t.points)
or the attrgetter operator wrapper:
table.sort(key=attrgetter('points'))
Sort the list by name first, then sort again by score difference. Python's sort is stable, meaning it will preserve order of elements that compare equal.
Python sorting algorithm is Timsort which, as ACEfanatic02 points out, is stable which means order is preserved. This link has a nice visual explanation of how it works.