If I have a "reference" to a dataframe, there appears to be no way to append to it in pandas because neither append nor concat support the inplace=True parameter.
An (overly) simple example:
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df = chosen_df.append(chosen_row)
Now because Python does something akin to copy reference by value, chosen_df will initially be a reference to whichever candidate dataframe passed some_test.
But the update semantics of pandas mean that the referenced dataframe is not updated by the result of the append function; a new label is created instead. I believe, if there was the possibility to use inplace=True this would work, but it looks like that isn't likely to happen, given discussion here https://github.com/pandas-dev/pandas/issues/14796
It's worth noting that with a simpler example using lists rather than dataframes does work, because the contents of lists are directly mutated by append().
So my question is --- How could an updatable abstraction over N dataframes be achieved in Python?
The idiom is commonplace, useful and trivial in languages that allow references, so I'm guessing I'm missing a Pythonic trick, or thinking about the whole problem with the wrong hat on!
Obviously the pure illustrative example can be resolved by duplicating the append in the body of an if...else and concretely referencing each underlying dataframe in turn. But this isn't scalable to more complex examples and it's a generic solution akin to references I'm looking for.
Any ideas?
There is a simple way to do this specifically for pandas dataframes - so I'll answer my own question.
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df.loc[max_idx+1] = chosen_row
The calculation of max_idx very much depends on the structure of chosen_df. In the simplest case when it is a dataframe with a sequential index starting at 0, then you can simply use the length of the index to calculate it.
If chosen_df is non-sequential you'll need call max() on the index column rather than rely on the length of the index.
If chosen_df is a slice or groupby object then you'll need to calculate the index off the max parent dataframe to ensure it's truly the max across all rows.
Related
There is this code commonly used in python pandas "for index, row in df.iterrows()".
What is the difference between displaying these during the loop:
print(row)
print(row.index)
print(row.index[index])
print(row[index])
I tried printing them and cant comprehend what it does and how it selects the content and I cant find a well explained source online.
For one, it's more concise.
Secondly, you're only supposed to use it for displaying data rather than modifying. According to the docs you may get unpredictable results (a concurrent modification thing, methinks).
As to how it selects it, the docs also say it just returns the individual rows as pd.Series with the index being the id pandas uses to keep track of each row in the pd.DataFrame. I'd guess it'd be a akin to using a python zip() function on a list of int [0..n] and a list of pd.Series.
In PySpark one can use column objects and strings to select columns. Both ways return the same result. Is there any difference? When should I use column objects instead of strings?
For example, I can use a column object:
import pyspark.sql.functions as F
df.select(F.lower(F.col('col_name')))
# or
df.select(F.lower(df['col_name']))
# or
df.select(F.lower(df.col_name))
Or I can use a string instead and get the same result:
df.select(F.lower('col_name'))
What are the advantages of using column objects instead of strings in PySpark
Read this PySpark style guide from Palantir here which explains when to use F.col() and not and best practices.
Git Link here
In many situations the first style can be simpler, shorter and visually less polluted. However, we have found that it faces a number of limitations, that lead us to prefer the second style:
If the dataframe variable name is large, expressions involving it quickly become unwieldy;
If the column name has a space or other unsupported character, the bracket operator must be used instead. This generates inconsistency, and df1['colA'] is just as difficult to write as F.col('colA');
Column expressions involving the dataframe aren't reusable and can't be used for defining abstract functions;
Renaming a dataframe variable can be error-prone, as all column references must be updated in tandem.
Additionally, the dot syntax encourages use of short and non-descriptive variable names for the dataframes, which we have found to be harmful for maintainability. Remember that dataframes are containers for data, and descriptive names is a helpful way to quickly set expectations about what's contained within.
By contrast, F.col('colA') will always reference a column designated colA in the dataframe being operated on, named df, in this case. It does not require keeping track of other dataframes' states at all, so the code becomes more local and less susceptible to "spooky interaction at a distance," which is often challenging to debug.
It depends on how the functions are implemented in Scala.
In scala, the signature of the function is part of the function itself. For example, func(foo: str) and func(bar: int) are two different functions and Scala can make the difference whether you call one or the other depending on the type of argument you use.
F.col('col_name')), df['col_name'] and df.col_name are the same type of object, a column. It is almost the same to use one syntax or another. A little difference is that you could write for example :
df_2.select(F.lower(df.col_name)) # Where the column is from another dataframe
# Spoiler alert : It may raise an error !!
When you call df.select(F.lower('col_name')), if the function lower(smth: str) is not defined in Scala, then you will have an error. Some functions are defined with str as input, others take only columns object. Try it to know if it works and then uses it. otherwise, you can make a pull request on the spark project to add the new signature.
In Pandas we can generate a correlation matrix with .corr(). My question is quite simple: is the column order of the original dataframe preserved? From my testing it seems the case, but I want to be sure.
I am asking because I am on Python 3.7.3 where dictionaries maintain insertion order. I don't know if the current question is related to that, but if Pandas uses dictionaries behind the scenes, then it might very well be that corr() is ordered as expected in Python 3.6+ but not in lower versions.
Well, if you look at the source code for corr, the following code is at the start:
numeric_df = self._get_numeric_data()
cols = numeric_df.columns
idx = cols.copy()
mat = numeric_df.values
As you see here, as long as the method _get_numeric_data preserves order, corr should preserve order as well. Diving a bit deeper into _get_numeric_data, you can see this block:
self._consolidate_inplace()
return self.combine([b for b in self.blocks if b.is_numeric], copy)
_consolidate_inplace constructs consolidated sections of the dataframe in a tuple (order preserved), while _get_numeric_data uses a list comprehension to filter this tuple to only numeric blocks (order still preserved).
More to the point, pandas isn't actually using a dictionary for your column names.
columns themselves are just instances of the Index class, which (from the doc string) is ordered.
So, to answer your question: yes, order is guaranteed in corr, because the way it obtains and iterates through the dataframe columns also preserves order.
I have two lists in form of pandas DataFrames which both contain a column of names. Now I want to compare these names and return a list of names which appear in both DataFrames. The problem is that my solution is way too slow since both list have several thousand entries.
Now I want to know if there is anything else I can do to accelerate the solution of my problem.
I already sorted my pandas dataframe by alphabet using "df.sort_values" in Order to create an alphabetical index so that a name in the first list which starts with the letter "X" will only be compared to entries with the same first letter in the second list.
I suspect that the main reason my program is running so slow is my way of accessing the fields which I am comparing.
I use a specific comparison function to compare the names and access the dataframe elements through the df.at[i, 'column_title'] method.
Edit: Note that this specific comparison function is more complex than a simple "==" since I am doing a kind of fuzzy string comparison to make sure names with slightly different spelling still get marked as a match. I use the whoswho library which returns me a match rate between 0 and 100. A simplified example focussing on my slow solution for the pandas dataframe comparison looks as follows:
for i in range(len(list1)):
for j in range(len(list2)):
# who.ratio returns a match rate between two strings
ratio = who.ratio(list1.at[i, 'name'], list2.at[j, 'name'])
if ratio > 75:
save(i,j) # stores values i and j in a result list
I also thought about switching from pandas to numpy but I read that this might slow it down even further since pandas is faster for big data amounts.
Can anybody tell me if there is there a faster way of accessing specific elements in a pandas array? Or is there a faster way in general to run a custom comparison function through two pd dataframes?
Edit2: spelling, addtitional information.
I found out that I have problem understanding when I should be accessing data from dataframe(df) using df[data] or df.data .
I mostly use the [] method to create new columns, but I can also access data using both df[] and df.data, but what's the difference and how can I better grasp those two ways of selecting data? When one should be used over the other one?
If I understand the Docs correctly, they are pretty much equivalent, except in these cases:
You can use [the .] access only if the index element is a valid python identifier, e.g. s.1 is not allowed.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
However, while
indexing operators [] and attribute operator . provide quick and easy
access to pandas data structures across a wide range of use cases [...]
in production you should really use the optimized panda data access methods such as .loc, .iloc, and .ix, because
[...] since the type of the data to be accessed isn’t known
in advance, directly using standard operators has some optimization
limits. For production code, we recommended that you take advantage of
the optimized pandas data access methods.
Using [] will use the value of the index.
a = "hello"
df[a] # It will give you content at hello
Using .
df.a # content at a
The difference is that with the first one you can use a variable.