Is there a quick way to find all unique identifiers in dataframe? - python

I have a dataframe with over 100 columns, I would like to check all pairs to see which are unique identifiers.

You can use drop_duplicates(subset), specifying the columns you would regard as possible identifiers in the subset argument.
Since you have so many columns it will probably be easiest to take all columns and subtract from them the ones you would disregard (such as value columns).

You can use from collections import counter. see doc

Related

showing cells with a particular symbols in pandas dataframe

i have not seen such question, so if you happen to know the answer or have seen the same question, please let me know
i have a dataframe in pandas with 4 columns and 5k rows, one of the columns is "price" and i need to do some manipulations with it. but the data was parsed from web-page and it is not clean, so i cannot convert this column to integer type after getting rid of dollar sign and comas. i found out that it also contains data in the format 3500/mo. so i need to filter cells with /mo and decide whether i can drop them, basing on how many of those i have and what is the price.
now, i have managed to count those cells using
df["price"].str.contains("/").sum()
but when i want to see those cells, i cannot do that, because when i create another variable to extract slash-containing cells and use "contains" or smth, i get the series with true/false values - showing me the condition of whether the cell does or does not contain that slash, while i actually need to see cells themselves. any ideas?
You need to use the boolean mask returned by df["price"].str.contains("/") as index to get the respective rows, i.e., df[df["price"].str.contains("/")] (cf. the pandas docs on indexing).

Why do double square brackets create a DataFrame with loc or iloc?

Comparing:
df.loc[:,'col1']
df.loc[:,['col1']]
Why does (2) create a DataFrame, while (1) creates a Series?
in principle when it's a list, it can be a list of more than one column's names, so it's natural for pandas to give you a DataFrame because only DataFrame can host more than one column. However, when it's a string instead of a list, pandas can safely say that it's just one column, and thus giving you a Series won't be a problem. Take the two formats and two outcomes as a reasonable flexibility to get whichever you need, a series or a dataframe. sometimes you just need specifically one of the two.

What's the fastest way to do these tasks?

I originally have some time series data, which looks like this and have to do the following:
First import it as dataframe
Set date column as datetime index
Add some indicators such as moving average etc, as new columns
Do some rounding (values of the whole column)
Shift a column one row up or down (just to manipulate the data)
Then convert the df to list (because I need to loop it based on some conditions, it's a lot faster than looping a df because I need speed)
But now I want to convert df to dict instead of list because I want to keep the column names, it's more convenient
But now I found out that convert to dict takes a lot longer than list. Even I do it manually instead of using python built-in method.
My question is, is there a better way to do it? Maybe not to import as dataframe in the first place? And still able to do Point 2 to Point 5? At the end I need to convert to dict which allows me to do the loop, keep the column names as keys? THanks.
P.S. the dict should look something like this, the format is similar to df, each row is basically the date with the corresponding data.
On item #7: If you want to convert to a dictionary, you can use df.to_dict()
On item #6: You don't need to convert the df to a list or loop over it: Here are better options. Look for the second answer (it says DON'T)

Groupby follow by series

df.Last_3mth_Avg.isnull().groupby([df['ShopID'],df['ProductID']]).sum().astype(int).reset_index(name='count')
The code above help me to see the number of null values by shopid and productid. Question is df.Last_3mth_Avg.isnull() becomes a series, how a groupby([df['ShopID'],df['ProductID']]) can be used afterwards?
I use the solution from:
Pandas count null values in a groupby function
You should filter your df first:
df[df.Last_3mth_Avg.isnull()].groupby(['ShopID','ProductID']).agg('count')
There are two ways to use groupby:
The common way is to use on the dataframe so you just mention the column names in the by= parameter
The second way is you apply on a series but use equal sized series in the by= parameter. This is rarely used and helps when you want to do convertions on a specific column and use groupby in the same line
So, the above code line should work

Is pandas.DataFrame.groupby Guaranteed To Be Stable?

I've noticed that there are several uses of pd.DataFrame.groupby followed by an apply implicitly assuming that groupby is stable - that is, if a and b are instances of the same group, and pre-grouping, a appeared before b, then a will appear pre b following the grouping as well.
I think there are several answers clearly implicitly using this, but, to be concrete, here is one using groupby+cumsum.
Is there anything actually promising this behavior? The documentation only states:
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Also, pandas having indices, the functionality could be theoretically be achieved also without this guarantee (albeit in a more cumbersome way).
Although the docs don't state this internally, it uses stable sort when generating the groups.
See:
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L291
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L4356
As I mentioned in the comments, this is important if you consider transform which will return a Series with it's index aligned to the original df. If the sorting didn't preserve the order, then this would make alignment perform additional work as it would need to sort the Series prior to assigning. In fact, this is mentioned in the comments:
_algos.groupsort_indexer implements counting sort and it is at least
O(ngroups), where
ngroups = prod(shape)
shape = map(len, keys)
That is, linear in the number of combinations (cartesian product) of unique
values of groupby keys. This can be huge when doing multi-key groupby.
np.argsort(kind='mergesort') is O(count x log(count)) where count is the
length of the data-frame;
Both algorithms are stable sort and that is necessary for correctness of
groupby operations.
e.g. consider:
df.groupby(key)[col].transform('first')
Yes; the description of the sort parameter of DataFrame.groupby now promises that groupby (with or without key sorting) "preserves the order of rows within each group":
sort : bool, default True
Sort group keys. Get better performance by
turning this off. Note this does not influence the order of
observations within each group. Groupby preserves the order of rows
within each group.

Categories