Transitioning from pandas .apply to a vectoriztion approach - python

I am trying to improve a pandas iteration with a purely vectorization approach. I am a little new to vectoriztion and am having trouble getting it to work.
Within one dataframe field, I am finding all the unique string-based records of addresses. I need to seach the dataframe for each unique address idividually and assign a single unique identifier to the returned records. In this way, I can have 1 UID for each each address regardless of multiple occurances in the dataframe.
I have developed an approach that utilizes vectorition with the pandas .apply method.
def addr_id(x):
global df
df['Unq_ID'][df['address'] == x] = uuid.uuid4()
pd.DataFrame(df['address'].unique(), columns=["column1"]).apply(lambda x: addr_id(x["column1"]), axis=1)
However, I am trying to do away with the .apply method completely. This is where I am stuck.
df['Unq_ID'][df['address'] == (pd.DataFrame(df['address'].unique(), columns=["column1"]))["column1"]] = uuid.uuid4()
I keep getting a ValueError: Can only compare identically-labeled Series objects

You want to get rid of the Pandas apply due to performance reasons, right?
May I suggest a different approach to your problem?
You can construct a dict with the unique values of column1 as keys and the uuids as values and then map them to the DataFrame:
uuid_dict = {key: uuid.uuid4() for key in df['column1'].unique()}
df['address'] = df.column1.map(uuid_dict)
This would be very fast because it avoids looping in Python (which Pandas apply does under the hood).

Related

pyspark Drop rows in dataframe to only have X distinct values in one column

So I have a dataframe with a column "Category" and it has over 12k distinct values, for sampling purposes I would like to get a small sample where there are only 1000 different values of this category column.
Before I was doing:
small_distinct = df.select("category").distinct().limit(1000).rdd.flatMap(lambda x: x).collect()
df = df.where(col("category").isin(small_distinct))
I know this is extremely inefficient as I'm doing a distinct of the category column and then casting it into a normal python list so I can use isin() filter.
Is there any "spark" way of doing this? I thought maybe something with rollingoverwindows could do the job? But I cant get to solve it
Thanks!
You can improve your code using a left_semi join:
small_distinct = df.select("category").distinct().limit(1000)
df = df.join(small_distinct, "category", "left_semi")
Using left_semi is a good way to filter a table using another table, keeping the same schema, in a efficient way.

Creating a new dataframe colum as a function of other columns [duplicate]

This question already has answers here:
Pandas Vectorized lookup of Dictionary
(2 answers)
Closed 2 years ago.
I have a dataframe with Country column. It has rows for around 15 countries. I want to add a Continent column using a mapping dictionary, ContinentDict, that has mapping from country name to Continent name)
I see that these two work
df['Population'] = df['Energy Supply'] / df['Energy Supply per Capita']
df['Continent'] = df.apply(lambda x: ContinentDict[x['Country']], axis='columns')
but this does not
df['Continent'] = ContinentDict[df['Country']]
Looks like the issue is that df['Country'] is a series object and so the statement is not smart enough to treat the last statement to be same as 2.
Questions
Would love to understand why statement 1 works but not 3. Is it because "dividing two series objects" is defined as an element wise divide?
Any way to change 3 to tell I want element wise operation without having to go the apply route?
From your statement a mapping dictionary, ContinentDict, it looks like ContinentDict is a Python dictionary. In this case,
ContinentDict[some_key]
is a pure Python call, regardless of what object some_key is. That's why the 3rd call fails since the df['Country'] is not in the dictionary key (and it never can be since dictionary keys are not mutable).
In which case, Python only allows to index the exact key and throws an error when the key is not in the dictionary.
Pandas does provide a tool for you to replace/map the values:
df['Continent'] = df['Country'].map(ContinentDict)
df['Continent']=df['Country'].map(ContinentDict)
In case 1, you are dealing with two pandas series, so it knows how to deal with them.
In case 2, you have a python dictionary and pandas series, pandas don't know how to deal with a dictionary(df['country'] is pandas series but not a key in the dictionary)

MultiLevel indexing pandas trying to collate two different Series

So currently I am trying to write a function which will take a given document id, and produce a list of other document ids related to this. This is to be achieved by checking this document id against a table with document id's and user id's, each user id then gets checked for what document id's they have accessed.
I have a function which will return a pandas Series for either of these requests, but now I would like to put them together so I can run calculations.
I believe the best way to go about this is to utilise MultiLevel indexing, producing a DataFrame like this:
user_id document_id
user_a doc_a
doc_b
doc_c
user_b doc_d
doc_e
user_c doc_f
doc_g
I am not sure how to go about producing this though. What I can do currently is produce a Series of user_id, I then make a DataFrame with this Series as the first column. I can then produce a second column like so:
df['document_id'] = df['user_id'].apply(lambda x: return_documents(x))
However, all this is doing is producing a series in each cell of the document_id column.
Any help would be appreciated, thanks!
I think what u need is explode:
df['document_id'] = df['user_id'].apply(lambda x: return_documents(x))
df = df.explode('document_id')

Python Pandas group by iteration

I am iterating over a groupby column in a pandas dataframe in Python 3.6 with the help of a for loop. The problem with this is that it becomes slow if I have a lot of data. This is my code:
import pandas as pd
dataDict = {}
for metric, df_metric in frontendFrame.groupby('METRIC'): # Creates frames for each metric
dataDict[metric] = df_metric.to_dict('records') # Converts dataframe to dictionary
frontendFrame is a dataframe containing two columns: VALUE and METRIC. My end goal is basically creating a dictionary where there is a key for each metric containing all data connected to it. I now this should be possible to do with lambda or map but I can't get it working with multiple arguments. frontendFrame.groupby('METRIC').apply(lambda x: print(x))
How can I solve this and make my script faster?
If you do not need any calculation involved after groupby , do not groupby data , you can using .loc to get what you need
s=frontendFrame.METRIC.unique()
frontendFrame.loc[frontendFrame.METRIC==s[0],]

Filtering pandas DataFrame

I'm reading in a .csv file using pandas, and then I want to filter out the rows where a specified column's value is not in a dictionary for example. So something like this:
df = pd.read_csv('mycsv.csv', sep='\t', encoding='utf-8', index_col=0,
names=['col1', 'col2','col3','col4'])
c = df.col4.value_counts(normalize=True).head(20)
values = dict(zip(c.index.tolist()[1::2], c.tolist()[1::2])) # Get odd and create dict
df_filtered = filter out all rows where col4 not in values
After searching around a bit I tried using the following to filter it:
df_filtered = df[df.col4 in values]
but that unfortunately didn't work.
I've done the following to make it works for what I want to do, but it's incredibly slow for a large .csv file, so I thought there must be a way to do it that's built in to pandas:
t = [(list(df.col1) + list(df.col2) + list(df.col3)) for i in range(len(df.col4)) if list(df.col4)[i] in values]
If you want to check against the dictionary values:
df_filtered = df[df.col4.isin(values.values())]
If you want to check against the dictionary keys:
df_filtered = df[df.col4.isin(values.keys())]
As A.Kot mentioned you could use the values method of the dict to search. But the values method returns either a list or an iterator depending on your version of Python.
If your only reason for creating that dict is membership testing, and you only ever look at the values of the dict then you are using the wrong data structure.
A set will improve your lookup performance, and simplify your check back to:
df_filtered = df[df.col4 in values]
If you use values elsewhere, and you want to check against the keys, then you're ok because membership testing against keys is efficient.

Categories