I have a pandas matrix df of the form:
user_id time url
4 20140502 'w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html',
7 20140307 'w.lejournal.fr/palmares/palmares-immobilier/'
10 20140604 'w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html'
etc...
I want to use the groupby function to group by user, then to make some statistics of the words appearing in the urls of each user, for example, get how many times there is the world 'actualite' in a user urls.
For now, my code is:
def my_stat_function(temp_set):
res = 0
for (u,t) in temp_set:
if 'actualite' in u and t > 20140101:
res += 1
return res
group_user = df.groupby('user_id')
output_list = []
for (i,group) in group_user:
dfg = pandas.DataFrame(group)
temp_set = [tuple(x) for x in dfg[['url','time']].values]
temp_var = my_stat_function(temp_set)
output_list.append([i]+[temp_var])
outputDf = pandas.DataFrame(data = output_list, columns = ['user_id','stat'])
My question is: can I avoid to iterate group by group to apply my_stat_function, and is there exist something faster, maybe applying the function apply? I would really like something more "pandas-ish' and faster.
Thank you for your help.
Related
I have a large database (circa 9m records) in the form:
user id, product id, qty
I want to understand the frequency with which owners of one product, own every other product.
I've attempted to do this with list comprehension:
for title in sampled_list:
cross_owners[title]=dict()
for title_2 in sampled_list:
cross_owners[title_2] = dict()
a = [x for x in owns_list if x[1] == title]
b = [x for x in owns_list if x[1] == title_2]
c = [x for x in a if x[0] in b[0]]
if len(c) > 0:
print(title)
print(title_2)
print(len(c))
cross_owners[title][title_2] = len(c)
This works, but is slow and there's essentially 50k products so a lot of possible permutations.
I've a sense that I should be using pandas or something more sophisticated, but I'm struggling to see how I should implement that.
To improve the performance of the code use Pandas for data preprocessing and manipulation. You can create a Pandas dataframe with columns "user_id", "product_id", and "qty", and then use the groupby method to group by "product_id" and "user_id", and count the frequency of each group. This way, you'll have a pre-processed data structure that allows you to perform the analysis more efficiently.
try this example :
import pandas as pd
df = pd.DataFrame(owns_list, columns=["user_id", "product_id", "qty"])
grouped = df.groupby(["product_id", "user_id"]).size().reset_index(name="frequency")
cross_owners = dict()
for title in sampled_list:
cross_owners[title] = dict()
product_owners = grouped[grouped["product_id"] == title]["user_id"].tolist()
for title_2 in sampled_list:
owners_of_title_2 = grouped[grouped["product_id"] == title_2]["user_id"].tolist()
shared_owners = set(product_owners) & set(owners_of_title_2)
cross_owners[title][title_2] = len(shared_owners)
This way, you can obtain the desired results more efficiently than with the original list comprehension approach.
I have a DataFrame like this:
I have this function:
def df_filter (df,col_key1,col_key2,col_key3,a,b,c):
df1 = df[(df[col_key1]==a) & (df[col_key2]==b) & (df[col_key3]==c)
df1 = df1.groupby(by=['ID'])['satisfaction'].mean()
return df1
result_1 = df_filter (df,'key1','key2','key3',1,0,0)
result_2 = df_filter (df,'key1','key2','key3',1,1,0)
result_3 = df_filter (df,'key1','key2','key3',1,1,1)
result_4 = df_filter (df,'key1','key2','key3',1,0,1)
result_n = df_filter (df,'key1','key2','key3',0,0,1)
How can I make a loop to get the results of all possible combinations of key1,key2 and key 3 using this function?
Thanks!
Here's a minimal change that gets the job done. To get an array of results, you can add the statement from itertools import product and do
results = [df_filter(df,'key1','key2','key3',*tup)
for tup in product([0,1],repeat=3)]
I'm trying to execute a filter in python, but I'm stuck at the end, when I need to group the resullt.
I have a json, which is this one: https://api.jsonbin.io/b/62300664a703bb67492bd3fc/3
And what I'm trying to do with it is filtering "apiFamily" searching for "payments-ted" or "payments-doc". If I find a match, I then must verify that the column "ApiEndpoints" has at least two endpoints in it.
My ultimate goal is to append both "apiFamily" in one row and all the ApiEndpoints" in another row. Something like this:
"ApiFamily": [
"payments-ted",
"payments-doc"
]
"ApiEndpoints": [
"/ted",
"/electronic-ted",
"/phone-ted",
"/banking-ted",
"/shared-automated-teller-machines-ted"
"/doc",
"/electronic-doc",
"/phone-doc",
"/banking-doc",
"/shared-automated-teller-machines-doc"
]
I have managed so achieve partial sucess, searching for a single condition:
#ApiFilter = df[(df['ApiFamily'] == 'payments-pix') & (rolesFilter['ApiEndpoints'].apply(lambda x: len(x)) >= 2)]
This obviously extracts only payments-pix which contains two or more ApiEndpoints.
Now I can manage to check both conditions, if I try this:
#ApiFilter = df[((df['ApiFamily'] == 'payments-ted') | (df['ApiFamily'] == 'payments-doc') &(df['ApiEndpoints'].apply(lambda x: len(x)) >= 2)]
I will get the correct rows, but it will obviously list the brand twice.
When I try to groupby the result, all I get is this:
TypeError: unhashable type: 'Series'
My doubt is: how to avoid this error? I assume I must do some sort of conversion of the columns that have multiple itens inside a row, but what is the best method?
I have tried this solution , it is kind of round-about but gets the final result you want
First get the data into a dictionary object
>>> import requests
>>> url = 'https://api.jsonbin.io/b/62300664a703bb67492bd3fc/3'
>>> response = requests.get(url)
>>> d = response.json()
We just need the ApiFamily and ApiEndpoints into a new dictionary
>>> dNew = {}
>>> for item in d['data'] :
>>> if item['ApiFamily'] in ['payments-ted','payments-doc']:
>>> dNew[item['ApiFamily']] = item['ApiEndpoints']
Change dNew into a dataframe and transpose it.
>>> df1 = pd.DataFrame(dNew)
>>> df1 = df1.applymap ( lambda x : '\'' + x + '\'')
>>> df2 = df1.transpose()
At this stage df2 looks like this -
>>> print(df2)
0 1 2 3 \
payments-ted '/ted' '/electronic-ted' '/phone-ted' '/banking-ted'
payments-doc '/doc' '/electronic-doc' '/phone-doc' '/banking-doc'
4
payments-ted '/shared-automated-teller-machines-ted'
payments-doc '/shared-automated-teller-machines-doc'
Now join all the columns using the comma symbol
>>> df2['final'] = df2.apply( ','.join , axis=1)
Finally
>>> df2 = df2[['final']]
>>> print(df2)
final
payments-ted '/ted','/electronic-ted','/phone-ted','/bankin...
payments-doc '/doc','/electronic-doc','/phone-doc','/bankin...
I have the below script - which aims to create a "merge based on a partial match" functionality since this is not possible with the normal .merge() funct to the best of my knowledge.
The below works / returns the desired result, but unfortunately, it's incredibly slow to the point that it's almost unusable where I need it.
Been looking around at other Stack Overflow posts that contain similar problems, but haven't yet been able to find a faster solution.
Any thoughts on how this could be accomplished would be appreciated!
import pandas as pd
df1 = pd.DataFrame([ 'https://wwww.example.com/hi', 'https://wwww.example.com/tri', 'https://wwww.example.com/bi', 'https://wwww.example.com/hihibi' ]
,columns = ['pages']
)
df2 = pd.DataFrame(['hi','bi','geo']
,columns = ['ngrams']
)
def join_on_partial_match(full_values=None, matching_criteria=None):
# Changing columns name with index number
full_values.columns.values[0] = "full"
matching_criteria.columns.values[0] = "ngram_match"
# Creating matching column so all rows match on join
full_values['join'] = 1
matching_criteria['join'] = 1
dfFull = full_values.merge(matching_criteria, on='join').drop('join', axis=1)
# Dropping the 'join' column we created to join the 2 tables
matching_criteria = matching_criteria.drop('join', axis=1)
# identifying matching and returning bool values based on whether match exists
dfFull['match'] = dfFull.apply(lambda x: x.full.find(x.ngram_match), axis=1).ge(0)
# filtering dataset to only 'True' rows
final = dfFull[dfFull['match'] == True]
final = final.drop('match', axis=1)
return final
join = join_on_partial_match(full_values=df1,matching_criteria=df2)
print(join)
>> full ngram_match
0 https://wwww.example.com/hi hi
7 https://wwww.example.com/bi bi
9 https://wwww.example.com/hihibi hi
10 https://wwww.example.com/hihibi bi
For anyone who is interested - ended up figuring out 2 ways to do this.
First returns all matches (i.e., it duplicates the input value and matches with all partial matches)
Only returns the first match.
Both are extremely fast. Just ended up using a pretty simple masking script
def partial_match_join_all_matches_returned(full_values=None, matching_criteria=None):
"""The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with all matching values (duplicating the full value).
Args:
full_values = None: This is the series that contains the full values for matching pair.
partial_values = None: This is the series that contains the partial values for matching pair.
Returns:
A dataframe with 2 columns - 'full' and 'match'.
"""
start_join1 = time.time()
matching_criteria = matching_criteria.to_frame("match")
full_values = full_values.to_frame("full")
full_values = full_values.drop_duplicates()
output=[]
for n in matching_criteria['match']:
mask = full_values['full'].str.contains(n, case=False, na=False)
df = full_values[mask]
df_copy = df.copy()
df_copy['match'] = n
# df = df.loc[n, 'match']
output.append(df_copy)
final = pd.concat(output)
end_join1 = (time.time() - start_join1)
end_join1 = str(round(end_join1, 2))
len_join1 = len(final)
return final
def partial_match_join_first_match_returned(full_values=None, matching_criteria=None):
"""The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with the first matching value.
Args:
full_values = None: This is the series that contains the full values for matching pair.
partial_values = None: This is the series that contains the partial values for matching pair.
Returns:
A dataframe with 2 columns - 'full' and 'match'.
"""
start_singlejoin = time.time()
matching_criteria = matching_criteria.to_frame("match")
full_values = full_values.to_frame("full").drop_duplicates()
output=[]
for n in matching_criteria['match']:
mask = full_values['full'].str.contains(n, case=False, na=False)
df = full_values[mask]
df_copy = df.copy()
df_copy['match'] = n
# leaves us with only the 1st of each URL
df_copy.drop_duplicates(subset=['full'])
output.append(df_copy)
final = pd.concat(output)
end_singlejoin = (time.time() - start_singlejoin)
end_singlejoin = str(round(end_singlejoin, 2))
len_singlejoin = len(final)
return final
I have a reproducible example, toy dataframe:
df = pd.DataFrame({'my_customers':['John','Foo'],'email':['email#gmail.com','othermail#yahoo.com'],'other_column':['yes','no']})
print(df)
my_customers email other_column
0 John email#gmail.com yes
1 Foo othermail#yahoo.com no
And I apply() a function to the rows, creating a new column inside the function:
def func(row):
# if this column is 'yes'
if row['other_column'] == 'yes':
# create a new column with 'Hello' in it
row['new_column'] = 'Hello'
# return to df
return row
# otherwise
else:
# just return the row
return row
I then apply the function to the df, and we can see that the order has been changed. The columns are now in alphabetical order. Is there any way to avoid this? I would like to keep it in the original order.
df = df.apply(func, axis = 1)
print(df)
email my_customers new_column other_column
0 email#gmail.com John Hello yes
1 othermail#yahoo.com Foo NaN no
Edited for clarification - the above code was too simple
input
df = pd.DataFrame({'my_customers':['John','Foo'],
'email':['email#gmail.com','othermail#yahoo.com'],
'api_status':['data found','no data found'],
'api_response':['huge json','huge json']})
my_customers email api_status api_response
0 John email#gmail.com data found huge json
1 Foo othermail#yahoo.com no data found huge json
Parsing the api_response. I need to create many new rows in the DF:
def api_parse(row):
# if we have response data
if row['api_response'] == huge json:
# get response for parsing
response_data = row['api_response']
"""Let's get associated URLS first"""
# if there's a URL section in the response
if 'urls' in response_data .keys():
# get all associated URLS into a list
urls = extract_values(response_data ['urls'], 'url')
row['Associated_Urls'] = urls
"""Get a list of jobs"""
if 'jobs' in response_data .keys():
# get all associated jobs and organizations into a list
titles = extract_values(person_data['jobs'], 'title')
organizations = extract_values(person_data['jobs'], 'organization')
counter = 1
# create a new column for each job
for pair in zip(titles,organizations):
row['Job'+'_'+str(counter)] = f'Title: {pair[0]}, Organization: {pair[1]}'
counter +=1
"""Get a list of education"""
if 'educations' in response_data .keys():
# get all degrees into list
degrees = extract_values(response_data ['educations'], 'display')
counter = 1
# create a new column for each degree
for edu in degrees:
row['education'+'_'+str(counter)] = edu
counter +=1
"""Get a list of social profiles from URLS we parsed earlier"""
facebook = [i for i in urls if 'facebook' in i] or [np.nan]
instagram = [i for i in urls if 'instagram' in i] or [np.nan]
linkedin = [i for i in urls if 'linkedin' in i] or [np.nan]
twitter = [i for i in urls if 'twitter' in i] or [np.nan]
amazon = [i for i in urls if 'amazon' in i] or [np.nan]
row['facebook'] = facebook
row['instagram'] = instagram
row['linkedin'] = linkedin
row['twitter'] = twitter
row['amazon'] = amazon
return row
elif row['api_Status'] == 'No Data Found':
# do nothing
return row
expected output:
my_customers email api_status api_response job_1 job_2 \
0 John email#gmail.com data found huge json xyz xyz2
1 Foo othermail#yahoo.com no data found huge json nan nan
education_1 facebook other api info
0 foo profile1 etc
1 nan nan nan
You could adjust the order of columns in your DataFrame after running the apply function. For example:
df = df.apply(func, axis = 1)
df = df[['my_customers', 'email', 'other_column', 'new_column']]
To reduce the amount of duplication (i.e. by having to retype all column names), you could get the existing set of columns before calling the apply function:
columns = list(df.columns)
df = df.apply(func, axis = 1)
df = df[columns + ['new_column']]
Update based on the author's edits to the original question. Whilst I'm not sure if the data structure chosen (storing API results in a Data Frame) is the best option, one simple solution could be to extract the new columns after calling the apply functions.
# Store the existing columns before calling apply
existing_columns = list(df.columns)
df = df.apply(func, axis = 1)
all_columns = list(df.columns)
new_columns = [column for column in all_columns if column not in existing_columns]
df = df[columns + new_columns]
For performance optimisations, you could store the existing columns in a set instead of a list which will yield lookups in constant time due to the hashed nature of a set data structure in Python. This would change existing_columns = list(df.columns) to existing_columns = set(df.columns).
Finally, as #Parfait very kindly points out in their comment, the code above may raise some depreciation warnings. Using pandas.DataFrame.reindex instead of df = df[columns + new_columns] will make the warnings disappear:
new_columns_order = [columns + new_columns]
df = df.reindex(columns=new_columns_order)
That occurs because you don't assign a value to the new column if row["other_column"] != 'yes'. Just try this:
def func(row):
if row['other_column'] == 'yes':
row['new_column'] = 'Hello'
return row
else:
row['new_column'] = ''
return row
df.apply(func, axis = 1)
You can choose the value for row["new_column"] == 'no' to be whatever. I just left it blank.