Removing for loop - using dictionary instead of pandas - python

I have 2 lists:
customer_ids
recommendations (list of list with each list having 6000 shop_ids)
Each list in recommendations represents shops-recommended against customers in customer_ids.
I have to filter out 20 shop_ids based on the shops in customer's city only.
Desired output:
recommendations- (list of list with each list having 20 shop_ids)
customer_ids = ['1','2','3',...]
recommendations = [['110','589','865'...], ['422','378','224'...],['198','974','546'...]]
Filter: shop's city == customer's city.
to extract the city for customers and shops I have 2 sql query:
df_cust_city = pd.read_sql_query("SELECT id, city_id FROM customer_table")
df_shop_city = pd.read_sql_query("SELECT shop_id, city FROM shop_table")
Code using list
filtered_list = []
for cust_id, shop_id in zip(customer_ids, recommendations):
cust_city = df_cust_city.loc[df_cust_city['id'] == cust_id, 'city_id'].iloc[0] #get customer city
df_city_filter = (df_shop_city.where(df_shop_city['city'] == cust_city)).dropna() #get all shops in customer city
df_city_filter = df_city_filter.astype(int)
filter_shop = df_city_filter['shop_id'].astype(str).values.tolist() #make a list of shop_ids in customer city
filtered = [x for x in shop_id if x in filter_rest] #filter recommended shop_ids based on city-filtered list
shop_filtered = list(islice(filtered, 20))
filtered_list.append(shop_filtered) #create recommendation list of lists with only 20 filtered shop_ids
Code using pandas
filtered_list = []
for cust_id, shop_id in zip(customer_ids, recommendations):
cust_city = df_cust_city.loc[df_cust_city['id'] == cust_id, 'city_id'].iloc[0] #get customer city
df_city_filter = (df_shop_city.where(df_shop_city['city'] == cust_city)).dropna()
recommended_shop = pd.DataFrame(shop_id, columns=['id'])
recommended_shop['id'] = recommended_shop['id'].astype(int)
shop_city_filter = pd.DataFrame(df_city_filter['id'].astype(int))
shops_common = recommended_shop.merge(shop_id, how='inner', on='id')
shops_common.drop_duplicates(subset="id", keep=False, inplace=True)
filtered = shops_common.head(20)
shop_filtered = filtered['id'].values.tolist()
filtered_list.append(shop_filtered)
Time taken for complete for loop to run:
using list: ~8000 seconds
using pandas: ~3000 seconds
I have to run the for loop 22 times.
is there a way to completely get rid of the for loop? Any tips/pointers on how to achieve this so it takes less time for 50000 customers at once. I am trying it out with dictionary.
df_cust_city:
id city_id
00919245 1
02220205 2
02221669 2
02223750 2
02304202 2
df_shop_city:
shop_id city
28 1
29 1
30 1
31 1
32 1

This will not get rid of the for loop, but how about you group customers by city first?
That way, the operations leading to filter_shop will only have to be performed N_cities times, rather than N_customers. In addition, the computation of the filtered variable might be significantly faster using a set difference.

Related

Poor performance filtering one dataframe with another

I have two dataframes one holds unique records of episodic data, the other lists of events. There are multiple events per episode. I need to loop through the episode data, find all the events that correspond to each episode and write the resultant events for a new dataframe. There are around 4,000 episodes and 20,000 events. The process is painfully slow as for each episode I am searching 20,000 events. I am guessing there is a way to reduce the number of events searched each loop by removing the matched ones - but I am not sure. This is my code (there is additional filtering to assist with matching)
for idx, row in episode_df.iterrows():
total_episodes += 1
icu_admission = datetime.strptime(row['ICU_ADM'], '%d/%m/%Y %H:%M:%S')
tmp_df = event_df.loc[event_df['ur'] == row['HRN']]
if ( len(tmp_df.index) < 1):
empty_episodes += 1
continue
# Loop through temp dataframe and write all records with an admission date
# close to icu_admission to new dataframe
for idx_a, row_a in tmp_df.iterrows():
admission = datetime.strptime(row_a['admission'], '%Y-%m-%d %H:%M:%S')
difference = admission - icu_admission
if (abs(difference.total_seconds()) > 14400):
continue
new_df = new_df.append(row_a)
selected_records += 1
A simplified version of the dataframes:
episode_df:
episode_no HRN name ICU_ADM
1 12345 joe date1
2 78124 ann date1
3 98374 bill date2
4 76523 lucy date3
event_df
episode_no ur admission
1 12345 date1
1 12345 date1
1 12345 date5
7 67899 date9
Not all episodes have events and only events with episodes need to be copied.
This could work:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['ICU_ADM'] = [pd.to_datetime(f'2020-01-{x}') for x in range(1,10)]
df1['test_day'] = df1['ICU_ADM'].dt.day
df2 = pd.DataFrame()
df2['admission'] = [pd.to_datetime(f'2020-01-{x}') for x in range(2,10,3)]
df2['admission_day'] = df2['admission'].dt.day
df2['random_val'] = np.random.rand(len(df2),1)
pd.merge_asof(df1, df2, left_on=['ICU_ADM'], right_on=['admission'], tolerance=pd.Timedelta('1 day'))

find most frequent pairs in a dataframe

Suppose I have a two-column dataframe where the first column is the ID of a meeting and the second is the ID of one of the participants in that meeting. Like this:
meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows
I want to find each person's top 15 co-participants. Eg.: I want to know which 15 people most frequently participate in meetings with Brad.
As an intermediate step I wrote a script that takes the original dataframe and makes a person-to-person dataframe, like this:
person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...
But I'm not sure this intermediate step is necessary. Also, it's taking forever to run (by my estimate it should take weeks!). Here's the monstrosity:
import pandas as pd
links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
print(i, 'of', len(grouped))
person_id = group[0].strip()
if len(person_id) == 14:
meetings = set(group[1]['meeting_id'])
for meeting in meetings:
lic_sub = lic[lic['meeting_id'] == meeting]
people = set(lic_sub['person_id'])
for person in people:
if person != person_id:
tup = (person_id, person)
links.append(tup)
df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)
Any ideas?
So here is one way using merge then sort the columns
s=df.merge(df,on='meeting_id')
s[['person_id_x','person_id_y']]=np.sort(s[['person_id_x','person_id_y']].values,1)
s=s.query('person_id_x!=person_id_y').drop_duplicates()
s
meeting_id person_id_x person_id_y
1 meeting0 person1234 person4321
2 meeting0 person1234 person5555
5 meeting0 person4321 person5555
10 meeting1 person4321 person9999

Pandas - Change value in column based on its relationship with another column

I am working with the sklearn.datasets.fetch_20newsgroups() dataset. Here, there are some documents that belong to more than one news group. I want to treat those documents as two different entities that each belong to one news group. To do this, I've brought the document IDs and group names into a dataframe.
import sklearn
from sklearn import datasets
data = datasets.fetch_20newsgroups()
filepaths = data.filenames.astype(str)
keys = []
for path in filepaths:
keys.append(os.path.split(path)[1])
groups = pd.DataFrame(keys, columns = ['Document_ID'])
groups['Group'] = data.target
groups.head()
>> Document_ID Group
0 102994 7
1 51861 4
2 51879 4
3 38242 1
4 60880 14
print (len(groups))
>>11314
print (len(groups['Document_ID'].drop_duplicates()))
>>9840
print (len(groups['Group'].drop_duplicates()))
>>20
For each Document_ID, I want to change its value if it has more than one Group number assigned. Example,
groups[groups['Document_ID']=='76139']
>> Document_ID Group
5392 76139 6
5680 76139 17
I want this to become:
>> Document_ID Group
5392 76139 6
5680 12345 17
Here, 12345 is a random new ID that is not already in keys list.
How can I do this?
You can find all the rows that contain duplicate Document_ID after the first with the duplicated methdod. Then create a list of new id's beginning with one more than the max id. Use the loc indexing operator to overwrite the duplicate keys with the new ids.
groups['Document_ID'] = groups['Document_ID'].astype(int)
dupes = groups.Document_ID.duplicated(keep='first')
max_id = groups.Document_ID.max() + 1
new_id = range(max_id, max_id + dupes.sum())
groups.loc[dupes, 'Document_ID'] = new_id
Test case
groups.loc[[5392,5680]]
Document_ID Group
5392 76139 6
5680 179489 17
Ensure that no duplicates remain.
groups.Document_ID.duplicated(keep='first').any()
False
Kinda Hacky, but why not!
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"Group": [7,1,3,4,4,6,17],
}
groups = pd.DataFrame(data)
groupDict ={}
tempLst=[]
#Create a list of unique ID's
DocList = groups['Document_ID'].unique()
DocList.tolist()
#Build a dictionary and push all group ids to the correct doc id
DocDict = {}
for x in DocList:
DocDict[x] = []
for index, row in groups.iterrows():
DocDict[row['Document_ID']].append(row['Group'])
#For all doc Id's with multip entries create a new id with the group id as a decimal point.
groups['DupID'] = groups['Document_ID'].apply(lambda x: len(DocDict[x]))
groups["Document_ID"] = np.where(groups['DupID'] > 1, groups["Document_ID"] + groups["Group"]/10,groups["Document_ID"])
Hope that helps...

Replacing values with the first value of a group in a different table

Let's have ratings and books tables.
RATINGS
User-ID ISBN Book-Rating
244662 0373630689 7
19378 0812515595 10
238625 0441892604 9
180315 0140439072 0
242471 3548248950 0
BOOKS
ISBN Book-Title Book-Author Year-Of-Publication Publisher
0393000753 A Reckoning May Sarton 1981 W W Norton
Since many of the books have the same names and authors but different publishers and years of publication, I want to group them by title and replace ISBN in the rating table with the ISBN of the first row in the group.
More concretely, if the grouping looks like this
Book-Name ISBN
Name1 A
B
C
Name2 D
E
Name3 F
G
and the ratings like
User-ID ISBN Book-Rating
X B 3
X E 6
Y D 1
Z F 8
I want ratings to look like
User-ID ISBN Book-Rating
X A 3
X D 6
Y D 1
Z G 8
to save memory needed for pivot_table. The data set can be found here.
My attempt was along the lines of
book_rating_view = ratings.merge(books, how='left', on='ISBN').groupby(['Book-Title'])['ISBN']
ratings['ISBN'].replace(ratings['ISBN'], pd.Series([book_rating_view.get_group(key).min() for key,_ in book_rating_view]))
which doesn't seem to work. Another attempt was to construct the pivot_table directly as
isbn_vector = books.groupby(['Book-Title']).first()
utility = pd.DataFrame(0, index=explicit_ratings['User-ID'], columns=users['User-ID'])
for name, group in explicit_ratings.groupby('User-ID'):
user_vector = pd.DataFrame(0, index=isbn_vector, columns = [name])
for row, index in group:
user_vector[books.groupby(['Book-Title']).get_group(row['ISBN']).first()] = row['Book-Rating']
utility.join(user_vector)
which leads to a MemoryError, even though reduced table should fit into the memory.
Thanks for any advice!
I want you show us BOOK dataframe little bit more and also desired output most of all, but how about below? (Even I usually don't recommend store the data as list in the dataframe...)
Say df1 = RATINGS, df2 = BOOKS,
dfm = df2.merge(df1, on='ISBN').groupby('Book-Title').apply(list)
dfm['Book-Rating'] = dfm['Book-Rating'].map(sum)

Pandas write variable number of new rows from list in Series

I'm using Pandas as a way to write data from Selenium.
Two example results from a search box ac_results on a webpage:
#Search for product_id = "01"
ac_results = "Orange (10)"
#Search for product_id = "02"
ac_result = ["Banana (10)", "Banana (20)", "Banana (30)"]
Orange returns only one price ($10) while Banana returns a variable number of prices from different vendors, in this example three prices ($10), ($20), ($30).
The code uses regex via re.findall to grab each price and put them into a list. The code works fine as long as re.findall finds only one list item, as for Oranges.
Problem is when there are a variable amount of prices, as when searching for Bananas. I would like to create a new row for each stated price, and the rows should also include product_id and item_name.
Current output:
product_id prices item_name
01 10 Orange
02 [u'10', u'20', u'30'] Banana
Desired output:
product_id prices item_name
01 10 Orange
02 10 Banana
02 20 Banana
02 30 Banana
Current code:
df = pd.read_csv("product_id.csv")
def crawl(product_id):
#Enter search input here, omitted
#Getting results:
search_result = driver.find_element_by_class_name("ac_results")
item_name = re.match("^.*(?=(\())", search_result.text).group().encode("utf-8")
prices = re.findall("((?<=\()[0-9]*)", search_reply.text)
return pd.Series([prices, item_name])
df[["prices", "item_name"]] = df["product_id"].apply(crawl)
df.to_csv("write.csv", index=False)
FYI: Workable solution with csv module, but I want to use Pandas.
with open("write.csv", "a") as data_write:
wr_data = csv.writer(data_write, delimiter = ",")
for price in prices: #<-- This is the important part!
wr_insref.writerow([product_id, price, item_name])
# initializing here for reproducibility
pids = ['01','02']
prices = [10, [u'10', u'20', u'30']]
names = ['Orange','Banana']
df = pd.DataFrame({"product_id": pids, "prices": prices, "item_name": names})
The following snippet should work after your apply(crawl).
# convert all of the prices to lists (even if they only have one element)
df.prices = df.prices.apply(lambda x: x if isinstance(x, list) else [x])
# Create a new dataframe which splits the lists into separate columns.
# Then flatten using stack. The explicit MultiIndex allows us to keep
# the item_name and product_id associated with each price.
idx = pd.MultiIndex.from_tuples(zip(*[df['item_name'],df['product_id']]),
names = ['item_name', 'product_id'])
df2 = pd.DataFrame(df.prices.tolist(), index=idx).stack()
# drop the hierarchical index and select columns of interest
df2 = df2.reset_index()[['product_id', 0, 'item_name']]
# rename back to prices
df2.columns = ['product_id', 'prices', 'item_name']
I was not able to run your code (probably missing inputs) but you can probably transform your prices list in a list of dict and then build a DataFrame from there:
d = [{"price":10, "product_id":2, "item_name":"banana"},
{"price":20, "product_id":2, "item_name":"banana"},
{"price":10, "product_id":1, "item_name":"orange"}]
df = pd.DataFrame(d)
Then df is:
item_name price product_id
0 banana 10 2
1 banana 20 2
2 orange 10 1

Categories