I have the following string extracted from a pandas column (its an sport example):
unpack ="{'TB': [['Brady', 'Godwin'], ['2023-RD1', '2023-RD4']], 'KC': [['Mahomes'], ['2023-RD2']]}"
To upack the string i use the following:
from ast import literal_eval
t_dict = literal_eval(unpack)
print(t_dict)
which gives me:
{'TB': [['Brady', 'Godwin'], ['2023-RD1', '2023-RD4']], 'KC': [['Mahomes'], ['2023-RD2']]}
I am now trying to extract all of these keys / values to variables/lists. My expected output is:
team1 = 'TB'
team2 = 'KC'
team1_trades_players = ['Brady', 'Godwin']
team1_trades_picks = ['2023-RD1', '2023-RD4']
team2_trades_players = ['Mahomes']
team2_trades_picks = ['2023-RD2]
I have tried the following but I am unsure how to send the first iteration to team1 and 2nd iteration to team2:
#extracting team for each pick
for t in t_dict:
print(t)
Gives me:
TB
KC
And then for the values i can correctly print them but unsure how to send back to the lists:
#extracting lists for each key:
for traded in t_dict.values():
#extracting the players traded for each team
for players in traded[0]:
print(players)
#extracting picks for each team
for picks in traded[1]:
print(picks)
Produces:
Brady
Godwin
2023-RD1
2023-RD4
Mahomes
2023-RD2
I think i am close but missing the final step of sending back to their variables/lists. Any help would be greatly appreciated! Thanks!
If the number of teams is known beforehand it is pretty simple:
team1, team2 = t_dict.keys()
team1_trades_players, team1_trades_picks = t_dict[team1]
team2_trades_players, team2_trades_picks = t_dict[team2]
If the number of teams is not known beforehand, I would recommend to just use t_dict.
I would recommend to put everything in a nested dict which you then can access easiely:
t_dict = {'TB': [['Brady', 'Godwin'], ['2023-RD1', '2023-RD4']], 'KC': [['Mahomes'], ['2023-RD2']]}
t_nested = {k:{"players": v[0], "picks": v[1]} for k,v in t_dict.items()}
team1 = list(t_nested.keys())[0]
team2 = list(t_nested.keys())[1]
team1_trades_players = t_nested[team1]['players']
team1_trades_picks = t_nested[team1]['picks']
team2_trades_players = t_nested[team2]['players']
team2_trades_picks = t_nested[team2]['picks']
But probably for most use cases it would be better to just keep it in that nested dict structure and use it directly instead of creating all these variables which make everything less dynamic.
Related
I have a csv file with roughly 50K rows of search engine queries. Some of the search queries are the same, just in a different word order, for example "query A this is " and "this is query A".
I've tested using fuzzywuzzy's token_sort_ratio function to find matching word order queries, which works well, however I'm struggling with the runtime of the nested loop, and looking for optimisation tips.
Currently the nested for loops take around 60 hours to run on my machine. Does anyone know how I might speed this up?
Code below:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
from tqdm import tqdm
filePath = '/content/queries.csv'
df = pd.read_csv(filePath)
table1 = df['keyword'].to_list()
table2 = df['keyword'].to_list()
data = []
for kw_t1 in tqdm(table1):
for kw_t2 in table2:
score = fuzz.token_sort_ratio(kw_t1,kw_t2)
if score == 100 and kw_t1 != kw_t2:
data +=[[kw_t1, kw_t2, score]]
data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])
Any advice would be appreciated.
Thanks!
Since what you are looking for are strings consisting of identical words (just not necessarily in the same order), there is no need to use fuzzy matching at all. You can instead use collections.Counter to create a frequency dict for each string, with which you can map the strings under a dict of lists keyed by their frequency dicts. You can then output sub-lists in the dicts whose lengths are greater than 1.
Since dicts are not hashable, you can make them keys of a dict by converting them to frozensets of tuples of key-value pairs first.
This improves the time complexity from O(n ^ 2) of your code to O(n) while also avoiding overhead of performing fuzzy matching.
from collections import Counter
matches = {}
for query in df['keyword']:
matches.setdefault(frozenset(Counter(query.split()).items()), []).append(query)
data = [match for match in matches.values() if len(match) > 1]
Demo: https://replit.com/#blhsing/WiseAfraidBrackets
I don't think you need fuzzywuzzy here: you are just checking for equality (score == 100) of the sorted queries, but with token_sort_ratio you are sorting the queries over and over. So I suggest to:
create a "base" list and a "sorted-elements" one
iterate on the elements.
This will still be O(n^2), but you will be sorting 50_000 strings instead of 2_500_000_000!
filePath = '/content/queries.csv'
df = pd.read_csv(filePath)
table_base = df['keyword'].to_list()
table_sorted = [sorted(kw) for kw in table_base]
data = []
ln = len(table_base)
for i in range(ln-1):
for j in range(i+1,ln):
if table_sorted[i] == table_sorted[j]:
data +=[[table_base[i], table_base[j], 100]]
data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])
Apply in pandas as usually works faster:
kw_t2 = df['keyword'].to_list()
def compare(kw_t1):
found_duplicates = []
score = fuzz.token_sort_ratio(kw_t1, kw_t2)
if score == 100 and kw_t1 != kw_t2:
found_duplicates.append(kw_t2)
return found_duplicates
df["duplicates"] = df['keyword'].apply(compare)
Hi this part of my code for a biology project:
# choosing and loading the file:
df = pd.read_csv('Dafniyot_Data.csv',delimiter=',')
#grouping data by C/I groups:
CII = df[df['group'].str.contains('CII')]
CCI = df[df['group'].str.contains('CCI')]
CCC = df[df['group'].str.contains('CCC')]
III = df[df['group'].str.contains('III')]
CIC = df[df['group'].str.contains('CIC')]
ICC = df[df['group'].str.contains('ICC')]
IIC = df[df['group'].str.contains('IIC')]
ICI = df[df['group'].str.contains('ICI')]
#creating a dictonary of the groups:
dict = {'CII':CII, 'CCI':CCI, 'CCC':CCC,'III':III,'CIC':CIC,'ICC':ICC,'IIC':IIC,'ICI':ICI}
#T test
#FERTUNITY
#using ttest for checking FERTUNITY - grandmaternal(F0)
t_F0a = stats.ttest_ind(CCC['N_offspring'],ICC['N_offspring'],nan_policy='omit')
t_F0b = stats.ttest_ind(CCI['N_offspring'],ICI['N_offspring'],nan_policy='omit')
t_F0c = stats.ttest_ind(IIC['N_offspring'],CIC['N_offspring'],nan_policy='omit')
t_F0d = stats.ttest_ind(CCI['N_offspring'],III['N_offspring'],nan_policy='omit')
t_F0 = {'FERTUNITY - grandmaternal(F0)':[t_F0a,t_F0b,t_F0c,t_F0d]}
I need to repeat the ttest part 6 more times with either changing the groups(CCC,etc..)or the row from the df('N_offspring',survival) which takes a lot of lines in the project.
I'm trying to find a way to still get the dictionary of each group in the end:
t_F0 = {'FERTUNITY - grandmaternal(F0)':[t_F0a,t_F0b,t_F0c,t_F0d]}
Because its vey useful for me later, but in a less repetitive way with less lines
Use itertools.product to generate all the keys, and a dict comprehension to generate the values:
from itertools import product
keys = [''.join(items) for items in product("CI", repeat=3)]
the_dict = { key: df[df['group'].str.contains(key)] for key in keys }
Similarly, you can generate the latter part of your test keys:
half_keys = [''.join(items) for items in product("CI", repeat=2)]
t_F0 = {
'FERTUNITY - grandmaternal(F0)': [
stats.ttest_ind(
the_dict[f"C{half_key}"]['N_offspring'],
the_dict[f"I{half_key}"]['N_offspring'],
nan_policy='omit'
) for half_key in half_keys
],
}
As an aside, you should not use dict as a variable name: it already has a meaning (the type of dict objects).
As a second aside, this deals with the literal question of how to DRY up creating a dictionary. However, do consider what Chris said in comments; this may be an XY problem.
Update: This can not be solved 100% since the number of merchants each user must receive is different. So some users might end up getting the same merchants as before. However, is it possible to let them get the same merchants, if there are not any other different merchants available?
I have the following excel file:
What I would like to do is to redistribute the merchants (Mer_id) so each user (Origin_pool) gets the same number of merchants as before, but a different set of merchants. For example, after the redistribution, Nick will receive 3 Mer_id's but not: 30303, 101020, 220340. Anna will receive 4 merchants but not 23401230,310231, 2030230, 2310505 and so on. Of course, one merchant can not be assigned to more than one person.
What I did so far is to find the total number of merchants each user must receive and randomly give them one mer_id that is not previously assigned to them. After I find a different mer_id I remove it from the list, so the other users won't receive the same merchant:
import pandas as pd
import numpy as np
df=pd.read_excel('dup_check_origin.xlsx')
dfcounts=df.groupby(['Origin_pool']).size().reset_index(name='counts')
Origin_pool=list(dfcounts['Origin_pool'])
counts=list(dfcounts['counts'])
dict_counts = dict(zip(Origin_pool, counts))
dest_name=[]
dest_mer=[]
for pool in Origin_pool:
pername=0
#for j in range(df.shape[0]):
while pername<=dict_counts[pool]:
rn=random.randint(0,df.shape[0]-1)
rid=df['Mer_id'].iloc[rn]
if (pool!=df['Origin_pool'].iloc[rn]):
#new_dict[pool]=rid
pername+=1
dest_name.append(pool)
dest_mer.append(rid)
df=df.drop(df.loc[df['Mer_id']==rid].index[0])
But it is not efficient at all, given the fact that in the future I might have more data than 18 rows.
Is there any library that does this or a way to make it more efficient?
Several days after your question, but I think it's a bullet proof code.
You can manage to create a function or class with the entire code.
I only created one, which is a recursive one, to handle the leftovers.
There are 3 lists, initialized at the beginning of the code:
pairs -> it returns your pool list (final one)
reshuffle -> it returns the pairs pool generated randomly and already appeared at pool pairs in the excel
still -> to handle the repeated pool pairs inside the function pullpush
The pullpsuh function comes first, because it will be called in different situations.
The first part of the program is a random algorithm to make pairs from mer_id(merchants) and origin_pool(poolers).
If the pair is not in the excel than it goes to the pairs list, otherwise they go to the reshuffle list.
Depending on the reshuffle characteristics another random algorithm is called or it will be processed by pullpush function.
If you execute the code once, as it is, and print(pairs) you may find a list with 15, 14 any more pool pairs lesser than 18.
Then, if you print(reshuffle) you will see the rest of the pairs to make 18.
To get the full 18 matchings in the pairs variable you must run:
pullpush(reshuffle).
The output here was obtained running the code followed by:
pullpush(reshuffle)
If you want to control that mer_id and origin_pool should not repeat for 3 rounds, you can load other 2 excels and split
them into oldpair2 and oldpair3.
[[8348201, 'Anna'], [53256236, 'Anna'], [9295, 'Anna'], [54240, 'Anna'], [30303, 'Marios'], [101020, 'Marios'], [959295, 'Marios'], [2030230, 'George'], [310231, 'George'], [23401230, 'George'], [2341134, 'Nick'], [178345, 'Marios'], [220340, 'Marios'], [737635, 'George'], [[2030230, 'George'], [928958, 'Nick']], [[5560503, 'George'], [34646, 'Nick']]]
The code:
import pandas as pd
import random
df=pd.read_excel('dup_check_origin.xlsx')
oldpair = df.values.tolist() #check previous pooling pairs
merchants = df['Mer_id'].values.tolist() #convert mer_id in list
poolers = df['Origin_pool'].values.tolist() #convert mer_id in list
random.shuffle(merchants) #1st step shuffle
pairs = [] #empty pairs list
reshuffle = [] #try again
still = [] #same as reshuffle for pullpush
def pullpush(repetition):
replacement = repetition #reshuffle transfer
for re in range(len(replacement)):
replace = next(r for r in pairs if r not in replacement)
repair = [[replace[0],replacement[re][1]],
[replacement[re][0],replace[1]]]
if repair not in oldpair:
iReplace = pairs.index(replace)#get index of pair
pairs.append(repair)
del pairs[iReplace] # remove from pairs
else:
still.append(repair)
if still:
pullpush(still) #recursive call
for p in range(len(poolers)):#avoid more merchants than poolers
pair = [merchants[p],poolers[p]]
if pair not in oldpair:
pairs.append(pair)
else:
reshuffle.append(pair)
if reshuffle:
merchants_bis = [x[0] for x in reshuffle]
poolers_bis = [x[1] for x in reshuffle]
if len(reshuffle) > 2: #shuffle needs 3 or more elements
random.shuffle(merchants_bis)
reshuffle = [] #clean before the loop
for n in range(len(poolers_bis)):
new_pair = [merchants_bis[n],poolers_bis[n]]
if new_pair not in oldpair:
pairs.append(new_pair)
else:
reshuffle.append(new_pair)
if len(reshuffle) == len(poolers_bis):#infinite loop
pullpush(reshuffle)
# double pairs and different poolers
elif (len(reshuffle) == 2 and not[i for i in reshuffle[0] if i in reshuffle[1]]):
merchants_bis = [merchants_bis[1],merchants_bis[0]]
new_pair = [[merchants_bis[1],poolers_bis[0]],
[merchants_bis[0],poolers_bis[1]]]
if new_pair not in oldpair:
pairs.append(new_pair)
else:
reshuffle.append(new_pair)
pullpush(reshuffle)
else: #one left or same poolers
pullpush(reshuffle)
My solution using dictionaries and lists, i print the result, but you can create a new dataframe with that.
from random import shuffle
import pandas as pd
df = pd.read_excel('dup_check_origin.xlsx')
dpool = {}
mers = list(df.Mer_id.unique())
shuffle(mers)
for pool in df.Origin_pool.unique():
dpool[pool] = list(df.Mer_id[df.Origin_pool == pool])
for key in dpool.keys():
inmers = dpool[key]
cnt = len(inmers)
new = [x for x in mers if x not in inmers][:cnt]
mers = [x for x in mers if x not in new]
print(key, new)
I have been working on a no-sql solution to naming a list of N postcodes using a national list of postcodes. So far I have my reference dictionary for the state of NSW in the form :
{'Belowra': 2545, 'Yambulla': 2550, 'Bingie': 2537, ... [n=4700]
My
function uses this to look up the names of a postcode:
def look_up_sub(pc, settings):
output=[]
for suburb, postcode in postcode_dict.items():
if postcode == pc and settings=='random':#select match at random
print(suburb) #remove later
output.append(suburb)
break #stop searching for matches
elif postcode == pc and settings=='all': #print all possible names for postcode
print(suburb) #remove later
return output
N=[2000,2020,2120,2019]
for i in N:
look_up_sub(i, 'random')
>>>Millers Point
>>>Mascot
>>>Westleigh
>>>Banksmeadow
While ok for small lists, when N is sufficiently large this inefficient approach is very slow. I have been thinking about how I could use numpy arrays to speed this up considerably and am looking for faster ways to approach this.
Your data structure is backwards, it should go from postcode:suburb and then when you pass it a pc you get a list of suburbs back, then either select from that list randomly or print all of them in the list.
Here is what you should do, first, reverse your dict:
import defaultdict
post_to_burb = defaultdict(list)
for suburb, postcode in postcode_dict.items():
post_to_burb[postcode].append(suburb)
Now, your function should do something like:
import random
def look_up_sub(pc, settings):
output = []
if settings == "random":
output.append(random.choice(post_to_burb[pc]))
elif settings == 'all':
output.extend(post_to_burb[pc])
return output
Using numpy here would be unweildy, especially since you are working with strings. You might get some marginal imporvemnt in runtime, but your overall algorithm will still be linear time. Now it is constant time, once you've set up your post_to_burb dict.
Build a dict from postal code to suburbs:
from collections import defaultdict
code_to_urbs = defaultdict(list)
for suburb, postcode in postcode_dict.items():
code_to_urbs[postcode].append(suburb)
With that done, you can just write code_to_urbs[postal_code].
I have a ticker that grabs current information of multiple elements and adds it to a list in the format: trade_list.append([[trade_id, results]]).
Say we're tracking trade_id's 4555, 5555, 23232, the trade_list will keep ticking away adding their results to the list, I then want to find the averages of their results individually.
The code works as such:
Find accounts
for a in accounts:
find open trades of accounts
for t in range(len(trades)):
do some math
trades_list.append(trade_id,result)
avernum = 0
average = []
for r in range(len(trades_list)):
average.append(trades_list[r][1]) # This is the value attached to the trade_id
avernum+=1
results = float(sum(average)/avernum))
results_list.append([[trade_id,results]])
This fills out really quickly. This is after two ticks:
print(results_list)
[[[53471, 28.36432]], [[53477, 31.67835]], [[53474, 32.27664]], [[52232, 1908.30604]], [[52241, 350.4758]], [[53471, 28.36432]], [[53477, 31.67835]], [[53474, 32.27664]], [[52232, 1908.30604]], [[52241, 350.4758]]]
These averages will move and change very quickly. I want to use results_list to track and watch them, then compare previous averages to current ones
Thinking:
for r in range(len(results_list)):
if results_list[r][0] == trade_id:
restick.append(results_list[r][1])
resnum = len(restick)
if restick[resnum] > restick[resnum-1]:
do fancy things
Here is some short code that does what you I think you have described, although I might have misunderstood. You basically do exactly what you say; select everything that has a certain trade_id and returns its average.:
TID_INDEX = 0
DATA_INDEX = 1
def id_average(t_id, arr):
filt_arr = [i[DATA_INDEX] for i in arr if i[TID_INDEX] == t_id]
return sum(filt_arr)/len(filt_arr)