The following scenario is given.
I have 2 dataframes called orders and customers.
I want to look where the CustomerID from the OrderDataFrame is in the LinkedCustomer column of the Customer Dataframe. The LinkedCustomers field is an array of CustomerIds.
The orders dataframe contains approximately 5.800.000 items.
The customer dataframe contains approximately 180 000 items.
I am looking for a way to optimize the following code, because this code runs but is very slow. How can I speed this up?
# demo data -- In the real scenario this data was read from csv-/json files.
orders = pd.DataFrame({'custId': [1, 2, 3, 4], 'orderId': [2,3,4,5]})
customers = pd.DataFrame({'id':[5,6,7], 'linkedCustomers': [{1,2}, {4,5,6}, {3, 7, 8, 9}]})
def getMergeCustomerID(row):
customerOrderId = row['custId']
searchMasterCustomer = customers[customers['linkedCustomers'].str.contains(str(customerOrderId))]
searchMasterCustomer = searchMasterCustomer['id']
if len(searchMasterCustomer) > 0:
return searchMasterCustomer
return customerOrderId
orders['newId'] = orders.apply(lambda x: getMergeCustomerID(x), axis=1)
# expected result
custId orderId newId
1 2 5
2 3 5
3 4 7
4 5 6
I think that in some circumstances this approach can solve your problem:
Build a dictionary first,
myDict = {}
for i,j in customers.iterrows():
for j2 in j[1]:
then use the dictionary to create the new column:
orders['newId'] = [myDict[i] for i in orders['custId']]
IMO even though this can solve your problem (speed up your program) this is not the most generic solution. Better answers are welcome!
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
SNP feature_id
8 rs3131972 GeneID:100302278
I have two pandas data frames. The first one contains a list of unigrams extracted from the text, count and probability of the unigram occurring in the text. The structure looks like this:
word count prob
0 we 109 0.003615
1 investigated 20 0.000663
2 the 1125 0.037315
3 potential 36 0.001194
4 of 1122 0.037215
The second one contains a list of skipgrams extracted from the same text, along with the count and probability of the skipgram occurring in the text. It looks like this:
word count prob
0 (we, investigated) 5 0.000055
1 (we, the) 31 0.000343
2 (we, potential) 2 0.000022
3 (investigated, the) 11 0.000122
4 (investigated, potential) 3 0.000033
Now, I want to calculate the pointwise mutual information for each skipgram, which is basically a log of skipgram probability divided by the product of its unigrams' probabilities. I wrote a function for that, which iterates through the skipgram df and and it works exactly how I want, but I have huge issues with performance, and I wanted to ask if there is a way to improve my code to make it calculate the pmi faster.
Here's my code:
def calculate_pmi(row):
skipgram_prob = float(row[3])
x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][0]]
y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][1]]
pmi = math.log10(float(skipgram_prob / (x_unigram_prob * y_unigram_prob)))
result = str(str(row[1][0]) + ' ' + str(row[1][1]) + ' ' + str(pmi))
return result
pmi_list = list(map(calculate_pmi, skipgram_df.itertuples()))
Performance of the function for now is around 483.18it/s, which is super slow, as I have hundreds of thousands of skipgrams to iterate through. Any suggestions would be welcome. Thanks.
This is a good question, and exercise, for new users of pandas. Use df.iterrows only as a last resort and, even then, consider alternatives. There are relatively few occasions when this is the right option.
Below is an example of how you can vectorise your calculations.
import pandas as pd
import numpy as np
uni = pd.DataFrame([['we', 109, 0.003615], ['investigated', 20, 0.000663],
['the', 1125, 0.037315], ['potential', 36, 0.001194],
['of', 1122, 0.037215]], columns=['word', 'count', 'prob'])
skip = pd.DataFrame([[('we', 'investigated'), 5, 0.000055],
[('we', 'the'), 31, 0.000343],
[('we', 'potential'), 2, 0.000022],
[('investigated', 'the'), 11, 0.000122],
[('investigated', 'potential'), 3, 0.000033]],
columns=['word', 'count', 'prob'])
# first split column of tuples in skip
skip[['word1', 'word2']] = skip['word'].apply(pd.Series)
# set index of uni to 'word'
uni = uni.set_index('word')
# merge prob1 & prob2 from uni to skip
skip['prob1'] = skip['word1'].map(uni['prob'].get)
skip['prob2'] = skip['word2'].map(uni['prob'].get)
# perform calculation and filter columns
skip['result'] = np.log(skip['prob'] / (skip['prob1'] * skip['prob2']))
skip = skip[['word', 'count', 'prob', 'result']]
I'm new to python and I could really use your help and guidance at the moment. I am trying to read a csv file with three cols and do some computation based on the first and second column i.e.
A spent 100 A spent 2040
A earned 60
B earned 48
B earned 180
A spent 40
Where A spent 2040 would be the addition of all 'A' and 'spent' amounts. This does not give me an error but it's not logically correct:
for row in rows:
cols = row.split(",")
truck = cols[0]
if (truck != 'A' and truck != 'B'):
record = cols[1]
if(record != "earned" and record != "spent"):
amount = int(cols[2])
#print(truck+" "+record+" "+str(amount))
if truck in entries:
if record in records:
records[record] = [amount]
entries[truck] = records
if record in records:
entries[truck][record] = [amount]
I am aware that this part is incorrect because I would be adding the same inner dictionary list to the outer dictionary but I'm not sure how to go from there:
entries[truck] = records
if record in records:
However, Im not sure of the syntax to create a new dictionary on the fly that would not be 'records'
I am getting:
{'B': {'earned': [60, 48], 'spent': [100]}, 'A': {'earned': [60, 48], 'spent': [100]}}
But hoping to get:
{'B': {'earned': [48]}, 'A': {'earned': [60], 'spent': [100]}}
For the kind of calculation you are doing here, I highly recommend Pandas.
Assuming in.csv looks like this:
You can do the totalling with three lines of code:
import pandas
df = pandas.read_csv('in.csv')
totals = df.groupby(['truck', 'type']).sum()
totals now looks like this:
truck type
A earned 60
spent 140
B earned 228
You will find that Pandas allows you to think on a much higher level and avoid fiddling with lower level data structures in cases like this.
if record in entries[truck]:
entries[truck][record] = [amount]
I believe this is what you would want? Now we are directly accessing the truck's records, instead of trying to check a local dictionary called records. Just like you did if there wasn't any entry of a truck.
Suppose I have the following DataFrames:
Key ContainerCode Quantity
1 P-A1-2097-05-B01 0
2 P-A1-1073-13-B04 0
3 P-A1-2024-09-H05 0
5 P-A1-2018-08-C05 0
6 P-A1-2089-03-C08 0
7 P-A1-3033-16-H07 0
8 P-A1-3035-18-C02 0
9 P-A1-4008-09-G01 0
Key SKU ContainerCode Quantity
1 22-3-1 P-A1-4008-09-G01 1
2 2132-12 P-A1-3033-16-H07 55
3 222-12 P-A1-4008-09-G01 3
4 4561-3 P-A1-3083-12-H01 126
How do I update the Quantity values in Containers to reflect the number of units in each container based on the information in Inventory? Note that multiple SKUs can reside in a single ContainerCode, so we need to add to the quantity, rather than just replace it, and there may be multiple entries in Containers for a particular ContainerCode.
What are the possible ways to accomplish this, and what are their relative pros and cons?
The following code seems to serve as a good test case:
import itertools
import pandas as pd
import numpy as np
inventory = pd.DataFrame({'Container Code':['A1','A2','A2','A4'],
containers = pd.DataFrame({'Container Code':['A1','A2','A3','A4'],
'Path Order':[1,2,3,4]})
summedInventory = inventory.groupby('Container Code')['Quantity'].sum()
print('Containers Data Frame')
print('\nInventory Data Frame')
print('\nSummed Inventory List')
newContainers = containers.drop('Quantity', axis=1). \
join(inventory.groupby('Container Code').sum(), on='Container Code')
This seems to produce the desired output.
I also tried using a regular merge:
pd.merge(containers.drop('Quantity', axis=1), \
summedInventory,how='inner',left_on='Container Code', right_index=True)
But that produces an 'IndexError: list index out of range'
Any ideas?
I hope I got your scenario correctly. I think you can use:
containers.drop('Quantity', axis = 1).\
join(inventory.groupby('ContainerCode').sum(), \
on = 'ContainerCode')
I'm first dropping quantity from containers because you don't need it - we'll create it from inventory.
Then, we group by inventory by the container code, to sum the quantity relevant to each container.
We then perform the join between the two, and each containercode existent in containers would recieve the summed quantity from inventory
I am new to Python. I would like to do the difference between two rows of a csv file when they have the same id. This csv dataset is built from an sql table export which has more than 3 millions rows.
This is an example on how my timeserie's dataset looks like :
DATE - Product ID - PRICE
26/08 - 1 - 4
26/08 - 2 - 3
27/08 - 1 - 5
27/08 - 2 - 3
For instance I would like to calculate the difference between the price of the product with id 1 on the 26/08 and the price of this same product on the next day (27/08) to estimate the price's variation over time. I wondered what could be the best way to manipulate and do calculation over these datas in Python to do my calculations, whether with Python's csv module or with SQL queries in the code. I also heard of Pandas library... Thanks for your help !
try building a dictionary by product id and analyzing each id after loading
dd = {}
with open('prod.csv', 'rb') as csvf:
csvr = csv.reader(csvf, delimiter='-')
for row in csvr:
if if len(row) == 0 or row[0].startswith('DATE'):
dd.setdefault(int(row[1]), []).append((row[0].strip(), int(row[2])))
{1: [('26/08', 4), ('27/08', 5)],
2: [('26/08', 3), ('27/08', 3)]}
this will make it pretty easy to do comparisons