Comparison between one element and all the others of a DataFrame column - python

I have a list of tuples which I turned into a DataFrame with thousands of rows, like this:
frag mass prot_position
0 TFDEHNAPNSNSNK 1573.675712 2
1 EPGANAIGMVAFK 1303.659458 29
2 GTIK 417.258734 2
3 SPWPSMAR 930.438172 44
4 LPAK 427.279469 29
5 NEDSFVVWEQIINSLSALK 2191.116099 17
...
and I have the follow rule:
def are_dif(m1, m2, ppm=10):
if abs((m1 - m2) / m1) < ppm * 0.000001:
v = False
else:
v = True
return v
So, I only want the "frag"s that have a mass that difers from all the other fragments mass. How can I achieve that "selection"?
Then, I have a list named "pinfo" that contains:
d = {'id':id, 'seq':seq_code, "1HW_fit":hits_fit}
# one for each protein
# each dictionary as the position of the protein that it describes.
So, I want to sum 1 to the "hits_fit" value, on the dictionary respective to the protein.

If I'm understanding correctly (not sure if I am), you can accomplish quite a bit by just sorting. First though, let me adjust the data to have a mix of close and far values for mass:
Unnamed: 0 frag mass prot_position
0 0 TFDEHNAPNSNSNK 1573.675712 2
1 1 EPGANAIGMVAFK 1573.675700 29
2 2 GTIK 417.258734 2
3 3 SPWPSMAR 417.258700 44
4 4 LPAK 427.279469 29
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17
Then I think you can do something like the following to select the "good" ones. First, create 'pdiff' (percent difference) to see how close mass is to the nearest neighbors:
ppm = .00001
df = df.sort('mass')
df['pdiff'] = (df.mass-df.mass.shift()) / df.mass
Unnamed: 0 frag mass prot_position pdiff
3 3 SPWPSMAR 417.258700 44 NaN
2 2 GTIK 417.258734 2 8.148421e-08
4 4 LPAK 427.279469 29 2.345241e-02
1 1 EPGANAIGMVAFK 1573.675700 29 7.284831e-01
0 0 TFDEHNAPNSNSNK 1573.675712 2 7.625459e-09
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 2.817926e-01
The first and last data lines make this a little tricky so this next line backfills the first line and repeats the last line so that the following mask works correctly. This works for the example here, but might need to be tweaked for other cases (but only as far as the first and last lines of data are concerned).
df = df.iloc[range(len(df))+[-1]].bfill()
df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]
Results:
Unnamed: 0 frag mass prot_position pdiff
4 4 LPAK 427.279469 29 0.023452
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 0.281793
Sorry, I don't understand the second part of the question at all.
Edit to add: As mentioned in a comment to #AmiTavory's answer, I think possibly the sorting approach and groupby approach could be combined for a simpler answer than this. I might try at a later time, but everyone should feel free to give this a shot themselves if interested.

Here's something that's slightly different from what you asked, but it is very simple, and I think gives a similar effect.
Using numpy.round, you can create a new column
import numpy as np
df['roundedMass'] = np.round(df.mass, 6)
Following that, you can do a groupby of the frags on the rounded mass, and use nunique to count the numbers in the group. Filter for the groups of size 1.
So, the number of frags per bin is:
df.frag.groupby(np.round(df.mass, 6)).nunique()

Another solution can be create a dup of your list (if you need to preserve it for further processing later), iterate over it and remove all element that are not corresponding with your rule (m1 & m2).
You will get a new list with all unique masses.
Just don't forget that if you do need to use the original list later you will need to use deepcopy.

Related

Why is Pandas DataFrame Function 'isin()' taking so much time?

The 'ratings' DataFrame has two columns of interest: User-ID and Book-Rating.
I'm trying to make a histogram showing the amount of books read per user in this dataset. In other words, I'm looking to count Book-Ratings per User-ID. I'll include the dataset in case anyone wants to check it out.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!wget https://raw.githubusercontent.com/porterjenkins/cs180-intro-data-science/master/data/ratings_train.csv
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
# Sort by User
ratings2 = ratings2.sort_values(by=['User-ID'])
usersList = []
booksRead = []
for i in range(2000):
numBooksRead = ratings2.isin([i]).sum()['User-ID']
if numBooksRead != 0:
usersList.append(i)
booksRead.append(numBooksRead)
new_dict = {'User_ID':usersList,'booksRated':booksRead}
usersBooks = pd.DataFrame(new_dict)
usersBooks
The code works as is, but it took almost 5 minutes to complete. And this is the problem: the dataset has 823,000 values. So if it took me 5 minutes to sort through only the first 2000 numbers, I don't think it's feasible to go through all of the data.
I also should admit, I'm sure there's a better way to make a DataFrame than creating two lists, turning them into a dict, and then making that a DataFrame.
Mostly I just want to know how to go through all this data in a way that won't take all day.
Thanks in advance!!
It seems you want a list of user IDs, with the count how often an ID appears in the dataframe. Use value_counts() for that:
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
In [74]: ratings2['User-ID'].value_counts()
Out[74]:
11676 6836
98391 4650
153662 1630
189835 1524
23902 1123
...
258717 1
242214 1
55947 1
256110 1
252621 1
Name: User-ID, Length: 21553, dtype: int64
The result is a Series, with the User-ID as index, and the value is number of books read (or rather, number of books rated by that user).
Note: be aware that the result is heavily skewed: there are a few very active readers, but most will have rated very few books. As a result, your histogram will likely just show one bin.
Taking the log (or plotting with the x-axis on a log scale) may show a clearer histogram:
np.log(s).hist()
First filter by column Book-Rating for remove 0 values and then count values by Series.value_counts with convert to DataFrame, loop here is not necessary:
ratings = pd.read_csv('ratings_train.csv')
ratings2 = ratings[ratings['Book-Rating'] != 0]
usersBooks = (ratings2['User-ID'].value_counts()
.sort_index()
.rename_axis('User_ID')
.reset_index(name='booksRated'))
print (usersBooks)
User_ID booksRated
0 8 6
1 17 4
2 44 1
3 53 3
4 69 2
... ...
21548 278773 3
21549 278782 2
21550 278843 17
21551 278851 10
21552 278854 4
[21553 rows x 2 columns]

I need faster methods to optimize my loop

So, I'm a python newbie looking for someone with an ideia on how to optimize my code. I'm working with a spreadsheet with over 6000 rows, and this portion of my code seems really ineficient.
for x in range(0,len(df):
if df.at[x,'Streak_currency'] != str(df.at[x,'Currency']):
df.at[x, 'Martingale'] = df.at[x-1, 'Martingale'] + (df.at[x-1, 'Martingale'] )/64
x+=1
if df.at[x,'Streak_currency'] == str(df.at[x,'Currency']):
x+=1
It can take upwards of 8 minutes run.
With my limited knowledge, I only manage to change my df.loc for df.at, and it helped a lot. But I st
UPDATE
In this section of the code, I'm trying to apply a function based on a previous value until a certain condition is met, in this case,
df.at[x,'Streak_currency'] != str(df.at[x,'Currency']):
I really don't know why this iteration is taking so long. In theory, it should only look at a previous value and apply the function. Here is a sample of the output:
Periodo Currency ... Agrupamento Martingale
0 1 GBPUSD 1 1.583720 <--- starts aplying a function over and over.
1 1 GBPUSD 1 1.608466
2 1 GBPUSD 1 1.633598
3 1 GBPUSD 1 1.659123
4 1 GBPUSD 1 1.685047
5 1 GBPUSD 1 1.711376 <- stops aplying, since Currency changed
6 1 EURCHF 2 1.256550
7 1 USDCAD 3 1.008720 <- starts applying again until currency changes
8 1 USDCAD 3 1.024481
9 1 USDCAD 3 1.040489
10 1 GBPAUD 4 1.603080
Pandas lookups like df.at[x,'Streak_currency'] are not efficient. Indeed, for each evaluation of this kind of expression (multiple time per loop iteration), pandas fetch the column regarding its name and then fetch the value in a list.
You can avoid this computation cost by just storing the columns in variables before the loop. Additionally, you can put the column in numpy array so the value can be fetch in a more efficient way (assuming all the value have the same type).
Finally, using string conversions and string comparisons on integers are not efficient. They can be avoided here (assuming the integers are not unreasonably big).
Here is an example:
import numpy as np
streakCurrency = np.array(df['Streak_currency'], dtype=np.int64)
currency = np.array(df['Currency'], dtype=np.int64)
martingale = np.array(df['Martingale'], dtype=np.float64)
for x in range(len(df)):
if streakCurrency[x] != currency[x]:
martingale[x] = martingale[x-1] * (65./64.)
x+=1
if streakCurrency[x] == currency[x]:
x+=1
# Update the pandas dataframe
df['Martingale'] = martingale
This should at least an order of magnitude faster.
Please note that the second condition is useless since the compared values cannot be equal and different at the same times (this may be a bug in your code)...

Date manipulation periods

I have this problem for work. So I have this dataset as follows:
Client Date Transaction Num
A 7/20/2017 1
A 7/26/2017 1
A 7/31/2017 1
A 8/23/2017 2
A 8/31/2017 2
A 9/11/2017 2
A 9/19/2017 3
A 9/27/2017 3
A 10/4/2017 3
B 6/1/2017 1
B 6/29/2017 1
B 7/6/2017 2
B 8/27/2017 3
B 9/28/2017 4
B 10/16/2017 4
B 11/30/2017 5
What I need to do is generate the transaction num based on the date for each client as follows:
For the starting date (for client A, it is 7/20/17), I need to assign a starting transaction Number = 1. Then for every 30 days from this starting date, I need to increment the transaction number by one. So 30 days from 7/20/17 is 8/19/17, so all dates falling within this range get transaction num =1, then if they exceed, the transaction number increments by one for every 30 days from starting date. This pattern goes on, so 30 days from 8/19/17 is 9/18/17, so dates within this range gets transaction num =2, and after 9/18/17, gets transaction num = 3 and so on.
I need to do this for a large excel. Any help would be appreciated. If it easier in python, please let me know as well.
Thanks,
Sammy
Interesting question, possibly multiple sollutions but I came up with the one below:
So in C1 enter this formula:
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)+1
Confirm with CTRL+SHIFT+ENTER, and drag your formula down.
Note: Sorry for the difference in layout of dates, I have to deal with Dutch version of Excel :)
EDIT: Explaination
Step 1 - Get minimum date corresponding to Cell A1:
=MIN(IF($A$1:$A$17=A1,$B$1:$B$17))
Step 2 - Get difference of cell B1 and minimmum and round it of. Doesn't matter if its one or 0 decimals:
=ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)
Step 3 - Devide difference through 30 days:
=ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30
Step 4 - Make sure you round of this outcome to below (probably bad english) with floor function to its closest multiple you want to round to. In this case it will be 1.
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)
Step 5 - Now we just need to add 1 to this outcome to prevent starting at 0
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)+1
Confirm all through CTRL+SHIFT+ENTER
If the dates are in order, you could just do a VLOOKUP to get the first one and subtract, but #JvdV's answer is more general
=INT((B2-VLOOKUP(A2,A:B,2,FALSE))/30)+1

pandas equivalent to numpy.roll

I have a pandas dataframe and I'd like to add a new column that has the contents of an existing column, but shifted relative to the rest of the data frame. I'd also like the value that drops off the bottom to get rolled around to the top.
For example if this is my dataframe:
>>> myDF
coord coverage
0 1 1
1 2 10
2 3 50
I want to get this:
>>> myDF_shifted
coord coverage coverage_shifted
0 1 1 50
1 2 10 1
2 3 50 10
(This is just a simplified example - in real life, my dataframes are larger and I will need to shift by more than one unit)
This is what I've tried and what I get back:
>>> myDF['coverage_shifted'] = myDF.coverage.shift(1)
>>> myDF
coord coverage coverage_shifted
0 1 1 NaN
1 2 10 1
2 3 50 10
So I can create the shifted column, but I don't know how to roll the bottom value around to the top. From internet searches I think that numpy lets you do this with "numpy.roll". Is there a pandas equivalent?
Pandas probably doesn't provide an off-the-shelf method to do the exactly what you described, however if you can move a little but out of the box, numpy has exactly that
In your case it is:
import numpy as np
myDF['coverage_shifted'] = np.roll(df.coverage, 2)
You can pass in an additional argument to the shift() to achieve what you want. The previous answer is much more helpful in most cases
last_value = myDF.iloc[-1]['coverage']
myDF['coverage_shifted'] = myDF.coverage.shift(1, fill_value=last_value)
You have to manually supply the value to fill_value
same can be applied for reverse rolling
first_value = myDF.iloc[0]['coverage']
myDF['coverage_back_shifted'] = myDF.coverage.shift(-1, fill_value=first_value)

pandas - how to combine selected rows in a DataFrame

I've been reading a huge (5 GB) gzip file in the form:
User1 User2 W
0 11 12 1
1 12 11 2
2 13 14 1
3 14 13 2
which is basically a directed graph representation of connections among users with a certain weight W. Since the file is so big, I tried to read it through networkx, building a Directed Graph and then set it to Undirected. But it took too much time. So I was thinking in doing the same thing analysing a pandas dataframe. I would like to return the previous dataframe in the form:
User1 User2 W
0 11 12 3
1 13 14 3
where the common links in the two directions have been merged into one having as W the sum of the single weights. Any help would be appreciated.
There is probably a more concise way, but this works. The main trick is just to normalize the data such that User1 is always the lower number ID. Then you can use groupby since 11,12 and 12,11 are now recognized as representing the same thing.
In [330]: df = pd.DataFrame({"User1":[11,12,13,14],"User2":[12,11,14,13],"W":[1,2,1,2]})
In [331]: df['U1'] = df[['User1','User2']].min(axis=1)
In [332]: df['U2'] = df[['User1','User2']].max(axis=1)
In [333]: df = df.drop(['User1','User2'],axis=1)
In [334]: df.groupby(['U1','U2'])['W'].sum()
Out[334]:
U1 U2
11 12 3
13 14 3
Name: W, dtype: int64
For more concise code that avoids creating new variables, you could replace the middle 3 steps with:
In [400]: df.ix[df.User1>df.User2,['User1','User2']] = df.ix[df.User1>df.User2,['User2','User1']].values
Note that column switching can be trickier than you'd think, see here: What is correct syntax to swap column values for selected rows in a pandas data frame using just one line?
As far as making this code fast in general, it will depend on your data. I don't think the above code will be as important as other things you might do. For example, your problem should be amenable to a chunking approach where you iterate over sections of the code, gradually shrinking it on each pass. In that case, the main thing you need to think about is sorting the data before chunking, so as to minimize how many passes you need to make. But doing it that way should allow you to do all the work in memory.

Categories