How to compare these strings in python? [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 months ago.
Improve this question
I have the following string:
1679.2235398,-1555.40390834,-1140.07728186,-1999.85500108
and I'm using a steganography technique to store it in an image. Now when I retrieve it back out of the image, sometimes I got it back in a complete form and I have no issue with that. Where in other occasions, the retrieved data are not fully retrieved (due to a modification/alteration being occurred on the image), so the result something look like this:
1679.2235398,-1555.I8\xf3\x1cj~\x9bc\x13\xac\x9e8I>[a\xfdV#\x1c\xe1\xea\xa0\x8ah\x02\xed\xd1\x1c\x84\x96\xe2\xfbk*8'l
Notice that, only "1679.2235398,-1555." are correctly retrieved, while the rest is where the modification has been occurred.
Now, how do I compute (in percentage) how much I successfully retrieved?
Since the length is not the same, I can't do a character by character comparison , it seems that I need to slice or convert the modified data into some other form to match the length of the original data.
Any tips?

A lot of this is going to depend on the context of your problem, but you have a number of options here.
If your results always look like that, you could just find the longest common subsequence, then divide by the length of the original string for a percentage.
Levenshtein distance is a common way of comparing strings, as the number of characters required to change to turn one string into another. This question has several answers discussing how to turn that into a percentage.
If you don't expect the strings to always come out in the same order, this answer suggests some algorithms used for DNA work.

Well it really depends.. My solution would be something like this:
I would start with all the longest string possible and check if they are in the new string
if original_string in new_string:
'something happens here'.
that would be inside a loop that wld decrease the size of the original string and get all combinations possible. so the next one wld be N-1 long and have 2 possible combinations (cutting off the first number or the last number), and so on, until u get to a specific threshold, or to 1 long strings. the loop can store the longest string you find in a log inside the if conditional, and afterward you can just check the results. hope that helps.

Related

is there away to make loop on huge data faster? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 months ago.
Improve this question
i have data (pandas data frame) with 10 millions row ,this code using for loop on data using google colab but when i perform it it is very slow .
is there away to use faster loop with these multiple statements (like np.where) or other solve??
i need help for rewrite this code in another way (like using np.where) or other to solve this problem
the code are:
'''
`for i in range(0,len(data)):
last=data.head(i)
select_acc = last.loc[last['ACOUNTNO']==data['ACOUNTNO'][i]]
avr= select_acc[ (select_acc['average']>0)]
if len(avr)==0:
lastavrage=0
else:
lastavrage = avr.average.mean()
if (data["average"][i]<lastavrage) and (data['LASTREAD'][i]> 0):
data["label"][i]="abnormal"
data["problem"][i]="error"
`
Generally speaking, the worst thing to do is to iterate rows.
I can't see a totally iteration free solution (by "iteration free" I mean, "without explicit iterations in python". Of course, any solution would have iterations anyway. But some may have iterations made under the hood, by the internal code of pandas or numpy, which are way faster).
But you could at least try to iterate over account numbers rather than rows (there are certainly less account numbers than rows. Otherwise you wouldn't need those computation any way).
For example, you could compute the threshold of "abnormal" average like this
for no in data.ACCOUNTNO.unique():
f=data.ACCOUNTNO==no # True/False series of rows matching this account
cs=data[f].average.cumsum() # Cumulative sum of 'average' column for this account
num=f.cumsum() # Numerotation of rows for this account
data.loc[f, 'lastavr']=cs/num
After that, column 'lastavr' contains what your variable lastaverage would worth in your code. Well, not exactly: your variable doesn't count current row, while mine does. We could have computed (cs-data.average)/(num-1) instead of cs/num to have it your way. But what for? The only thing you do with this is compare to current df.average. And data.average>(cs-data.average)/(num-1) iff data.average>cs/num. So it is simpler that way, and it avoids special case for 1st row
Then, once you have that new column (you could also just use a series, without adding it as a column. A little bit like I did for cs and num which are not columns of data), it is simply a matter of
pb = (data.average<data.lastavr) & (data.LASTREAD>0)
data.loc[pb,'label']='abnormal'
data.loc[pb,'problem']='error'
Note that the fact that I don't have a way to avoid the iteration over ACCOUNTNO, doesn't mean that there isn't one. In fact, I am pretty sure that with lookup or some combination of join/merge/groupby there could be one. But it probably doesn't matter much, because you have probably way less ACCOUNTNO than you have rows. So my remaining loop is probably negligible.

Replace values in power query column using python? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Newbie in Power BI/Power query and python. I hope to ask this question succinctly.
I have a "primary" query in PBI but need to change the values of one column (categories) based on the values in the (description) column. I feel there is a better solution than a new conditional if/else column, or ReplaceReplacer.text in M Code.
An I idea I had was to create a list or query of all values in (description) that need to have their category changed , and somehow use python to iterate through the (description) list and when it finds a value in (description), it knows to drop the new value into category.
I've googled extensively but can't find that kind of "loop" that I can drop a python script into Power Query/Power BI.
What direction should I be heading in, or am I asking the right questions? I'd appreciate any advice!
John
You are having a rather simple ETL task at hand that clearly doesn't justify incorporating another language like Python/Pandas.
Given the limited information you are sharing I would imagine to use a separate mapping table for your categories and then merge that one with your original table. And eventually you only keep the columns you are interested in.
E.g. this mapping or translation table has 2 columns: OLD and NEW. Then you merge this mapping table with your data table such that OLD equals your Description column (the GUI will help you with that) and then expand the newly generated column. Finally rename the columns you want to keep and remove all the rest. This is way more efficient than 100 replacements.

retrieve 13mer peptide sequence from uniprotID and specific residue [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have a list of UniprotIDs with a corresponding residue of interest (e.g. Q7TQ48_S442). I need to retrieve the +/-6 residues around the specific site within the protein sequence(in the example, the sequence I need would be DIEAEASEERQQE).
Can you suggest a method to do it for a list of IDs + residue of interest using Python, R, or an already available web-tool?
Thanks,
Emanuele
If I enter a list of protein IDs into UniProt from https://www.uniprot.org/uploadlists/ or by uploading a file, I get a table of results. At the top of the table, there is an option that allows you to select the columns - one option is the peptide sequence. (no programming needed so far - just upload the list of UIDs you are interested in).
Now, to extract the specific sequence, this can be done in R using the substr command. Here, we'd want to add/subtract 6 from either end:
len13seq <- with(uniprot_data, substr(peptide_sequence, start = ind - 6, stop = ind + 6 ))
where in your example, ind = 442.
To make this work you need to
Separate your tags into two(+?) columns - the UniprotID and the site index. You can also include the amino acid if you need it for later analyses.
Create a file with just the UniProtIDs which is fed into the UniProt database.
Customize the displayed columns, making sure to get the sequence.
Download the result and read it into R.
Merge the original data frame (with the site index) with the downloaded results.
generate the sequence in the neighborhood around your point of interest.
It is possible to do this entirely within R - I did that at one point, but I'm not sure you need it unless you need the entire thing to be automated. If that's what you need, I would suggest checking out https://www.bioconductor.org/packages/3.7/bioc/html/UniProt.ws.html. I don't use Bioconductor often, so I'm not familiar with the package. When I previously used R to get UniProt data, what I was after was not available in the tablular output, and I had to modify my code quite a bit to get to the data I was after.
Hopefully, the Bioconductor solution is easier than what I did.

Similarity Analysis in SQL or Python or R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a very interesting problem I have been trying to resolve in past few days without luck. I have the 120k descriptions of the items that I have to compare to 38k of items and determine what is the level of similarity between. Ultimately I want to see if any of 38k exist within 120k based on similarity.
I found nice similarity script in excel and I organized my data as multiplication table so I can compare each description from 120k to each description in 38k. See pic below. So the function works, however, the amount of calculation is just not possible to run in excel. We are talking over 2 billion calculation if I split this in half ( 120k X 16k). The function is comparing description from A2 to B1, then A2 to C1 and so forth till the end which is 16k. Then it goes description from A3 and does the same and 120k times like that.
Does anyone know Script in SQL or R or Python that can do this if put this on the powerful server?
You are looking for aproximate string matching. There is a free add-on for Excel, developed by Microsoft to create a so called Fuzzy match. It uses the Jaccard index algorithm to determine the similarity of two given values.
Make sure that both lists of descriptions are listed in a sortable table column (Ctrl+L);
Link the columns in the 'Left Columns' and the 'Right Columns' section by clikcing on them and press the connect button in the middle;
Select which columns you want as output (hold Ctrl if you want to select multiple columns on either the left or the right side);
Make sure the FuzzyLookup.Similarity is checked, this will give the similarity score between the values 0-1;
Determine the maximum number of matches shown per comparable string;
Determine your Threshold. The number represents the minimum percentage of similarity between two strings before it marks it as a match;
Go to a new sheet to cell A1, that's because the new generated similarity table will overwrite current data;
Hit the 'Go'button!
Select all the similarity scores and give them more decimals for a proper result.
See example.

Python Program to tell amount of image files that will fit on a disk/flash drive [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Someone on here asked a question similar to this, but it got quickly downvoted and closed due to the newbiness of it's nature. So I decided to answer it myself, and for others who want to know how to make this nifty program, because it isn't really such a bad idea. So here goes nothing!
I'm not going to tell you your code is horrendous. But you could simplify it.
Everything in the try routine could be replaced by a single line:
print 'Space for about', int(totstor*1048576/25), 'standard jpg image files available'
In other words you can print the result of the calculation directly. Let int() take care of rounding, getting rid of ".0", etc., and rely on the fact that you can print integers (and most other data types) directly without converting them to a string. You simply chain together the items you need in the output using commas. (There are other ways of getting numbers into the desired output text, but this is simplest).
First, I took a couple of pictures (JPEG FORMAT) (With a software I developed myself :3) and got their file sizes, and they averaged out at about 25kb. So then to get the amount of pictures on a disk, you take it's available space in kilobytes, and divide it by 25 (Or larger for different file types, not really sure).
For the program, I asked the user to enter amount of gigs available, and then the program multiplies it by 1048576 (Kbs in a Gig) and stores it as a value, then multiplies it by 25.
So, here is the wonderful code (Hopefully the comments kinda explain what is going on, I am not to great at this):
#Main Loop
while True:
#Set number of Gigs
totstor=raw_input("Enter the amount of storage on the desired disk (In gb): ")
#Just in case you get bored
if totstor=='quit':
break
try:
#Do the math
gigs=int(totstor)
gigs=round(gigs)
kilos=gigs*1048576
kilos=kilos/25
kilos=round(kilos)
kilos=str(kilos)
kilos=kilos.strip('.0')
print 'Space for about '+kilos+' standard jpg image files available'
#If an error occurs, let em' know
except:
print 'Invalid Number!'
print '\n'
print '\n'
#Bye
quit()
Anyone who got any help out of this, leave some feedback. Or just tell me how horrendous my code is xD.

Categories