Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have a list of UniprotIDs with a corresponding residue of interest (e.g. Q7TQ48_S442). I need to retrieve the +/-6 residues around the specific site within the protein sequence(in the example, the sequence I need would be DIEAEASEERQQE).
Can you suggest a method to do it for a list of IDs + residue of interest using Python, R, or an already available web-tool?
Thanks,
Emanuele
If I enter a list of protein IDs into UniProt from https://www.uniprot.org/uploadlists/ or by uploading a file, I get a table of results. At the top of the table, there is an option that allows you to select the columns - one option is the peptide sequence. (no programming needed so far - just upload the list of UIDs you are interested in).
Now, to extract the specific sequence, this can be done in R using the substr command. Here, we'd want to add/subtract 6 from either end:
len13seq <- with(uniprot_data, substr(peptide_sequence, start = ind - 6, stop = ind + 6 ))
where in your example, ind = 442.
To make this work you need to
Separate your tags into two(+?) columns - the UniprotID and the site index. You can also include the amino acid if you need it for later analyses.
Create a file with just the UniProtIDs which is fed into the UniProt database.
Customize the displayed columns, making sure to get the sequence.
Download the result and read it into R.
Merge the original data frame (with the site index) with the downloaded results.
generate the sequence in the neighborhood around your point of interest.
It is possible to do this entirely within R - I did that at one point, but I'm not sure you need it unless you need the entire thing to be automated. If that's what you need, I would suggest checking out https://www.bioconductor.org/packages/3.7/bioc/html/UniProt.ws.html. I don't use Bioconductor often, so I'm not familiar with the package. When I previously used R to get UniProt data, what I was after was not available in the tablular output, and I had to modify my code quite a bit to get to the data I was after.
Hopefully, the Bioconductor solution is easier than what I did.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Newbie in Power BI/Power query and python. I hope to ask this question succinctly.
I have a "primary" query in PBI but need to change the values of one column (categories) based on the values in the (description) column. I feel there is a better solution than a new conditional if/else column, or ReplaceReplacer.text in M Code.
An I idea I had was to create a list or query of all values in (description) that need to have their category changed , and somehow use python to iterate through the (description) list and when it finds a value in (description), it knows to drop the new value into category.
I've googled extensively but can't find that kind of "loop" that I can drop a python script into Power Query/Power BI.
What direction should I be heading in, or am I asking the right questions? I'd appreciate any advice!
John
You are having a rather simple ETL task at hand that clearly doesn't justify incorporating another language like Python/Pandas.
Given the limited information you are sharing I would imagine to use a separate mapping table for your categories and then merge that one with your original table. And eventually you only keep the columns you are interested in.
E.g. this mapping or translation table has 2 columns: OLD and NEW. Then you merge this mapping table with your data table such that OLD equals your Description column (the GUI will help you with that) and then expand the newly generated column. Finally rename the columns you want to keep and remove all the rest. This is way more efficient than 100 replacements.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a very interesting problem I have been trying to resolve in past few days without luck. I have the 120k descriptions of the items that I have to compare to 38k of items and determine what is the level of similarity between. Ultimately I want to see if any of 38k exist within 120k based on similarity.
I found nice similarity script in excel and I organized my data as multiplication table so I can compare each description from 120k to each description in 38k. See pic below. So the function works, however, the amount of calculation is just not possible to run in excel. We are talking over 2 billion calculation if I split this in half ( 120k X 16k). The function is comparing description from A2 to B1, then A2 to C1 and so forth till the end which is 16k. Then it goes description from A3 and does the same and 120k times like that.
Does anyone know Script in SQL or R or Python that can do this if put this on the powerful server?
You are looking for aproximate string matching. There is a free add-on for Excel, developed by Microsoft to create a so called Fuzzy match. It uses the Jaccard index algorithm to determine the similarity of two given values.
Make sure that both lists of descriptions are listed in a sortable table column (Ctrl+L);
Link the columns in the 'Left Columns' and the 'Right Columns' section by clikcing on them and press the connect button in the middle;
Select which columns you want as output (hold Ctrl if you want to select multiple columns on either the left or the right side);
Make sure the FuzzyLookup.Similarity is checked, this will give the similarity score between the values 0-1;
Determine the maximum number of matches shown per comparable string;
Determine your Threshold. The number represents the minimum percentage of similarity between two strings before it marks it as a match;
Go to a new sheet to cell A1, that's because the new generated similarity table will overwrite current data;
Hit the 'Go'button!
Select all the similarity scores and give them more decimals for a proper result.
See example.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 months ago.
Improve this question
I have the following string:
1679.2235398,-1555.40390834,-1140.07728186,-1999.85500108
and I'm using a steganography technique to store it in an image. Now when I retrieve it back out of the image, sometimes I got it back in a complete form and I have no issue with that. Where in other occasions, the retrieved data are not fully retrieved (due to a modification/alteration being occurred on the image), so the result something look like this:
1679.2235398,-1555.I8\xf3\x1cj~\x9bc\x13\xac\x9e8I>[a\xfdV#\x1c\xe1\xea\xa0\x8ah\x02\xed\xd1\x1c\x84\x96\xe2\xfbk*8'l
Notice that, only "1679.2235398,-1555." are correctly retrieved, while the rest is where the modification has been occurred.
Now, how do I compute (in percentage) how much I successfully retrieved?
Since the length is not the same, I can't do a character by character comparison , it seems that I need to slice or convert the modified data into some other form to match the length of the original data.
Any tips?
A lot of this is going to depend on the context of your problem, but you have a number of options here.
If your results always look like that, you could just find the longest common subsequence, then divide by the length of the original string for a percentage.
Levenshtein distance is a common way of comparing strings, as the number of characters required to change to turn one string into another. This question has several answers discussing how to turn that into a percentage.
If you don't expect the strings to always come out in the same order, this answer suggests some algorithms used for DNA work.
Well it really depends.. My solution would be something like this:
I would start with all the longest string possible and check if they are in the new string
if original_string in new_string:
'something happens here'.
that would be inside a loop that wld decrease the size of the original string and get all combinations possible. so the next one wld be N-1 long and have 2 possible combinations (cutting off the first number or the last number), and so on, until u get to a specific threshold, or to 1 long strings. the loop can store the longest string you find in a log inside the if conditional, and afterward you can just check the results. hope that helps.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a CSV file with approximately 700 columns and 10,000 rows. Each of these columns contains attribute information for the object in column 1 of each row. I would like to search through this "database" for specific records that match a set of requirements based on their attribute information.
For instance, one column contains state information in the 2 letter abbreviation form. Another column might contain an acronym referring to a certain geographic characteristic. Suppose I'm looking for all rows where the state is NY, and the acronym in GRG.
What packages should I use to handle this work/data anlaysis in R?
If there are no good packages in R, for handling such a large dataset, what should I look to using?
I am familiar with R, Python, Office and some SQL commands.
Edit: I am not going to modify the dataset, but record (print out or create a subset from) the results of the querying. I'll have a total of 10-12 queries at first to determine if this dataset truly serves my need. But I may possibly have hundreds of queries later - at which point I'd like to switch from manual querying of the dataset to an automated querying (if possible).
you can use the fread option from the data.table package
http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf
or you can import the data in an RDBMS and connect to it using RODBC
http://www.statmethods.net/input/dbinterface.html
or you can use RevoScaleR package from Revolution Analytics
or you can use the cloud to process the data
or you can use ff package
based on your querying needs- data.table package is the best
you can use setKey to set the index
Depending how much data is in each column and if you're planning to do statistical analysis, I would definitely go with R. If no analysis then python with pandas is a good solution. Do not use office for those files, it'll give you a headache.
If you're brave and your data is going to increase, implement MongoDB with either R or python depending on previous need.
If you don't want to load the whole file into memory I suggest using python library Pandas.
You can enable "iterator=True" and then load the file chunk by chunk into memory and loop through each chunk to do your analysis.
If you need any other information, please let me know.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm using Python to parse urls into words. I am having some success but I am trying to cut down on ambiguity. For example, I am given the following url
"abbeycarsuk.com"
and my algorithm outputs:
['abbey','car','suk'],['abbey','cars','uk']
Clearly the second parsing is the correct one, but the first one is also technically just as correct (apparently 'suk' is a word in the dictionary that I am using).
What would help me out a lot is if there is a wordlist out there that also contains the fequency/popularity of each word. I could work this into my algorithm and then the second parsing would be chosen (since 'uk' is obviously more common than 'suk'). Does anyone know where I could find such a list? I found wordfrequency.info but they charge for the data, and the free sample they offer does not have enough words for me to be able to use it successfully.
Alternatively, I suppose I could download a large corpus (project Gutenberg?) and get the frequency values myself, however if such a data set already exists, it would make my life a lot easier.
There is an extensive article on this very subject written by Peter Norvig (Google's head of research), which contains worked examples in Python, and is fairly easy to understand. The article, along with the data used in the sample programs (some excerpts of Google ngram data) can be found here. The complete set of Google ngrams, for several languages, can be found here (free to download if you live in the east of the US).
As you mention, "corpus" is the keyword to search for.
E. g. here is a nice list of resources:
http://www-nlp.stanford.edu/links/statnlp.html
(scroll down)
http://ucrel.lancs.ac.uk/bncfreq/flists.html
This is perhaps the list you want. I guess you can cut down on the size of it to increase performance if it's needed.
Here is a nice big list for you. More information available here.
Have it search using a smaller dictionary first, a smaller dictionary will tend to keep more commonly used words. Then if it fails, you could have it use your more compete dictionary that includes words like 'suk'.
You would then be able to ignore word frequency analysis, however you would take a hit to your overhead by adding another smaller dictionary.
You might be able to use will's link that he posted in the comments as a small dictonary
edit also, the link you provided does indeed have a free service where you can download a list of the top 5,000 used words