Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Newbie in Power BI/Power query and python. I hope to ask this question succinctly.
I have a "primary" query in PBI but need to change the values of one column (categories) based on the values in the (description) column. I feel there is a better solution than a new conditional if/else column, or ReplaceReplacer.text in M Code.
An I idea I had was to create a list or query of all values in (description) that need to have their category changed , and somehow use python to iterate through the (description) list and when it finds a value in (description), it knows to drop the new value into category.
I've googled extensively but can't find that kind of "loop" that I can drop a python script into Power Query/Power BI.
What direction should I be heading in, or am I asking the right questions? I'd appreciate any advice!
John
You are having a rather simple ETL task at hand that clearly doesn't justify incorporating another language like Python/Pandas.
Given the limited information you are sharing I would imagine to use a separate mapping table for your categories and then merge that one with your original table. And eventually you only keep the columns you are interested in.
E.g. this mapping or translation table has 2 columns: OLD and NEW. Then you merge this mapping table with your data table such that OLD equals your Description column (the GUI will help you with that) and then expand the newly generated column. Finally rename the columns you want to keep and remove all the rest. This is way more efficient than 100 replacements.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 months ago.
Improve this question
i have data (pandas data frame) with 10 millions row ,this code using for loop on data using google colab but when i perform it it is very slow .
is there away to use faster loop with these multiple statements (like np.where) or other solve??
i need help for rewrite this code in another way (like using np.where) or other to solve this problem
the code are:
'''
`for i in range(0,len(data)):
last=data.head(i)
select_acc = last.loc[last['ACOUNTNO']==data['ACOUNTNO'][i]]
avr= select_acc[ (select_acc['average']>0)]
if len(avr)==0:
lastavrage=0
else:
lastavrage = avr.average.mean()
if (data["average"][i]<lastavrage) and (data['LASTREAD'][i]> 0):
data["label"][i]="abnormal"
data["problem"][i]="error"
`
Generally speaking, the worst thing to do is to iterate rows.
I can't see a totally iteration free solution (by "iteration free" I mean, "without explicit iterations in python". Of course, any solution would have iterations anyway. But some may have iterations made under the hood, by the internal code of pandas or numpy, which are way faster).
But you could at least try to iterate over account numbers rather than rows (there are certainly less account numbers than rows. Otherwise you wouldn't need those computation any way).
For example, you could compute the threshold of "abnormal" average like this
for no in data.ACCOUNTNO.unique():
f=data.ACCOUNTNO==no # True/False series of rows matching this account
cs=data[f].average.cumsum() # Cumulative sum of 'average' column for this account
num=f.cumsum() # Numerotation of rows for this account
data.loc[f, 'lastavr']=cs/num
After that, column 'lastavr' contains what your variable lastaverage would worth in your code. Well, not exactly: your variable doesn't count current row, while mine does. We could have computed (cs-data.average)/(num-1) instead of cs/num to have it your way. But what for? The only thing you do with this is compare to current df.average. And data.average>(cs-data.average)/(num-1) iff data.average>cs/num. So it is simpler that way, and it avoids special case for 1st row
Then, once you have that new column (you could also just use a series, without adding it as a column. A little bit like I did for cs and num which are not columns of data), it is simply a matter of
pb = (data.average<data.lastavr) & (data.LASTREAD>0)
data.loc[pb,'label']='abnormal'
data.loc[pb,'problem']='error'
Note that the fact that I don't have a way to avoid the iteration over ACCOUNTNO, doesn't mean that there isn't one. In fact, I am pretty sure that with lookup or some combination of join/merge/groupby there could be one. But it probably doesn't matter much, because you have probably way less ACCOUNTNO than you have rows. So my remaining loop is probably negligible.
I've attached a screenshot of a table in excel, but I'm doing this in pythonenter image description here
I'm trying to recreate the column "predict" in python, I have the other columns already. I am trying to get the first row of "predict" to be equal to the first row of "ytd" and then for every value following that one, I want it to be the result of the "nc" value multiplied by the previous value in the "predict" column. It doesn't have to be done in this particular order or in this way, I just want that to be the end result, and any clear help to achieve that would be much appreciated. I feel like there should be a way to do this with conditionals, but I am struggling to find the right combination of information.
Have you got any code in Python? Is the information there or are you reading the information from the excel file, and then printing out or saving to the file? I didn't quite understand the question.
I'm learning dataframe now. I've been stuck in how to get a subset of a dataframe or table with its label index. I know it's a very simple question but I couldn't find the solution in pandas documentation. Hope someone could help me. Appreciate your help.
So, I have a dataframe named df_teams like below:
enter image description here
If I want to get a subtable of a specific team 'Warriors', I can use df_teams[df_teams['nickname']=='Warriors'], resulting a row in the form of dataframe. My question is, what if I want to get a subtable of more teams, say I want information of both 'Warriors' and 'Hawks' to form a new table? Can I do something similar by using logical index and finishing in one line of code?
You could do a bitwise or on the two conditions using the '|' character.
df_teams[(df_teams['nickname']=='Warriors')|(df_teams['nickname']=='Hawks')]
Alternatively if you have a list of values you want to check against you could instead use the isin method to return rows that have one of the values present in the list.
E.g
df_teams[df_teams['nickname'].isin(['Warriors','Hawks'])]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have a list of UniprotIDs with a corresponding residue of interest (e.g. Q7TQ48_S442). I need to retrieve the +/-6 residues around the specific site within the protein sequence(in the example, the sequence I need would be DIEAEASEERQQE).
Can you suggest a method to do it for a list of IDs + residue of interest using Python, R, or an already available web-tool?
Thanks,
Emanuele
If I enter a list of protein IDs into UniProt from https://www.uniprot.org/uploadlists/ or by uploading a file, I get a table of results. At the top of the table, there is an option that allows you to select the columns - one option is the peptide sequence. (no programming needed so far - just upload the list of UIDs you are interested in).
Now, to extract the specific sequence, this can be done in R using the substr command. Here, we'd want to add/subtract 6 from either end:
len13seq <- with(uniprot_data, substr(peptide_sequence, start = ind - 6, stop = ind + 6 ))
where in your example, ind = 442.
To make this work you need to
Separate your tags into two(+?) columns - the UniprotID and the site index. You can also include the amino acid if you need it for later analyses.
Create a file with just the UniProtIDs which is fed into the UniProt database.
Customize the displayed columns, making sure to get the sequence.
Download the result and read it into R.
Merge the original data frame (with the site index) with the downloaded results.
generate the sequence in the neighborhood around your point of interest.
It is possible to do this entirely within R - I did that at one point, but I'm not sure you need it unless you need the entire thing to be automated. If that's what you need, I would suggest checking out https://www.bioconductor.org/packages/3.7/bioc/html/UniProt.ws.html. I don't use Bioconductor often, so I'm not familiar with the package. When I previously used R to get UniProt data, what I was after was not available in the tablular output, and I had to modify my code quite a bit to get to the data I was after.
Hopefully, the Bioconductor solution is easier than what I did.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a very interesting problem I have been trying to resolve in past few days without luck. I have the 120k descriptions of the items that I have to compare to 38k of items and determine what is the level of similarity between. Ultimately I want to see if any of 38k exist within 120k based on similarity.
I found nice similarity script in excel and I organized my data as multiplication table so I can compare each description from 120k to each description in 38k. See pic below. So the function works, however, the amount of calculation is just not possible to run in excel. We are talking over 2 billion calculation if I split this in half ( 120k X 16k). The function is comparing description from A2 to B1, then A2 to C1 and so forth till the end which is 16k. Then it goes description from A3 and does the same and 120k times like that.
Does anyone know Script in SQL or R or Python that can do this if put this on the powerful server?
You are looking for aproximate string matching. There is a free add-on for Excel, developed by Microsoft to create a so called Fuzzy match. It uses the Jaccard index algorithm to determine the similarity of two given values.
Make sure that both lists of descriptions are listed in a sortable table column (Ctrl+L);
Link the columns in the 'Left Columns' and the 'Right Columns' section by clikcing on them and press the connect button in the middle;
Select which columns you want as output (hold Ctrl if you want to select multiple columns on either the left or the right side);
Make sure the FuzzyLookup.Similarity is checked, this will give the similarity score between the values 0-1;
Determine the maximum number of matches shown per comparable string;
Determine your Threshold. The number represents the minimum percentage of similarity between two strings before it marks it as a match;
Go to a new sheet to cell A1, that's because the new generated similarity table will overwrite current data;
Hit the 'Go'button!
Select all the similarity scores and give them more decimals for a proper result.
See example.