So I'm fairly new to coding only having relatively simple scripts here and there when I need them for work. I have a document that has an ID column formatted as:
"Number Word Number" and some values under a spec, lower tol, and upper tol column.
Where sometimes the number under ID is a integer or float and the word can be one of say 30 different possibilities. Ultimately these need to be read and then organized depending on the spec and lower/upper tol columns into something like below:
I'm using Pandas to read the data and do the manipulations I need so my question isn't so much of a how to do it, but more of a how should it best be done.
The way my code is written is basically a series of if statements that handle each of the scenarios I've come across so far, but based on other peoples code I've seen this is generally not done and as I understand considered poor practice. It's very basic if statements like:
if(The ID column has "Note" in it) then its a basic dimension
if(The ID column has Roughness) then its an Ra value
if(The ID column has Position in it) then its a true position etc
Problem is I'm not really sure what the "correct" way to do it would be in terms of making it more efficient and simpler. I have currently a series of 30+ if statements and ways to handle different situations that I've run into so far. Virtually all the code I've written is done in this overly specific and not very general coding methodology that even though it works I find personally overcomplicated but I'm not really sure what capabilities of python/pandas I'm sort of missing and not utilizing to simplify my code.
Since you need to test what the variable in ID is and do some staff accordingly you can't avoid the if statements most probably.What i suggest you to do since you have written the code is to reform the database.If there is not a very specific reason you have database with a structure like this,you should change it asap.
To be specific to ID add an (auto)increment unique number and break the 3 datapoints of ID column into 3 seperate columns.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 months ago.
Improve this question
i have data (pandas data frame) with 10 millions row ,this code using for loop on data using google colab but when i perform it it is very slow .
is there away to use faster loop with these multiple statements (like np.where) or other solve??
i need help for rewrite this code in another way (like using np.where) or other to solve this problem
the code are:
'''
`for i in range(0,len(data)):
last=data.head(i)
select_acc = last.loc[last['ACOUNTNO']==data['ACOUNTNO'][i]]
avr= select_acc[ (select_acc['average']>0)]
if len(avr)==0:
lastavrage=0
else:
lastavrage = avr.average.mean()
if (data["average"][i]<lastavrage) and (data['LASTREAD'][i]> 0):
data["label"][i]="abnormal"
data["problem"][i]="error"
`
Generally speaking, the worst thing to do is to iterate rows.
I can't see a totally iteration free solution (by "iteration free" I mean, "without explicit iterations in python". Of course, any solution would have iterations anyway. But some may have iterations made under the hood, by the internal code of pandas or numpy, which are way faster).
But you could at least try to iterate over account numbers rather than rows (there are certainly less account numbers than rows. Otherwise you wouldn't need those computation any way).
For example, you could compute the threshold of "abnormal" average like this
for no in data.ACCOUNTNO.unique():
f=data.ACCOUNTNO==no # True/False series of rows matching this account
cs=data[f].average.cumsum() # Cumulative sum of 'average' column for this account
num=f.cumsum() # Numerotation of rows for this account
data.loc[f, 'lastavr']=cs/num
After that, column 'lastavr' contains what your variable lastaverage would worth in your code. Well, not exactly: your variable doesn't count current row, while mine does. We could have computed (cs-data.average)/(num-1) instead of cs/num to have it your way. But what for? The only thing you do with this is compare to current df.average. And data.average>(cs-data.average)/(num-1) iff data.average>cs/num. So it is simpler that way, and it avoids special case for 1st row
Then, once you have that new column (you could also just use a series, without adding it as a column. A little bit like I did for cs and num which are not columns of data), it is simply a matter of
pb = (data.average<data.lastavr) & (data.LASTREAD>0)
data.loc[pb,'label']='abnormal'
data.loc[pb,'problem']='error'
Note that the fact that I don't have a way to avoid the iteration over ACCOUNTNO, doesn't mean that there isn't one. In fact, I am pretty sure that with lookup or some combination of join/merge/groupby there could be one. But it probably doesn't matter much, because you have probably way less ACCOUNTNO than you have rows. So my remaining loop is probably negligible.
I have a database with 1 million rows and based on a user's input I need to find him the most relevant matches.
The way the code was written in the past was by using the library fuzzywuzzy. A ratio between 2 strings was calculated in order to show how similar the strings were.
The problem with that is that we had to run the ratio function for each row from the database, meaning 1 million function calls and the performance is really bad. We've never thought that we'd get to the point of having this much data.
I am looking for a better algorithm or solution for handling the search in this case. I've stomped upon something called TF-IDF (Term Frequency-Inverse Document Frequency). It was described as a solution for "fuzzy matching at scale", way faster.
Unfortunately I couldn't wrap my mind around it and completely understand how does it work, and the more I read about it, the more I think that this is not what I need, since all the examples that I've seen are trying to find similar matches between 2 lists, not 1 string and 1 list.
So, am I on the wrong path? And if so, could you please give me some ideas on how can I handle this scenario? Unfortunately, Full Text Search works only with exact matches, so in our case fuzzy is definitely the way we want to go.
And if you're going to propose the idea of using a separate search engine, we don't want to add a new tool to our infrastructure just for this.
Totally new in this forum and new in python so I would appreciate it if anybody can help me.
I am trying to build a script in python based on data that I have in an excel spreadsheet. I'd like to create an app/script where I can estimate the pregnancy due date and the conception date (for animals) based on measurements that I have taken during ultrasounds. I am able to estimate it with a calculator but it takes some conversion to do (from cm to mm) and days to months. In order to do that in Python, I figured I create a variable for each measurement and set each variable equals to its value in days (and integer).
Here is the problem: the main column of my data set is the actual measurements of the babies in mm (Known as BPD) but the BPD can be an integer like 5mm or 6.4mm. Since I can't name a variable with a period or a dot in it, what would be the best way to handle my data and assign variables to it? I have tried BPD_4.8= 77days, but python tells me there's a syntax error (I'm sure lol), but if I type BDP_5= 78 it seems to work. I haven't mastered lists and tuples, not do I really know how to use them properly so ill keep looking online and see what happens.
I'm sure it's something super silly for you guys, but I'm really pulling my hair out and I have nothing but 2 inches of hair lol
This is what my current screen looks like..HELP :(
Howdy and welcome to StackOverflow. The short answer is:
Use a better data structure
You really shouldn't be encoding valuable information into variable names like that. What's going to happen if you want to calculate something with your BPD measurements? Or when you have duplicate BPD's?
This is bad practise. It might seem like a lot of effort to take the time to figure out how to do this properly - but it will be more than worth it if you intend to continue to use Python :)
I'll give you a couple options...
Option 1: Use a dictionary
Dictionaries are common data structures in any language.. so it can pay to know how to use them.
Dictionaries hold information about an object using key/value pairs. For example you might have:
measurements = {
'animal_1' : {'bpd': 4.6, 'due_date_days': 55},
'animal_2' : {'bpd': 5.2, 'due_date_days': 77},
}
An advantage of dictionaries is that they are explicit, ie values have keys which explicitly identify what the information is assigned to. E.g. measurements['animal_1']['due_date_days'] would return the due date for animal 1.
A disadvantage is that it will be harder to compute information / examine relationships than you'll be used to in Excel.
Option 2: Use Pandas
Pandas is a data science library for Python. It's fast, has similar functionality to Excel and is probably well suited to your use case.
I'd recommend you take the time to do a tutorial or two. If you're planning to use Python for data analysis then it's worth using the language and any suitable libraries properly.
You can check out some Pandas tutorials here: https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html
Good luck!
Simple question, and my google-fu is not strong enough to find the right term to get a solid answer from the documentation. Any term I look for that includes either change or modify leads me to questions like 'How to change column name....'
I am reading in a large dataframe, and I may be adding new columns to it. These columns are based on interpolation of values on a row by row basis, and the simple numbers of rows makes this process a couple hours in length. Hence, I save the dataframe, which also can take a bit of time - 30 seconds at least.
My current code will always save the dataframe, even if I have not added any new columns. Since I am still developing some plotting tools around it, I am wasting a lot of time waiting for the save to finish at the termination of the script needlessly.
Is there a DataFrame attribute I can test to see if the DataFrame has been modified? Essentially, if this is False I can avoid saving at the end of the script, but if it is True then a save is necessary. This simple one line if will save me a lot of time and a lost of SSD writes!
You can use:
df.equals(old_df)
You can read the it's functionality in pandas' documentation. It basically does what you want, returning True only if both DataFrames are equal, and it's probably the fastest way to do it since it's an implementation of pandas itself.
Notice you need to use .copy() when assigning old_df before changes in your current df, otherwise you might pass the dataframe by reference and not by value.
I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.
Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)
Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.
I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples