change specific value in mdf-object (python, asammdf) - python

I would like to change some values in an mdf file (specifically, I would like to check for consistency, since the measurement instrument for some reason writes 10**10 when no value could be found). I can't figure out how to access specific values and change them. I figured out how to include the channel units in the channel names, which works reasonably fast:
with MDF(file) as mdf:
for i,gp in enumerate(mdf.groups):# add units to channel names (faster than using pandas)
for j,ch in enumerate(gp.channels):
mdf.groups[i].channels[j].name = ch.name + " [" + ch.unit + "]"
Unfortunately, gp.channels doesn't seem to have a way to access the data, only some metadata for each channel (or at least I can't figure out the attribute or method).
I already tried to convert to a dataframe, where this is rather easy, but the file is quite large so it takes waaaay too long to sift through all the datapoints - my guess is this could be quite a bit faster if it is done in the mdf directly.
# slow method with dataframe conversion
data = mdf.to_dataframe()
columns = data.columns.tolist()
for col in columns:
for i,val in enumerate(data[col]):
if val == 10**10:
data.loc[i, col] = np.nan
Downsampling solves the taking too long part, but this is not really a solution either since I do need the original sample rate.
Accessing the data is not a problem, since I can use the select() or get() methods, but I can't change the values - I don't know how. Ideally, I'd change any 10**10 to a np.nan.

ok, I figured out how to do if efficiently in pandas, which works for me.
I used an combination of a lambda function and the applymap method of a pandas DataFrame:
data = data.applymap(lambda x: np.nan if x==10**10 else x)

Do you still get the 10**10 values when you call get with ignore_invalidation_bots=False? In mdf v4 the writing applications can use the invalidation bits to mark invalid samples

Related

(Python) Insert each data at list type in dataframe at on

enter image description here
I want to add column df['B'] and df['C']
I coded like below
df['B'] = df['A'].split("_").str[0]
df['C'] = df['A'].split("_").str[1]
But python split function is slow than I expected, So It spends too long time as dataframe became more bigger
I want to find more efficient ways.
Is it possible to use split function one time?
df[['B','C']] = df['A'].split("_") (this code is just example)
Or is there any more smart way?
Thanks.
split in Pandas has an expand option which you could use a follows. I've not tested speed.
df = df.join(df['A'].str.split('_', expand = True).rename(columns = {0 : 'B', 1: 'C'}))
You could use expand to create several columns at a time, and then join to add all those columns
df.join(df.A.str.split('_', expand=True).rename({0:'B', 1:'C'}, axis=1))
Tho, I fail to see how the method you are using can be so slow. I doubt this one is so much faster. It is basically the same split. That is not python's split btw. This is a pandas method. So it is "vectorized".
Unless you have thousands of those columns? In which case, indeed, it is better if you can avoid 1000 calls
Edit: timing
With 1000000 rows, on my computer, this method (and you can see that, in the same 10 seconds, you got twice the exact same answer — with the variation of axis=1 vs columns= for rename ­— so it might be the good one :D), takes 2.71 seconds.
Your method takes 2.77 seconds.
With some 0.03 standard deviation. So sometimes, yours is even faster, tho I ran it enough time to prove with a p-value<5% that yours is sligthly slower, but really really sligthly.
I guess, this is just as fast as it gets.

Pandas - value changes when adding a new column to a dataframe

I'm trying to add a new column to a dataframe using the following code.
labels = df1['labels']
df2['labels'] = labels
However, in the later part of my program, I found that there might be something wrong with the assignment. So, I checked it using
labels.equals(other=df2['labels'])
and I got a False. (I added this line instantly after assignment)
I also tried to
print out part of labels and df2, and it turns out that there are indeed some lines that are different.
check max and min values of both series, and they are different
check number of unique values in both series using len(set(labels)) and len(set(df2['labels'])), and they differs a lot
test with a smaller amount of data, but this works totally fine.
My dataframe is rather large (40 million+ lines), so I cannot print them all out and check the values. Does anyone have any idea about what might lead to this kind of problem? Or is there any suggestions for further tests?

Find row based on multiple conditions (column values greater than)

My issue is that I need to identify the patient "ID" if anything critical (high conc. XT or increase in Crea) is observed in their blood sample.
Ideally, the sick patients "ID" should be categorized into one of the three groups which could be called Bad_30, Bad_40, and Bad_40. If the patients don't make it into one of the "Bad" groups, then they are non-criticals
See answer
This might be the way:
critical = df[(df['hour36_XT']>=2.0) | (df['hour42_XT']>=1.5) | (df['hour48_XT']>=0.5)]
not_critical = df[~df.index.isin(critical.index)]
Before using this, you will have to convert the data type of all values to float. You can do that by using dtype=np.float32 while defining the data frame
You can put multiple conditions within one df.loc bracket. I tried this on your dataset and it worked as expected:
newDf = df.loc[(df['hour36_XT'] >= 2.0) & (df['hour42_XT'] >= 1.0) & (df['hour48_XT'] >= 0.5)])
print(newDF['ID'])
Explanation: I'm creating a new dataframe using your conditions and then printing out the IDs of the resulting dataframe.
Words of advice: You should avoid iterating over Pandas dataframe rows, and once you learn to utilize Pandas you'll be surprised how rarely you need to do this. This should be the first lesson when starting to use Pandas, but it's so ingrained in us programmers that we tend to skip over the powerful abilities of the Pandas package and immediately turn to using row iterations. If you rely on row iteration when working with Pandas you'll likely find an annoying slowness when you start working with larger datasets and/or more complex operations. I recommend reading up on this, I'm a beginner myself and have found this article to be a good reference point.

(Py)Spark combineByKey mergeCombiners output type != mergeCombinerVal type

I'm trying to optimize one piece of Software written in Python using Pandas DF . The algorithm takes a pandas DF as input, can't be distributed and it outputs a metric for each client.
Maybe it's not the best solution, but my time-efficient approach is to load all files in parallel and then build a DF for each client
This works fine BUT very few clients have really HUGE amount of data. So I need to save memory when creating their DF.
In order to do this I'm performing a groupBy() (actually a combineByKey, but logically it's a groupBy) and then for each group (aka now a single Row of an RDD) I build a list and from it, a pandas DF.
However this makes many copies of the data (RDD rows, List and pandas DF...) in a single task/node and crashes and I would like to remove that many copies in a single node.
I was thinking on a "special" combineByKey with the following pseudo-code:
def createCombiner(val):
return [val]
def mergeCombinerVal(x,val):
x.append(val);
return x;
def mergeCombiners(x,y):
#Not checking if y is a pandas DF already, but we can do it too
if (x is a list):
pandasDF= pd.Dataframe(data=x,columns=myCols);
pandasDF.append(y);
return pandasDF
else:
x.append(y);
return x;
My question here, docs say nothing, but someone knows if it's safe to assume that this will work? (return dataType of merging two combiners is not the same than the combiner). I can control datatype on mergeCombinerVal too if the amount of "bad" calls is marginal, but it would be very inefficient to append to a pandas DF row by row.
Any better idea to perform want i want to do?
Thanks!,
PS: Right now I'm packing Spark rows, would switching from Spark rows to python lists without column names help reducing memory usage?
Just writing my comment as answer
At the end I've used regular combineByKey, its faster than groupByKey (idk the exact reason, I guess it helps packing the rows, because my rows are small size, but there are maaaany rows), and also allows me to group them into a "real" Python List (groupByKey groups into some kind of Iterable which Pandas doesn't support and forces me to create another copy of the structure, which doubles memory usage and crashes), which helps me with memory management when packing them into Pandas/C datatypes.
Now I can use those lists to build a dataframe directly without any extra transformation (I don't know what structure is Spark's groupByKey "list", but pandas won't accept it in the constructor).
Still, my original idea should have given a little less memory usage (at most 1x DF + 0.5x list, while now I have 1x DF + 1x list), but as user8371915 said it's not guaranteed by the API/docs..., better not to put that into production :)
For now, my biggest clients fit into a reasonable amount of memory. I'm processing most of my clients in a very parallel low-memory-per-executor job and the biggest ones in a not-so-parallel high-memory-per-executor job. I decide based on a pre-count I perform

Spreadsheet Manipulation Tricks w/ Python's Pandas

I'm giving myself a crash course in using python and pandas for data crunching. I finally got sick of using spreadsheets and wanted something more flexible than R so I decided to give this a spin. It's a really slick interface and I'm having a blast playing around with it. However, in researching different tricks, I've been unable to find just a cheat sheet of basic spreadsheet functions, particularly with regard to adding formulas to new columns in dataframes that reference other columns.
I was wondering if someone might give me the recommended code to accomplish the 6 standard spreadsheet operations below, just so I can get a better idea of how it works. If you'd like to see a full size rendering of the image just click here
If you'd like to see the spreadsheet for yourself, click here.
I'm already somewhat familiar with adding columns to dataframes, it's mainly the cross-referencing of specific cells that I'm struggling with. Basically, I'm anticipating the answer loosely looking something like:
table['NewColumn']=(table['given_column']+magic-code-that-I-don't-know).astype(float-or-int-or-whatever)
If I would do well to use an additional library to accomplish any of these functions, feel free to suggest it.
In general, you want to be thinking about vectorized operations on columns instead of operations on specific cells.
So, for example, if you had a data column, and you wanted another column that was the same but with each value multiplied by 3, you could do this in two basic ways. The first is the "cell-by-cell" operation.
df['data_prime'] = df['data'].apply(lambda x: 3*x)
The second is the vectorized way:
df['data_prime'] = df['data'] * 3
So, column-by-column in your spreadsheet:
Count (you can add 1 to the right side if you want it to start at 1 instead of 0):
df['count'] = pandas.Series(range(len(df))
Running total:
df['running total'] = df['data'].cumsum()
Difference from a scalar (set the scalar to a particular value in your df if you want):
df['diff'] = scalar - df['data']
Moving average:
df['moving average'] = df['running total'] / df['count'].astype('float')
Basic formula from your spreadsheet:
I think you have enough to this on your own.
If statement:
df['new column'] = 0
mask = df['data column'] >= 3
df.loc[mask, 'new column'] = 1

Categories