My issue is that I need to identify the patient "ID" if anything critical (high conc. XT or increase in Crea) is observed in their blood sample.
Ideally, the sick patients "ID" should be categorized into one of the three groups which could be called Bad_30, Bad_40, and Bad_40. If the patients don't make it into one of the "Bad" groups, then they are non-criticals
See answer
This might be the way:
critical = df[(df['hour36_XT']>=2.0) | (df['hour42_XT']>=1.5) | (df['hour48_XT']>=0.5)]
not_critical = df[~df.index.isin(critical.index)]
Before using this, you will have to convert the data type of all values to float. You can do that by using dtype=np.float32 while defining the data frame
You can put multiple conditions within one df.loc bracket. I tried this on your dataset and it worked as expected:
newDf = df.loc[(df['hour36_XT'] >= 2.0) & (df['hour42_XT'] >= 1.0) & (df['hour48_XT'] >= 0.5)])
print(newDF['ID'])
Explanation: I'm creating a new dataframe using your conditions and then printing out the IDs of the resulting dataframe.
Words of advice: You should avoid iterating over Pandas dataframe rows, and once you learn to utilize Pandas you'll be surprised how rarely you need to do this. This should be the first lesson when starting to use Pandas, but it's so ingrained in us programmers that we tend to skip over the powerful abilities of the Pandas package and immediately turn to using row iterations. If you rely on row iteration when working with Pandas you'll likely find an annoying slowness when you start working with larger datasets and/or more complex operations. I recommend reading up on this, I'm a beginner myself and have found this article to be a good reference point.
Related
Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))
I have two lists in form of pandas DataFrames which both contain a column of names. Now I want to compare these names and return a list of names which appear in both DataFrames. The problem is that my solution is way too slow since both list have several thousand entries.
Now I want to know if there is anything else I can do to accelerate the solution of my problem.
I already sorted my pandas dataframe by alphabet using "df.sort_values" in Order to create an alphabetical index so that a name in the first list which starts with the letter "X" will only be compared to entries with the same first letter in the second list.
I suspect that the main reason my program is running so slow is my way of accessing the fields which I am comparing.
I use a specific comparison function to compare the names and access the dataframe elements through the df.at[i, 'column_title'] method.
Edit: Note that this specific comparison function is more complex than a simple "==" since I am doing a kind of fuzzy string comparison to make sure names with slightly different spelling still get marked as a match. I use the whoswho library which returns me a match rate between 0 and 100. A simplified example focussing on my slow solution for the pandas dataframe comparison looks as follows:
for i in range(len(list1)):
for j in range(len(list2)):
# who.ratio returns a match rate between two strings
ratio = who.ratio(list1.at[i, 'name'], list2.at[j, 'name'])
if ratio > 75:
save(i,j) # stores values i and j in a result list
I also thought about switching from pandas to numpy but I read that this might slow it down even further since pandas is faster for big data amounts.
Can anybody tell me if there is there a faster way of accessing specific elements in a pandas array? Or is there a faster way in general to run a custom comparison function through two pd dataframes?
Edit2: spelling, addtitional information.
I have a table in Hadoop which contains 7 billion strings which can themselves contain anything. I need to remove every name from the column containing the strings. An example string would be 'John went to the park' and I'd need to remove 'John' from that, ideally just replacing with '[name]'.
In the case of 'John and Mary went to market', the output would be '[NAME] and [NAME] went to market'.
To support this I have an ordered list of the most frequently occurring 20k names.
I have access to Hue (Hive, Impala) and Zeppelin (Spark, Python & libraries) to execute this.
I've tried this in the DB, but being unable to update columns or iterate over a variable made it a non-starter, so using Python and PySpark seems to be the best option especially considering the number of calculations (20k names * 7bil input strings)
#nameList contains ['John','Emma',etc]
def removeNames(line, nameList):
str_line= line[0]
for name in nameList:
rx = f"(^| |[[:^alpha:]])({name})( |$|[[:^alpha:]])"
str_line = re.sub(rx,'[NAME]', str_line)
str_line= [str_line]
return tuple(str_line)
df = session.sql("select free_text from table")
rdd = df.rdd.map(lambda line: removeNames(line, nameList))
rdd.toDF().show()
The code is executing, but it's taking an hour and a half even if I limit the input text to 1000 lines (which is nothing for Spark), and the lines aren't actually being replaced in the final output.
What I'm wondering is: Why isn't map actually updating the lines of the RDD, and how could I make this more efficient so it executes in a reasonable amount of time?
This is my first time posting so if there's essential info missing, I'll fill in as much as I can.
Thank you!
In case you're still curious about this, by using the udf (your removeNames function) Spark is serializing all of your data to the master node, essentially defeating your usage of Spark to do this operation in a distributed fashion. As the method suggested in the comments, if you go with the regexp_replace() method, Spark will be able to keep all of the data on the distributed nodes, keeping everything distributed and improving performance.
I'm trying to optimize one piece of Software written in Python using Pandas DF . The algorithm takes a pandas DF as input, can't be distributed and it outputs a metric for each client.
Maybe it's not the best solution, but my time-efficient approach is to load all files in parallel and then build a DF for each client
This works fine BUT very few clients have really HUGE amount of data. So I need to save memory when creating their DF.
In order to do this I'm performing a groupBy() (actually a combineByKey, but logically it's a groupBy) and then for each group (aka now a single Row of an RDD) I build a list and from it, a pandas DF.
However this makes many copies of the data (RDD rows, List and pandas DF...) in a single task/node and crashes and I would like to remove that many copies in a single node.
I was thinking on a "special" combineByKey with the following pseudo-code:
def createCombiner(val):
return [val]
def mergeCombinerVal(x,val):
x.append(val);
return x;
def mergeCombiners(x,y):
#Not checking if y is a pandas DF already, but we can do it too
if (x is a list):
pandasDF= pd.Dataframe(data=x,columns=myCols);
pandasDF.append(y);
return pandasDF
else:
x.append(y);
return x;
My question here, docs say nothing, but someone knows if it's safe to assume that this will work? (return dataType of merging two combiners is not the same than the combiner). I can control datatype on mergeCombinerVal too if the amount of "bad" calls is marginal, but it would be very inefficient to append to a pandas DF row by row.
Any better idea to perform want i want to do?
Thanks!,
PS: Right now I'm packing Spark rows, would switching from Spark rows to python lists without column names help reducing memory usage?
Just writing my comment as answer
At the end I've used regular combineByKey, its faster than groupByKey (idk the exact reason, I guess it helps packing the rows, because my rows are small size, but there are maaaany rows), and also allows me to group them into a "real" Python List (groupByKey groups into some kind of Iterable which Pandas doesn't support and forces me to create another copy of the structure, which doubles memory usage and crashes), which helps me with memory management when packing them into Pandas/C datatypes.
Now I can use those lists to build a dataframe directly without any extra transformation (I don't know what structure is Spark's groupByKey "list", but pandas won't accept it in the constructor).
Still, my original idea should have given a little less memory usage (at most 1x DF + 0.5x list, while now I have 1x DF + 1x list), but as user8371915 said it's not guaranteed by the API/docs..., better not to put that into production :)
For now, my biggest clients fit into a reasonable amount of memory. I'm processing most of my clients in a very parallel low-memory-per-executor job and the biggest ones in a not-so-parallel high-memory-per-executor job. I decide based on a pre-count I perform
I'm giving myself a crash course in using python and pandas for data crunching. I finally got sick of using spreadsheets and wanted something more flexible than R so I decided to give this a spin. It's a really slick interface and I'm having a blast playing around with it. However, in researching different tricks, I've been unable to find just a cheat sheet of basic spreadsheet functions, particularly with regard to adding formulas to new columns in dataframes that reference other columns.
I was wondering if someone might give me the recommended code to accomplish the 6 standard spreadsheet operations below, just so I can get a better idea of how it works. If you'd like to see a full size rendering of the image just click here
If you'd like to see the spreadsheet for yourself, click here.
I'm already somewhat familiar with adding columns to dataframes, it's mainly the cross-referencing of specific cells that I'm struggling with. Basically, I'm anticipating the answer loosely looking something like:
table['NewColumn']=(table['given_column']+magic-code-that-I-don't-know).astype(float-or-int-or-whatever)
If I would do well to use an additional library to accomplish any of these functions, feel free to suggest it.
In general, you want to be thinking about vectorized operations on columns instead of operations on specific cells.
So, for example, if you had a data column, and you wanted another column that was the same but with each value multiplied by 3, you could do this in two basic ways. The first is the "cell-by-cell" operation.
df['data_prime'] = df['data'].apply(lambda x: 3*x)
The second is the vectorized way:
df['data_prime'] = df['data'] * 3
So, column-by-column in your spreadsheet:
Count (you can add 1 to the right side if you want it to start at 1 instead of 0):
df['count'] = pandas.Series(range(len(df))
Running total:
df['running total'] = df['data'].cumsum()
Difference from a scalar (set the scalar to a particular value in your df if you want):
df['diff'] = scalar - df['data']
Moving average:
df['moving average'] = df['running total'] / df['count'].astype('float')
Basic formula from your spreadsheet:
I think you have enough to this on your own.
If statement:
df['new column'] = 0
mask = df['data column'] >= 3
df.loc[mask, 'new column'] = 1