I'm giving myself a crash course in using python and pandas for data crunching. I finally got sick of using spreadsheets and wanted something more flexible than R so I decided to give this a spin. It's a really slick interface and I'm having a blast playing around with it. However, in researching different tricks, I've been unable to find just a cheat sheet of basic spreadsheet functions, particularly with regard to adding formulas to new columns in dataframes that reference other columns.
I was wondering if someone might give me the recommended code to accomplish the 6 standard spreadsheet operations below, just so I can get a better idea of how it works. If you'd like to see a full size rendering of the image just click here
If you'd like to see the spreadsheet for yourself, click here.
I'm already somewhat familiar with adding columns to dataframes, it's mainly the cross-referencing of specific cells that I'm struggling with. Basically, I'm anticipating the answer loosely looking something like:
table['NewColumn']=(table['given_column']+magic-code-that-I-don't-know).astype(float-or-int-or-whatever)
If I would do well to use an additional library to accomplish any of these functions, feel free to suggest it.
In general, you want to be thinking about vectorized operations on columns instead of operations on specific cells.
So, for example, if you had a data column, and you wanted another column that was the same but with each value multiplied by 3, you could do this in two basic ways. The first is the "cell-by-cell" operation.
df['data_prime'] = df['data'].apply(lambda x: 3*x)
The second is the vectorized way:
df['data_prime'] = df['data'] * 3
So, column-by-column in your spreadsheet:
Count (you can add 1 to the right side if you want it to start at 1 instead of 0):
df['count'] = pandas.Series(range(len(df))
Running total:
df['running total'] = df['data'].cumsum()
Difference from a scalar (set the scalar to a particular value in your df if you want):
df['diff'] = scalar - df['data']
Moving average:
df['moving average'] = df['running total'] / df['count'].astype('float')
Basic formula from your spreadsheet:
I think you have enough to this on your own.
If statement:
df['new column'] = 0
mask = df['data column'] >= 3
df.loc[mask, 'new column'] = 1
Related
Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))
I am completely new to Python (I started last week!), so while I looked at similar questions, I have difficulty understanding what's going on and even more difficulty adapting them to my situation.
I have a csv file where rows are dates and columns are different regions (see image 1). I would like to create a file that has 3 columns: Date, Region, and Indicator where for each date and region name the third column would have the correct indicator (see image 2).
I tried turning wide into long data, but I could not quite get it to work, as I said, I am completely new to Python. My second approach was to split it up by columns and then merge it again. I'd be grateful for any suggestions.
This gives your solution using stack() in pandas:
import pandas as pd
# In your case, use pd.read_csv instead of this:
frame = pd.DataFrame({
'Date': ['3/24/2020', '3/25/2020', '3/26/2020', '3/27/2020'],
'Algoma': [None,0,0,0],
'Brant': [None,1,0,0],
'Chatham': [None,0,0,0],
})
solution = frame.set_index('Date').stack().reset_index(name='Indicator').rename(columns={'level_1':'Region'})
solution.to_csv('solution.csv')
This is the inverse of doing a pivot, as explained here: Doing the opposite of pivot in pandas Python. As you can see there, you could also consider using the melt function as an alternative.
first, you're region column is currently 'one hot encoded'. What you are trying to do is to "reverse" one hot encode your region column. Maybe check if this link answers your question:
Reversing 'one-hot' encoding in Pandas.
I have a data frame TB_greater_2018 that 3 columns: country, e_inc_100k_2000 and e_inc_100k_2018. I would like to subtract e_inc_100k_2000 from e_inc_100k_2018 and then use those values returned to create a new column of the differences and then sort by the countries with the largest difference. My current code is:
case_increase_per_100k = TB_greater_2018["e_inc_100k_2018"] - TB_greater_2018["e_inc_100k_2000"]
TB_greater_2018["case_increase_per_100k"] = case_increase_per_100k
TB_greater_2018.sort_values("case_increase_per_100k", ascending=[False]).head()
When I run this, I get a SettingwithCopyWarning. Is there a way to do this without getting this warning? Or just overall a better way of accomplishing the task?
You can do
TB_greater_2018["case_increase_per_100k"] = TB_greater_2018["e_inc_100k_2018"] - TB_greater_2018["e_inc_100k_2000"]
TB_greater_2018.sort_values("case_increase_per_100k", ascending=[False]).head()
It looks like the error is from finding the difference and using that as a column in separate operations, although tbh I'm not clear why that would be.
I would like to change some values in an mdf file (specifically, I would like to check for consistency, since the measurement instrument for some reason writes 10**10 when no value could be found). I can't figure out how to access specific values and change them. I figured out how to include the channel units in the channel names, which works reasonably fast:
with MDF(file) as mdf:
for i,gp in enumerate(mdf.groups):# add units to channel names (faster than using pandas)
for j,ch in enumerate(gp.channels):
mdf.groups[i].channels[j].name = ch.name + " [" + ch.unit + "]"
Unfortunately, gp.channels doesn't seem to have a way to access the data, only some metadata for each channel (or at least I can't figure out the attribute or method).
I already tried to convert to a dataframe, where this is rather easy, but the file is quite large so it takes waaaay too long to sift through all the datapoints - my guess is this could be quite a bit faster if it is done in the mdf directly.
# slow method with dataframe conversion
data = mdf.to_dataframe()
columns = data.columns.tolist()
for col in columns:
for i,val in enumerate(data[col]):
if val == 10**10:
data.loc[i, col] = np.nan
Downsampling solves the taking too long part, but this is not really a solution either since I do need the original sample rate.
Accessing the data is not a problem, since I can use the select() or get() methods, but I can't change the values - I don't know how. Ideally, I'd change any 10**10 to a np.nan.
ok, I figured out how to do if efficiently in pandas, which works for me.
I used an combination of a lambda function and the applymap method of a pandas DataFrame:
data = data.applymap(lambda x: np.nan if x==10**10 else x)
Do you still get the 10**10 values when you call get with ignore_invalidation_bots=False? In mdf v4 the writing applications can use the invalidation bits to mark invalid samples
My issue is that I need to identify the patient "ID" if anything critical (high conc. XT or increase in Crea) is observed in their blood sample.
Ideally, the sick patients "ID" should be categorized into one of the three groups which could be called Bad_30, Bad_40, and Bad_40. If the patients don't make it into one of the "Bad" groups, then they are non-criticals
See answer
This might be the way:
critical = df[(df['hour36_XT']>=2.0) | (df['hour42_XT']>=1.5) | (df['hour48_XT']>=0.5)]
not_critical = df[~df.index.isin(critical.index)]
Before using this, you will have to convert the data type of all values to float. You can do that by using dtype=np.float32 while defining the data frame
You can put multiple conditions within one df.loc bracket. I tried this on your dataset and it worked as expected:
newDf = df.loc[(df['hour36_XT'] >= 2.0) & (df['hour42_XT'] >= 1.0) & (df['hour48_XT'] >= 0.5)])
print(newDF['ID'])
Explanation: I'm creating a new dataframe using your conditions and then printing out the IDs of the resulting dataframe.
Words of advice: You should avoid iterating over Pandas dataframe rows, and once you learn to utilize Pandas you'll be surprised how rarely you need to do this. This should be the first lesson when starting to use Pandas, but it's so ingrained in us programmers that we tend to skip over the powerful abilities of the Pandas package and immediately turn to using row iterations. If you rely on row iteration when working with Pandas you'll likely find an annoying slowness when you start working with larger datasets and/or more complex operations. I recommend reading up on this, I'm a beginner myself and have found this article to be a good reference point.