Select columns in a pandas DataFrame - python

I have a pandas dataframe with hundreds of columns of antibiotic names. Each specific antibiotic is coded in the dataframe as ending in E, T, or P to indicate empirical, treatment, or prophylactic regimens.
An example excerpt from the column list is:
['MeropenemP', 'MeropenemE', 'MeropenemT', DoripenemP', 'DoripenemE',
'DoripenemT', ImipenemP', 'ImipenemE', 'ImipenemT', 'BiapenemP',
'BiapenemE', 'BiapenemT', 'PanipenemP', 'PanipenemE',
'PanipenemT','PipTazP', 'PipTazE', 'PipTazT','PiperacillinP',
'PiperacillinE', 'PiperacillinT']
A small sample of data is located here:
Sample antibiotic data
It is simple enough for me to separate out columns any type into separate dataframes with a regex, e.g. to select all the empirically prescribed antibiotics columns I use:
E_cols = master.filter(axis=1, regex=('[a-z]+E$'))
Each column has a binary value (0,1) for prescription of each antibiotic regimen type per person (row).
Question:
How would I go about summing the rows of all columns (1's) for each type of regimen type and generating a new column for each result in the dataframe e.g. total_emperical, total_prophylactic, total_treatment.
The reason I want to add to the existing dataframe is that I wish to filter on other values for each regimen type.

Once you've generated the list of columns that match your reg exp then you can just create the new total columns like so:
df['total_emperical'] = df[E_cols].sum(axis=1)
and repeat for the other totals.
Passing axis=1 to sum will sum row-wise

Related

How to modify column names when combining rows of multiple columns into single rows in a dataframe based on a categorical value. + selective sum/mean

I'm using the pandas .groupby function to combine multiple rows into a single row.
Currently I have a dataframe df_clean which has 405 rows, and 85 columns. There are up to 3 rows that correspond to a single Batch.
My current code for combining the multiple rows is:
num_V = 84 #number of rows -1 for the row "Batch" that they are being grouped by
max_row = df_clean.groupby('Batch').Batch.count().max()
df2= (
df_clean.groupby('Batch')
.apply(lambda x: x.values[:,1:].reshape(1,-1)[0])
.apply(pd.Series)
)
This code works creating a dataframe df2 which groups the rows by Batch, however the columns in the resulting dataframe are simply numbered (0,1,2,3,...249,250,251) note that 84*3=252, ((number of columns - Batch column)*3)=252, Batch becomes the index.
I'm cleaning some data for analysis and I want to combine the data of several (generally 1-3) Sub_Batch values on separate rows into a single row based on their Batch. Ideally I would like to be able to determine which columns are grouped into a row and remain separate columns in the row, as well as which columns the average, or total value is reported.
for example desired input/output:
Original dataframe
Output dataframe
note: naming of columns, and that all columns are copied over and that the columns are ordered according to which sub-batch they belong to. ie Weight_2 will always correspond to the second sub_batch that is a part of that Batch, Weight_3 will correspond to the third Sub_batch that is part of the Batch.
Ideal output dataframe
note: naming of columns, and that in this dataframe there is only a single column that records the Color as they are identical for all Sub-Batch values within a Batch. The individual Temperature values are recorded, as well as the average of the Temperature values for a Batch. The individual Weight values are recorded as well as the sum of the weight values in the column 'Total_weight`
I am 100% okay with the Output Dataframe scenario as I will simply add the values that I want afterwards using .mean and .sum for the values that I desire, I am simply asking if it can be done using `.groupby' as it is not something that I have worked with before, and I know that it does have some ability to sum or average results.

How to reshape dataframe with pandas?

I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

How to keep rows in a DataFrame based on column unique sets?

How to keep rows in a DataFrame based on column unique pairs in Python?
I have a massive ocean datasets with over 300k rows. Given some unique latitude-longitude pairs have multiple depths, I am only interested in keeping unique rows that contain unique sets of Latitude-Longitude-Year-Month.
The goal here is to know how many months of sampling for a given Latitude-Longitude location.
I tried using pandas conditions but the sets that I want are dependent on each other.
Any ideas on how to do this?
So far I've tried the following:
# keep Latitude, Longitude, Year and Month
glp = glp[['latitude', 'longitude', 'year', 'month']]
# only keep unique rows
glp.drop_duplicates(keep = False, inplace = True)
but it removes too many lines as I want those four variables to work together
The code you are looking for is .drop_duplicates()
Assuming your dataframe variable is df, you can use
df.drop_duplicates()
or include column name list if you're only looking for unique values within specified columns
df.drop_duplicates(subset=[column_list])#column_list of names you want to compare
Edit:
If that's the case, I guess you could just do
df.groupby([column_list]).first() #first() takes the first values of other columns
And then you could just use df.reset_index() if you want the unique sets as columns again.

Aggregate Function to dataframe while retaining rows in Pandas

I want to aggregate my data based off a field known as COLLISION_ID and a count of each COLLISION_ID.
I want to remove repeating COLLISION_IDs since they have the same Coordinates, but retain a count of occurrences in original data-set.
My code is below
df2 = df1.groupby(['COLLISION_ID'])[['COLLISION_ID']].count()
This returns such:
I would like my data returned as the COLLISION_ID numbers, the count, and the remaining columns of my data which are not shown here(~40 additional columns that will be filtered later)
If you are talking about filter , we should do transform
df1['count_col']=df1.groupby(['COLLISION_ID'])['COLLISION_ID'].transform('count')
Then you can filter the df1 with column count

Categories