How to keep rows in a DataFrame based on column unique sets? - python

How to keep rows in a DataFrame based on column unique pairs in Python?
I have a massive ocean datasets with over 300k rows. Given some unique latitude-longitude pairs have multiple depths, I am only interested in keeping unique rows that contain unique sets of Latitude-Longitude-Year-Month.
The goal here is to know how many months of sampling for a given Latitude-Longitude location.
I tried using pandas conditions but the sets that I want are dependent on each other.
Any ideas on how to do this?
So far I've tried the following:
# keep Latitude, Longitude, Year and Month
glp = glp[['latitude', 'longitude', 'year', 'month']]
# only keep unique rows
glp.drop_duplicates(keep = False, inplace = True)
but it removes too many lines as I want those four variables to work together

The code you are looking for is .drop_duplicates()
Assuming your dataframe variable is df, you can use
df.drop_duplicates()
or include column name list if you're only looking for unique values within specified columns
df.drop_duplicates(subset=[column_list])#column_list of names you want to compare
Edit:
If that's the case, I guess you could just do
df.groupby([column_list]).first() #first() takes the first values of other columns
And then you could just use df.reset_index() if you want the unique sets as columns again.

Related

Efficient way to create a new column with a count of a value in a set of other columns

The dataset I am using has number of columns which hold criminal offence codes (eg, 90, 120, 10) for prisoners. The columns are sparsely populated because of the complex survey routing logic used to capture the data. The data needs to be one hot encoded to feed into a machine learning model. Having (number of columns where offenses are held) x (number of offense codes) does one-hot encode the data, but it creates a dataset that is far too sparse.
I therefore want to create one column for each offense code and, for each row in the dataset, populate it with the count of that code across all columns that hold offenses.
I can imagine a way to do this by converting the dataframe to a dictionary, but this seems very slow and bad practice for pandas.
#dataset is a dataframe
#offense_columns is a list of strings corresponding to column names in the dataset
#create a list of all the codes that appear across all offense columns
all_possible_offense_codes=[]
for colname in all_possible_offense_codes.values():
for value in dataset[colname].values():
if value not in all_possible_offense_codes:
all_possible_offense_codes.append(value)
#create a copy subset of the dataframe with just the offense columns
offense_cols_subset=dataset[offense_columns]
#convert to dictionary- quicker to loop through than df
offense_cols_dict=offense_cols_subset.to_dict(orient='index')
#create an empty dictionary to hold the counts and append back onto the main dataframe
all_offense_counts={}
#look at each row in the dataframe (converted into a dict) one by one
for row,variables in offense_cols_dict.items():
#create a dict with all offense code as key and value as 0 (starting count)
#considered using get(code,0) rather than prepopulating keys and vals...
#but think different vals across dicts would create alignment issues...
#when appending back onto dataset df
this_row_offense_counts={code:0 for code in all_possible_offense_codes}
#then go through each offense column
for column in offense_columns:
#find the code stored in this column for this row
code=offense_cols_dict[row,column]
#increment count by 1
this_row_offense_counts[code]=this_row_offense_offense_counts[code]+1
#once all columns have been counted, store counts in dictionary
all_offense_counts[row]=this_row_offense_counts
#once all rows have been counted, turn into a dataframe
offense_counts_cols=pf.DataFrame.from_dict(all_offense_counts,orient=index)
#join to the original dataframe
dataset.join(offense_counts)
#drop the sparsely populated offense_columns
dataset.drop(offense_columns,axis=1)
From what I understood, melt function should help, please try this:
pd.melt(dataset, id_vars =[please add a unique col here], value_vars =[[offense_columns]])

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Aggregate Function to dataframe while retaining rows in Pandas

I want to aggregate my data based off a field known as COLLISION_ID and a count of each COLLISION_ID.
I want to remove repeating COLLISION_IDs since they have the same Coordinates, but retain a count of occurrences in original data-set.
My code is below
df2 = df1.groupby(['COLLISION_ID'])[['COLLISION_ID']].count()
This returns such:
I would like my data returned as the COLLISION_ID numbers, the count, and the remaining columns of my data which are not shown here(~40 additional columns that will be filtered later)
If you are talking about filter , we should do transform
df1['count_col']=df1.groupby(['COLLISION_ID'])['COLLISION_ID'].transform('count')
Then you can filter the df1 with column count

How to insert multiple consecutive columns with empty values to python dataframe

I have a dataframe stations with four columns "1990", "2000", "2006", and "2012" with area data. To interpolate the years in between I want to insert columns with empty values in the gaps.
I did use pandas.DataFrame.insert to insert columns at specific locations but couldn't find out how to do that with multiple columns like pandas.DataFrame.insert[1, ["1991":"1999"], np.nan].
Is there a way to insert multiple columns with a consecutive number/name to fill the gaps?
I appreciate every help!
You won't hear this often for question about pandas, but in this instance, I think looping is probably the clearest solution:
for year in range(1991, 2000):
df[str(year)] = np.NaN.
You can then reorder the columns afterwards.

Select columns in a pandas DataFrame

I have a pandas dataframe with hundreds of columns of antibiotic names. Each specific antibiotic is coded in the dataframe as ending in E, T, or P to indicate empirical, treatment, or prophylactic regimens.
An example excerpt from the column list is:
['MeropenemP', 'MeropenemE', 'MeropenemT', DoripenemP', 'DoripenemE',
'DoripenemT', ImipenemP', 'ImipenemE', 'ImipenemT', 'BiapenemP',
'BiapenemE', 'BiapenemT', 'PanipenemP', 'PanipenemE',
'PanipenemT','PipTazP', 'PipTazE', 'PipTazT','PiperacillinP',
'PiperacillinE', 'PiperacillinT']
A small sample of data is located here:
Sample antibiotic data
It is simple enough for me to separate out columns any type into separate dataframes with a regex, e.g. to select all the empirically prescribed antibiotics columns I use:
E_cols = master.filter(axis=1, regex=('[a-z]+E$'))
Each column has a binary value (0,1) for prescription of each antibiotic regimen type per person (row).
Question:
How would I go about summing the rows of all columns (1's) for each type of regimen type and generating a new column for each result in the dataframe e.g. total_emperical, total_prophylactic, total_treatment.
The reason I want to add to the existing dataframe is that I wish to filter on other values for each regimen type.
Once you've generated the list of columns that match your reg exp then you can just create the new total columns like so:
df['total_emperical'] = df[E_cols].sum(axis=1)
and repeat for the other totals.
Passing axis=1 to sum will sum row-wise

Categories