Pandas: Drop duplicates, with a constraint in another column

Pandas: Drop duplicates, with a constraint in another column - python

Title URL Price Address Rental_Type
0 House URL $600 Auburn Apartment
1 House URL $600 Auburn Apartment
2 House URL $900 NY Apartment
3 Room! URL $1018 NaN Office
4 Room! URL $910 NaN Office
I'm trying to drop duplicates under Title. But I only want to drop rows that have Rental_Type == 'Office'. I also have a second constraint. I would like to drop the rows with Rental_Type == 'Apartment', but I want to keep the first duplicate in this scenario. So in this situation row 3 and 4 would drop, and then only row 1 out of row 0/1.

I would build this up in steps to construct a list of incidences you wish to drop.
offices = df['Rental_Type'] == 'Office'
apts = df['Rental_Type'] == 'Apartment'
dup_offices = df[offices].duplicated('Title', keep=False)
dup_apts = df[apts].duplicated('Title', keep='first')
to_drop = pd.Index(dup_apts[dup_apts].index.tolist() + \
dup_offices[dup_offices].index.tolist())
df = df.drop(to_drop)

You can drop the duplicates with your constraints in this fashion:
#drop all duplicate with Rental_Type=='Office'
df1 = df[(df.Rental_Type=='Office')].drop_duplicates(['Title'], keep=False)
#Capture the duplicate row with Rental_Type=='Apartment'
df2 = df[(df.Rental_Type=='Apartment')].duplicated(['Title'], keep = 'last')
df3=df[(df.Rental_Type=='Apartment')][df2.values][1:]
#Put them together
df_final = pd.concat([df1,df3])
In [1]: df_final
Out[1]:
Title URL Price Address Rental_Type
1 House URL 600 Auburn Apartment

Related

Pandas: set group id based on identical columns and same elements in list

Updated!
Hi,
my data contains the names of persons and a list of cities they lived in. I want to group them together following these conditions:
first_name and last_name are identical
or (if 1. doesn't hold) their last_name are the same and they have lived in at least one identical city.
The result should be a new column indicating the group id that each person belongs to.
The DataFrame df looks like this:
>>> df
last_name first_name cities
0 Dorsey Nancy [Moscow, New York]
1 Harper Max [Munich, Paris, Shanghai]
2 Mueller Max [New York, Los Angeles]
3 Dorsey Nancy [New York, Miami]
4 Harper Maxwell [Munich, Miami]
The new dataframe df_id should look like this. The order of id is irrelevant (i.e., which group gets id=1), but only observations that fulfill either condition 1 or 2 should get the same id.
>>> df_id
last_name first_name cities id
0 Dorsey Nancy [Moscow, New York] 1
1 Harper Max [Munich, Paris, Shanghai] 2
2 Mueller Max [New York, Los Angeles] 3
3 Dorsey Nancy [New York, Miami] 1
4 Harper Maxwell [Munich, Miami] 2
My current code:
df= df.reset_index(drop=True)
#explode lists to rows
df_exploded = df.explode('cities')
# define id_couter and dictionionary for to match index to id
id_counter = 1
id_matched = dict()
# define id function
def match_id(df):
global id_counter
# check if index already matched
if df.index not in id_matched.keys():
# get all persons with similar names (condition 1)
select = df_expanded[(df_expanded['first_name']==df['first_name']) & df_expanded['last_name']==df['last_name'])]
# get all persons with same last_name and city (condition 2)
if select.empty:
select_2 = df_expanded[(df_expanded['last_name']==df['last_name']) & (df_expanded['cities'].isin(df['cities']))]
# create new id for this specific person
if select_2.empty:
id_matched[df.index] = id_counter
# create new id for group of person and record in dictionary
else:
select_list = select_2.index.unique().tolist()
for i in select_list:
id_matched[select_list[i]] = id_counter
# create new id for group of person and record in dictionary
else:
select_list = select.index.unique().tolist()
for i in select_list:
id_matched[select_list[i]] = id_counter
# set next id
id_counter += 1
# run function
df = df.progress_apply(match_id, axis=1)
# convert dict to DataFrame
df_id_matched = pd.DataFrame.from_dict(id_matched, orient='index')
df_id_matched['id'] = df_id_matched.index
# merge back together with df to create df_id
Does anyone have a more efficient way to perform this task? The data set is huge and it would take several days...
Thanks in advance!

Use:
#sample data was changed for lists for each cities
#like 'Moscow, New York' changed to 'Moscow', 'New York'
df_id = pd.DataFrame({'last_name':['Dorsey','Harper', 'Mueller', 'Dorsey'],
'first_name':['Nancy','Max', 'Max', 'Nancy'],
'cities':[['Moscow', 'New York'], ['Munich', 'Paris', 'Shanghai'], ['New York', 'Los Angeles'], ['New York', 'Miami']]})
#created default index values
df_id = df_id.reset_index(drop=True)
#explode lists to rows
df = df_id.explode('cities')
#get duplicates per 3 columns, get at least one dupe by index and sorting
s = (df.duplicated(['last_name','first_name','cities'], keep=False)
.any(level=0)
.sort_values(ascending=False))
#create new column with cumulative sum by inverted mask
df_id['id'] = (~s).cumsum().add(1)
print (df_id)
last_name first_name cities id
0 Dorsey Nancy [Moscow, New York] 1
1 Harper Max [Munich, Paris, Shanghai] 2
2 Mueller Max [New York, Los Angeles] 3
3 Dorsey Nancy [New York, Miami] 1

Pandas deleting rows based on same sting in columns

Manufacturer Buy Box Seller
0 Goli Goli Nutrition Inc.
1 Hanes 3rd Street Brands
2 NaN Inspiring Life
3 Sports Research Sports Research
4 Beckham Luxury Linen Thalestris Co.
Hello i am using pandas DataFrame to clean this file and want to delete rows which contains the manufacturers name in the buy-box seller column. For example row 1 will be deleted because it contains the string 'Goli' in Buy-Box seller Column.

There are misisng values so first replace them by DataFrame.fillna and then test if match values between columns by not in statement in DataFrame.apply with axis=1 and filter in boolean indexing:
mask = (df.fillna('Missing vals')
.apply(lambda x: x['Manufacturer'] not in x['Buy Box Seller'], axis=1))
df = df[mask]

How to turn header inside rows into columns?

How do I turn the headers inside the rows into columns?
For example I have the Dataframe below.
enter image description here
and would like it to be
enter image description here
EDIT:
Code to produce current df example
import pandas as pd
df = pd.DataFrame({'Date':[2020,2021,2022], 'James':'', ' Sales': [3,4,5], ' City':'NY', ' DIV':'a', 'KIM':'', ' Sales ': [3,4,5], ' City ':'SF', ' DIV ':'b'}).T.reset_index()
index 0 1 2
0 Date 2020 2021 2022
1 James
2 Sales 3 4 5
3 City NY NY NY
4 DIV a a a
5 KIM
6 Sales 3 4 5
7 City SF SF SF
8 DIV b b b
looking to get
Name City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 3 4 5
I think the best way is to iterate over the first column if the name(eg James) has no indent its turn into a column until it hits a other value (KIM). So to find a way to categories the header which is not indent into a new column which stops when a new header comes up (KIM).
#Edit 2 there not only two names (KIM or JAMES) there is like 20 names. Or only the three second levels (Sales, City, Div). Different names have more that 3 second levels some have 7 levels. The only thing that is consistent is the Names are not indent but the second levels are.

Using a slightly simpler example, this works, but it sure ain't pretty:
df = pd.DataFrame({
'date': ['James', 'Sales', 'City', 'Kim', 'Sales', 'City',],
'2020': ['', '3', 'NY', '', '4', 'SF'],
'2021': ['', '4', 'NY', '', '5', 'SF'],
})
def rows_to_columns(group):
for value in group.date.values:
if value != group.person.values[0] and value != 'Sales':
temp_column = '_'+value
group.loc[group['date']==value, temp_column] = group['2020']
group[value.lower()] = (
group[temp_column]
.fillna(method='ffill')
.fillna(method='bfill')
)
group.drop([temp_column], axis=1, inplace=True)
pass
pass
return group
df.loc[df['2020']=='', 'person'] = df.date
df.person = df.person.fillna(method='ffill')
new_df = (df
.groupby('person')
.apply(lambda x:rows_to_columns(x))
.drop(['date'], axis=1)
.loc[df.date=='Sales']
)
The basic idea is to
Copy the name into a separate column and fill that column using .fillna(method='ffill'). This works if the assumption holds that every person's block begins with the person's name. Otherwise it wreaks havoc.
All other values, such as 'div' and 'city' will be converted by row_to_columns(group). The function iterates over all rows in a group that are neither the person's name nor 'Sales', copies the value from the row into a temp column, creates a new column for that row and uses ffill and bfill to fill it out. It then deletes the temp column and returns the group.
The resulting data frame is the intended format once the column 'Sales' is dropped.
Note: This solution probably does not work well on larger datasets.

You gave more details, and I see you are not working with multi-level indexes. The best way for you would be to create the DataFrame already in the format you need in this case. The way you are creating the first DataFrame is not well structured and the information is not indexed by name (James/KIM) as they are columns with empty values, no link with the other values. The stacking you did use blank spaces on a string. Take a look at multi-indexing and generate a data frame you can work with, or create the data frame in the format you need in the end.
-- Answer considering multi-level indexes --
Using the few information provided, I see your Dataframe is stacked, it means, you have multiple indexes. The first level is person (James/KIM) and the second level is Sales/City/DIV. So your Dataframe should be created like this:
import pandas
multi_index = pandas.MultiIndex.from_tuples([
('James', 'Sales'), ('James', 'City'), ('James', 'DIV'),
('KIM', 'Sales'), ('KIM', 'City'), ('KIM', 'DIV')])
year_2020 = pandas.Series([3, 'NY', 'a', 4, 'SF', 'b'], index=multi_index)
year_2021 = pandas.Series([4, 'NY', 'a', 5, 'SF', 'b'], index=multi_index)
year_2022 = pandas.Series([5, 'NY', 'a', 6, 'SF', 'b'], index=multi_index)
frame = { '2020': year_2020, '2021': year_2021, '2022': year_2022}
df = pandas.DataFrame(frame)
print(df)
2020 2021 2022
James Sales 3 4 5
City NY NY NY
DIV a a a
KIM Sales 4 5 6
City SF SF SF
DIV b b b
Now that you have the multi_level DataFrame, you have many ways to transform it. This is what we will do to make it one level:
sales_df = df.xs('Sales', axis=0, level=1).copy()
div_df = df.xs('DIV', axis=0, level=1).copy()
city_df = df.xs('City', axis=0, level=1).copy()
The results will be:
print(sales)
2020 2021 2022
James 3 4 5
KIM 4 5 6
print(div_df)
2020 2021 2022
James a a a
KIM b b b
print(city_df)
2020 2021 2022
James NY NY NY
KIM SF SF SF
You are discarding any information regarding DIV or City changes from years, so we can reduce the City and DIV dataframe to a Series, taking the first one as reference:
div_series = div_df.iloc[:,0]
city_series = city_df.iloc[:,0]
Take the sales DF as reference, and add the City and DIV series:
sales_df['DIV'] = div_series
sales_df['City'] = city_series
sales_df['Account'] = 'Sales'
Now reorder the columns as you wish:
sales_df = sales_df[['City', 'DIV', 'Account', '2020', '2021', '2022']]
print(sales_df)
City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 4 5 6

how to handle different spelling of column names when extracting data?

For this example i have 2 dataframes, the genre column in df1 is column 3 but in df2 it is column 2, also the header is slightly different. in my actual script i have to search the column names because the column location varies in each sheet it reads.
how do i recognise different header names as the same thing?
df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
'VENDOR ID': ['1234','4321','4132'],
'GENRE(S)': ['Action', 'Adventure', 'Drama']})
df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
'Genre': ['Animation', 'Adventure', 'Family'],
'VENDOR ID': ['5678','8765','8576']})
column_names = ['TITLE','VENDOR ID','GENRE(S)']
appended_data = []
sheet1 = df1[df1.columns.intersection(column_names)]
appended_data.append(sheet1)
sheet2 = df2[df2.columns.intersection(column_names)]
appended_data.append(sheet2)
appended_data = pd.concat(appended_data, sort=False)
output:
TITLE VENDOR ID GENRE(S)
0 The Matrix 1234 Action
1 Die Hard 4321 Adventure
2 Kill Bill 4132 Drama
0 Toy Story 5678 NaN
1 Shrek 8765 NaN
2 Frozen 8576 NaN
desired output:
TITLE VENDOR ID GENRE(S)
0 The Matrix 1234 Action
1 Die Hard 4321 Adventure
2 Kill Bill 4132 Drama
0 Toy Story 5678 Animation
1 Shrek 8765 Adventure
2 Frozen 8576 Family

Thank you for taking the time to do that. Asking a good questions is very important and now that you have posed a coherent question I was able to find a simple solution rather quickly:
import pandas as pd
df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
'VENDOR ID': ['1234','4321','4132'],
'GENRE(S)': ['Action', 'Adventure', 'Drama']})
df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
'Genre': ['Animation', 'Adventure', 'Family'],
'VENDOR ID': ['5678','8765','8576']})
Simple way:
We will use .append() below but for this to work, we need columns in df1 and df2 to match. In this case we'll simply replace df2's 'Genre' to 'GENRE(S)'
df2.columns = ['TITLE', 'GENRE(S)', 'VENDOR ID']
df3 = df1.append(df2)
print(df3)
GENRE(S) TITLE VENDOR ID
0 Action The Matrix 1234
1 Adventure Die Hard 4321
2 Drama Kill Bill 4132
0 Animation Toy Story 5678
1 Adventure Shrek 8765
2 Family Frozen 8576
More elaborate:
Now, for a single use case this works but there may be cases where you have many mismatched columns and/or have to do this repeatedly. Here is a solution using boolean indexing to find mismatched names, then zip() and .rename() to map the column names:
# RELOAD YOUR ORIGINAL DF'S
df1_find = df1.columns[~df1.columns.isin(df2.columns)] # select col name that isnt in df2
df2_find = df2.columns[~df2.columns.isin(df1.columns)] # select col name that isnt in df1
zipped = dict(zip(df2_find, df1_find)) # df2_find as key, df1_find as value
df2.rename(columns=zipped, inplace=True) # map zipped dict to the column names
df3 = df1.append(df2)
print(df3)
GENRE(S) TITLE VENDOR ID
0 Action The Matrix 1234
1 Adventure Die Hard 4321
2 Drama Kill Bill 4132
0 Animation Toy Story 5678
1 Adventure Shrek 8765
2 Family Frozen 8576
Keep in mind:
this way of doing it assumes that both your df's have the same count
of columns
this ALSO assumes that df1 has your ideal column names which you
will use against other dfs to fix their column names
I hope this helps.

Compare two different sheets in one Excel-File with each other

I have an Excel-File with two sheets.
One contains the df1:
Country City Population Planet
Germany Berlin 30500 Earth
Spain Madrid 21021 Earth
...
And the second contains the df2:
Country City Population Planet
Spain Madrid 21021 Earth
...
Now I want to compare the two dataframes and check if there are rows in df1 which are also in df2 and if yes then:
I want to add a new column to df1 which has the name double and just want to put an "X" if the row is in df1 and in df2.

# create string data
df1_str = '''Country,City,Population,Planet
Germany,Berlin,30500,Earth
Spain,Madrid,21021,Earth'''
df2_str = '''Country,City,Population,Planet
Spain,Madrid,21021,Earth'''
# read in to dataframe
df1 = pd.read_csv(io.StringIO(df1_str))
# read in to list for iteration
df1_list = pd.read_csv(io.StringIO(df1_str)).values.tolist()
df2_list = pd.read_csv(io.StringIO(df2_str)).values.tolist()
# join all columns and make a unique combination
df1_list = ["-".join(map(str, item)) for item in df1_list]
df2_list = ["-".join(map(str, item)) for item in df2_list]
# check the combinations exist in both data frame
common_flag = []
for item1 in df1_list:
for item2 in df2_list:
if item1 in item2: # you might prefer item1 == item2:
common_flag.append("X")
else:
common_flag.append(None)
# add the result to datagrame df1
df1["double"] = pd.Series(common_flag)
Make sure the column order are same in both dataframe when creating combination list.
output:
Country City Population Planet double
0 Germany Berlin 30500 Earth None
1 Spain Madrid 21021 Earth X

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Drop duplicates, with a constraint in another column - python

Related

Pandas: set group id based on identical columns and same elements in list

Pandas deleting rows based on same sting in columns

How to turn header inside rows into columns?

how to handle different spelling of column names when extracting data?

Compare two different sheets in one Excel-File with each other

Categories

Resources