How to drop duplicate values based in specific columns using pandas?

How to drop duplicate values based in specific columns using pandas? - python

Currently, I imported the following data frame from Excel into pandas and I want to delete duplicate values based in the values of two columns.
# Python 3.5.2
# Pandas library version 0.22
import pandas as pd
# Save the Excel workbook in a variable
current_workbook = pd.ExcelFile('C:\\Users\\userX\\Desktop\\cost_values.xlsx')
# convert the workbook to a data frame
current_worksheet = pd.read_excel(current_workbook, index_col = 'vend_num')
# current output
print(current_worksheet)
| vend_number | vend_name | quantity | source |
| ----------- |----------------------- | -------- | -------- |
CHARLS Charlie & Associates $5,700.00 Central
CHARLS Charlie & Associates $5,700.00 South
CHARLS Charlie & Associates $5,700.00 North
CHARLS Charlie & Associates $5,700.00 West
HUGHES Hughinos $3,800.00 Central
HUGHES Hughinos $3,800.00 South
FERNAS Fernanda Industries $3,500.00 South
FERNAS Fernanda Industries $3,500.00 North
FERNAS Fernanda Industries $3,000.00 West
....
What I want is to remove those duplicate values based in the columns quantity and source:
Review the quantity and source column values:
1.1. If the quantity of a vendor is equal in another row from the same
vendor and source is not equal to Central then drop the repeated
rows from this vendor except the row Central.
1.2. Else if the quantity of a vendor is equal in another row from the same vendor and there is no source Central then drop the repeated rows.
Desired result
| vend_number | vend_name | quantity | source |
| ----------- |----------------------- | -------- | -------- |
CHARLS Charlie & Associates $5,700.00 Central
HUGHES Hughinos $3,800.00 Central
FERNAS Fernanda Industries $3,500.00 South
FERNAS Fernanda Industries $3,000.00 West
....
So far, I have tried the following code but pandas is not even detecting any duplicate rows.
print(current_worksheet.loc[current_worksheet.duplicated()])
print(current_worksheet.duplicated())
I have tried to figure out the solution but I am struggling quite a bit in this problem, so any help in this question is greatly appreciated. Feel free to improve the question.

Here is one way.
df['CentralFlag'] = (df['source'] == 'Central')
df = df.sort_values('CentralFlag', ascending=False)\
.drop_duplicates(['vend_name', 'quantity'])\
.drop('CentralFlag', 1)
# vend_number vend_name quantity source
# 0 CHARLS Charlie&Associates $5,700.00 Central
# 4 HUGHES Hughinos $3,800.00 Central
# 6 FERNAS FernandaIndustries $3,500.00 South
# 8 FERNAS FernandaIndustries $3,000.00 West
Explanation
Create a flag column, sort by this descending, so Central is prioritised.
Sort by vend_name and quantity, then drop the flag column.

You can do it two steps
s=df.loc[df['source']=='Central',:]
t=df.loc[~df['vend_number'].isin(s['vend_number']),:]
pd.concat([s,t.drop_duplicates(['vend_number','quantity'],keep='first')])

Related

Searching one pandas DataFrame column based on the search criteria made up of the content of two columns for another df separated by a wild card

I want to lookup the value of a the lookup_table value column based on the combination of text and two different columns from the data table. See example below:
Data:
VMType
Location
DSv3
East Europe
ESv3
East US
ESv3
East Asia
DSv4
Central US
Ca2
Central US
lookup_table:
Type
Code
Dv3/DSv3 - Gen Purpose East Europe
abc123
Dv3/D1 - Gen Purpose West US
abc321
Dav4/DSv4 - Gen Purpose Central US
bbb321
Eav3/ESv3 - Hi Tech East Asia
def321
Eav3/ESv3 - Hi Tech East US
xcd321
Csv2/Ca2 - Hi Tech Central US
xcc321
I want to do something like
data['new_column'] = lookup_table['Code'] where lookup_table['Type'] == Data['VMType'] + '*' + Data['Location']
or to remove the wild card it could be evaluated as follows:
data['new_column'] = lookup_table['Code'] where lookup_table['Type'] contains Data['VMType'] AND lookup_table['Type'] contains Data['Location']
Resulting in:
Data:
VMType
Location
new_column
DSv3
East Europe
abc123
ESv3
East US
xcd321
ESv3
East Asia
def321
DSv4
Central US
abc321
Ca2
Central US
xcc321
Ideally this can be done without iterating through the df.

First, extract columns VMType and Location from the lookup_table. Then merge with your data dataframe:
lookup_table['VMType'] = lookup_table['Type'].str[:2]
lookup_table['Location'] = lookup_table['Type'].str.split().str[-1]
lookup_table = lookup_table[['VMType', 'Location', 'Code']]
data.merge(lookup_table)
Output:
VMType Location Code
0 D1 Europe abc123
1 E3 US xcd321
2 E3 Asia def321
3 D1 US abc321
4 C2 US xcc321

How to create a list of random values from a dataframe column?

Goal: To create a new column of random values in df2 from an existing column in df1. df1 is from the "original" excel file that I am reading, and df2 is a new dataframe I am making in python. I ultimately want three random values from the df1 column in one data cell of the df2 column. Example below.
Given this source excel file (NBA.xlsx):
df1:
Final_Game_Day
Champion
MVP
String_Example
Average_Viewership_(millions)
10/11/2020
Lakers
LeBron James
LeBron wins 4th championship; Lakers win record tying 17th
7.45
6/13/2019
Raptors
Kawhi Leonard
Toronto wins 1st championship denying Warriors 3-peat
15.14
6/8/2018
Warriors
Kevin Durant
Warriors sweep Cavs for back-to-back championships
17.56
6/12/2017
Warriors
Kevin Durant
Durant wins first after leaving OKC
20.38
6/19/2016
Cavaliers
LeBron James
Cavs win their 1st ever title
20.28
6/16/2015
Warriors
Andre Iguodala
Iguodala first 6th man to win MVP
19.94
I am looking for something similar to the Sample Values column where there are three random (or the three most common data values from the respective column in df1).
df2:
| Column_Name | Sample_Values |
| -------- | ------------- |
| Final_Game_Day | 10/11/2020; 6/13/2019; 6/8/2018 |
| Champion | Warriors; Lakers; Raptors |
| MVP | LeBron James; Kevin Durant; Kawhi Leonard |
| String_Example |LeBron wins 4th championship; Lakers win record tying 17th; Toronto wins 1st championship denying Warriors 3-peat; Warriors sweep Cavs for back-to-back championships |
| Average_Viewership_(millions) | 7.45; 15.14; 17.56 |
Sample code to start (I think only the bottom two portions are where I need to add code to get what I want):
### Setting up
import pandas as pd
import os
import glob
###Setting working directory
path = os.getcws()
files = os.listdir(path)
### Prep to get all files
from os import listdir
from os.path import isfile, join
### Reading only excel files in folder
FileList_xlsx = [f for f in files if f[-4:] == "xlsx"]
# Initializing empty data frame
df = pd.DataFrame()
# Initializing Sample Values List for Sample Values
SampleValues = []
# Loop over list of Excel files
for f in FileList_xlsx:
ReadXlsx = pd.read_excel(f)
ColumnNames = list(ReadXlsx.columns.values)
# Sets up first column list in df2
for a in ColumnNames:
Agg_ColumnNames.append(a)
for a in ColumnNames:
### Missing code here ###
# Create final dataframe - Need Sample Values at end
final = {'Column_Name': Agg_ColumnNames, }

To get the n most common values from a column, you can use the index that's returned by value_counts:
n = 3
>>> df1["MVP"].value_counts().index[:n].to_list()
['Kevin Durant', 'LeBron James', 'Andre Iguodala']
So, to get your df2, you could get the three most common values from each column in a dictionary comprehension and do a string.join. It would look like this:
df2 = pd.Series({column: "; ".join(str(s) for s in df1[column].value_counts().index[:3]) for column in df1.columns}).reset_index()
df2.columns = ["Column_Name", "Sample_Values"]
>>> df2
Column_Name Sample_Values
0 Final_Game_Day 10/11/2020; 6/8/2018; 6/12/2017
1 Champion Warriors; Cavaliers; Raptors
2 MVP Kevin Durant; LeBron James; Andre Iguodala
3 String_Example Toronto wins 1st championship denying Warriors...
4 Average_Viewership_(millions) 17.56; 20.38; 7.45

Wide to Long MiltiIndex dataset using pandas

I've solved a part of this question Wide to long dataset using pandas but still need more help.
I've a dataset which has columns as:
IA01_Raw_Baseline,IA01_Raw_Midline,IA01_Class1_Endline etc. I want to break it so that the mid word i.e. Raw, Class1 etc remain in the column and IAx, timeLine (mid, base, end) becomes two separate columns. For example for a row of data:
ID Country Type Region Gender IA01_Raw_EndLine IA01_Raw_Baseline IA01_Raw_Midline IA02_Raw QA_Include QA_Comments
SC1 France A Europe Male 4 8 1 yes N/A
It should become:
ID Country Type Region Gender Timeline IA Raw Class1 Class2 QA_Include QA_Comments
SC1 France A Europe Male Endline IA01 4 N/A N/A yes N/A
SC1 France A Europe Male Baseline IA01 8 N/A N/A yes N/A
SC1 France A Europe Male Midline IA01 1 N/A N/A yes N/A
This is just the conversion of one row where I've 500+ columns which has different IA from 01 to 12 and attributes like Raw, Class1, Class2, Amount etc all has baseline and midline etc.
When converting I did break them down to the columns, where I've idVars containing columns that will be the index and valueVars will have the IA01_Raw_Baseline type columns:
idVars = list(gd_df.columns[0:40]) + list(gd_df.columns[472:527]) #values that will not pivot
valueVars = gd_df.columns[41:472]#.tolist() #value that will pivot
gd_df2 = gd_df.set_index(idVars)
gd_df2.columns = pd.MultiIndex.from_tuples(map(tuple, gd_df2.columns.str.split('_', n=1)))
gd_out = gd_df2.stack(level=0).reset_index().rename({'level_7': 'IA'}, axis=1)
So this code gave me:
As you can see I got IA column the way I wanted but the timeline is still embedded in the column name. What should I change in my code so that it should give this output:
UPDATE:
By doing this:
s=df.set_index(idVars)
s.columns=pd.MultiIndex.from_tuples(s.columns.str.split('_').map(tuple),names =['IA','raw','Timeline'])
s.stack([0,2]).reset_index()
s.to_excel(r'gd_out1.xlsx')
I'm getting:

Since you mentioned wide to long , we using wide_to_long
s=pd.wide_to_long(df,['IA01_Raw'],i=['ID', 'Country', 'Type', 'Region', 'Gender','IA02_Raw', 'QA_Include',
'QA_Comments'],j='Timeline',suffix='\w+',sep='_')
s.columns=pd.MultiIndex.from_tuples(s.columns.str.split('_').map(tuple),names =['IA','raw'])
s.stack(0).reset_index()
Out[27]:
raw ID Country Type Region ... QA_Comments Timeline IA Raw
0 SC1 France A Europe ... NaN EndLine IA01 4
1 SC1 France A Europe ... NaN Baseline IA01 8
2 SC1 France A Europe ... NaN Midline IA01 1
[3 rows x 11 columns]
Update
s=df.set_index(['ID', 'Country', 'Type', 'Region', 'Gender', 'QA_Include',
'QA_Comments','IA02_Raw'])
s.columns=pd.MultiIndex.from_tuples(s.columns.str.split('_').map(tuple),names =['IA','raw','Timeline'])
s.stack([0,2]).reset_index()

Phrase similarity from List

Hi assuming I have 2 lists:
names = ['Daniel', 'Mario', 'Mandy', 'Jolene', 'Fabio']
places = ['on top of the table', 'France', 'valley of the kings']
and a dataframe with some sentences
ex:
DataframeOrig
Index | Sent
0 | Mandy went to France on the Eiffel Tower
1 | Daniele was dancing on top of the box
2 | I am eating on top of the table
3 | Maria went to the valley of the kings
I would like to use a distance metric like difflib to scan the sentences and compare phrases to the list having a determined offset. Hopefully the result of this would be:
Index | Sent | Result
0 | Mandy went to France on the Eiffel Tower | Mandy
1 | Daniele was dancing on top of the box | Daniel
2 | I am eating on top of the table | on top of the table
3 | Maria went to the valley of the kings | Mario, valley of the kings
How would you go about it without using loads of loops to get phrase matches?

Assign a Row to Data Frame Header that Starts with a Specific String from Excel- Pandas

I have many excel files that are in different formats. Some of them look like this, which is normal with one header can be read into pandas.
# First Column Second Column Address City State Zip
1 House The Clairs 4321 Main Street Chicago IL 54872
2 Restaurant The Monks 6323 East Wing Miluakee WI 45458
and some of them are in various formats with multiple headers,
Table 1
Comp ID Info
# First Column Second Column Address City State Zip
1 Office The Fairs 1234 Main Street Seattle WA 54872
2 College The Blanks 4523 West Street Madison WI 45875
3 Ground The Brewers 895 Toronto Street Madrid IA 56487
Table2
Comp ID Info
# First Column Second Column Address City State Zip
1 College The Banks 568 Old Street Cleveland OH 52125
2 Professional The Circuits 695 New Street Boston MA 36521
This looks like this in Excel (I am pasting the image here to show how it actually looks in excel),
As you can see above there are three different levels of headers. For sure every file has a row that starts with First Column.
For an individual file like this, I can read like below, which is just fine.
xls = pd.ExcelFile(r'mypath\myfile.xlsx')
df = pd.read_excel('xls', 'mysheet', header=[2])
However, I need a final data frame like this (Appended with files that have only one header),
First Column Second Column Address City State Zip
0 House The Clair 4321 Main Street Chicago IL 54872
1 Restaurant The Monks 6323 East Wing Milwaukee WI 45458
2 Office The Fairs 1234 Main Street Seattle WA 54872
3 College The Blanks 4523 West Street Madison WI 45875
4 Ground The Brewers 895 Toronto Street Madrid IA 56487
5 College The Banks 568 Old Street Cleveland OH 52125
6 Professional The Circuits 695 New Street Boston MA 36521
Since I have many files, I want to read each file in my folder and clean them up by getting only one header from a row. Had I knew the index position of the row, that I need as head, I could simply do something like in this post.
However, as some of those files can have Multiple headers (I showed 2 extra headers in above example, some have 4 headers) in different formats, I want to iterate through the file and set the row that starts with First Column to be header in the beginning of the file.
Additionally, I want to drop those rows that are in the middle of the the file that has First Column.
After I create a cleaned file headers starting with First Column, I can append each data frame and create my output file I need. How can I achieve this in pandas? Any help or suggestions would be great.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to drop duplicate values based in specific columns using pandas? - python

You can do it two steps s=df.loc[df['source']=='Central',:] t=df.loc[~df['vend_number'].isin(s['vend_number']),:] pd.concat([s,t.drop_duplicates(['vend_number','quantity'],keep='first')])

Related

Searching one pandas DataFrame column based on the search criteria made up of the content of two columns for another df separated by a wild card

How to create a list of random values from a dataframe column?

Wide to Long MiltiIndex dataset using pandas

Phrase similarity from List

Assign a Row to Data Frame Header that Starts with a Specific String from Excel- Pandas

Categories

Resources