Dropping duplicate rows but keeping certain values Pandas - python

I have 2 similar dataframes that I concatenated that have a lot of repeated values because they are basically the same data set but for different years.
The problem is that one of the sets has some values missing whereas the other sometimes has these values.
For example:
Name Unit Year Level
Nik 1 2000 12
Nik 1 12
John 2 2001 11
John 2 2001 11
Stacy 1 8
Stacy 1 1999 8
.
.
I want to drop duplicates on the subset = ['Name', 'Unit', 'Level'] since some repetitions don't have years.
However, I'm left with the data that has no Year and I'd like to keep the data with these values:
Name Unit Year Level
Nik 1 2000 12
John 2 2001 11
Stacy 1 1999 8
.
.
How do I keep these values rather than the blanks?

Use sort_values with default parameter na_position='last', so should be omit, and then drop_duplicates:
print (df)
Name Unit Year Level
0 Nik 1 NaN 12
1 Nik 1 2000.0 12
2 John 2 2001.0 11
3 John 2 2001.0 11
4 Stacy 1 NaN 8
5 Stacy 1 1999.0 8
subset = ['Name', 'Unit', 'Level']
df = df.sort_values('Year').drop_duplicates(subset)
Or:
df = df.sort_values(subset + ['Year']).drop_duplicates(subset)
print (df)
Name Unit Year Level
5 Stacy 1 1999.0 8
1 Nik 1 2000.0 12
2 John 2 2001.0 11
Another solution with GroupBy.first for return first non missing value of Year per groups:
df = df.groupby(subset, as_index=False, sort=False)['Year'].first()
print (df)
Name Unit Level Year
0 Nik 1 12 2000.0
1 John 2 11 2001.0
2 Stacy 1 8 1999.0

One solution that comes to mind is to first sort the concatenated dataframe by year with the sortvalues function:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
then drop duplicates with keep='first' parameter
df.drop_duplicates(subset=['Name', 'Unit', 'Level'], keep="first")

I would suggest that you look at the creation step of your merged dataset.
When merging the data sets you can do so on multiple indices i.e.
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
With the outer join you collect all data sets and remove duplicates right away. The only thing left is to merge the Year column which you can do like so:
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
This fills the gaps and afterwards you are able to simply drop the 'Year_r' column.
The benefit here is that not only NaN values of missing years are covered but also missing Years which are represented as empty strings.
Following a small working example:
import pandas as pd
import numpy as np
left = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo', 'Peter', 'Adam'],
'Unit': ['2', '4', '6', '2', '4', '12'],
'Year': ['', '2009', '1954', '2025', '2012', '2024'],
'Level': ['L1', 'L1', 'L0', 'L4', 'L3', 'L10']})
right = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo'],
'Unit': ['2', '4', '6', '2'],
'Year': ['2010', '2009', '1954', '2025'],
'Level': ['L1', 'L1', 'L0', 'L4']})
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
df

Related

JSON list flatten to dataframe as multiple columns with prefix

I have a json with some nested/array items like the one below
I'm looking at flattening it before saving it into a csv
[{'SKU':'SKU1','name':'test name 1',
'ItemSalesPrices':[{'SourceNumber': 'OEM', 'AssetNumber': 'TEST1A', 'UnitPrice': 1600}, {'SourceNumber': 'RRP', 'AssetNumber': 'TEST1B', 'UnitPrice': 1500}],
},
{'SKU':'SKU2','name':'test name 2',
'ItemSalesPrices':[{'SourceNumber': 'RRP', 'AssetNumber': 'TEST2', 'UnitPrice': 1500}],
}
]
I have attempted with the good solution here flattern nested JSON and retain columns (or Panda json_normalize) but got no where so I'm hoping to get some tips from the community
SKU
Name
ItemSalesPrices_OEM_UnitPrice
ItemSalesPrices_OEM_AssetNumber
ItemSalesPrices_RRP_UnitPrice
ItemSalesPrices_RRP_AssetNumber
SKU1
test name 1
1600
TEST1A
1500
TEST1B
SKU2
test name 2
1500
TEST2
Thank you
Use json_normalize:
first = ['SKU','name']
df = pd.json_normalize(L,'ItemSalesPrices', first)
print (df)
SourceNumber AssetNumber UnitPrice SKU name
0 OEM TEST1A 1600 TEST1 test name 1
1 RRP TEST1B 1500 TEST1 test name 1
2 RRP TEST2 1500 TEST2 test name 2
Then you can pivoting values - if numeric use sum, if strings use join:
f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ','.join(x)
df1 = (df.pivot_table(index=first,
columns='SourceNumber',
aggfunc=f))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.rename_axis(None, axis=1).reset_index()
print (df1)
SKU name AssetNumber_OEM AssetNumber_RRP UnitPrice_OEM \
0 SKU1 test name 1 TEST1A TEST1B 1600.0
1 SKU2 test name 2 NaN TEST2 NaN
UnitPrice_RRP
0 1500.0
1 1500.0

How to merge several columns into one column with several records using python and pandas?

I have a data which I need to transform in order to get 2 cols insted of 4 :
data = [['123', 'Billy', 'Bill', 'Bi'],
['234', 'James', 'J', 'Ji'],
['543', 'Floyd', 'Flo', 'F'],
]
processed_data = ?
needed_df = pandas.DataFrame(processed_data, columns=['Number', 'Name'])
I expect the following behaviour:
['123', 'Billy']
['123', 'Bill']
['123', 'Bi']
['234', 'James']
['234', 'J']
['234', 'Ji']
I've tried to use for in for loop but getting the wrong result:
for row in df.iterrows():
for col in df.columns:
new_row = ...
processed_df = pandas.concat(df, new_row)
Such a construction gives a too big result
The similar question using sql:
How to split several columns into one column with several records in SQL?
Or, you can convert you exists data into a dataframe then perform pandas dataframe reshaping with melt:
import pandas as pd
data = [['123', 'Billy', 'Bill', 'Bi'],
['234', 'James', 'J', 'Ji'],
['543', 'Floyd', 'Flo', 'F'],
]
df = pd.DataFrame(data)
df.melt(0).sort_values(0)
Output:
0 variable value
0 123 1 Billy
3 123 2 Bill
6 123 3 Bi
1 234 1 James
4 234 2 J
7 234 3 Ji
2 543 1 Floyd
5 543 2 Flo
8 543 3 F
Let use list comprehension to create pairs of Name and Number then create a new dataframe
pd.DataFrame([[x, z] for x, *y in data for z in y], columns=['Number', 'Name'])
Number Name
0 123 Billy
1 123 Bill
2 123 Bi
3 234 James
4 234 J
5 234 Ji
6 543 Floyd
7 543 Flo
8 543 F

Keep duplicated row from a specific data frame after append

I have two data frames (df1 & df2), after a merge, I find two duplicate rows (same values for three columns:["ID","City","Year"]). I would like to keep one of the duplicate rows which comes from df2 .
import pandas as pd
data1 = {
'ID':[7,2],
'City': ["Berlin","Paris"],
'Year':[2012,2000],
'Number':[62,43],}
data2 ={
'ID': [7, 5],
'City': ["Berlin", "London"],
'Year':[2012,2019],
'Number': [60, 100], }
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df_merged= df1.append(df2)
Is there a way to do this ?
Expected output:
ID City Year Number
0 2 Paris 2000 43
1 7 Berlin 2012 60
2 5 London 2019 100
new = (pd.concat([df1, df2])
.drop_duplicates(subset=["ID", "City", "Year"],
keep="last",
ignore_index=True))
append will be gone in near future, use pd.concat there please. Then drop_duplicates over the said columns while keep="last":
In [376]: df1
Out[376]:
ID City Year Number
0 7 Berlin 2012 62
1 2 Paris 2000 43
In [377]: df2
Out[377]:
ID City Year Number
0 7 Berlin 2012 60
1 5 London 2019 100
In [378]: (pd.concat([df1, df2])
...: .drop_duplicates(subset=["ID", "City", "Year"],
...: keep="last",
...: ignore_index=True))
Out[378]:
ID City Year Number
0 2 Paris 2000 43
1 7 Berlin 2012 60
2 5 London 2019 100
ignore_index makes it again 0, 1, 2 after drop_duplicates disturbs it

How to get a value from a row that is next day in Python Pandas?

I am trying to get a value from another row which is "next day" data for each person. Let's say I have this example dataset:
import pandas as pd
data= {'date' : [20210701, 20210703, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,1,0]}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
I am trying to create another column with a value of column 'a' of the next day.
So, I created a 'next_day' column with:
df['next_date'] = df['date'] + pd.Timedelta(days=1)
but I am stuck on the next step.
The final data frame should look like this:
import pandas as pd
data= {'date' : [20210701, 20210703, 20210704, 20210703, 20210704, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,1,0],
'new_column' : [np.nan, 1, np.nan, 1, np.nan, np.nan ]}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
As you can see, the new column takes the value from the next day for each person and takes NaN for the ones that there is no data.
You can utilize numpy where to check for the wanted conditions and df.shift to grab next row values:
df['new_column'] = np.where(((df['name'].shift(-1)==df['name']) &
(df['next_date']==df['date'].shift(-1))), df['a'].shift(-1), np.nan)
This seems to work:
date name a next_date
0 2021-07-01 Dave 1 2021-07-02
1 2021-07-03 Dave 0 2021-07-04
2 2021-07-04 Dave 1 2021-07-05
3 2021-07-03 Sue 1 2021-07-04
4 2021-07-05 Sue 1 2021-07-06
5 2021-07-05 Ann 0 2021-07-06
df['next_date'] = df['next_date'].apply(lambda x:df.loc[df.date==x, 'a'])
date name a next_date
0 2021-07-01 Dave 1 NaN
1 2021-07-03 Dave 0 1.0
2 2021-07-04 Dave 1 NaN
3 2021-07-03 Sue 1 1.0
4 2021-07-05 Sue 1 NaN
5 2021-07-05 Ann 0 NaN
Update: Taking 'name' into account
Here is a solution – In order to account for the name as well, we can apply a function to the dataframe as a whole. As it's more complex, define it first,
def get_next_a(x):
# get the relevant rows
values = df.loc[(df['name']==x['name']) & (df.date==x.next_date), 'a']
# return the first truthy value or np.nan if no match is found
return next((v for v in values), np.nan)
and apply it afterwards:
df['new_column'] = df.apply(get_next_a, axis=1)

How can I compare the row values of select columns with the same columns in another dataframe?

I have two data frames with headers as follows:
df1 = pd.DataFrame(columns=['STATE', 'COUNTY', 'QUANTITY'])
df2 = pd.DataFrame(columns=['FIPS', 'STATE', 'COUNTY'])
I want to create a 3rd data frame:
df3 = pd.DataFrame(columns=['FIPS', 'QUANTITY'])
Such that each row in df1 will have its state and county values compared every row in df2 until a match is found. Once a match is found, the 'FIPS' value from df2 and the 'QUANTITY' value from df1 will be appended to df3.
Basically, I want a data frame that has the FIPS values and Quantity Values per county / state and the csv that I am reading doesn't come with FIPS values.
The Code:
import pandas as pd
import numpy as np
a = [['1', '5', '10'], ['2', '6', '12'], ['3', '7', '11']]
b = [['005', '2', '6'], ['101', '1', '5'], ['201', '3', '7']]
df1 = pd.DataFrame(a, columns=['STATE', 'COUNTY', 'QUANTITY'])
df2 = pd.DataFrame(b, columns=['FIPS', 'STATE', 'COUNTY'])
df3 = pd.DataFrame(columns=['FIPS', 'QUANTITY'])
print(df1)
print(df2)
df3['QUANTITY'] = np.where((df1['STATE'] == df2['STATE']) &
(df1['COUNTY'] == df2['COUNTY'])
, df1['QUANTITY'], np.nan)
df3['FIPS'] = np.where((df1['STATE'] == df2['STATE']) & (df1['COUNTY']
== df2['COUNTY'])
, df2['FIPS'], np.nan)
Has the Result:
STATE COUNTY QUANTITY
0 1 5 10
1 2 6 12
2 3 7 11
FIPS STATE COUNTY
0 005 2 6
1 101 1 5
2 201 3 7
FIPS QUANTITY
0 NaN NaN
1 NaN NaN
2 201 11
I'm looking for something that gives me:
STATE COUNTY QUANTITY
0 1 5 10
1 2 6 12
2 3 7 11
FIPS STATE COUNTY
0 005 2 6
1 101 1 5
2 201 3 7
FIPS QUANTITY
0 101 10
1 005 12
2 201 11
I am comfortable doing such computations in VBA, C++ and MATLAB however I have no clue how to compare elemental indexes of dataframes in python.
Use DataFrame.merge with default inner join and then select columns by subset:
df3 = df1.merge(df2, on=['STATE','COUNTY'])[['FIPS','QUANTITY']]
print (df3)
FIPS QUANTITY
0 101 10
1 005 12
2 201 11
Maybe you can try something like this:
df3 = pd.merge(df1, df2, left_on = ['STATE', 'COUNTY'], right_on= ['STATE', 'COUNTY']) # merge the two dataframes with STATE and COUNTY as join keys
df3 = df3.drop(['STATE', 'COUNTY'], axis = 1) # drop columns you don't need
df3

Categories