pandas How to store rows dropped using `drop_duplicates`?

pandas How to store rows dropped using `drop_duplicates`? - python

Note: See EDIT below.
I need to keep a log of all rows dropped from my df, but I'm not sure how to capture them. The log should be a data frame that I can update for each .drop or .drop_duplicatesoperation. Here are 3 examples of the code for which I want to log dropped rows:
df_jobs_by_user = df.drop_duplicates(subset=['owner', 'job_number'], keep='first')
df.drop(df.index[indexes], inplace=True)
df = df.drop(df[df.submission_time.dt.strftime('%Y') != '2018'].index)
I found this solution to a different .drop case that uses pd.isnull to recode a pd.dropna statement and so allows a log to be generated prior to actually dropping the rows:
df.dropna(subset=['col2', 'col3']).equals(df.loc[~pd.isnull(df[['col2', 'col3']]).any(axis=1)])
But in trying to adapt it to pd.drop_duplicates, I find there is no pd.isduplicate parallel to pd.isnull, so this may not be the best way to achieve the results I need.
EDIT
I rewrote my question here to be more precise about the result I want.
I start with a df that has one dupe row:
import pandas as pd
import numpy as np
df = pd.DataFrame([['whatever', 'dupe row', 'x'], ['idx 1', 'uniq row', np.nan], ['sth diff', 'dupe row', 'x']], columns=['col1', 'col2', 'col3'])
print(df)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
2 sth diff dupe row x
I then implement the solution from jjp:
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df.append(df.loc[mask])
I print the results:
print(df_keep)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
df_keep is what I expect and want.
print(df_droplog)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
2 sth diff dupe row x
2 sth diff dupe row x
df_droplog is not what I want. It includes the rows from index 0 and index 1 which were not dropped and which I therefore do not want in my drop log. It also includes the row from index 2 twice. I want it only once.
What I want:
print(df_droplog)
# Output:
col1 col2 col3
2 sth diff dupe row x

There is a parallel: pd.DataFrame.duplicated returns a Boolean series. You can use it as follows:
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['owner', 'job_number'], keep='first')
df_jobs_by_user = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])

Since you only want the duplicated rows in df_droplog, just append only those to an empty dataframe. What you were doing was appending them to the original dataframe df. Try this,
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])

Related

how to sum up columns from different dataframes into a single dataframe in pandas

Sample data
import pandas as pd
df1 = pd.DataFrame()
df1["Col1"] = [0,2,4,6,2]
df1["Col2"] = [5,1,3,4,0]
df1["Col3"] = [8,0,5,1,7]
df1["Col4"] = [1,4,6,0,8]
#df1_new = df1.iloc[:, 1:3]
df2 = pd.DataFrame()
df2["Col1"] = [8,2,4,6,2,3,5]
df2["Col2"] = [3,7,3,4,0,6,8]
df2["Col3"] = [5,0,5,1,7,9,1]
df2["Col4"] = [0,4,6,0,8,6,0]
#df2_new = df1.iloc[:, 1:3]
dataframes = [df1, df2]
for df in dataframes:
df_new=df.iloc[:, 1:3]
print(df_new.sum(axis=0))
result from above looks like this:
Col2 13
Col3 21
dtype: int64
Col2 31
Col3 28
dtype: int64
But how can I sum up both dataframes and put it into a single one?
Result should look like this:
Real example looks like this:
xlsx_files = glob.glob(os.path.join(path, "*.xlsx"))
#print(csv_files)
# loop over the list of csv files
for f in xlsx_files:
# create df from each excel file
dfs = pd.read_excel(f)
# grab file name to user it in summarized df
file_name = f.split("\\")[-1]
new_df = pd.concat([dfs]).iloc[:,13:28].sum()

You can either sum the dataframes separately and then add the results, or sum the concatenated dataframes:
df1.iloc[:,1:3].sum() + df2.iloc[:,1:3].sum()
pd.concat([df1,df2]).iloc[:,1:3].sum()
In both cases the result is
Col2 44
Col3 49
dtype: int64
You can convert the result from a series to a DataFrame and transpose using
.to_frame().T
to get this output:
Col2 Col3
0 44 49
For the code in your updated question, you probably want something like this:
xlsx_files = glob.glob(os.path.join(path, "*.xlsx"))
#print(csv_files)
# loop over the list of csv files
new_df = pd.DataFrame()
for f in xlsx_files:
# create df from each excel file
dfs = pd.read_excel(f)
# grab file name to user it in summarized df
file_name = f.split("\\")[-1]
new_df = pd.concat([new_df, dfs])
result = new_df.iloc[:,13:28].sum()

here is another way about it
combining the sum of the individual sum of the DFs, converting result to a DF and then choosing Col2 and Col3 after Transposing
(df1.sum() + df2.sum()).to_frame().T[['Col2','Col3']]
Col2 Col3
0 44 49

Get the columnwise sums of both dataframes, take the middle two columns of each, and add them together. Then, transpose the result to turn the rows into columns:
pd.DataFrame((df1.iloc[:, 1:3].sum() + df2.iloc[:, 1:3].sum())).T
This outputs:
Col2 Col3
0 44 49

Here is one way:
long, short = (df1, df2) if len(df1.index) > len(df2.index) else (df2, df1)
print((short[["Col2", "Col3"]].reindex(long.index, fill_value=0) + long[["Col2", "Col3"]]).sum().to_frame().T)
Or, if you need to use iloc for the columns, here is another way:
long, short = (df1, df2) if len(df1.index) > len(df2.index) else (df2, df1)
print((short.iloc[:, 1:3].reindex(long.index, fill_value=0) + long.iloc[:, 1:3]).sum().to_frame().T)
Output (same for both):
Col2 Col3
0 44 49

Converting 3 Columns into 2 Column - Python

I need to convert 3 columns into 2 rows using python.
col1 col2 col3
A 2 3
B 4 5
col1 col2
A 2
A 3
B 4
B 5
*my code
hdr = ['col1', 'col2']
final_output=[]
for row in rows:
output = {}
output1 = {}
output = { A : row.get(col1), B: row.get(col2)}
output1 = { A : row.get(col1), B: row.get(col3)}
final_out.append(output)
final_out.append(output1)
with open(tgt_file.csv, w) as tgt_file:
csv_writer=csv.DictWriter(tgt_file, fieldnames=hdr, delimiter=',')
csv_writer.writeheader()
csv_writer.writerows(final_output)

import pandas as pd
### this is the sample data
df = pd.DataFrame(data= [['A',2, 3],['B',4, 5]],
columns =['col1', 'col2', 'col3'])
### this is the solution
ef = [] # create an empty list
for i,row in df.iterrows():
ef.append([row[0], row[1]]) # append first column first
ef.append([row[0], row[2]]) # append 2nd column second
df = pd.DataFrame(data=ef,columns=['col1','col2']) # recreate the dataframe
remark: there are more advanced solutions possible, but I think this is readable

You can try using pd.melt
df = pd.melt(df, id_vars=["col1"],value_name = 'col2').drop(['variable'],axis=1)
And then you can sort the dataframe on "col1".

How do you drop a column by index?

When I run this code it drops the first row instead of the first column:
df.drop(axis=1, index=0)
How do you drop a column by index?

You can use df.columns[i] to denote the column. Example:
df.drop(df.columns[0], axis=1)

Using the example
df = pd.DataFrame([
[1023.423,12.59595],
[1000,11.63024902],
[975,9.529815674],
[100,-48.20524597]], columns = ['col1', 'col2'])
col1 col2
0 1023.423 12.595950
1 1000.000 11.630249
2 975.000 9.529816
3 100.000 -48.205246
If you do df.drop(index=0), the output is dropping row with index 0
col1 col2
1 1000.0 11.630249
2 975.0 9.529816
3 100.0 -48.205246
If you do df.drop('col1', axis=1), the output is dropping column with name 'col1'
col2
0 12.595950
1 11.630249
2 9.529816
3 -48.205246
Please remember to use inplace=True where necessary

How to switch column values in the same Pandas DataFrame

I have the following DataFrame:
I need to switch values of col2 and col3 with the values of col4 and col5. Values of col1 will remain the same. The end result needs to look as the following:
Is there a way to do this without looping through the DataFrame?

Use rename in pandas
In [160]: df = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]})
In [161]: df
Out[161]:
A B
0 1 3
1 2 4
2 3 5
In [167]: df.rename({'B':'A','A':'B'},axis=1)
Out[167]:
B A
0 1 3
1 2 4
2 3 5

This should do:
og_cols = df.columns
new_cols = [df.columns[0], *df.columns[3:], *df.columns[1:3]]
df = df[new_cols] # Sort columns in the desired order
df.columns = og_cols # Use original column names

If you want to swap the column values:
df.iloc[:, 1:3], df.iloc[:, 3:] = df.iloc[:,3:].to_numpy(copy=True), df.iloc[:,1:3].to_numpy(copy=True)

Pandas reindex could help :
cols = df.columns
#reposition the columns
df = df.reindex(columns=['col1','col4','col5','col2','col3'])
#pass in new names
df.columns = cols

Add new column in Pandas DataFrame Python [duplicate]

This question already has answers here:
How to add a new column to an existing DataFrame?
(32 answers)
Closed 4 years ago.
I have dataframe in Pandas for example:
Col1 Col2
A 1
B 2
C 3
Now if I would like to add one more column named Col3 and the value is based on Col2. In formula, if Col2 > 1, then Col3 is 0, otherwise would be 1. So, in the example above. The output would be:
Col1 Col2 Col3
A 1 1
B 2 0
C 3 0
Any idea on how to achieve this?

You just do an opposite comparison. if Col2 <= 1. This will return a boolean Series with False values for those greater than 1 and True values for the other. If you convert it to an int64 dtype, True becomes 1 and False become 0,
df['Col3'] = (df['Col2'] <= 1).astype(int)
If you want a more general solution, where you can assign any number to Col3 depending on the value of Col2 you should do something like:
df['Col3'] = df['Col2'].map(lambda x: 42 if x > 1 else 55)
Or:
df['Col3'] = 0
condition = df['Col2'] > 1
df.loc[condition, 'Col3'] = 42
df.loc[~condition, 'Col3'] = 55

The easiest way that I found for adding a column to a DataFrame was to use the "add" function. Here's a snippet of code, also with the output to a CSV file. Note that including the "columns" argument allows you to set the name of the column (which happens to be the same as the name of the np.array that I used as the source of the data).
# now to create a PANDAS data frame
df = pd.DataFrame(data = FF_maxRSSBasal, columns=['FF_maxRSSBasal'])
# from here on, we use the trick of creating a new dataframe and then "add"ing it
df2 = pd.DataFrame(data = FF_maxRSSPrism, columns=['FF_maxRSSPrism'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = FF_maxRSSPyramidal, columns=['FF_maxRSSPyramidal'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_strainE22, columns=['deltaFF_strainE22'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = scaled, columns=['scaled'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_orientation, columns=['deltaFF_orientation'])
df = df.add( df2, fill_value=0 )
#print(df)
df.to_csv('FF_data_frame.csv')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas How to store rows dropped using `drop_duplicates`? - python

There is a parallel: pd.DataFrame.duplicated returns a Boolean series. You can use it as follows: df_droplog = pd.DataFrame() mask = df.duplicated(subset=['owner', 'job_number'], keep='first') df_jobs_by_user = df.loc[~mask] df_droplog = df_droplog.append(df.loc[mask])

Related

how to sum up columns from different dataframes into a single dataframe in pandas

Converting 3 Columns into 2 Column - Python

How do you drop a column by index?

How to switch column values in the same Pandas DataFrame

Add new column in Pandas DataFrame Python [duplicate]

Categories

Resources