find duplicated csv columns from list [python pandas] - python

I want to find duplicate columns from a list, so not just any columns.
example of correct csv looks like this:
col1, col2, col3, col4, custom, custom
1,2,3,4,test,test
4,3,2,1,test,test
list looks like this:
columnNames = ['col1', 'col2', 'col3', 'col4']
So when I run something like df.columns.duplicated() I don't want to it detect the duplicate 'custom' fields, only if there is more than one 'col1' column, or more than one 'col2' column, etc, and return True when one of those columns is found to be duplicated.
I found when including a duplicate 'colN' column name, col4 in the example, and I print it out, it shows me that index(['col1', 'col2', 'col3', 'col4', 'col4.1'], dtype='object')
No idea how to write that line of code.

Use Index.isin + Index.duplicated to create a boolean mask:
c = df.columns.str.rsplit('.', n=1).str[0]
mask = c.isin(columnNames) & c.duplicated()
If want to find duplicated column names use boolean indexing with this mask:
dupe_cols = df.columns[mask]

When u will read this csv file using pandas u will not get any of the two columns with same name. As I know, second custom column name will get replaced by custom.1, so u can get idea from that how many duplicates are there

Here is one more way to do it using a list comprehension:
import pandas as pd
df = pd.DataFrame([[1,1,2,3,4,"test","test"],[4,4,3,2,1,"test","test"]],
columns = ["col1", "col1.1", "col2", "col3", "col4", "custom", "custom"])
print(df)
Out[1]:
col1 col1.1 col2 col3 col4 custom custom
0 1 1 2 3 4 test test
1 4 4 3 2 1 test test
columnNames = ['col1', 'col2', 'col3', 'col4']
splitColumns = pd.Index([i.split('.')[0] for i in df.columns])
[False if col not in columnNames else dup for col, dup in zip(splitColumns, splitColumns.duplicated())]
Out[2]: [False, True, False, False, False, False, False]

Related

Python: Is there a way to get all the column names in a new column if a condition is met (e.g. "Yes")

unique id
col1
col2
col3
New Col
1
Yes
No
Yes
col1, col3
2
No
Yes
No
col2
3
Yes
Yes
No
col1, col2
4
No
No
No
I was wondering how I can get the respective column names if "Yes" to a new column call "New Col".
You could apply a list comprehension to each row to return the column names where the value in the column is 'Yes'.
import pandas as pd
data = {'unique id': [1,2,3,4],
'col1': ['Yes','No','Yes','No'],
'col2': ['No','Yes','Yes','No'],
'col3': ['Yes','No','No','No']}
df = pd.DataFrame (data)
df['new_col'] = df.apply(lambda row:','.join([col for col in df.columns[1:] if row[col]=='Yes']) , axis=1)
print(df)
""" OUTPUT
unique id col1 col2 col3 new_col
0 1 Yes No Yes col1,col3
1 2 No Yes No col2
2 3 Yes Yes No col1,col2
3 4 No No No
"""
I would use pandas.DataFrame.iterrows to solve the problem. It returns each row as a Series, which allows you to do such comparison.
import pandas as pd
data = {
'unique id': [1,2,3,4],
'col1': ['Yes','No','Yes','No'],
'col2': ['No','Yes','Yes','No'],
'col3': ['Yes','No','No','No']
}
df = pd.DataFrame(data)
new_col = []
# We only need the Series part of returning tuples from df.iterrows
for _, row in df.iterrows():
# Get indices that match the desired condition
match_columns = row[row == 'Yes'].index.tolist()
new_col.append(match_columns)
# Assign the result back to the table (DataFrame) as a new column
df['New Col'] = new_col
Now df is what you want. Although I would suggest not using space for a column name (just for coding convention), so df['new_col'] = new_col may be better. I did that just to meet your original needs.
this is one solution, although I am sure it's not ideal. Please ask for any details.
# import pandas and make fake data
import pandas as pd
data = {'unique id': [1,2,3,4],
'col1': ['Yes','No','Yes','No'],
'col2': ['No','Yes','Yes','No'],
'col3': ['Yes','No','No','No']}
df = pd.DataFrame (data)
# now find all locations where each column contains 'Yes'
mask1 = df['col1']=='Yes'
mask2 = df['col2']=='Yes'
mask3 = df['col3']=='Yes'
# now build the new desired output column using string manipulation
output = []
for m1,m2,m3 in zip(mask1,mask2,mask3): # for each row of the dataframe
contains = [] # collect the components of the string in the new column
if m1:
contains.append('col1')
if m2:
contains.append('col2')
if m3:
contains.append('col3')
output.append(', '.join(contains)) # build the string
# and add the new column of data to the dataframe
df['New Col'] = output
Basic idea:
Define a function get_filtered_colnames we can apply to row by row. Let this function return the names of columns where the value in that row is 'Yes'.
Apply get_filtered_colnames to every row in df.
import numpy as np
import pandas as pd
# Creating the dataframe.
df = pd.DataFrame(np.zeros((4, 3)), columns=['col1', 'col2', 'col3'])
df['col1'] = ['Yes', 'No', 'Yes', 'No']
df['col2'] = ['No', 'Yes', 'Yes', 'No']
df['col3'] = ['Yes', 'No', 'No', 'No']
# Defining a function that can be applied row by row to the dataframe.
def get_filtered_colnames(row, colnames):
# Extracting the column indices where the column contains 'Yes'.
filtered_idxs = np.where(row == 'Yes')[0]
# Column names where the column containes 'Yes'.
filtered_colnames = [colnames[filtered_idx] for filtered_idx in filtered_idxs]
return filtered_colnames
# Applying the above function row by row (hence, axis=1). Passing col names as parameter.
df['New Col'] = df.apply(get_filtered_colnames, colnames=df.columns.tolist(), axis=1)
print(df)
This gives the following desired output:
col1 col2 col3 New Col
0 'Yes' 'No' 'Yes' ['col1', 'col3']
1 'No' 'Yes' 'No' ['col2']
2 'Yes' 'Yes' 'No' ['col1', 'col2']
3 'No' 'No' 'No' []
Edit: You can follow #kevinkayaks comment if you want the output as str instead of list.

pandas How to store rows dropped using `drop_duplicates`?

Note: See EDIT below.
I need to keep a log of all rows dropped from my df, but I'm not sure how to capture them. The log should be a data frame that I can update for each .drop or .drop_duplicatesoperation. Here are 3 examples of the code for which I want to log dropped rows:
df_jobs_by_user = df.drop_duplicates(subset=['owner', 'job_number'], keep='first')
df.drop(df.index[indexes], inplace=True)
df = df.drop(df[df.submission_time.dt.strftime('%Y') != '2018'].index)
I found this solution to a different .drop case that uses pd.isnull to recode a pd.dropna statement and so allows a log to be generated prior to actually dropping the rows:
df.dropna(subset=['col2', 'col3']).equals(df.loc[~pd.isnull(df[['col2', 'col3']]).any(axis=1)])
But in trying to adapt it to pd.drop_duplicates, I find there is no pd.isduplicate parallel to pd.isnull, so this may not be the best way to achieve the results I need.
EDIT
I rewrote my question here to be more precise about the result I want.
I start with a df that has one dupe row:
import pandas as pd
import numpy as np
df = pd.DataFrame([['whatever', 'dupe row', 'x'], ['idx 1', 'uniq row', np.nan], ['sth diff', 'dupe row', 'x']], columns=['col1', 'col2', 'col3'])
print(df)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
2 sth diff dupe row x
I then implement the solution from jjp:
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df.append(df.loc[mask])
I print the results:
print(df_keep)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
df_keep is what I expect and want.
print(df_droplog)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
2 sth diff dupe row x
2 sth diff dupe row x
df_droplog is not what I want. It includes the rows from index 0 and index 1 which were not dropped and which I therefore do not want in my drop log. It also includes the row from index 2 twice. I want it only once.
What I want:
print(df_droplog)
# Output:
col1 col2 col3
2 sth diff dupe row x
There is a parallel: pd.DataFrame.duplicated returns a Boolean series. You can use it as follows:
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['owner', 'job_number'], keep='first')
df_jobs_by_user = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])
Since you only want the duplicated rows in df_droplog, just append only those to an empty dataframe. What you were doing was appending them to the original dataframe df. Try this,
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])

Excel Pandas Python question on IndexingError, can search and remove columns fine containing certain words, but not rows

I am using this code
searchfor = ["s", 'John']
df = df[~df.iloc[1].astype(str).str.contains('|'.join(searchfor),na=False)]
This returns the error
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
However this works fine if run as a column search
df = df[~df.iloc[;,1].astype(str).str.contains('|'.join(searchfor),na=False)]
I am trying to remove a row based on if the row contains a certain phrase
To drop rows
Create a mask which returns True or False depending whether that cell contains your strings
search_for = ["s", "John"]
mask = data.applymap(lambda x: any(s in str(x) for s in search_for))
Then use filter .any to check for at least one True per row with boolean indexing and take only the rows where no True was found.
df_filtered = df[~mask.any(axis=1)]
To drop columns
search_for = ["s", "John"]
mask = data.applymap(lambda x: any(s in str(x) for s in search_for))
axis=0 instead of 1 to check for each column:
columns_analysis = mask.any(axis=0)
get the indexes when True to drop
columns_to_drop = columns_analysis[columns_analysis == True].index.tolist()
df_filtered = data.drop(columns_to_drop, axis=1)
This is related to the way you are splitting your data.
In the first statement, you are asking python to split your dataframe and give you the second (index 1 is second if you want first change index to 0) row, while in the second case, you are asking for the second column and in your dataframe these have different lengths (my mistake, is shapes). See this example:
d = {'col1': [1, 2], 'col2': [3, 4], 'col3':[23,23]}
df = pd.DataFrame(data=d)
print(df)
col1 col2 col3
1 3 23
2 4 23
First row:
df.iloc[0]
col1 1
col2 3
col3 23
Name: 0, dtype: int64
First column:
df.iloc[:,]
1
2
Name: col2, dtype: int64
Try this and if you like the answer vote...
Good luck.

Pandas selecting discontinuous columns from a dataframe

I am using the following to select specific columns from the dataframe comb, which I would like to bring into a new dataframe. The individual selects work fine EG: comb.ix[:,0:1], but when I attempt to combine them using the + I get a bad result the 1st selection ([:,0:1]) getting stuck on the end of the dataframe and the values contained in original col 1 are wiped out while appearing at the end of the row. What is the right way to get just the columns I want? (I'd include sample data but as you may see, too many columns...which is why I'm trying to do it this way)
comb.ix[:,0:1]+comb.ix[:,17:342]
If you want to concatenate a sub selection of your df columns then use pd.concat:
pd.concat([comb.ix[:,0:1],comb.ix[:,17:342]], axis=1)
So long as the indices match then this will align correctly.
Thanks to #iHightower that you can also sub-select by passing the labels:
pd.concat([df.ix[:,'Col1':'Col5'],df.ix[:,'Col9':'Col15']],a‌​xis=1)
Note that .ix will be deprecated in a future version the following should work:
In [115]:
df = pd.DataFrame(columns=['col' + str(x) for x in range(10)])
df
Out[115]:
Empty DataFrame
Columns: [col0, col1, col2, col3, col4, col5, col6, col7, col8, col9]
Index: []
In [118]:
pd.concat([df.loc[:, 'col2':'col4'], df.loc[:, 'col7':'col8']], axis=1)
​
Out[118]:
Empty DataFrame
Columns: [col2, col3, col4, col7, col8]
Index: []
Or using iloc:
In [127]:
pd.concat([df.iloc[:, df.columns.get_loc('col2'):df.columns.get_loc('col4')], df.iloc[:, df.columns.get_loc('col7'):df.columns.get_loc('col8')]], axis=1)
Out[127]:
Empty DataFrame
Columns: [col2, col3, col7]
Index: []
Note that iloc slicing is open/closed so the end range is not included so you'd have to find the column after the column of interest if you want to include it:
In [128]:
pd.concat([df.iloc[:, df.columns.get_loc('col2'):df.columns.get_loc('col4')+1], df.iloc[:, df.columns.get_loc('col7'):df.columns.get_loc('col8')+1]], axis=1)
Out[128]:
Empty DataFrame
Columns: [col2, col3, col4, col7, col8]
Index: []
NumPy has a nice module named r_, allowing you to solve it with the modern DataFrame selection interface, iloc:
df.iloc[:, np.r_[0:1, 17:342]]
I believe this is a more elegant solution.
It even support more complex selections:
df.iloc[:, np.r_[0:1, 5, 16, 17:342:2, -5:]]
I recently solved it by just appending ranges
r1 = pd.Series(range(5))
r2 = pd.Series([10,15,20])
final_range = r1.append(r2)
df.iloc[:,final_range]
Then you will get columns from 0:5 and 10, 15, 20.

Add new column in Pandas DataFrame Python [duplicate]

This question already has answers here:
How to add a new column to an existing DataFrame?
(32 answers)
Closed 4 years ago.
I have dataframe in Pandas for example:
Col1 Col2
A 1
B 2
C 3
Now if I would like to add one more column named Col3 and the value is based on Col2. In formula, if Col2 > 1, then Col3 is 0, otherwise would be 1. So, in the example above. The output would be:
Col1 Col2 Col3
A 1 1
B 2 0
C 3 0
Any idea on how to achieve this?
You just do an opposite comparison. if Col2 <= 1. This will return a boolean Series with False values for those greater than 1 and True values for the other. If you convert it to an int64 dtype, True becomes 1 and False become 0,
df['Col3'] = (df['Col2'] <= 1).astype(int)
If you want a more general solution, where you can assign any number to Col3 depending on the value of Col2 you should do something like:
df['Col3'] = df['Col2'].map(lambda x: 42 if x > 1 else 55)
Or:
df['Col3'] = 0
condition = df['Col2'] > 1
df.loc[condition, 'Col3'] = 42
df.loc[~condition, 'Col3'] = 55
The easiest way that I found for adding a column to a DataFrame was to use the "add" function. Here's a snippet of code, also with the output to a CSV file. Note that including the "columns" argument allows you to set the name of the column (which happens to be the same as the name of the np.array that I used as the source of the data).
# now to create a PANDAS data frame
df = pd.DataFrame(data = FF_maxRSSBasal, columns=['FF_maxRSSBasal'])
# from here on, we use the trick of creating a new dataframe and then "add"ing it
df2 = pd.DataFrame(data = FF_maxRSSPrism, columns=['FF_maxRSSPrism'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = FF_maxRSSPyramidal, columns=['FF_maxRSSPyramidal'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_strainE22, columns=['deltaFF_strainE22'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = scaled, columns=['scaled'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_orientation, columns=['deltaFF_orientation'])
df = df.add( df2, fill_value=0 )
#print(df)
df.to_csv('FF_data_frame.csv')

Categories