Dataframe failed to mask rows due to string values - python

I wanted to use column values in one csv file to mask rows in another csv,
as in:
df6 = pd.read_csv(‘py_all1a.csv’) # file with multiple columns
df7 = pd.read_csv(‘artexclude1.csv’) # file with multiple columns
#
# csv df6 col 1 has the same header and data type as col 8 in df7.
# I want to mask rows in df6 that have a matching col value to any
# in df7. The data in each column is a text value (single word).
#
mask = df6.iloc[:,1].isin(df7.iloc[:,8])
df6[~mask].to_csv(‘py_all1b.csv’, index=False)
#
On that last line, I tried [mask] with the tilde, resulting in no change to the df6 file (py_all1b.csv), and without the tilde (producing the file with just the column headers).
An answer using a specific data set was provided in the below answer, but it did not work because there were inconsistencies between the text values, namely, on entry had a space while another did not.
The below answer is correct, and I have added a paragraph to show how the text issue can also be resolved.

Try converting to a set first:
mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))
This ensures your comparison is against values.
Example
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# 0 1 2
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
# 3 10 11 12
df2 = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]])
# 0 1 2
# 0 1 2 3
# 1 1 2 3
# 2 1 2 3
# 3 1 2 3
mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))
df1[mask]
# 0 1 2
# 0 1 2 3
With strings
It still works:
df1 = pd.DataFrame([['a', 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
df2 = pd.DataFrame([['a', 2, 3], ['a', 2, 3], ['a', 2, 3], ['a', 2, 3]])
mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))
df1[mask]
# 0 1 2
# 0 a 2 3
When you are dealing with string data, there may be problems with whitespace that can cause matches to be missed. As described in this answer, you may need to instead use:
df6 = pd.read_csv('py_all1a.csv', skipinitialspace=True) # file with multiple columns
df7 = pd.read_csv('artexclude1.csv', skipinitialspace=True) # file with multiple columns
mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))
df6[~mask].to_csv('py_all1b.csv', index=False)

Related

How to find the most frequent value of a column per row, where each column value is a list of values

I have a dataframe that, as a result of a previous group by, contains 5 rows and two columns. column A is a unique name, and column B contains a list of unique numbers that correspond to different factors related to the unique name. How can I find the most common number (mode) for each row?
df = pd.DataFrame({"A": [Name1,Name2,...], "B": [[3, 5, 6, 6], [1, 1, 1, 4],...]})
I have tried:
df['C'] = df[['B']].mode(axis=1)
but this simply creates a copy of the lists from column B. Not really sure how to access each list in this case.
Result should be:
A: B: C:
Name 1 [3,5,6,6] 6
Name 2 [1,1,1,4] 1
Any help would be great.
Here's a method using statistics module's mode function
from statistics import mode
Two options:
df["C"] = df["B"].apply(mode)
df.head()
# A B C
# 0 Name1 [3, 5, 6, 6] 6
# 1 Name2 [1, 1, 1, 4] 1
Or
df["C"] = [mode(df["B"][i]) for i in range(len(df))]
df.head()
# A B C
# 0 Name1 [3, 5, 6, 6] 6
# 1 Name2 [1, 1, 1, 4] 1
I would use Pandas' .apply() function here. It will execute a function on each element in a series. First, we define the function, I'm taking the mode from Find the most common element in a list
def mode(lst):
return max(set(lst), key=lst.count)
Then, we apply this function to the B column to get C:
df['C'] = df['B'].apply(mode)
Our output is:
>>> df
A B C
0 Name1 [3, 5, 6, 6] 6
1 Name2 [1, 1, 1, 4] 1

Try to get the cross of 2 series of a pandas table

I am stuck with an issue on a massive pandas table. I would like to get a boolean to check the cross of 2 series.
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8]})
I would like to add one column in my array to get a result like this one
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8],
'C': [0, -1, 0, 1]
})
So basically to get
0 when there is no cross between series B and A
-1 when table B crosses down table A
1 when table B crosses up table A
I need to do vector calculation because my real table is like more than one million rows.
Thank you
You can compute the relative position of the 2 columns with lt, then convert to integer and compute the diff:
m = df['A'].lt(df['B'])
df['C'] = m.astype(int).diff().fillna(0, downcast='infer')
output:
A B C
0 1 10 0
1 2 1 -1
2 3 2 0
3 4 8 1
visual of A/B:

Automatically create multiple python datasets based on column names

I have a huge data set with columns like: "Eas_1", "Eas_2", and so on to "Eas_40" and "Nor_1" to "Nor_40". I want to automatically create multiple separate data sets that consist of all columns that end with the same number (grouped by column name number) and column number pasted as values in the new column (Bin).
My data frame:
df = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Eas_2": [4, 5, 10, 2],
"Nor_1": [9, 7, 9, 2],
"Nor_2": [10, 8, 10, 3],
"Error_1": [2, 5, 1, 6],
"Error_2": [5, 0, 3, 2],
})
I don't know how to create Bin column and paste the column name values, but I could separate data sets manually like this:
df1 = df.filter(regex='_1')
df2 = df.filter(regex='_2')
This would take a lot of effort for me, plus I would have to change the script every time I get new data. This is how I imagine end result:
df1 = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Nor_1": [9, 7, 9, 2],
"Error_1": [2, 5, 1, 6],
"Bin": [1, 1, 1, 1],
})
Thanks in advance!
You can extract the suffixes with .str.extract, then groupby on those:
suffixes = df.columns.str.extract('(\d+)$', expand=False)
for label, data in df.groupby(suffixes, axis=1):
print('-'*10, label, '-'*10)
print(data)
Note To collect your dataframes, you can do:
dfs = [data for _, data in df.groupby(suffixes, axis=1)]
# access the second dataframe
dfs[1]
Output:
---------- 1 ----------
Eas_1 Nor_1 Error_1
0 3 9 2
1 4 7 5
2 9 9 1
3 1 2 6
---------- 2 ----------
Eas_2 Nor_2 Error_2
0 4 10 5
1 5 8 0
2 10 10 3
3 2 3 2

Convert column suffixes from pandas join into a MultiIndex

I have two pandas DataFrames with (not necessarily) identical index and column names.
>>> df_L = pd.DataFrame({'X': [1, 3],
'Y': [5, 7]})
>>> df_R = pd.DataFrame({'X': [2, 4],
'Y': [6, 8]})
I can join them together and assign suffixes.
>>> df_L.join(df_R, lsuffix='_L', rsuffix='_R')
X_L Y_L X_R Y_R
0 1 5 2 6
1 3 7 4 8
But what I want is to make 'L' and 'R' sub-columns under both 'X' and 'Y'.
The desired DataFrame looks like this:
>>> pd.DataFrame(columns=pd.MultiIndex.from_product([['X', 'Y'], ['L', 'R']]),
data=[[1, 5, 2, 6],
[3, 7, 4, 8]])
X Y
L R L R
0 1 5 2 6
1 3 7 4 8
Is there a way I can combine the two original DataFrames to get this desired DataFrame?
You can use pd.concat with the keys argument, along the first axis:
df = pd.concat([df_L, df_R], keys=['L','R'],axis=1).swaplevel(0,1,axis=1).sort_index(level=0, axis=1)
>>> df
X Y
L R L R
0 1 2 5 6
1 3 4 7 8
For those looking for an answer to the more general problem of joining two data frames with different indices or columns into a multi-index table:
# Prepend a key-level to the column index
# https://stackoverflow.com/questions/14744068
df_L = pd.concat([df_L], keys=["L"], axis=1)
df_R = pd.concat([df_R], keys=["R"], axis=1)
# Join the two dataframes
df = df_L.join(df_R)
# Reorder levels if needed:
df = df.reorder_levels([1,0], axis=1).sort_index(axis=1)
Example:
# Data:
df_L = pd.DataFrame({'X': [1, 3, 5], 'Y': [7, 9, 11]})
df_R = pd.DataFrame({'X': [2, 4], 'Y': [6, 8], 'Z': [10, 12]})
# Result:
# X Y Z
# L R L R R
# 0 1 2.0 7 6.0 10.0
# 1 3 4.0 9 8.0 12.0
# 2 5 NaN 11 NaN NaN
This also solves the special case of the OP with equal indices and columns.
df_L.columns = pd.MultiIndex.from_product([["L", ], df_L.columns])

Drop Columns that starts with any of a list of strings Pandas

I'm trying to drop all columns from a df that start with any of a list of strings. I needed to copy these columns to their own dfs, and now want to drop them from a copy of the main df to make it easier to analyze.
df.columns = ["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"...]
Entered some code that gave me this dataframes with these columns:
aaa.columns = ["AAA1234", "AAA5678"]
bbb.columns = ["BBB1234", "BBB5678"]
I did get the final df that I wanted, but my code felt rather clunky:
droplist_cols = [aaa, bbb]
droplist = []
for x in droplist_cols:
for col in x.columns:
droplist.append(col)
df1 = df.drop(labels=droplist, axis=1)
Columns of final df:
df1.columns = ["CCC123", "DDD123"...]
Is there a better way to do this?
--Edit for sample data--
df = pd.DataFrame([[1, 2, 3, 4, 5], [1, 3, 4, 2, 1], [4, 6, 9, 8, 3], [1, 3, 4, 2, 1], [3, 2, 5, 7, 1]], columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123"])
Desired result:
CCC123
0 5
1 1
2 3
3 1
4 1
IICU
Lets begin with a dataframe thus;
df=pd.DataFrame({"A":[0]})
Modify dataframe to include your columns
df2=df.reindex(columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"], fill_value=0)
Drop all columns starting with A
df3=df2.loc[:,~df2.columns.str.startswith('A')]
If you need to drop say A OR B I would
df3=df2.loc[:,~(df2.columns.str.startswith('A')|df2.columns.str.startswith('B'))]

Categories