I have a data frame that has multiple columns, example:
Prod_A Prod_B Prod_C State Region
1 1 0 1 1 1
I would like to drop all columns that starts with Prod_, (I can't select or drop by name because the data frame has 200 variables)
Is it possible to do this ?
Thank you
Use startswith for mask and then delete columns with loc and boolean indexing:
df = df.loc[:, ~df.columns.str.startswith('Prod')]
print (df)
State Region
1 1 1
First, select all columns to be deleted:
unwanted = df.columns[df.columns.str.startswith('Prod_')]
The, drop them all:
df.drop(unwanted, axis=1, inplace=True)
we can also use negative RegEx:
In [269]: df.filter(regex=r'^(?!Prod_).*$')
Out[269]:
State Region
1 1 1
Drop all rows where the path column starts with /var:
df = df[~df['path'].map(lambda x: (str(x).startswith('/var')))]
This can be further simplified to:
df = df[~df['path'].str.startswith('/var')]
map+lambda offer more flexibility by allowing to handle raw values as opposed to scalars. In the example below rows will be removed when they start with /var or are empty (nan, None, etc).
df = df[~df['path'].map(lambda x: (str(x).startswith('/var') or not x))]
Drop all rows where the path column starts with /var or /tmp (you can also pass a tuple to startswith):
df = df[~df['path'].map(lambda x: (str(x).startswith(('/var', '/tmp'))))]
The tilda ~ is used for negation; if you wanted instead to keep all rows starting with /var then just remove the ~.
Related
Sorry if the title is unclear - I wasn't too sure how to word it. So I have a dataframe that has two columns for old IDs and new IDs.
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
I'm trying to figure out a way to check the string length of each column/row and return any id's that don't match the required string length of 4 into a new dataframe. This will eventually turn into a dictionary of incorrect IDs.
This is the approach I'm currently taking:
incorrect_id_df = df[df.applymap(lambda x: len(x) != 4)]
and the current output:
old_id new_id
111 NaN
NaN NaN
NaN 777
NaN NaN
I'm not sure where to go from here and I'm sure there's a much better approach but this is the output I'm looking for where it's a single column dataframe with just the IDs that don't match the required string length and also with the column name id:
id
111
777
In general, DataFrame.applymap is pretty slow, so you should avoid it. I would stack both columns in a single one, and select the ids with length 4:
import pandas as pd
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
ids = df.stack()
bad_ids = ids[ids.str.len() != 4]
Output:
>>> bad_ids
0 old_id 111
2 new_id 777
dtype: object
The advantage of this approach is that now you have the location of the bad IDs which might be useful later. If you don't need it you can just use ids = df.stack().reset_index().
here's part of an answer
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
all_ids = df.values.flatten()
bad_ids = [bad_id for bad_id in all_ids if len(bad_id) != 4]
bad_ids
Or if you are not completely sure what are you doing, you can always use brutal force method :D
import pandas as pd
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
rows,colums= df.shape
#print (df)
for row in range(rows):
k= (df.loc[row])
for colum in range(colums):
#print(k.iloc[colum])
if len(k.iloc[colum])!=4:
print("Bad size of ID on row:"+str(row)+" colum:"+str(colum))
As commented by Jon Clements, stack could be useful here – it basically stacks (duh) all columns on top of each other:
>>> df[df.applymap(len) != 4].stack().reset_index(drop=True)
0 111
1 777
dtype: object
To turn that into a single-column df named id, you can extend it with a .rename('id').to_frame().
need to filter/eliminate the first #n row till were is present "nan" symbol from a much bigger df like df2 show.
main_df = {
'Courses':["Spark","Java","Python","Go"],
'Discount':[2000,2300,1200,2000],
'Pappa':[np.nan,np.nan,"2","ai"],
'Puppo':["Glob","Java","n","Godo"],
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(main_df,index=index_labels2)
I tryed with :
maino_df = main_df.loc[:, (main_df.iloc [0] != np.nan) & ((main_df.iloc [0,:] < 1000))]
to obtain:
main_dfnew = {
'Courses':["Python","Go"],
'Discount':[1200,2000],
'Pappa':["2","ai"],
'Puppo':["n","Godo"],
}
index_labels2=['r3','r5']
df2 = pd.DataFrame(main_dfnew, index=index_labels2)
but eliminate also the columns where is nan
IIUC, you want to drop the first row where you have NaNs, and keep all the rows after the first row that has no NaNs?
NB. I am assuming real NaNs here, if not first use replace or other method to convert to NaN, or comparison to match the data to consider invalid
You could use:
df3 = df2[df2.notna().all(1).cummax()]
output:
Courses Discount Pappa Puppo
r3 Python 1200 2 n
r5 Go 2000 ai Godo
If you just want to remove all the rows with NaNs, use dropna:
df3 = df2.dropna(axis=0)
I have a dataset which has 1854 rows and 106 columns , in the third column of it there are values like "Worm.Win32.Zwr.c" (and other type of malware names) ,I want to check if there is a word like 'worm' in any rows then insert 1 in target column of the same row
for rows in malware_data:
if ('worm' in malware_data[3]):
malware_data.loc[rows]['target']=1
else:
malware_data.loc[rows]['target']=0
you can do this in several ways:
1) by creating a bool mask to filter what rows contain your word 'worm':
mask = df.str.lower().str.contains('worm')
df.loc[mask, third_column].target = 1
df.loc[~mask, third_column].target = 0
insetad of df.str.lower().str.contains('worm') you can use df.str.contains('(?i)worm')
if you do not know the name of your third column you could use:
third_column = df.columns[2]
2) by applying a function along your third column of the DataFrame as #ArunPrabhath suggested:
df.target = df[third_column].apply(lamda x: int('worm' in x.lower()))
malware_data['target'] = malware_data[3].apply(lamda row: 1 if ('worm' in row) else 0)
I know i can do like below if we are checking only two columns together.
df['flag'] = df['a_id'].isin(df['b_id'])
where df is a data frame, and a_id and b_id are two columns of the data frame. It will return True or False value based on the match. But i need to compare multiple columns together.
For example: if there are a_id , a_region, a_ip, b_id, b_region and b_ip columns. I want to compare like below,
a_key = df['a_id'] + df['a_region] + df['a_ip']
b_key = df['b_id'] + df['b_region] + df['b_ip']
df['flag'] = a_key.isin(b_key)
Somehow the above code is always returning False value. The output should be like below,
First row flag will be True because there is a match.
a_key becomes 2a10 this is match with last row of b_key (2a10)
You were going in the right direction, just use:
a_key = df['a_id'].astype(str) + df['a_region'] + df['a_ip'].astype(str)
b_key = df['b_id'].astype(str) + df['b_region'] + df['b_ip'].astype(str)
a_key.isin(b_key)
Mine is giving below results:
0 True
1 False
2 False
You can use isin with DataFrame as value, but as per the docs:
If values is a DataFrame, then both the index and column labels must
match
So this should work:
# Removing the prefixes from column names
df_a = df[['a_id', 'a_region', 'a_ip']].rename(columns=lambda x: x[2:])
df_b = df[['b_id', 'b_region', 'b_ip']].rename(columns=lambda x: x[2:])
# Find rows where all values are in the other
matched = df_a.isin(df_b).all(axis=1)
# Get actual rows with boolean indexing
df_a.loc[matched]
# ... or add boolean flag to dataframe
df['flag'] = matched
Here's one approach using DataFrame.merge, pandas.concat and testing for duplicated values:
df_merged = df.merge(df,
left_on=['a_id', 'a_region', 'a_ip'],
right_on=['b_id', 'b_region', 'b_ip'],
suffixes=('', '_y'))
df['flag'] = pd.concat([df, df_merged[df.columns]]).duplicated(keep=False)[:len(df)].values
[out]
a_id a_region a_ip b_id b_region b_ip flag
0 2 a 10 3222222 sssss 22222 True
1 22222 bcccc 10000 43333 ddddd 11111 False
2 33333 acccc 120000 2 a 10 False
I have dataframe made from csv in which missing data is represented by ? symbol. I want to check how many rows there are in which ? occurs with number of occurrence.
So far i made this but it shows number of all rows, not only that ones in which ? occurs.
print(sum([True for idx,row in df.iterrows() if
any(row.str.contains('[?]'))]))
You can use apply + str.contains, assuming all your columns are strings.
c = np.sum(df.apply(lambda x: x.str.contains('\?')).values)
If you need to select string columns only, use select_dtypes -
i = df.select_dtypes(exclude=['number']).apply(lambda x: x.str.contains('\?'))
c = np.sum(i.values)
Alternatively, to find the number of rows containing ? in them, use
c = df.apply(lambda x: x.str.contains('\?')).any(axis=1).sum()
Demo -
df
A B
0 aaa ?xyz
1 bbb que!?
2 ? ddd
3 foo? fff
df.apply(lambda x: x.str.contains('\?')).any(1).sum()
4