Remove columns that have substring similar to other columns Python

Remove columns that have substring similar to other columns Python - python

I have dataframe where the column names have the same format: data_sensor, where the date is in the format of yymmdd. Here is a subset of it:
Considering the last data (180722), I would like to keep the column according to sensor pre-defined priority. For example, I would like to define that SN1 is more important than SK3. So the desired result would be the same dataframe, only without column 180722_SK3. The number of columns with the same last date can be more than two.
This is the solution I implemented:
sensorsImportance = ['SN1', 'SK3'] #list of importence, first item is the most important
sensorsOrdering = {word: i for i, word in enumerate(sensorsImportance)}
def remove_duplicate_last_date(df,sensorsOrdering):
s = []
lastDate = df.columns.tolist()[-1].split('_')[0]
for i in df.columns.tolist():
if lastDate in i:
s.append(i.split('_')[1])
if len(s)>1:
keepCol = lastDate +'_'+sorted(s, key=sensorsOrdering.get)[0]
dropCols = [lastDate +'_'+i for i in sorted(s, key=sensorsOrdering.get)[1:]]
df.drop(dropCols,axis=1,inplace=True)
return df
It works fine, however, I feel that this is too cumbersome, is there a better way?

It can be done, with split the column then apply the argsort with the list , then reorder your dataframe , and join back the columns after groupby get the first value by date
df.columns=df.columns.str.split('_').map(tuple)
sensorsImportance = ['SN1', 'SK3']
idx=df.columns.get_level_values(1).map(dict(zip(sensorsImportance,range(len(sensorsImportance))))).argsort()
df=df.iloc[:,idx].T.groupby(level=0).head(1).T
df.columns=df.columns.map('_'.join)

Related

Creating a new column in Dataframe based on multiple lists

I'm trying to create a new column 'BroadCategory' within a dataframe based on whether values within another column called 'Venue Category' within the data occur in specific lists. I have 5 lists that I am using to fill in the values in the new column
For example:
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Bar),'Bar','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Museum_ArtGallery),'Museum/Art Gallery','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Public_Transport),'Public Transport','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Restaurant_FoodVenue),'Restaurant/Food Venue','Other')
I ultimately want the values in VenueCategory column occurring in the list Bar to be labeled 'Bar' and those occurring in the list Museum_ArtGallery to be labeled 'Museum_ArtGallery', etc. My code above doesn't accomplish this.
I tried this in order to keep the values I had previously filled but it's still overwriting the values I had filled in based on my previous conditions:
df['BroadCategory'] = np.where(df[df.VenueCategory!='Other'].isin(Entertainment_Venue),'Entertainment Venue','Other')
How can I fill the column BoardCategory with the specific values based on whether the values in the VenueCategory column occur in the specified lists Bar, Restaurant, Public_Transport, Museum_ArtGallery, etc?

support your data is like this
df=pd.DataFrame({'VenueCategory':['drink','wine','MOMA','MTA','sushi','Hudson']})
Bar=['drink','wine','alcohol']
Museum_ArtGallery=['MOMA','MCM']
Public_Transport=['MTA','MBTA']
Restaurant_FoodVenue=['sushi','chicken']
prepare a dictionary:
from collections import defaultdict
d=defaultdict(lambda:'other')
d.update({x:'Bar' for x in Bar})
d.update({x:'Museum_ArtGallery' for x in Museum_ArtGallery})
d.update({x:'Public_Transport' for x in Public_Transport})
d.update({x:'Restaurant_FoodVenue' for x in Restaurant_FoodVenue})
build new column and print result:
df['BroadCategory']=df['VenueCategory'].apply(lambda x:d[x])
df

venue_list = [['Bar', Bar],
['Museum_ArtGallery',Museum_ArtGallery]
#etc
]
venue_lookup = pd.concat([
pd.DataFrame({
'BroadCategory':venue[0],
'VenueCategory':venue[1]}) for venue in venue_list]
)
pd.merge(df, venue_lookup, how='left', on = 'VenueCategory')

Your solution is already close. Just that in order not to overwrite previously values, you should get a subset of the rows and only set new values on the subset.
To do that, you can firstly initialize new column BroadCategory to 'Other'. Then set up a subset of rows of each category by subscripting the new column with Boolean mask using the .isin() function like you are using now. The codes are like below:
df['BroadCategory'] = 'Other'
df['BroadCategory'][df['VenueCategory'].isin(Bar)] = 'Bar'
df['BroadCategory'][df['VenueCategory'].isin(Museum_ArtGallery)] = 'Museum/Art Gallery'
df['BroadCategory'][df['VenueCategory'].isin(Public_Transport)] = 'Public Transport'
df['BroadCategory'][df['VenueCategory'].isin(Restaurant_FoodVenue)] = 'Restaurant/Food Venue'
df['BroadCategory'][df['VenueCategory'].isin(Entertainment_Venue)] = 'Entertainment Venue'

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?

Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row

Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

complex dataframe filtering request on the last occurence of a value in Panda/Python [EDIT]

I have a hard time to do a complex dataframe filtering.
Here the problem:
For each column 'id' of same value, the column 'job' can take the values 'fireman','nan','policeman'.
I would like to filter my dataframe so that for each id of same value,
I keep only the rows starting where the value 'fireman' for job is occuring the last consecutive time. I first have to group by 'job' values to filter on:
df.groupby("job").filter(lambda x: f(x))
I don't know which function f is appropriate.
Any ideas ?
To try:
df = pd.DataFrame([[79,1,], [79,2,'fireman'],[79,3,'fireman'],[79,4,],[79,5,],[79,6,'fireman'],[79,7,'fireman'],[79,8,'policeman']], columns=['id','day','job'])
output = pd.DataFrame([[79,6,'fireman'],[79,7,'fireman'],[79,8,'policeman']], columns=['id','day','job'])

Here is a version without the need of extra variables:
df.groupby('imo').apply(lambda grp: grp[grp.index >=
((grp.polygon.shift() != grp.polygon) &
(grp.polygon.shift(-1) == grp.polygon) &
(grp.polygon == 'FE')
).cumsum().idxmax()]
).reset_index(level=0, drop=True)

Optimize a for loop applied on all elements of a df

EDIT : here are the first lines :
df = pd.read_csv(os.path.join(path, file), dtype = str,delimiter = ';',error_bad_lines=False, nrows=50)
df["CALDAY"] = df["CALDAY"].apply(lambda x:dt.datetime.strptime(x,'%d/%m/%Y'))
df = df.fillna(0)
I have a csv file that has 1500 columns and 35000 rows. It contains values, but under the form 1.700,35 for example, whereas in python I need 1700.35. When I read the csv, all values are under a str type.
To solve this I wrote this function :
def format_nombre(df):
for i in range(length):
for j in range(width):
element = df.iloc[i,j]
if (type(element) != type(df.iloc[1,0])):
a = df.iloc[i,j].replace(".","")
b = float(a.replace(",","."))
df.iloc[i,j] = b
Basically, I select each intersection of all rows and columns, I replace the problematic characters, I turn the element into a float and I replace it in the dataframe. The if ensures that the function doesn't consider dates, which are in the first column of my dataframe.
The problem is that although the function does exactly what I want, it takes approximately 1 minute to cover 10 rows, so transforming my csv would take a little less than 60h.
I realize this is far from being optimized, but I struggled and failed to find a way that suited my needs and (scarce) skills.

How about:
def to_numeric(column):
if np.issubdtype(column.dtype, np.datetime64):
return column
else:
return column.str.replace('.', '').str.replace(',', '.').astype(float)
df = df.apply(to_numeric)
That's assuming all strings are valid. Otherwise use pd.to_numeric instead of astype(float).

Speed up vlookup like operation using pandas in python

I have written some code to essentially do a excel style vlookup on two pandas dataframes and want to speed it up.
The structure of the data frames is as follows:
dbase1_df.columns:
'VALUE', 'COUNT', 'GRID', 'SGO10GEO'
merged_df.columns:
'GRID', 'ST0, 'ST1', 'ST2', 'ST3', 'ST4', 'ST5', 'ST6', 'ST7', 'ST8', 'ST9', 'ST10'
sgo_df.columns:
'mkey', 'type'
To combine them, I do the following:
1. For each row in dbase1_df, find the row where its 'SGO10GEO' value matches the 'mkey' value of sgo_df. Obtain the 'type' from that row in sgo_df.
'type' contains an integer ranging from 0 to 10. Create a column name by appending 'ST' to type.
Find the value in merged_df, where its 'GRID' value matches the 'GRID' value in dbase1_df and the column name is the one we obtained in step 2. Output this value into a csv file.
// Read in dbase1 dbf into data frame
dbase1_df = pandas.DataFrame.from_csv(dbase1_file,index_col=False)
merged_df = pandas.DataFrame.from_csv('merged.csv',index_col=False)
lup_out.writerow(["VALUE","TYPE",EXTRACT_VAR.upper()])
// For each unique value in dbase1 data frame:
for index, row in dbase1_df.iterrows():
# 1. Find the soil type corresponding to the mukey
tmp = sgo_df.type.values[sgo_df['mkey'] == int(row['SGO10GEO'])]
if tmp.size > 0:
s_type = 'ST'+tmp[0]
val = int(row['VALUE'])
# 2. Obtain hmu value
tmp_val = merged_df[s_type].values[merged_df['GRID'] == int(row['GRID'])]
if tmp_val.size > 0:
hmu_val = tmp_val[0]
# 4. Output into data frame: VALUE, hmu value
lup_out.writerow([val,s_type,hmu_val])
else:
err_out.writerow([merged_df['GRID'], type, row['GRID']])
Is there anything here that might be a speed bottleneck? Currently it takes me around 20 minutes for around ~500,000 rows in dbase1_df; ~1,000 rows in merged_df and ~500,000 rows in sgo_df.
thanks!

You need to use the merge operation in Pandas to get a better performance. I'm not able to test the below code since I don't have the data but at minimum it should help you to get the idea:
import pandas as pd
dbase1_df = pd.DataFrame.from_csv('dbase1_file.csv',index_col=False)
sgo_df = pd.DataFrame.from_csv('sgo_df.csv',index_col=False)
merged_df = pd.DataFrame.from_csv('merged_df.csv',index_col=False)
#you need to use the same column names for common columns to be able to do the merge operation in pandas , so we changed the column name to mkey
dbase1_df.columns = [u'VALUE', u'COUNT', u'GRID', u'mkey']
#Below operation merges the two dataframes
Step1_Merge = pd.merge(dbase1_df,sgo_df)
#We need to add a new column to concatenate ST and type
Step1_Merge['type_2'] = Step1_Merge['type'].map(lambda x: 'ST'+str(x))
# We need to change the shape of merged_df and move columns to rows to be able to do another merge operation
id = merged_df.ix[:,['GRID']]
a = pd.merge(merged_df.stack(0).reset_index(1), id, left_index=True, right_index=True)
# We also need to change the automatically generated name to type_2 to be able to do the next merge operation
a.columns = [u'type_2', 0, u'GRID']
result = pd.merge(Step1_Merge,a,on=[u'type_2',u'GRID'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove columns that have substring similar to other columns Python - python

Related

Creating a new column in Dataframe based on multiple lists

Drop Pandas DataFrame lines according to a GropuBy property

complex dataframe filtering request on the last occurence of a value in Panda/Python [EDIT]

Optimize a for loop applied on all elements of a df

Speed up vlookup like operation using pandas in python

Categories

Resources