How to drop all columns but not first that starts with pattern? - python

I have code that deletes all columns that are starting with spike:
import pandas as pd
data = {'spike_starts1': [1,2,3], 'spike_starts2': [4,5,6], 'spike_starts3': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
df2 = df.drop(df.columns[df.columns.str.contains(pat = '^spike')].tolist() , axis=1).copy()
Question: How to modify code above so that it will leave first column that starts with spike but delete all others that starts with spike? If code above is hard to modify suggest your own versions.

This can be achieved just by changing .tolist()[1:], the final code must look like:
import pandas as pd
data = {'spike_starts1': [1,2,3], 'spike_starts2': [4,5,6], 'spike_starts3': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
df2 = df.drop(df.columns[df.columns.str.contains(pat = '^spike')].tolist()[1:] , axis=1).copy()

You can create a spike flag and drop duplicates which will only keep the first one.
(
df.T.assign(flag=lambda x: x.index.str.slice(0,5))
.drop_duplicates(subset='flag')
.drop('flag',1)
.T
)
spike_starts1 not
0 1 10
1 2 11
2 3 12
Of you can build a dict with only the first spike column and other non spike columns.
(
pd.DataFrame({'spike' if c.startswith('spike') else c:df[c] for c in df.columns})
.rename(columns = {'spike': [e for e in df.columns if e.startswith('spike')][0]})
)
Another solution:
(
pd.DataFrame(df.columns)
.assign(F=lambda x: x[0].str[:5])
.drop_duplicates(subset='F')
.pipe(lambda x: df.reindex(columns=x[0]))
)

Related

Optimizing an Excel to Pandas import and transformation from wide to long data

I need to import and transform xlsx files. They are written in a wide format and I need to reproduce some of the cell information from each row and pair it up with information from all the other rows:
[Edit: changed format to represent the more complex requirements]
Source format
ID
Property
Activity1name
Activity1timestamp
Activity2name
Activity2timestamp
1
A
a
1.1.22 00:00
b
2.1.22 10:05
2
B
a
1.1.22 03:00
b
5.1.22 20:16
Target format
ID
Property
Activity
Timestamp
1
A
a
1.1.22 00:00
1
A
b
2.1.22 10:05
2
B
a
1.1.22 03:00
2
B
b
5.1.22 20:16
The following code works fine to transform the data, but the process is really, really slow:
def transform(data_in):
data = pd.DataFrame(columns=columns)
# Determine number of processes entered in a single row of the original file
steps_per_row = int((data_in.shape[1] - (len(columns) - 2)) / len(process_matching) + 1)
data_in = data_in.to_dict("records") # Convert to dict for speed optimization
for row_dict in tqdm(data_in): # Iterate over each row of the original file
new_row = {}
# Set common columns for each process step
for column in column_matching:
new_row[column] = row_dict[column_matching[column]]
for step in range(0, steps_per_row):
rep = str(step+1) if step > 0 else ""
# Iterate for as many times as there are process steps in one row of the original file and
# set specific columns for each process step, keeping common column values identical for current row
for column in process_matching:
new_row[column] = row_dict[process_matching[column]+rep]
data = data.append(new_row, ignore_index=True) # append dict of new_row to existing data
data.index.name = "SortKey"
data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp # TODO check if works as intended
data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
data.fillna('', inplace=True) # Replace NaN values with empty cells
return data
Obviously, iterating over each row and then even each column is not at all how to use pandas the right way, but I don't see how this kind of transformation can be vectorized.
I have tried using parallelization (modin) and played around with using dict or not, but it didn't work / help. The rest of the script literally just opens and saves the files, so the problem lies here.
I would be very grateful for any ideas on how to improve the speed!
The df.melt function should be able to do this type of operation much faster.
df = pd.DataFrame({'ID' : [1, 2],
'Property' : ['A', 'B'],
'Info1' : ['x', 'a'],
'Info2' : ['y', 'b'],
'Info3' : ['z', 'c'],
})
data=df.melt(id_vars=['ID','Property'], value_vars=['Info1', 'Info2', 'Info3'])
** Edit to address modified question **
Combine the df.melt with df.pivot operation.
# create data
df = pd.DataFrame({'ID' : [1, 2, 3],
'Property' : ['A', 'B', 'C'],
'Activity1name' : ['a', 'a', 'a'],
'Activity1timestamp' : ['1_1_22', '1_1_23', '1_1_24'],
'Activity2name' : ['b', 'b', 'b'],
'Activity2timestamp' : ['2_1_22', '2_1_23', '2_1_24'],
})
# melt dataframe
df_melted = df.melt(id_vars=['ID','Property'],
value_vars=['Activity1name', 'Activity1timestamp',
'Activity2name', 'Activity2timestamp',],
)
# merge categories, i.e. Activity1name Activity2name become Activity
df_melted.loc[df_melted['variable'].str.contains('name'), 'variable'] = 'Activity'
df_melted.loc[df_melted['variable'].str.contains('timestamp'),'variable'] = 'Timestamp'
# add category ids (dataframe may need to be sorted before this operation)
u_category_ids = np.arange(1,len(df_melted.variable.unique())+1)
category_ids = np.repeat(u_category_ids,len(df)*2).astype(str)
df_melted.insert(0, 'unique_id', df_melted['ID'].astype(str) +'_'+ category_ids)
# pivot table
table = df_melted.pivot_table(index=['unique_id','ID','Property',],
columns='variable', values='value',
aggfunc=lambda x: ' '.join(x))
table = table.reset_index().drop(['unique_id'], axis=1)
Using pd.melt, as suggested by #Pantelis, I was able to speed up this transformation so extremely much, it's unbelievable. Before, a file with ~13k rows took 4-5 hours on a brand-new ThinkPad X1 - now it takes less than 2 minutes! That's a speed up by factor 150, just wow. :)
Here's my new code, for inspiration / reference if anyone has a similar data structure:
def transform(data_in):
# Determine number of processes entered in a single row of the original file
steps_per_row = int((data_in.shape[1] - len(column_matching)) / len(process_matching) )
# Specify columns for pd.melt, transforming wide data format to long format
id_columns = column_matching.values()
var_names = {"Erledigungstermin Auftragsschrittbeschreibung":data_in["Auftragsschrittbeschreibung"].replace(" ", np.nan).dropna().values[0]}
var_columns = ["Erledigungstermin Auftragsschrittbeschreibung"]
for _ in range(2, steps_per_row+1):
try:
var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in["Auftragsschrittbeschreibung" + str(_)].replace(" ", np.nan).dropna().values[0]
except IndexError:
var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in.loc[0,"Auftragsschrittbeschreibung" + str(_)]
var_columns.append("Erledigungstermin Auftragsschrittbeschreibung" + str(_))
data = pd.melt(data_in, id_vars=id_columns, value_vars=var_columns, var_name="ActivityName", value_name=timestamp)
data.replace(var_names, inplace=True) # Replace "Erledigungstermin Auftragsschrittbeschreibung" with ActivityName
data.sort_values(["Auftrags-\npositionsnummer",timestamp], ascending=True, inplace=True)
# Improve column names
data.index.name = "SortKey"
column_names = {v: k for k, v in column_matching.items()}
data.rename(mapper=column_names, axis="columns", inplace=True)
data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp
data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
data.fillna('', inplace=True) # Replace NaN values with empty cells
return data

Create a pandas DataFrame where each cell is a set of strings

I am trying to create a DataFrame like so:
col_a
col_b
{'soln_a'}
{'soln_b'}
In case it helps, here are some of my failed attempts:
import pandas as pd
my_dict_a = {"col_a": set(["soln_a"]), "col_b": set("soln_b")}
df_0 = pd.DataFrame.from_dict(my_dict_a) # ValueError: All arrays must be of the same length
df_1 = pd.DataFrame.from_dict(my_dict_a, orient="index").T # splits 'soln_b' into individual letters
my_dict_b = {"col_a": ["soln_a"], "col_b": ["soln_b"]}
df_2 = pd.DataFrame(my_dict_b).apply(set) # TypeError: 'set' type is unordered
df_3 = pd.DataFrame.from_dict(my_dict_b, orient="index").T # creates DataFrame of lists
df_3.apply(set, axis=1) # combines into single set of {soln_a, soln_b}
What's the best way to do this?
You just need to ensure your input data structure is formatted correctly.
The (default) dictionary -> DataFrame constructor, asks for the values in the dictionary be a collection of some type. You just need to make sure you have a collection of set objects, instead of having the key link directly to a set.
So, if I change my input dictionary to have a list of sets, then it works as expected.
import pandas as pd
my_dict = {
"col_a": [{"soln_a"}, {"soln_c"}],
"col_b": [{"soln_b", "soln_d"}, {"soln_c"}]
}
df = pd.DataFrame.from_dict(my_dict)
print(df)
col_a col_b
0 {soln_a} {soln_d, soln_b}
1 {soln_c} {soln_c}
You could apply a list comprehension on the columns:
my_dict_b = {"col_a": ["soln_a"], "col_b": ["soln_b"]}
df_2 = pd.DataFrame(my_dict_b)
df_2 = df_2.apply(lambda col: [set([x]) for x in col])
Output:
col_a col_b
0 {soln_a} {soln_b}
Why not something like this?
df = pd.DataFrame({
'col_a': [set(['soln_a'])],
'col_b': [set(['soln_b'])],
})
Output:
>>> df
col_a col_b
0 {soln_a} {soln_b}

Creating function to rename columns in pandas dataframe

I have dataframe as below:
df = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
I tried creating a function which allow to change the column name dynamically where I can just input the old column name and new column name in the function as below:
def rename_column_name(df, old_column, new_column):
df = df.rename({'{}'.format(old_column) : '{}'.format(new_column)}, axis=1)
return df
This function is only applicable if I only have one input as below:
new_df = rename_column_name(df, '$a' , 'a')
which give me this new_df as below:
new_df = pd.DataFrame({'a':[1,2], '$b': [10,20]})
However, i wanted to create a function that allow me to make changes on multiple/one column depending on my preference as such:
new_df = rename_column_name(df, ['$a','$b'] , ['a','b'])
And get the new_df as below
new_df = pd.DataFrame({'a':[1,2], 'b': [10,20]})
So, how do I make my function more dynamic to allow me the freedom to enter multiple/one column names and rename them?
You don't need a function, you can do this using dict comprehension:
In [265]: old_names = df.columns.tolist()
In [266]: new_names = ['a','b']
In [268]: df = df.rename(columns=dict(zip(old_names, new_names)))
In [269]: df
Out[269]:
a b
0 1 10
1 2 20
Function that OP needs:
In [274]: def rename_column_name(df, old_column_list, new_column_list):
...: df = df.rename(columns=dict(zip(old_column_list, new_column_list)))
...: return df
...:
In [275]: rename_column_name(df,old_names,new_names)
Out[275]:
a b
0 1 10
1 2 20
You need to pass a list of columns to this function. It can be multiple columns or a single column. This should do what you were looking for.
def rename_column_name(df, old_column, new_column):
if not isinstance(old_column,(list,tuple)):
old_column = [old_column]
if not isinstance(new_column,(list,tuple)):
old_column = [new_column]
df = df.rename({'{}'.format(old) : '{}'.format(new) for old,new in zip(old_column,new_column)}, axis=1)
return df # dang i should have used dict.zip like in the other solution :P
I guess ... although i don't understand how this is easier than just calling
df.rename(columns={'$a':'a','$b':b})
You can do that with zip function where,
old_column_names and new_column_names should be lists.
def rename_column_name(df, old_column_names, new_column_names):
//validating the such that all the new names have been passed
if(len(old_column_names) == len(new_column_names)):
df = df.rename(columns=dict(zip(old_column_names, new_column_names)), inplace=True)
return df
To handle both one column rename and passing them as lists the function would require further conditions which can be
def rename_column_name(df, old_column_names, new_column_names):
//validating the such that all the new names have been passed
if(isinstance(old_column_names, list)) and (isinstance(new_column_names, list)):
if(len(old_column_names) == len(new_column_names)):
df = df.rename(columns=dict(zip(old_column_names, new_column_names)), inplace=True)
elif (isinstance(old_column_names, str)) and (isinstance(new_column_names, str)):
df = df.rename(columns={'{}'.format(old_column_names) : '{}'.format(new_column_names)}, inplace=True)
return df

Diff between two dataframes in pandas

I have two dataframes both of which have the same basic schema. (4 date fields, a couple of string fields, and 4-5 float fields). Call them df1 and df2.
What I want to do is basically get a "diff" of the two - where I get back all rows that are not shared between the two dataframes (not in the set intersection). Note, the two dataframes need not be the same length.
I tried using pandas.merge(how='outer') but I was not sure what column to pass in as the 'key' as there really isn't one and the various combinations I tried were not working. It is possible that df1 or df2 has two (or more) rows that are identical.
What is a good way to do this in pandas/Python?
Try this:
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
diff_df = diff_df.loc[diff_df['Exist'] != 'both']
You will have a dataframe of all rows that don't exist on both df1 and df2.
IIUC:
You can use pd.Index.symmetric_difference
pd.concat([df1, df2]).loc[
df1.index.symmetric_difference(df2.index)
]
You can use this function, the output is an ordered dict of 6 dataframes which you can write to excel for further analysis.
'df1' and 'df2' refers to your input dataframes.
'uid' refers to the column or combination of columns that make up the unique key. (i.e. 'Fruits')
'dedupe' (default=True) drops duplicates in df1 and df2. (refer to Step 4 in comments)
'labels' (default = ('df1','df2')) allows you to name the input dataframes. If a unique key exists in both dataframes, but have
different values in one or more columns, it is usually important to know these rows, put them one on top of the other and label the row with the name so we know to which dataframe does it belong to.
'drop' can take a list of columns to be excluded from the consideration when considering the difference
Here goes:
df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity'])
df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity'])
dict1 = diff_func(df1, df2, 'Fruits')
In [10]: dict1['df1_only']:
Out[10]:
Fruits Quantity
1 coconut 3
In [11]: dict1['df2_only']:
Out[11]:
Fruits Quantity
3 durian 4
In [12]: dict1['Diff']:
Out[12]:
Fruits Quantity df1 or df2
0 banana 2 df1
1 banana 3 df2
In [13]: dict1['Merge']:
Out[13]:
Fruits Quantity
0 apple 1
Here is the code:
import pandas as pd
from collections import OrderedDict as od
def diff_func(df1, df2, uid, dedupe=True, labels=('df1', 'df2'), drop=[]):
dict_df = {labels[0]: df1, labels[1]: df2}
col1 = df1.columns.values.tolist()
col2 = df2.columns.values.tolist()
# There could be columns known to be different, hence allow user to pass this as a list to be dropped.
if drop:
print ('Ignoring columns {} in comparison.'.format(', '.join(drop)))
col1 = list(filter(lambda x: x not in drop, col1))
col2 = list(filter(lambda x: x not in drop, col2))
df1 = df1[col1]
df2 = df2[col2]
# Step 1 - Check if no. of columns are the same:
len_lr = len(col1), len(col2)
assert len_lr[0]==len_lr[1], \
'Cannot compare frames with different number of columns: {}.'.format(len_lr)
# Step 2a - Check if the set of column headers are the same
# (order doesnt matter)
assert set(col1)==set(col2), \
'Left column headers are different from right column headers.' \
+'\n Left orphans: {}'.format(list(set(col1)-set(col2))) \
+'\n Right orphans: {}'.format(list(set(col2)-set(col1)))
# Step 2b - Check if the column headers are in the same order
if col1 != col2:
print ('[Note] Reordering right Dataframe...')
df2 = df2[col1]
# Step 3 - Check datatype are the same [Order is important]
if set((df1.dtypes == df2.dtypes).tolist()) - {True}:
print ('dtypes are not the same.')
df_dtypes = pd.DataFrame({labels[0]:df1.dtypes,labels[1]:df2.dtypes,'Diff':(df1.dtypes == df2.dtypes)})
df_dtypes = df_dtypes[df_dtypes['Diff']==False][[labels[0],labels[1],'Diff']]
print (df_dtypes)
else:
print ('DataType check: Passed')
# Step 4 - Check for duplicate rows
if dedupe:
for key, df in dict_df.items():
if df.shape[0] != df.drop_duplicates().shape[0]:
print(key + ': Duplicates exists, they will be dropped.')
dict_df[key] = df.drop_duplicates()
# Step 5 - Check for duplicate uids.
if type(uid)==str or type(uid)==list:
print ('Uniqueness check: {}'.format(uid))
for key, df in dict_df.items():
count_uid = df.shape[0]
count_uid_unique = df[uid].drop_duplicates().shape[0]
var = [0,1][count_uid_unique == df.shape[0]] #<-- Round off to the nearest integer if it is 100%
pct = round(100*count_uid_unique/df.shape[0], var)
print ('{}: {} out of {} are unique ({}%).'.format(key, count_uid_unique, count_uid, pct))
# Checks complete, begin merge. '''Remenber to dedupe, provide labels for common_no_match'''
dict_result = od()
df_merge = pd.merge(df1, df2, on=col1, how='inner')
if not df_merge.shape[0]:
print ('Error: Merged DataFrame is empty.')
else:
dict_result[labels[0]] = df1
dict_result[labels[1]] = df2
dict_result['Merge'] = df_merge
if type(uid)==str:
uid = [uid]
if type(uid)==list:
df1_only = df1.append(df_merge).reset_index(drop=True)
df1_only['Duplicated']=df1_only.duplicated(keep=False) #keep=False, marks all duplicates as True
df1_only = df1_only[df1_only['Duplicated']==False]
df2_only = df2.append(df_merge).reset_index(drop=True)
df2_only['Duplicated']=df2_only.duplicated(keep=False)
df2_only = df2_only[df2_only['Duplicated']==False]
label = labels[0]+' or '+labels[1]
df_lc = df1_only.copy()
df_lc[label] = labels[0]
df_rc = df2_only.copy()
df_rc[label] = labels[1]
df_c = df_lc.append(df_rc).reset_index(drop=True)
df_c['Duplicated'] = df_c.duplicated(subset=uid, keep=False)
df_c1 = df_c[df_c['Duplicated']==True]
df_c1 = df_c1.drop('Duplicated', axis=1)
df_uc = df_c[df_c['Duplicated']==False]
df_uc_left = df_uc[df_uc[label]==labels[0]]
df_uc_right = df_uc[df_uc[label]==labels[1]]
dict_result[labels[0]+'_only'] = df_uc_left.drop(['Duplicated', label], axis=1)
dict_result[labels[1]+'_only'] = df_uc_right.drop(['Duplicated', label], axis=1)
dict_result['Diff'] = df_c1.sort_values(uid).reset_index(drop=True)
return dict_result
Set df2.columns = df1.columns
Now, set every column as the index: df1 = df1.set_index(df1.columns.tolist()), and similarly for df2.
You can now do df1.index.difference(df2.index), and df2.index.difference(df1.index), and the two results are your distinct columns.
with
left_df.merge(df,left_on=left_df.columns.tolist(),right_on=df.columns.tolist(),how='outer')
you can get the outer join result.
Similarly, you can get the inner join result.Then make a diff that would be what you want.

Add a suffix to a dataframe called from a dictionary

I am trying to add a suffix to the dataframes called on by a dictionary.
Here is a sample code below:
import pandas as pd
import numpy as np
from collections import OrderedDict
from itertools import chain
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
num_periods_3 = 5
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
dates3 = pd.date_range('1/1/2000 02:00:00', periods=num_periods_3, freq='10min')
# column_names = ['WS Avg','WS Max','WS Min','WS Dev','WD Avg']
# column_names = ['A','B','C','D','E']
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
column_names_3 = ['E', 'B', 'C']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = pd.DataFrame(np.random.randn(num_periods_3, len(column_names_3)), index=dates3, columns=column_names_3)
sep0 = '<~>'
suf1 = '_1'
suf2 = '_2'
suf3 = '_3'
ddict = {'df1': df1, 'df2': df2, 'df3': df3}
frames_to_concat = {'Sheets': ['df1', 'df3']}
Suffs = {'Suffixes': ['Suffix 1', 'Suffix 2', 'Suffix 3']}
Suff = {'Suffix 1': suf1, 'Suffix 2': suf2, 'Suffix 3': suf3}
## appply suffix to each data frame selected in order HERE
# Suffdict = [Suff[x] for x in Suffs['Suffixes']]
# print(Suffdict)
df4 = pd.concat([ddict[x] for x in frames_to_concat['Sheets']],
axis=1,
join='outer')
I want to add a suffix to each dataframe so that they can be distinguished when the dataframes are concatenated. I am having some trouble calling them and then applying them to each dataframe. So I have called for df1 and df3 to be concatenated and I would like only suffix 1 to be applied to df1 and suffix 2 to be applied to df3.
Order does not matter for the data frame suffix if df2 and df3 were called suffix 1 would be applied to df2 and suffix 2 would be applied to df3. obviously the last suffix would not be used.
Unless you have python3.6, you cannot guarantee order in dictionaries. Even if you could with python3.6, that would imply your code would not run in any lower python version. If you need order, you should be looking at lists instead.
You can store your dataframes as well as your suffixes in a list, and then use zip to add a suffix to each df in turn.
dfs = [df1, df2, df3]
sufs = [suf1, suf2, suf3]
df_sufs = [x.add_suffix(y) for x, y in zip(dfs, sufs)]
Based on your code/answer, you can load your dataframes and suffixes into lists, call zip, add a suffix to each one, and call pd.concat.
dfs = [ddict[x] for x in frames_to_concat['Sheets']]
sufs = [suff[x] for x in suffs['Suffixes']]
df4 = pd.concat([x.add_suffix(sep0 + y)
for x, y in zip(dfs, sufs)], axis=1, join='outer')
Ended up just making a simple iterator for the problem. Here is my solution
n=0
for df in frames_to_concat['Sheets']:
print(df_dict[df])
df_dict[df] = df_dict[df].add_suffix(sep0 + suff[suffs['Suffixes'][n]])
n = n+1
Anyone have a better way to do this?

Categories