I have a pandas DataFrame of unique rows which looks something like this:
df = pd.DataFrame({'O': ['O-1','O-1'],
'B': ['B-1','B-2'],
'C': ['C-1','C-2'],
'R': ['R-1','R-1']},
columns = ['O', 'B', 'C', 'R'])
Columns of df are ordered in parent-child linear relation, wherein column O is level 1, column B is level 2 and so on. The intention is to convert this df into a tree like structure for navigation purposes, which would look something like this:
output = pd.DataFrame({'PARENT': ['O-1','O-1','O-1','O-1','O-1','B-1','B-1','B-2','B-2','C-1','C-2'],
'CHILD_TYPE': ['B','B','C','C','R','C','R','C','R','R','R'],
'CHILD': ['B-1','B-2','C-1','C-2','R-1','C-1','R-1','C-2','R-1','R-1','R-1']},
columns = ['PARENT', 'CHILD_TYPE', 'CHILD'])
Filtering on each column's each value in df (as parent) then copying all unique values of remaining columns on the right as child seems like a bad way to achieve this.
Is there an efficient way?
As I mentioned, we have this way to achieve this:
Filtering on each column's each value in df (as parent) then copying all unique values of remaining columns on the right as child seems like a bad way to achieve this.
And the solution with same logic is here:
sample = pd.DataFrame({'O': ['O-1','O-1'],
'B': ['B-1','B-2'],
'C': ['C-1','C-2'],
'R': ['R-1','R-1']}, columns = ['O', 'B', 'C', 'R'])
ls = []
for col in sample:
for val in sample[col]:
fs = sample[sample[col] == val]
fvl = fs.iloc[:,fs.columns.get_loc(col)+1:].T.values.tolist()
fcl = fs.iloc[:,fs.columns.get_loc(col)+1:].columns.tolist()
for fc, fvs in zip(fcl, fvl):
for fv in fvs:
ls.append([val,fc,fv])
output = pd.DataFrame(ls, columns = ['PARENT', 'CHILD_TYPE', 'CHILD']).drop_duplicates()
Related
I need to import and transform xlsx files. They are written in a wide format and I need to reproduce some of the cell information from each row and pair it up with information from all the other rows:
[Edit: changed format to represent the more complex requirements]
Source format
ID
Property
Activity1name
Activity1timestamp
Activity2name
Activity2timestamp
1
A
a
1.1.22 00:00
b
2.1.22 10:05
2
B
a
1.1.22 03:00
b
5.1.22 20:16
Target format
ID
Property
Activity
Timestamp
1
A
a
1.1.22 00:00
1
A
b
2.1.22 10:05
2
B
a
1.1.22 03:00
2
B
b
5.1.22 20:16
The following code works fine to transform the data, but the process is really, really slow:
def transform(data_in):
data = pd.DataFrame(columns=columns)
# Determine number of processes entered in a single row of the original file
steps_per_row = int((data_in.shape[1] - (len(columns) - 2)) / len(process_matching) + 1)
data_in = data_in.to_dict("records") # Convert to dict for speed optimization
for row_dict in tqdm(data_in): # Iterate over each row of the original file
new_row = {}
# Set common columns for each process step
for column in column_matching:
new_row[column] = row_dict[column_matching[column]]
for step in range(0, steps_per_row):
rep = str(step+1) if step > 0 else ""
# Iterate for as many times as there are process steps in one row of the original file and
# set specific columns for each process step, keeping common column values identical for current row
for column in process_matching:
new_row[column] = row_dict[process_matching[column]+rep]
data = data.append(new_row, ignore_index=True) # append dict of new_row to existing data
data.index.name = "SortKey"
data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp # TODO check if works as intended
data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
data.fillna('', inplace=True) # Replace NaN values with empty cells
return data
Obviously, iterating over each row and then even each column is not at all how to use pandas the right way, but I don't see how this kind of transformation can be vectorized.
I have tried using parallelization (modin) and played around with using dict or not, but it didn't work / help. The rest of the script literally just opens and saves the files, so the problem lies here.
I would be very grateful for any ideas on how to improve the speed!
The df.melt function should be able to do this type of operation much faster.
df = pd.DataFrame({'ID' : [1, 2],
'Property' : ['A', 'B'],
'Info1' : ['x', 'a'],
'Info2' : ['y', 'b'],
'Info3' : ['z', 'c'],
})
data=df.melt(id_vars=['ID','Property'], value_vars=['Info1', 'Info2', 'Info3'])
** Edit to address modified question **
Combine the df.melt with df.pivot operation.
# create data
df = pd.DataFrame({'ID' : [1, 2, 3],
'Property' : ['A', 'B', 'C'],
'Activity1name' : ['a', 'a', 'a'],
'Activity1timestamp' : ['1_1_22', '1_1_23', '1_1_24'],
'Activity2name' : ['b', 'b', 'b'],
'Activity2timestamp' : ['2_1_22', '2_1_23', '2_1_24'],
})
# melt dataframe
df_melted = df.melt(id_vars=['ID','Property'],
value_vars=['Activity1name', 'Activity1timestamp',
'Activity2name', 'Activity2timestamp',],
)
# merge categories, i.e. Activity1name Activity2name become Activity
df_melted.loc[df_melted['variable'].str.contains('name'), 'variable'] = 'Activity'
df_melted.loc[df_melted['variable'].str.contains('timestamp'),'variable'] = 'Timestamp'
# add category ids (dataframe may need to be sorted before this operation)
u_category_ids = np.arange(1,len(df_melted.variable.unique())+1)
category_ids = np.repeat(u_category_ids,len(df)*2).astype(str)
df_melted.insert(0, 'unique_id', df_melted['ID'].astype(str) +'_'+ category_ids)
# pivot table
table = df_melted.pivot_table(index=['unique_id','ID','Property',],
columns='variable', values='value',
aggfunc=lambda x: ' '.join(x))
table = table.reset_index().drop(['unique_id'], axis=1)
Using pd.melt, as suggested by #Pantelis, I was able to speed up this transformation so extremely much, it's unbelievable. Before, a file with ~13k rows took 4-5 hours on a brand-new ThinkPad X1 - now it takes less than 2 minutes! That's a speed up by factor 150, just wow. :)
Here's my new code, for inspiration / reference if anyone has a similar data structure:
def transform(data_in):
# Determine number of processes entered in a single row of the original file
steps_per_row = int((data_in.shape[1] - len(column_matching)) / len(process_matching) )
# Specify columns for pd.melt, transforming wide data format to long format
id_columns = column_matching.values()
var_names = {"Erledigungstermin Auftragsschrittbeschreibung":data_in["Auftragsschrittbeschreibung"].replace(" ", np.nan).dropna().values[0]}
var_columns = ["Erledigungstermin Auftragsschrittbeschreibung"]
for _ in range(2, steps_per_row+1):
try:
var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in["Auftragsschrittbeschreibung" + str(_)].replace(" ", np.nan).dropna().values[0]
except IndexError:
var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in.loc[0,"Auftragsschrittbeschreibung" + str(_)]
var_columns.append("Erledigungstermin Auftragsschrittbeschreibung" + str(_))
data = pd.melt(data_in, id_vars=id_columns, value_vars=var_columns, var_name="ActivityName", value_name=timestamp)
data.replace(var_names, inplace=True) # Replace "Erledigungstermin Auftragsschrittbeschreibung" with ActivityName
data.sort_values(["Auftrags-\npositionsnummer",timestamp], ascending=True, inplace=True)
# Improve column names
data.index.name = "SortKey"
column_names = {v: k for k, v in column_matching.items()}
data.rename(mapper=column_names, axis="columns", inplace=True)
data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp
data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
data.fillna('', inplace=True) # Replace NaN values with empty cells
return data
I am trying to add a suffix to the dataframes called on by a dictionary.
Here is a sample code below:
import pandas as pd
import numpy as np
from collections import OrderedDict
from itertools import chain
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
num_periods_3 = 5
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
dates3 = pd.date_range('1/1/2000 02:00:00', periods=num_periods_3, freq='10min')
# column_names = ['WS Avg','WS Max','WS Min','WS Dev','WD Avg']
# column_names = ['A','B','C','D','E']
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
column_names_3 = ['E', 'B', 'C']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = pd.DataFrame(np.random.randn(num_periods_3, len(column_names_3)), index=dates3, columns=column_names_3)
sep0 = '<~>'
suf1 = '_1'
suf2 = '_2'
suf3 = '_3'
ddict = {'df1': df1, 'df2': df2, 'df3': df3}
frames_to_concat = {'Sheets': ['df1', 'df3']}
Suffs = {'Suffixes': ['Suffix 1', 'Suffix 2', 'Suffix 3']}
Suff = {'Suffix 1': suf1, 'Suffix 2': suf2, 'Suffix 3': suf3}
## appply suffix to each data frame selected in order HERE
# Suffdict = [Suff[x] for x in Suffs['Suffixes']]
# print(Suffdict)
df4 = pd.concat([ddict[x] for x in frames_to_concat['Sheets']],
axis=1,
join='outer')
I want to add a suffix to each dataframe so that they can be distinguished when the dataframes are concatenated. I am having some trouble calling them and then applying them to each dataframe. So I have called for df1 and df3 to be concatenated and I would like only suffix 1 to be applied to df1 and suffix 2 to be applied to df3.
Order does not matter for the data frame suffix if df2 and df3 were called suffix 1 would be applied to df2 and suffix 2 would be applied to df3. obviously the last suffix would not be used.
Unless you have python3.6, you cannot guarantee order in dictionaries. Even if you could with python3.6, that would imply your code would not run in any lower python version. If you need order, you should be looking at lists instead.
You can store your dataframes as well as your suffixes in a list, and then use zip to add a suffix to each df in turn.
dfs = [df1, df2, df3]
sufs = [suf1, suf2, suf3]
df_sufs = [x.add_suffix(y) for x, y in zip(dfs, sufs)]
Based on your code/answer, you can load your dataframes and suffixes into lists, call zip, add a suffix to each one, and call pd.concat.
dfs = [ddict[x] for x in frames_to_concat['Sheets']]
sufs = [suff[x] for x in suffs['Suffixes']]
df4 = pd.concat([x.add_suffix(sep0 + y)
for x, y in zip(dfs, sufs)], axis=1, join='outer')
Ended up just making a simple iterator for the problem. Here is my solution
n=0
for df in frames_to_concat['Sheets']:
print(df_dict[df])
df_dict[df] = df_dict[df].add_suffix(sep0 + suff[suffs['Suffixes'][n]])
n = n+1
Anyone have a better way to do this?
I am building functions to help me load data from the web. The problem I am trying to solve as far as loading data is that column names are different depending on the source. For example, Yahoo Finance data column headings look like this Open, High, Low, Close, Volume, Adj Close. Quandl.com will have data sets that have DATE,VALUE,date,value etc. The all upper case and lowercase throws everything off and Value and Adj. Close for the most part mean the same thing. I want to associate columns with different names but the same meaning to one value. For example Adj. Close and value both = AC; Open, OPEN, and open all = O.
So I have a Csv file ("Functions//ColumnNameChanges.txt") that stores dict() keys and values of column names.
Date,D
Open,O
High,H
and then I wrote this function to populate my dictionary
def DictKeyValuesFromText ():
Dictionary = {}
TextFileName = "Functions//ColumnNameChanges.txt"
with open(TextFileName,'r') as f:
for line in f:
x = line.find(",")
y = line.find("/")
k = line[0:x]
v = line[x+1:y]
Dictionary[k] = v
return Dictionary
This is the output of print(DictKeyValuesFromText())
{'': '', 'Date': 'D', 'High': 'H', 'Open': 'O'}
The next function is where my problems are at
def ChangeColumnNames(DataFrameFileLocation):
x = DictKeyValuesFromText()
df = pd.read_csv(DataFrameFileLocation)
for y in df.columns:
if y not in x.keys():
i = input("The column " + y + " is not in the list, give a name:")
df.rename(columns={y:i})
else:
df.rename(columns={y:x[y]})
return df
df.rename is not working. This is the output I get print(ChangeColumnNames("Tvix_data.csv"))
The column Low is not in the list, give a name:L
The column Close is not in the list, give a name:C
The column Volume is not in the list, give a name:V
The column Adj Close is not in the list, give a name:AC
Date Open High Low Close Volume \
0 2010-11-30 106.269997 112.349997 104.389997 112.349997 0
1 2010-12-01 99.979997 100.689997 98.799998 100.689997 0
2 2010-12-02 98.309998 98.309998 86.499998 86.589998 0
The columns names should be D, O, H, L, C, V. I am missing something any help would be appreciated.
df.rename works just fine, but it is not inplace by default. Either re-assign its return value or use inplace=True. It expects a dictionary with old names as keys and new names as values.
df = df.rename(columns = {'col_a': 'COL_A', 'col_b': 'COL_B'})
or
df.rename(columns = {'col_a': 'COL_A', 'col_b': 'COL_B'}, inplace=True)
Well, when you already have the dictionary store it in a variable say
DC = {'': '', 'Date': 'D', 'High': 'H', 'Open': 'O'}
DC can now be mapped to the dataframe columns like
df.columns = df.columns.map(DC)
In case you want to use rename() method you can simply go with
df = df.rename(columns = DC)
I am really struggling to make it work...
How can I get a Series, transform it to a dataframe, add a column to it, and concatenate it in a loop?
The pseudo code is below, but the correct syntax is a mystery to me:
The Pseudo code is:
def func_B_Column(df):
return 1
df_1 = (...) # columns=['a', 'etc1', 'etc2']
df_2 = pandas.DataFrame(columns=['a','b','c'])
listOfColumnC = ['c1','c2','c3']
for var in listOfColumnC :
series = df_1.groupby('a').apply(func_B_Column) #series object should have now 'a' as index, and func_B_Column as value
aux = series.to_frame('b')
aux['c'] = aux.apply(lambda x: var, axis=1) #add another column 'c' to the series object
df_2 = df_2 .append(aux) #concatenate the results as rows, at the end
Edited after the question's refinement
df_2 = DataFrame()
for var in listOfColumnC :
df_2 = df_2.append(DataFrame({'b': df_1.groupby('a').apply(func_B_Column), 'c': var}))
I am trying to extract data from a csv file using python's pandas module. The experiment data has 6 columns (lets say a,b,c,d,e,f) and i have a list of model directories. Not every model has all 6 'species' (columns) so i need to split the data specifically for each model. Here is my code:
def read_experimental_data(self,experiment_path):
[path,fle]=os.path.split(experiment_path)
os.chdir(path)
data_df=pandas.read_csv(experiment_path)
# print data_df
experiment_species=data_df.keys() #(a,b,c,d,e,f)
# print experiment_species
for i in self.all_models_dirs: #iterate through a list of model directories.
[path,fle]=os.path.split(i)
model_specific_data=pandas.DataFrame()
species_dct=self.get_model_species(i+'.xml') #gives all the species (culuns) in this particular model
# print species_dct
#gives me only species that are included in model dir i
for l in species_dct.keys():
for m in experiment_species:
if l == m:
#how do i collate these pandas series into a single dataframe?
print data_df[m]
The above code gives me the correct data but i'm having trouble collecting it in a usable format. I've tried to merge and concatenate them but no joy. Does any body know how to do this?
Thanks
You can create a new DataFrame from data_df by passing it a list of columns you want,
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df_filtered = df[['a', 'c']]
or an example using some of your variable names,
import pandas as pd
data_df = pd.DataFrame({'a': [1,2], 'b': [3,4], 'c': [5,6],
'd': [7,8], 'e': [9,10], 'f': [11,12]})
experiment_species = data_df.keys()
species_dct = ['b', 'd', 'e', 'x', 'y', 'z']
good_columns = list(set(experiment_species).intersection(species_dct))
df_filtered = data_df[good_columns]