Fuse rows by name and sum values of all rows - python

I have a file imported from tableau that I already transformed manually (to get rid of special characters i couldn't treat in the code and unwanted columns).
What I would like to do is to not have to treat it manually, but I cannot find the solution.
From the original Excel file I only need three columns I will name A, B and C. Once you remove all useless column, it will look like something like this (i don't write other column but we have to get rid of them).
IMPORTANT : In the column C, it's not a whitespace but a special character :
A
B
C
project name
reference 1
$111 111
fused cell with above
fused cell with above
$15 214
fused cell with above
fused cell with above
$462 134
fused cell with above
fused cell with above
$70 900
project name 2
reference 2
$787 741
fused cell with above
fused cell with above
$41 414
fused cell with above
fused cell with above
$462 134
fused cell with above
fused cell with above
$4 500
fused cell with above
fused cell with above
$2 415
project name 3
reference 3
$111
project name 4
reference 4
$642 874
Edit : A screenshot of fused cell :
On the final file, I need all lines to become one by doing the sum of value from C volumn, and put the result in the row where the project name is indicated.
Thank you in advance for your advices !
Here is my actual code to transform the file (yes I import a csv but the original file is an excel, the csv is after I transform the excel file manually :
elif typeOfFile == "9":
#Import Excel file
DataMonthly = pd.read_csv (filename, usecols = ['A', 'B', 'C'], sep=';')
# Select only Wanted Data
df=pd.DataFrame(DataMonthly)
#Create a "DPAC" column and fill it with the specified code of the entity
df['DPAC'] = 'code'
firstCol = df.pop('DPAC')
df.insert(0, 'DPAC', firstCol)
#Create an 'Item' Column
df['Item'] = np.nan
firstCol = df.pop('Item')
df.insert(1, 'Item', firstCol)
df.columns.values[2] = 'A'
df.columns.values[3] = 'B'
#Create a 'Segment' Column
df['Segment'] = ''
firstCol = df.pop('Segment')
df.insert(4, 'Segment', firstCol)
#Create an 'EndCustomerCountry' Column
df['EndCustomerCountry'] = ''
firstCol = df.pop('EndCustomerCountry')
df.insert(5, 'EndCustomerCountry', firstCol)
df.columns.values[6] = 'C'
#Create a 'SubSegment' Column
df['SubSegment'] = ''
firstCol = df.pop('SubSegment')
df.insert(7, 'SubSegment', firstCol)
#Create a 'TechnologyName' Column
df['TechnologyName'] = ''
firstCol = df.pop('TechnologyName')
df.insert(8, 'TechnologyName', firstCol)
#Define cols
new_cols = ['DPAC', 'Item', 'A', 'B', 'Segment', 'EndCustomerCountry', 'C', 'SubSegment', 'TechnologyName']
df=df.reindex(columns=new_cols)
#Clean "C" column
df['C'] = df['C'].str.replace(r'\s', '')
df['C'] = df['C'].str[1:]
print(df['C'])
# aggregation_functions = {'C': 'sum'}
# df = df.groupby(df['B']).aggregate(aggregation_functions)
#Create Dataframe
df = pd.DataFrame(data=df).reset_index(drop=True)
if file_exists:
df.to_csv ('C:/Program Files/Data_Arranger/Output Files/Monthly.csv', mode='a', header=False, index=False)
else:
df.to_csv ('C:/Program Files/Data_Arranger/Output Files/Monthly.csv', mode='a', header=True, index=False)

Under the assumption that your dataframe df that you get from reading the Excel/CSV-file looks something like
df = pd.DataFrame({
"A": ["project1", np.NaN, np.NaN, "project2", np.NaN],
"B": ["reference1", np.NaN, np.NaN, "reference2", np.NaN],
"C": ["$111 111", "$15 214", "$462 134", "$70 900", "$787 741"],
})
A B C
0 project1 reference1 $111 111
1 NaN NaN $15 214
2 NaN NaN $462 134
3 project2 reference2 $70 900
4 NaN NaN $787 741
you could try
res = (
df.assign(C=df["C"].str.replace("[\$\s]", "", regex=True).astype("int"))
.groupby([df["A"].ffill(), df["B"].ffill()]).agg({"C": "sum"})
.reset_index()
)
to get
A B C
0 project1 reference1 588459
1 project2 reference2 858641
The writing back is still a bit unclear to me. If you want the totals to replace all individual contributions to that total, then you could directly do
df["A"], df["B"] = df["A"].ffill(), df["B"].ffill()
df["C"] = (
df.assign(C=df["C"].str.replace("[\$\s]", "", regex=True).astype("int"))
.groupby(["A", "B"])["C"].transform("sum")
)
to get
A B C
0 project1 reference1 588459
1 project1 reference1 588459
2 project1 reference1 588459
3 project2 reference2 858641
4 project2 reference2 858641

Related

Optimizing an Excel to Pandas import and transformation from wide to long data

I need to import and transform xlsx files. They are written in a wide format and I need to reproduce some of the cell information from each row and pair it up with information from all the other rows:
[Edit: changed format to represent the more complex requirements]
Source format
ID
Property
Activity1name
Activity1timestamp
Activity2name
Activity2timestamp
1
A
a
1.1.22 00:00
b
2.1.22 10:05
2
B
a
1.1.22 03:00
b
5.1.22 20:16
Target format
ID
Property
Activity
Timestamp
1
A
a
1.1.22 00:00
1
A
b
2.1.22 10:05
2
B
a
1.1.22 03:00
2
B
b
5.1.22 20:16
The following code works fine to transform the data, but the process is really, really slow:
def transform(data_in):
data = pd.DataFrame(columns=columns)
# Determine number of processes entered in a single row of the original file
steps_per_row = int((data_in.shape[1] - (len(columns) - 2)) / len(process_matching) + 1)
data_in = data_in.to_dict("records") # Convert to dict for speed optimization
for row_dict in tqdm(data_in): # Iterate over each row of the original file
new_row = {}
# Set common columns for each process step
for column in column_matching:
new_row[column] = row_dict[column_matching[column]]
for step in range(0, steps_per_row):
rep = str(step+1) if step > 0 else ""
# Iterate for as many times as there are process steps in one row of the original file and
# set specific columns for each process step, keeping common column values identical for current row
for column in process_matching:
new_row[column] = row_dict[process_matching[column]+rep]
data = data.append(new_row, ignore_index=True) # append dict of new_row to existing data
data.index.name = "SortKey"
data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp # TODO check if works as intended
data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
data.fillna('', inplace=True) # Replace NaN values with empty cells
return data
Obviously, iterating over each row and then even each column is not at all how to use pandas the right way, but I don't see how this kind of transformation can be vectorized.
I have tried using parallelization (modin) and played around with using dict or not, but it didn't work / help. The rest of the script literally just opens and saves the files, so the problem lies here.
I would be very grateful for any ideas on how to improve the speed!
The df.melt function should be able to do this type of operation much faster.
df = pd.DataFrame({'ID' : [1, 2],
'Property' : ['A', 'B'],
'Info1' : ['x', 'a'],
'Info2' : ['y', 'b'],
'Info3' : ['z', 'c'],
})
data=df.melt(id_vars=['ID','Property'], value_vars=['Info1', 'Info2', 'Info3'])
** Edit to address modified question **
Combine the df.melt with df.pivot operation.
# create data
df = pd.DataFrame({'ID' : [1, 2, 3],
'Property' : ['A', 'B', 'C'],
'Activity1name' : ['a', 'a', 'a'],
'Activity1timestamp' : ['1_1_22', '1_1_23', '1_1_24'],
'Activity2name' : ['b', 'b', 'b'],
'Activity2timestamp' : ['2_1_22', '2_1_23', '2_1_24'],
})
# melt dataframe
df_melted = df.melt(id_vars=['ID','Property'],
value_vars=['Activity1name', 'Activity1timestamp',
'Activity2name', 'Activity2timestamp',],
)
# merge categories, i.e. Activity1name Activity2name become Activity
df_melted.loc[df_melted['variable'].str.contains('name'), 'variable'] = 'Activity'
df_melted.loc[df_melted['variable'].str.contains('timestamp'),'variable'] = 'Timestamp'
# add category ids (dataframe may need to be sorted before this operation)
u_category_ids = np.arange(1,len(df_melted.variable.unique())+1)
category_ids = np.repeat(u_category_ids,len(df)*2).astype(str)
df_melted.insert(0, 'unique_id', df_melted['ID'].astype(str) +'_'+ category_ids)
# pivot table
table = df_melted.pivot_table(index=['unique_id','ID','Property',],
columns='variable', values='value',
aggfunc=lambda x: ' '.join(x))
table = table.reset_index().drop(['unique_id'], axis=1)
Using pd.melt, as suggested by #Pantelis, I was able to speed up this transformation so extremely much, it's unbelievable. Before, a file with ~13k rows took 4-5 hours on a brand-new ThinkPad X1 - now it takes less than 2 minutes! That's a speed up by factor 150, just wow. :)
Here's my new code, for inspiration / reference if anyone has a similar data structure:
def transform(data_in):
# Determine number of processes entered in a single row of the original file
steps_per_row = int((data_in.shape[1] - len(column_matching)) / len(process_matching) )
# Specify columns for pd.melt, transforming wide data format to long format
id_columns = column_matching.values()
var_names = {"Erledigungstermin Auftragsschrittbeschreibung":data_in["Auftragsschrittbeschreibung"].replace(" ", np.nan).dropna().values[0]}
var_columns = ["Erledigungstermin Auftragsschrittbeschreibung"]
for _ in range(2, steps_per_row+1):
try:
var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in["Auftragsschrittbeschreibung" + str(_)].replace(" ", np.nan).dropna().values[0]
except IndexError:
var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in.loc[0,"Auftragsschrittbeschreibung" + str(_)]
var_columns.append("Erledigungstermin Auftragsschrittbeschreibung" + str(_))
data = pd.melt(data_in, id_vars=id_columns, value_vars=var_columns, var_name="ActivityName", value_name=timestamp)
data.replace(var_names, inplace=True) # Replace "Erledigungstermin Auftragsschrittbeschreibung" with ActivityName
data.sort_values(["Auftrags-\npositionsnummer",timestamp], ascending=True, inplace=True)
# Improve column names
data.index.name = "SortKey"
column_names = {v: k for k, v in column_matching.items()}
data.rename(mapper=column_names, axis="columns", inplace=True)
data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp
data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
data.fillna('', inplace=True) # Replace NaN values with empty cells
return data

Python pandas convert csv file into wide long txt file and put the values that have the same name in the "MA" column in the same row

I want to get a file from the csv file formatted as follows:
CSV file:
Desired output txt file (Header italicized):
MA Am1 Am2 Am3 Am4
MX1 X Y - -
MX2 9 10 11 12
Any suggestions on how to do this? Thank you!
Need help with writing the python code for achieving this. I've tried to loop through every row, but still struggling to find a way to write this.
You can try this.
Based on unique MA value groups, get the values [names column here]
Create a new dataframe with it.
Expand the values list to columns and add it to new dataframe.
Copy name column from first data frame.
Reorder 'name' column.
Code:
import pandas as pd
df = pd.DataFrame([['MX1', 1, 222],['MX1', 2, 222],['MX2', 4, 44],['MX2', 3, 222],['MX2', 5, 222]], columns=['name','values','etc'])
df_new = pd.DataFrame(columns = ['name', 'values'])
for group in df.groupby('name'):
df_new.loc[-1] = [group[0], group[1]['values'].to_list()]
df_new.index = df_new.index + 1
df_new = df_new.sort_index()
df_expanded = pd.DataFrame(df_new['values'].values.tolist()).add_prefix('Am')
df_expanded['name'] = df_new['name']
cols = df_expanded.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_expanded = df_expanded[cols]
print(df_expanded.fillna('-'))
Output:
name Am0 Am1 Am2
0 MX2 4 3 5.0
1 MX1 1 2 -

Pandas dataframe writing to excel as list. But I don't want data as list in excel

I have a code which iterate through excel and extract values from excel columns as loaded as list in dataframe. When I write dataframe to excel, I am seeing data with in [] and quotes for string ['']. How can I remove [''] when I write to excel.
Also I want to write only first value in product ID column to excel. how can I do that?
result = pd.DataFrame.from_dict(result) # result has list of data
df_t = result.T
writer = pd.ExcelWriter(path)
df_t.to_excel(writer, 'data')
writer.save()
My output to excel
I am expecting output as below and Product_ID column should only have first value in list
I tried below and getting error
path = path to excel
df = pd.read_excel(path, engine="openpyxl")
def data_clean(x):
for index, data in enumerate(x.values):
item = eval(data)
if len(item):
x.values[index] = item[0]
else:
x.values[index] = ""
return x
new_df = df.apply(data_clean, axis=1)
new_df.to_excel(path)
I am getting below error:
item = eval(data)
TypeError: eval() arg 1 must be a string, bytes or code object
df_t['id'] = df_t['id'].str[0] # this is a shortcut for if you only want the 0th index
df_t['other_columns'] = df_t['other_columns'].apply(lambda x: " ".join(x)) # this is to "unlist" the lists of lists which you have fed into a pandas column
This should be the effect you want, but you have to make sure that the data in each cell is ['', ...] form, and if it's different you can modify the way it's handled in the data_clean function:
import pandas as pd
df = pd.read_excel("1.xlsx", engine="openpyxl")
def data_clean(x):
for index, data in enumerate(x.values):
item = eval(data)
if len(item):
x.values[index] = item[0]
else:
x.values[index] = ""
return x
new_df = df.apply(data_clean, axis=1)
new_df.to_excel("new.xlsx")
The following is an example of df and modified new_df(Some randomly generated data):
# df
name Product_ID xxx yyy
0 ['Allen'] ['AF124', 'AC12414'] [124124] [222]
1 ['Aaszflen'] ['DF124', 'AC12415'] [234125] [22124124,124125]
2 ['Allen'] ['CF1sdv24', 'AC12416'] [123544126] [33542124124,124126]
3 ['Azdxven'] ['BF124', 'AC12417'] [35127] [333]
4 ['Allen'] ['MF124', 'AC12418'] [3528] [12352324124,124128]
5 ['Allen'] ['AF124', 'AC12419'] [122359] [12352324124,124129]
# new_df
name Product_ID xxx yyy
0 Allen AF124 124124 222
1 Aaszflen DF124 234125 22124124
2 Allen CF1sdv24 123544126 33542124124
3 Azdxven BF124 35127 333
4 Allen MF124 3528 12352324124
5 Allen AF124 122359 12352324124

Comparing multiple columns of same CSV file and returning result to another CSV file using Python

I have a CSV file with 7 columns
This is my csv file
Application,Expected Value,ADER,UGOM,PRD
APP,CVD2,CVD2,CVD2,CVD2
APP1,"VCF7,hg6","VCF7,hg6","VCF8,hg6","VCF7,hg6"
APP1,"VDF9,pova8","VDF9,pova8","VDF10,pova10","VDF9,pova11"
APP2,gf8,gf8,gf8,gf8
APP3,pf8,pf8,gf8,pf8
APP4,vd3,mn7","vd3,mn7","vd3,mn7","vd3,mn7"
So here i want to compare a Expected Value column with the columns after that (that is ADER,UGOM,PRD)
so here is my code in python
import pandas as pd
# assuming id columns are identical and contain the same values
df1 = pd.read_csv('file1.csv', index_col='Expected Value')
df2 = pd.read_csv('file1.csv', index_col='ADER')
df3 = pd.DataFrame(columns=['status'], index=df1.index)
df3['status'] = (df1['Expected Value'] == df2['ADER']).replace([True, False], ['Matching', 'Not Matching'])
df3.to_csv('output.csv')
So this not creating any output.csv file ,nor it generates any output . So can anyone help
So i edited code : based on a comment by #Vlado
import pandas as pd
# assuming id columns are identical and contain the same values
df1 = pd.read_csv('first.csv')
df3 = pd.DataFrame(columns=['Application','Expected Value','ADER','status of AdER'], index=df1.index)
df3['Application'] = df1['Application']
df3['Expected Value'] = df1['Expected Value']
df3['ADER'] = df1['ADER']
df3['status'] = (df1['Expected Value'] == df1['ADER'])
df3['status'].replace([True, False], ['Matching', 'Not Matching'])
df3.to_csv('output.csv')
so now it works for one column ADER , but my headers after EXpected Values are dynamic , it may change . so sometimes it may be one column after expected Value , sometimes N columns and header name may also change . so can some one help on how to do that
Below given piece of code generates the desired output. It will compare the Expected Value column with rest of the columns after that.
import pandas as pd
df = pd.read_csv("input.csv")
expected_value_index = df.columns.get_loc("Expected Value")
for col_index in range(expected_value_index+1, len(df.columns)):
column = df.columns[expected_value_index]+" & "+ df.columns[col_index]
df[column] = df.loc[:,"Expected Value"] == df.iloc[:,col_index]
df[column].replace([True, False], ["Matching", "No Matching"], inplace=True)
df.to_csv("output.csv", index=None)
I haven't tried replicating your code as of yet but here are a few suggestions:
You do not need to read the df two times.
df1 = pd.read_csv('FinalResult1.csv')
is sufficient.
Then, you can proceed with
df1['status'] = (df1['exp'] == df1['ader'])
df1['status'].replace([True, False], ['Matching', 'Not Matching'])
Alternatively, you could do this row by row by using the pandas apply method.
If that doesn't work a reasonable first step would be to print your dataframe out to see what is happening.
Try this code
k = list(df1.columns).index('Expected Value') + 1
# get the integer index for the column after 'Expected Value'
df3 = df1.iloc[:, :k]
# copy first k columns
df3 = pd.concat([df3, (df1.iloc[:, k:] == np.repeat(
df1['Expected Value'].to_frame().values, df1.shape[1] - k, axis=1))], axis=1)
# slice df1 with iloc, wich works just as slicing lists in python
# np.repeat, used to repeat 'Expected Value' as many columns as needed (df1.shape[1]-k=3)
# .to_frame, slicing a column from a df returns a 1D series so we turin it back into a 2D df
# .values, returns the underlying numpy array without any index/column names
# ...without .values Pandas would try to find 3 columns named 'Expected Value' in df1
# concatenate previous df3 with this calculation
print(df3)
Output
Application FileName ConfigVariable Expected Value ADER UGOM PRD
0 APP1 FileName1 ConfigVariable1 CVD2 True True True
1 APP1 FileName2 ConfigVariable2 VCF7,hg6 True False True
2 APP1 FileName3 ConfigVariable3 VDF9,pova8 True False False
3 APP2 FileName4 ConfigVariable4 gf8 True True True
4 APP3 FileName5 ConfigVariable5 pf8 True False True
5 APP4 FileName6 ConfigVariable vd3,mn7 True True True
Of course you can do a loop if for some reason you need a special calculation on some column
for colname in df1.columns[k:]:
df3[colname] = df1[colname] == df1['Expected Value']

Highlight panda df errors based on conditions

Good day SO community,
I have been having an issue with trying to highlight errors in my df, row by row.
reference_dict = {'jobclass' : ['A','B'], 'Jobs' : ['Teacher','Plumber']}
dict = {'jobclass': ['A','C','A'], 'Jobs': ['Teacher', 'Plumber','Policeman']}
df = pd.DataFrame(data=dict)
def highlight_rows(df):
for i in df.index:
if df.jobclass[i] in reference_dict['jobclass']:
print(df.jobclass[i])
return 'background-color: green'
df.style.apply(highlight_rows, axis = 1)
I am getting the error:
TypeError: ('string indices must be integers', 'occurred at index 0')
What i hope to get is my df with values not found in my reference_dict being highlighted.
Any help would be greatly appreciated.. Cheers!
Edit:
x = {'jobclass' : ['A','B'], 'Jobs' : ['Teacher','Plumber']}
d = {'jobclass': ['A','C','A'], 'Jobs': ['Teacher', 'Plumber','Policeman']}
df = pd.DataFrame(data=d)
print(df)
def highlight_rows(s):
ret = ["" for i in s.index]
for i in df.index:
if df.jobclass[i] not in x['jobclass']:
ret[s.index.get_loc('Jobs')] = "background-color: yellow"
return ret
df.style.apply(highlight_rows, axis = 1)
Tried this and got the whole column highlighted instead of the specific rows values that i desire.. =/
You can use merge with parameter indicator for found not matched values and then create DataFrame of styles:
x = {'jobclass' : ['A','B'], 'Jobs' : ['Teacher','Plumber']}
d = {'jobclass': ['A','C','A'], 'Jobs': ['Teacher', 'Plumber','Policeman']}
df = pd.DataFrame(data=d)
print (df)
jobclass Jobs
0 A Teacher
1 C Plumber
2 A Policeman
Detail:
print (df.merge(pd.DataFrame(x) , on='jobclass', how='left', indicator=True))
jobclass Jobs_x Jobs_y _merge
0 A Teacher Teacher both
1 C Plumber NaN left_only
2 A Policeman Teacher both
def highlight_rows(s):
c1 = 'background-color: yellow'
c2 = ''
df1 = pd.DataFrame(x)
m = s.merge(df1, on='jobclass', how='left', indicator=True)['_merge'] == 'left_only'
df2 = pd.DataFrame(c2, index=s.index, columns=s.columns)
df2.loc[m, 'Jobs'] = c1
return df2
df.style.apply(highlight_rows, axis = None)
Good day to you as well!
What i hope to get is my df with values not found in my reference_dict being highlighted.
If you're looking for values not found in reference_dict to be highlighted, do you mean for the function to be the following?
def highlight_rows(df):
for i in df.index:
if df.jobclass[i] not in reference_dict['jobclass']:
print(df.jobclass[i])
return 'background-color: green'
Either way, why highlight the rows when you could isolate them? It seems like you want to look at all of the job classes in df where there is not one in reference_dict.
import pandas as pd
reference_dict = {'jobclass' : ['A','B'], 'Jobs' : ['Teacher','Plumber']}
data_dict = {'jobclass': ['A','C','A'], 'Jobs': ['Teacher', 'Plumber','Policeman']}
ref_df = pd.DataFrame(reference_dict)
df = pd.DataFrame(data_dict)
outliers = df.merge(ref_df, how='outer', on='jobclass') # merge the two tables together, how='outer' includes jobclasses which the DataFrames do not have in common. Will automatically generate columns Jobs_x and Jobs_y once joined together because the columns have the same name
outliers = outliers[ outliers['Jobs_y'].isnull() ] # Jobs_y is null when there is no matching jobclass in the reference DataFrame, so we can take advantage of that by filtering
outliers = outliers.drop('Jobs_y', axis=1) # let's drop the junk column after we used it to filter for what we wanted
print("The reference DataFrame is:")
print(ref_df,'\n')
print("The input DataFrame is:")
print(df,'\n')
print("The result is a list of all the jobclasses not in the reference DataFrame and what job is with it:")
print(outliers)
The result is:
The reference DataFrame is:
jobclass Jobs
0 A Teacher
1 B Plumber
The input DataFrame is:
jobclass Jobs
0 A Teacher
1 C Plumber
2 A Policeman
The result is a list of all the jobclasses not in the reference DataFrame and what job is with it:
jobclass Jobs_x
2 C Plumber
This could have been a tangent but it's what I'd do. I was not aware you could highlight rows in pandas at all, cool trick.

Categories