To remove spaces from df.to_markdown() - python

By reading txt files into data frame, I want to add a new column based on the values in the existing columns, i.e. adding the numeric values from 'Stock' and 'Delivery'.
The problem is, the original data (from data supplier), was generated from "df.to_markdowns()".
Seems I can't remove the white spaces.
ds = pd.read_csv("C:\\TEMP\\ff.txt", sep="|", header = 0, skipinitialspace=True)
ds.columns = ds.columns.str.strip()
df['new'] = ds['Stock'] + ds['Delivery']
print (df)
What would be the way to handle such case? Thank you.
By the way, this simulates the txt file creation from "df.to_markdown()"
import pandas as pd
data = {'Price': [59,98,79],
'Stock': [53,60,60],
'Delivery': [11,7,6]}
df = pd.DataFrame(data)
with open("C:\\TEMP\\ff.txt", 'a') as outfile:
outfile.write(df.to_markdown() + "\n")
outfile.close

This should do what you need.
ds = pd.read_csv(
"C:\\TEMP\\ff.txt",
sep="|",
skiprows=[1],
skipinitialspace=True
)
ds.columns = ds.columns.str.strip()
ds = ds.loc[:, ["Price", "Stock", "Delivery"]]
ds['new'] = ds['Stock'] + ds['Delivery']
print(ds)
output
Price Stock Delivery new
0 59 53 11 64
1 98 60 7 67
2 79 60 6 66
skiprows=[1] skips the row at index 1, which is the row with the --------:
With this row removed from the dataframe, pandas automatically interprets the Price, Stock, and Delivery columns as integers, which allows the statement ds['new'] = ds['Stock'] + ds['Delivery'] to work as expected.

This works on the example you have provided:
pd.read_csv("~/Downloads/ff.txt", sep=r"\s*\|\s*", engine="python", skiprows=[1])[["Price", "Stock", "Delivery"]]
If you want something else I suggest you provide an example for it.

Related

Python pandas convert csv file into wide long txt file and put the values that have the same name in the "MA" column in the same row

I want to get a file from the csv file formatted as follows:
CSV file:
Desired output txt file (Header italicized):
MA Am1 Am2 Am3 Am4
MX1 X Y - -
MX2 9 10 11 12
Any suggestions on how to do this? Thank you!
Need help with writing the python code for achieving this. I've tried to loop through every row, but still struggling to find a way to write this.
You can try this.
Based on unique MA value groups, get the values [names column here]
Create a new dataframe with it.
Expand the values list to columns and add it to new dataframe.
Copy name column from first data frame.
Reorder 'name' column.
Code:
import pandas as pd
df = pd.DataFrame([['MX1', 1, 222],['MX1', 2, 222],['MX2', 4, 44],['MX2', 3, 222],['MX2', 5, 222]], columns=['name','values','etc'])
df_new = pd.DataFrame(columns = ['name', 'values'])
for group in df.groupby('name'):
df_new.loc[-1] = [group[0], group[1]['values'].to_list()]
df_new.index = df_new.index + 1
df_new = df_new.sort_index()
df_expanded = pd.DataFrame(df_new['values'].values.tolist()).add_prefix('Am')
df_expanded['name'] = df_new['name']
cols = df_expanded.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_expanded = df_expanded[cols]
print(df_expanded.fillna('-'))
Output:
name Am0 Am1 Am2
0 MX2 4 3 5.0
1 MX1 1 2 -

Pandas dataframe writing to excel as list. But I don't want data as list in excel

I have a code which iterate through excel and extract values from excel columns as loaded as list in dataframe. When I write dataframe to excel, I am seeing data with in [] and quotes for string ['']. How can I remove [''] when I write to excel.
Also I want to write only first value in product ID column to excel. how can I do that?
result = pd.DataFrame.from_dict(result) # result has list of data
df_t = result.T
writer = pd.ExcelWriter(path)
df_t.to_excel(writer, 'data')
writer.save()
My output to excel
I am expecting output as below and Product_ID column should only have first value in list
I tried below and getting error
path = path to excel
df = pd.read_excel(path, engine="openpyxl")
def data_clean(x):
for index, data in enumerate(x.values):
item = eval(data)
if len(item):
x.values[index] = item[0]
else:
x.values[index] = ""
return x
new_df = df.apply(data_clean, axis=1)
new_df.to_excel(path)
I am getting below error:
item = eval(data)
TypeError: eval() arg 1 must be a string, bytes or code object
df_t['id'] = df_t['id'].str[0] # this is a shortcut for if you only want the 0th index
df_t['other_columns'] = df_t['other_columns'].apply(lambda x: " ".join(x)) # this is to "unlist" the lists of lists which you have fed into a pandas column
This should be the effect you want, but you have to make sure that the data in each cell is ['', ...] form, and if it's different you can modify the way it's handled in the data_clean function:
import pandas as pd
df = pd.read_excel("1.xlsx", engine="openpyxl")
def data_clean(x):
for index, data in enumerate(x.values):
item = eval(data)
if len(item):
x.values[index] = item[0]
else:
x.values[index] = ""
return x
new_df = df.apply(data_clean, axis=1)
new_df.to_excel("new.xlsx")
The following is an example of df and modified new_df(Some randomly generated data):
# df
name Product_ID xxx yyy
0 ['Allen'] ['AF124', 'AC12414'] [124124] [222]
1 ['Aaszflen'] ['DF124', 'AC12415'] [234125] [22124124,124125]
2 ['Allen'] ['CF1sdv24', 'AC12416'] [123544126] [33542124124,124126]
3 ['Azdxven'] ['BF124', 'AC12417'] [35127] [333]
4 ['Allen'] ['MF124', 'AC12418'] [3528] [12352324124,124128]
5 ['Allen'] ['AF124', 'AC12419'] [122359] [12352324124,124129]
# new_df
name Product_ID xxx yyy
0 Allen AF124 124124 222
1 Aaszflen DF124 234125 22124124
2 Allen CF1sdv24 123544126 33542124124
3 Azdxven BF124 35127 333
4 Allen MF124 3528 12352324124
5 Allen AF124 122359 12352324124

Python pandas says columns can't be found but they exist within a csv file

So I have this script
mport pandas as pd
import numpy as np
PRIMARY_TUMOR_PATIENT_ID_REGEX = '^.{4}-.{2}-.{4}-01.*'
SHORTEN_PATIENT_REGEX = '^(.{4}-.{2}-.{4}).*'
def mutations_for_gene(df):
mutated_patients = df['identifier'].unique()
return pd.DataFrame({'mutated': np.ones(len(mutated_patients))}, index=mutated_patients)
def prep_data(mutation_path):
df = pd.read_csv(mutation_path, low_memory=True, dtype=str, header = 0)#Line 24 reads in a line memory csv file from the given path and parses it based on '\t' delimators, and casts the data to str
df = df[~df['Hugo_Symbol'].str.contains('Hugo_Symbol')] #analyzes the 'Hugo_Symbol' heading within the data and makes a new dataframe where any row that contains 'Hugo_Symbol' is dropped
df['Hugo_Symbol'] = '\'' + df['Hugo_Symbol'].astype(str) # Appends ''\'' to all the data remaining in that column
df['Tumor_Sample_Barcode'] = df['Tumor_Sample_Barcode'].str.strip() #strips away whitespace from the data within this heading
non_silent = df.where(df['Variant_Classification'] != 'Silent') #creates a new dataframe where the data within the column 'Variant_Classification' is not equal to 'Silent'
df = non_silent.dropna(subset=['Variant_Classification']) #Drops all the rows that are missing at least one element
non_01_barcodes = df[~df['Tumor_Sample_Barcode'].str.contains(PRIMARY_TUMOR_PATIENT_ID_REGEX)]
#TODO: Double check that the extra ['Tumor_Sample_Barcode'] serves no purpose
df = df.drop(non_01_barcodes.index)
print(df)
shortened_patients = df['Tumor_Sample_Barcode'].str.extract(SHORTEN_PATIENT_REGEX, expand=False)
df['identifier'] = shortened_patients
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
gene_mutation_df.columns = gene_mutation_df.columns.str.strip()
gene_mutation_df.set_index(['Hugo_Symbol', 'patient'], inplace=True)
gene_mutation_df = gene_mutation_df.reset_index()
gene_patient_mutations = gene_mutation_df.pivot(index='Hugo_Symbol', columns='patient', values='mutated')
return gene_patient_mutations.transpose().fillna(0)
This is the csv file that the script reads in:
identifier,Hugo_Symbol,Tumor_Sample_Barcode,Variant_Classification,patient
1,patient,a,Silent,6
22,mutated,d,e,7
1,Hugo_Symbol,f,g,88
The script gives this error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-60-3f9c00f320bc> in <module>
----> 1 prep_data('test.csv')
<ipython-input-59-2a67d5c44e5a> in prep_data(mutation_path)
21 display(gene_mutation_df)
22 gene_mutation_df.columns = gene_mutation_df.columns.str.strip()
---> 23 gene_mutation_df.set_index(['Hugo_Symbol', 'patient'], inplace=True)
24 gene_mutation_df = gene_mutation_df.reset_index()
25 gene_patient_mutations = gene_mutation_df.pivot(index='Hugo_Symbol', columns='patient', values='mutated')
e:\Anaconda3\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
4546
4547 if missing:
-> 4548 raise KeyError(f"None of {missing} are in the columns")
4549
4550 if inplace:
KeyError: "None of ['Hugo_Symbol', 'patient'] are in the columns"
Previously, I had this is as that line
gene_mutation_df.index.set_names(['Hugo_Symbol', 'patient'], inplace=True)
But that also gave an error that the set_name length expects one argument but got two
Any help would be much appreciated
I would really prefer if the csv data was changed instead of the script and somehow the script could work with set_names instead of set_index
The issue is:
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
'Hugo_Symbol is used for a groupby, so now it's in the index, not a column
In the case of the sample data, an empty dataframe, with no columns, has been created.
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
print(gene_mutation_df) # print the dataframe to see what it looks like
print(gene_mutation_df.info()) # print the information for the dataframe
gene_mutation_df.columns = gene_mutation_df.columns.str.strip()
gene_mutation_df.set_index(['Hugo_Symbol', 'patient'], inplace=True)
# output
Empty DataFrame
Columns: [identifier, Hugo_Symbol, Tumor_Sample_Barcode, Variant_Classification, patient]
Index: []
Empty DataFrame
Columns: []
Index: []
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Empty DataFrameNone
reset the index
Resetting the index, will make Hugo_Symbol a column again
As long as the dataframe is not empty, the KeyError should be resolved.
gene_mutation_df = gene_mutation_df.reset_index() # try adding this line
gene_mutation_df.set_index(['Hugo_Symbol', 'patient'], inplace=True)
Addition Notes
There are a number of lines of code, that may be resulting in an empty dataframe
non_01_barcodes = df[~df['Tumor_Sample_Barcode'].str.contains(PRIMARY_TUMOR_PATIENT_ID_REGEX)]
shortened_patients = df['Tumor_Sample_Barcode'].str.extract(SHORTEN_PATIENT_REGEX, expand=False)
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
Test if the dataframe is empty
Use .empty to determine if a dataframe is empty
def prep_data(mutation_path):
df = pd.read_csv(mutation_path, low_memory=True, dtype=str, header = 0)#Line 24 reads in a line memory csv file from the given path and parses it based on '\t' delimators, and casts the data to str
df.columns = df.columns.str.strip() # clean the column names here if there is leading or trailing whitespace.
df = df[~df['Hugo_Symbol'].str.contains('Hugo_Symbol')] #analyzes the 'Hugo_Symbol' heading within the data and makes a new dataframe where any row that contains 'Hugo_Symbol' is dropped
df['Hugo_Symbol'] = '\'' + df['Hugo_Symbol'].astype(str) # Appends ''\'' to all the data remaining in that column
df['Tumor_Sample_Barcode'] = df['Tumor_Sample_Barcode'].str.strip() #strips away whitespace from the data within this heading
non_silent = df.where(df['Variant_Classification'] != 'Silent') #creates a new dataframe where the data within the column 'Variant_Classification' is not equal to 'Silent'
df = non_silent.dropna(subset=['Variant_Classification']) #Drops all the rows that are missing at least one element
non_01_barcodes = df[~df['Tumor_Sample_Barcode'].str.contains(PRIMARY_TUMOR_PATIENT_ID_REGEX)]
#TODO: Double check that the extra ['Tumor_Sample_Barcode'] serves no purpose
df = df.drop(non_01_barcodes.index)
print(df)
shortened_patients = df['Tumor_Sample_Barcode'].str.extract(SHORTEN_PATIENT_REGEX, expand=False)
df['identifier'] = shortened_patients
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
gene_mutation_df = gene_mutation_df.reset_index() # reset the index here
print(gene_mutation_df)
if gene_mutation_df.empty: # check if the dataframe is empty
print('The dataframe is empty')
else:
# gene_mutation_df.set_index(['Hugo_Symbol', 'patient'], inplace=True) # this is not needed, pivot won't work if you do this
# gene_mutation_df = gene_mutation_df.reset_index() # this is not needed, the dataframe was reset already
gene_patient_mutations = gene_mutation_df.pivot(index='Hugo_Symbol', columns='patient', values='mutated') # values needs to be a column in the dataframe
return gene_patient_mutations.transpose().fillna(0)

Python: Add rows with different column names to dict/dataframe

I want to add data (dictionaries) to a dictionary, where every added dictionary represent a new row. It is a iterative process and it is not known what column names a new added dictionary(row) could have. In the end I want a pandas dataframe. Furthermore I have to write the dataframe every 1500 rows to a file ( which is a problem, because after 1500 rows, it could of course happen that new data is added which has columns that are not present in the already written 1500 rows to the file).
I need a approach which is very fast (maybe 26ms per row). My approach is slow, because it has to check every data if it has new column names and in the end it has to reread the file, to create a new file where all columns have the same lengths. The data comes from a queue which is processed in another process.
import pandas as pd
def writingData(exportFullName='path', buffer=1500, maxFiles=150000, writingQueue):
imagePassed = 0
with open(exportFullName, 'a') as f:
columnNamesAllList = []
columnNamesAllSet = set()
dfTempAll = pd.DataFrame(index=range(buffer), columns=columnNamesAllList)
columnNamesUpdated = False
for data in iter(writingQueue.get, "STOP"):
print(imagesPassed)
dfTemp = pd.DataFrame([data],index=[imagesPassed])
if set(dfTemp).difference(columnNamesAllSet):
columnNamesAllSet.update(set(dfTemp))
columnNamesAllList.extend(list(dfTemp))
columnNamesUpdated = True
else:
columnNamesUpdated = False
if columnNamesUpdated:
print('Updated')
dfTempAll = dfTemp.combine_first(dfTempAll)
else:
dfTempAll.iloc[imagesPassed - 1] = dfTemp.iloc[0]
imagesPassed += 1
if imagesPassed == buffer:
dfTempAll.dropna(how='all', inplace=True)
dfTempAll.to_csv(f, sep='\t', header=True)
dfTempAll = pd.DataFrame(index=range(buffer), columns=columnNamesAllList)
imagePassed = 0
Reading it in again:
dfTempAll = pd.DataFrame( index=range(maxFiles), columns=columnNamesAllList)
for number, chunk in enumerate(pd.read_csv(exportFullName, delimiter='\t', chunksize=buffer, low_memory=True, memory_map=True,engine='c')):
dfTempAll.iloc[number*buffer:(number+1*buffer)] = pd.concat([chunk, columnNamesAllList]).values#.to_csv(f, sep='\t', header=False) # , chunksize=buffer
#dfTempAll = pd.concat([chunk, dfTempAll])
dfTempAll.reset_index(drop=True, inplace=True).to_csv(exportFullName, sep='\t', header=True)
Small example with dataframes
So to make it clear. Lets say I have a 4 row already existent dataframe (in the real case it could have 150000 rows like in the code above), where 2 rows are already filled with data and I add a new row it could look like this with the exception that the new data is a dictionary in the raw input:
df1 = pd.DataFrame(index=range(4),columns=['A','B','D'], data={'A': [1, 2, 'NaN', 'NaN'], 'B': [3, 4,'NaN', 'NaN'],'D': [3, 4,'NaN', 'NaN']})
df2 = pd.DataFrame(index=[2],columns=['A','C','B'], data={'A': [0], 'B': [0],'C': [0] })#

How do I import CSV to Pandas df where data is organized by an index column with a parent/child relationship?

I have GBs of data in this text format:
1,'Acct01','Freds Autoshop'
2,'3-way-Cntrl','Y'
1000,576,686,837
1001,683,170,775
1,'Acct02','Daves Tacos'
2,'centrifugal','N'
1000,334,787,143
1001,749,132,987
The first column indicates the row content and is an index series that repeats for each Account (Acct01, Acct02...). Rows with index values (1,2) are one-to-one associated with each account (Parent). I would like to flatten this data into a dataframe that associates the Account level data (index = 1,2) with it's associated series data (1000, 10001, 1002, 1003...) the child data in a flat df.
Desired df:
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1000,576,686,837
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1001,683,170,775
'Acct02','Daves Tacos',2,'centrifugal','N',1000,334,787,143
'Acct02','Daves Tacos',2,'centrifugal','N',1001,749,132,987
I've been able to do this in a very mechanical, very slow row-by-row process:
import pandas as pd
import numpy as np
import time
file = 'C:\\PythonData\\AcctData.txt'
t0 = time.time()
pdata = [] # Parse data
acct = [] # Account Data
row = {} #Assembly Container
#Set dataframe columns
df = pd.DataFrame(columns=['Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT03'])
# open the file and read through it line by line
with open(file, 'r') as f:
for line in f:
#Strip each line
pdata = [x.strip() for x in line.split(',')]
#Use the index to parse data into either acct[] for use on the rows with counter > 2
indx = int(pdata[0])
if indx == 1:
acct.clear()
acct.append(pdata[1])
acct.append(pdata[2])
elif indx == 2:
acct.append(pdata[1])
acct.append(pdata[2])
else:
row.clear()
row['Account'] = acct[0]
row['Name'] = acct[1]
row['Type'] = acct[2]
row['Flag'] = acct[3]
row['Counter'] = pdata[0]
row['CNT01'] = pdata[1]
row['CNT02'] = pdata[2]
row['CNT03'] = pdata[3]
if indx > 2:
#data.append(row)
df = df.append(row, ignore_index=True)
t1 = time.time()
totalTimeDf = t1-t0
TTDf = '%.3f'%(totalTimeDf)
print(TTDf + " Seconds to Complete df: " + i_filepath)
print(df)
Result:
0.018 Seconds to Complete df: C:\PythonData\AcctData.txt
Account Name Type Flag Counter CNT01 CNT02 CNT03
0 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1000 576 686 837
1 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1001 683 170 775
2 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1000 334 787 143
3 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1001 749 132 987
This works but is tragically slow. I suspect there is a very easy pythonic way to import and organize to a df. It appears an OrderDict will properly organize the data as follows:
import csv
from collections import OrderedDict
od = OrderedDict()
file_name = 'C:\\PythonData\\AcctData.txt'
try:
csvfile = open(file_name, 'rt')
except:
print("File not found")
csvReader = csv.reader(csvfile, delimiter=",")
for row in csvReader:
key = row[0]
od.setdefault(key,[]).append(row)
od
Result:
OrderedDict([('1',
[['1', "'Acct01'", "'Freds Autoshop'"],
['1', "'Acct02'", "'Daves Tacos'"]]),
('2',
[['2', "'3-way-Cntrl'", "'Y'"],
['2', "'centrifugal'", "'N'"]]),
('1000',
[['1000', '576', '686', '837'], ['1000', '334', '787', '143']]),
('1001',
[['1001', '683', '170', '775'], ['1001', '749', '132', '987']])])
From the OrderDict I haven't been able to figure out how to combine keys 1,2 and associate with acct specific series of keys (1000, 1001) then append into a df. How do I go from OrderedDict to df while flattening the Parent/Child data? Or, is there a better way to process this data?
I'm not sure if it's the fastes or the pythonic way, but I believe a pandas aproach might do, since you need to iterate for every 4 rows in a weird real specific way:
first importing libraries to work with:
import pandas as pd
import numpy as np
since I didn't have a file to load, I just recreated it as an array (this part you'll have to do some work, or simply load it to a pandas' DataFrame with 4 columns will be fine [like next step]):
data = [[1,'Acct01','Freds Autoshop'],
[2,'3-way-Cntrl','Y' ],
[1000,576,686,837 ],
[1001,683,170,775 ],
[1002,333,44,885 ],
[1003,611183,12,1 ],
[1,'Acct02','Daves Tacos' ],
[2,'centrifugal','N' ],
[1000,334,787,143 ] ,
[1001,749,132,987],
[1,'Acct03','Norah Jones' ],
[2,'undertaker','N' ],
[1000,323,1,3 ] ,
[1001,311,2,111 ] ,
[1002,95,112,4]]
Created a dataframe with the above data + created new columns with numpy's nans (faster than panda's) as placeholders.
df = pd.DataFrame(data)
df['4']= np.nan
df['5']= np.nan
df['6']= np.nan
df['7']= np.nan
df['8']= np.nan
df.columns = ['idx','Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT3']
Making a new df that will get everytime "AcctXXXX" apears and how many rows bellow until the next parent.
# Getting the unique "Acct" and their index position into an array
acct_idx_pos = np.array([df[df['Account'].str.contains('Acct').fillna(False)]['Account'].values, df[df['Account'].str.contains('Acct').fillna(False)].index.values])
# Making a df with the transposed array
df_pos = pd.DataFrame(acct_idx_pos.T, columns=['Acct', 'Position'])
# Shifting the values into a new column and filling the last value (nan) with the df length
df_pos['End_position'] = df_pos['Position'].shift(-1)
df_pos['End_position'][-1:] = len(df)
# Making the column we want, that is the number of loops we'll go
df_pos['Position_length'] = df_pos['End_position'] - df_pos['Position']
A custom function that uses a dummy Dataframe and concatenates temporary ones (will be used later)
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
Created a function that will loop to fill each row and drop duplicated rows:
# a complicated loop function
def shorthen_df(df, num_iterations):
# to not delete original df
dataframe = df.copy()
# for the slicing, we need to start at the first row.
curr_row = 1
# fill current row's nan values with values from next row
dataframe.iloc[curr_row-1:curr_row:,3] = dataframe.iloc[curr_row:curr_row+1:,1].values
dataframe.iloc[curr_row-1:curr_row:,4] = dataframe.iloc[curr_row:curr_row+1:,2].values
dataframe.iloc[curr_row-1:curr_row:,5] = dataframe.iloc[curr_row+1:curr_row+2:,0].values
dataframe.iloc[curr_row-1:curr_row:,6] = dataframe.iloc[curr_row+1:curr_row+2:,1].values
dataframe.iloc[curr_row-1:curr_row:,7] = dataframe.iloc[curr_row+1:curr_row+2:,2].values
dataframe.iloc[curr_row-1:curr_row:,8] = dataframe.iloc[curr_row+1:curr_row+2:,3].values
# the "num_iterations-2" is because the first two lines are filled and not replaced
# as the next ones will be. So this will vary correctly to each "account"
for i in range(1, num_iterations-2):
# Replaces next row with values from previous row
dataframe.iloc[curr_row+(i-1):curr_row+i:] = dataframe.iloc[curr_row+(i-2):curr_row+(i-1):].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,5] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,0].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,6] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,1].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,7] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,2].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,8] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,3].values
# last 2 rows of df
dataframe = dataframe[0:len(dataframe)-2]
return dataframe
Finally, creating the dummy DF that will concat all "Acct" and loop for each one with it's position, using both functions above.
df_final= pd.DataFrame()
for start, end, iterations in zip(df_pos.Position.values, df_pos.End_position.values, df_pos.Position_length.values):
df2 = df[start:end]
df_temp = shorthen_df(df2, iterations)
df_final = concatenate_loop_dfs(df_temp, df_final)
# Dropping first/unnecessary columns
df_final.drop('idx', axis=1, inplace=True)
# resetting index
df_final.reset_index(inplace=True, drop=True)
df_final
returns
Account Name Type Flag Counter CNT01 CNT02 CNT3
0 Acct01 Freds Autoshop 3-way-Cntrl Y 1000.0 576 686 837
1 Acct01 Freds Autoshop 3-way-Cntrl Y 1001.0 683 170 775
2 Acct01 Freds Autoshop 3-way-Cntrl Y 1002.0 333 44 885
3 Acct01 Freds Autoshop 3-way-Cntrl Y 1003.0 611183 12 1
4 Acct02 Daves Tacos centrifugal N 1000.0 334 787 143
5 Acct02 Daves Tacos centrifugal N 1001.0 749 132 987
6 Acct03 Norah Jones undertaker N 1000.0 323 1 3
7 Acct03 Norah Jones undertaker N 1001.0 311 2 111
8 Acct03 Norah Jones undertaker N 1002.0 95 112 4

Categories