Exclude given columns when sorting a table python - python

I am trying to sort a table but would like to exclude given columns by their names while sorting. In other words, the given columns should remain where they were before sorting. This is aimed at dealing with columns like "Don't know', "NA" etc.
The API I'm using is unique and company specific but it uses python.
A table in this API is an object which is a list of rows, where each row is a list of cells and each cell is a list of cell values.
I am currently have a working function which sorts a table but I would like to edit/modify this to exclude a given column by it's name but I am struggling to find a way.
FYI - "Matrix" can be thought of as the table itself.
def SortColumns(byRow=0, usingCellValue=0, descending=True):
"""
:param byRow: Use the values in this row to determine the sort order of the
columns.
:param usingCellValue: When there are multiple values within a cell use this
to control which value row within each cell is used for sorting
(zero-based)
:param descending: Determines the order in which the values should be
sorted.
"""
for A in range(0,Matrix.Count):
for B in range(0,Matrix.Count):
if(A==B):
continue; #do not compare rows against eachother
valA = Matrix[byRow][A][usingCellValue].NumericValue if Matrix[byRow][A].Count > usingCellValue else None;
valB = Matrix[byRow][B][usingCellValue].NumericValue if Matrix[byRow][B].Count > usingCellValue else None;
if(descending):
if valB < valA:
Matrix.SwitchColumns(A,B)
else:
if valA < valB:
Matrix.SwitchColumns(A,B)
I am thinking of adding a new parameter which takes a list of column names, and use this to bypass these columns.
Something like:
def SortColumns(fixedcolumns, byRow=0,usingCellValue=0,descending=True):

While iterating through the columns, You can use the continue statement to skip over columns that you don't want to move. Put these conditions at the start of your two loops:
for A in range(0,Matrix.Count):
a_name = ??? #somehow get the name of column A
if a_name in fixedcolumns: continue
for B in range(0,Matrix.Count):
b_name = ??? #somehow get the name of column B
if b_name in fixedcolumns: continue
if(A==B):
continue

Related

Populating a subset of rows in a dataframe with values from another column / Collapsing several columns

First time posting here. I expect there's a better way of doing what I'm trying to do. I've been going round in circles for days and would really appreciate some help.
I am working with survey data about prisoners and their sentences.
Each prisoner has a type for the purpose of the survey, and this is stored in the column 'prisoner_type'. For each prisoner type, there is a group of 5 columns where their offenses can be recorded (not all columns are necessarilly used). I'd like to collapse these groups of columns into one set of 5 columns and add these to the dataset so that, on each row, there is one set of 5 columns where I can find the offenses.
I have created a dictionary to look up the column names that the offence codes and offence types are stored in for each prisoner type. The key in the outer dictionary is the prisoner type. Here is an abridged version:
offense_variables=
{ 3={'codes':{1:'V0114',2:'V0115',3:'V0116',4:'V0117',5:'V0118'},
'off_types':{1:'V0124',2:'V0125',3:'V0126',4:'V0127',5:'V0128'}}
8={'codes':{1:'V0270',2:'V0271',3:'V0272',4:'V0273',5:'V0274'},
'off_types': {1:'V0280',2:'V0281',3:'V0282',4:'V0283',5:'V0285'}} }
I am first creating 10 new columns: offense_1...offense_5 and type_1...type_5.
I am then trying to:
Use pandas iloc to locate the all the rows for a given prisoner type
Set the values for the new columns by looking up the variable for each offense number under
that prisoner type in the dictionary, and assign that column as the new values.
Problems:
The code doesn't terminate. I'm not sure why it's running on and on.
I recieve the error message "A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
pris_types=[3,8]
for pt in pris_types:
#five offenses are listed in the survey, so we need five columns to hold offence codes
#and five to hold offence types
#1 and 2 are just placeholder values
for item in [i+1 for i in range(5)]:
dataset[f'off_{item}_code']='1'
dataset[f'off_{item}_type']='2'
#then use .loc to get indexes for this prisoner type
#look up the variable of the column that we need to take the values from
#using the dictionary shown above
for item in [i+1 for i in range(5)]:
dataset.loc[dataset['prisoner_type'] == pt, \
dataset[f'off_{item}_code']] = \
dataset[offense_variables[pt]['codes'][item]]
dataset.loc[dataset[prisoner_type] == pt, \
dataset[f'off_{item}_type']] = \
dataset[offense_variables[pt]['types'][item]]
The problem is that in your .loc[] sections, you just need to use the column label (string object) to identify the column where values are to be set, not the entire series/column object, as you are currently doing. With your current code, you are creating new columns named with values stored in the dataset[f'off_{item}_type'] columns. So, instead of:
for item in [i+1 for i in range(5)]:
dataset.loc[dataset['prisoner_type'] == pt, \
dataset[f'off_{item}_code']] = \
dataset[offense_variables[pt]['codes'][item]]
dataset.loc[dataset[prisoner_type] == pt, \
dataset[f'off_{item}_type']] = \
dataset[offense_variables[pt]['types'][item]]
use:
for item in range(1,6):
(dataset.loc[dataset['prisoner_type'] == pt, \
f'off_{item}_code'] = \
dataset[offense_variables[pt]['codes'][item]]
dataset.loc[dataset[prisoner_type] == pt, \
f'off_{item}_type'] = \
dataset[offense_variables[pt]['types'][item]]
(I simplified your range loop line too.)
Also, you don't need to have the statements creating the 10 new columns inside the loop over prisoner types, you can move them outside of that loop. You actually don't need to create them manually like that. The .loc[] code would create them for you.

Remove rows from dataframe whose text does not contain items from a list

I am importing data from a table with inconsistent naming conventions. I have created a list of manufacturer names that I would like to use as a basis of comparison against the imported name. Ideally, I will delete all rows from the dataframe that do not align with the manufacturer list. I am trying to create an index vector using a for loop to iterate through each element of the dataframe column and compare against the list. If the text is there, update my index vector to true. If not, index vector is updated to false. Finally, I want to use the index vector to drop rows from the original data frame.
I have tried generators and sets, but to no avail. I thought a for loop would be less elegant but ultimately work, yet I'm still stuck. My code is below.
meltdat.Products is my dataframe column that contains the imported data
mfgs is my list of manufacturer names
prodex is my index vector
meltdat = pd.DataFrame(
{"Location":["S1","S1","S1","S1","S1"],
"Date":["1/1/2020", "1/1/2020", "1/1/2020", "1/1/2020", "1/1/2020"],
"Products":['CC304RED','COHoney','EtainXL','Med467','MarysTop'],
"Sold":[1,3,0,1,2]})
mfgs = ['CC', 'Etain', 'Marys']
for prods in meltdat.Products:
if any(mfg in meltdat.Products[prods] for mfg in mfgs):
prodex[prods] = TRUE
else:
prodex[prods] = FALSE
I added example data in the dataframe that mirrors my imported data.
you can use pd.DataFrame.apply:
meltdat[meltdat.Products.apply(lambda x: any(m in x for m in mfgs))]

Tactic for comparing dataframes when column names are different and sequence is unknown

I need to compare two DataFrames at at time to find out if the values match or not. One DataFrame is from an Excel workbook and the other is from a SQL query. The problem is that not only might the columns be out of sequence, but the column headers might have a different name as well. This would prevent me from simply getting the Excel column headers and using those to rearrange the columns in the SQL DataFrame. In addition, I will be doing this across several tabs in an excel work book and against different queries. Not only do the column names differ from excel to SQL, but they may also differ from excel to excel and SQL to SQL.
I did create a solution, but not only is it very choppy, but I'm concerned it will begin to take up a considerable amount of memory to run.
The solution entails using lists in a list. If the excel value is in the same list as the SQL value they are considered a match and the function will return the final order that the SQL DataFrame must change to in order to match the same order that the Excel DataFrame is using. In case I missed some possibilities and the newly created order list has a different length than what is needed, I simply return the original SQL list of headers in the original order.
The example below is barely a fraction of what I will actually be working with. The actual number of variations and column names are much higher than the example below. Any suggestions anyone has on how to improve this function, or offer a better solution to this problem, would be appreciated.
Here is an example:
#Example data
exceltab1 = {'ColA':[1,2,3],
'ColB':[3,4,1],
'ColC':[4,1,2]}
exceltab2 = {'cColumn':[10,15,17],
'aColumn':[5,7,8],
'bColumn':[9,8,7]}
sqltab1 = {'Col/A':[1,2,3],
'Col/C':[4,1,2],
'Col/B':[3,4,1]}
sqltab2 = {'col_banana':[9,8,7],
'col_apple':[5,7,8],
'col_carrot':[10,15,17]}
#Code
import pandas as pd
ec1 = pd.DataFrame(exceltab1)
ec2 = pd.DataFrame(exceltab2)
sq1 = pd.DataFrame(sqltab1)
sq2 = pd.DataFrame(sqltab2)
#This will fail because the columns are out of order
result1 = (ec1.values == sq1.values).all()
def translate(excel_headers ,sql_headers):
translator = [["ColA", "aColumn", "Col/A", "col_apple"],
["ColB", "bColumn", "Col/B", "col_banana"],
["ColC", "cColumn", "Col/C", "col_carrot"]]
order = []
for i in range(len(excel_headers)):
for list in translator:
for item in sql_headers:
if excel_headers[i] in list and item in list:
order.append(item)
break
if len(order) != len(sql_headers):
return sql_headers
else:
return order
sq1 =sq1[translate(list(ec1.columns), list(sq1.columns))]
#This will pass because the columns now line up
result2 = (ec1.values == sq1.values).all()
print(f"Result 1: {result1} , Result 2: {result2}")
Result:
Result 1: False , Result 2: True
No code, but an algorithm.
We have a set of columns A and another B. We can compare a column from A and another from B and see if they're equal. We do that for all combinations of columns.
This can be seen as a bipartite graph where there are two groups of vertices A and B (one vertex for each column), and an edge exists between two vertices if those two columns are equal. Then the problem of translating column names is equivalent to finding a perfect matching in this bipartite graph.
An algorithm to do this with is Hopkroft-Karp, which has a Python implementation here. That finds maximum matchings, so you still have to check whether it found a perfect matching (that is, each column from A has an associated column from B).

SQL Alchemy Filter rows based on the values contained in cells of other column

I am new to python and SQLALCHEMY, and I came across this doubt, whether can we filter rows of the table based on cell values of the column of same table.
example:
Sbranch=value
result=Transaction.query.filter(Transaction.branch==Sbranch)
.order_by(desc(Transaction.id)).limit(50).all()
if the value of Sbranch=0, i want to read all the rows regardless of Sbranch value, else i want to filter rows with contains Transaction.branch==Sbranch.
I know that it can be achieved by comparing the values of Sbranch(if-else conditions),but it gets complicated as the number of such columns increases.
Example:
Sbranch=value1
trans_by=value2
trans_to=value3
.
.
result=Transaction.query.filter(Transaction.branch==Sbranch,Transaction.trans_by==value2,Transaction_to==trans_to)
.order_by(desc(Transaction.id)).limit(50).all()
I want to apply similar filter with all 3 columns.
I want to know if there is any inbuilt function in SQLALCHEMY to this problem.
You can optionally add the filter based on the value of SBranch
query = Transaction.query
if SBranch != 0:
query = query.filter(Transaction.branch == SBranch)
result = query.order_by(Transaction.id.desc()).limit(50).all()
I think i found the solution, it's not the best but will reduce the work for the developers(not the processor).
Sbranch=value
branches=[]
if Sbranch==0:
# Append all the values into the array for which the rows are filtered
# for example:
branches=[1,2,4,7,3,8]
else:
branches.append(branch)
result=Transaction.query.filter(Transaction.branch.in_(branches))
.order_by(desc(Transaction.id)).limit(50).all()

How to access particular column name in XlsxWriter through dictionary key?

I want to excel file like in above image, with more than 100 columns,
row_header = ['Student_id', 'Student_name', 'Collage_name', 'From', 'To',
.........upto 100 column name]
My Simple Question is how write flexible code, So if order of name change then, I don't have to write code again.
For first row which contains the only column name only I write following code. This will create the first row.
for item in row_header:
worksheet.write(row, col, item)
worksheet.set_row(row, col)
col += 1
Now problem I that, I am getting the list of dictionary, each dictionary contains the one student details, means each dictionary contains 100 key value.
student_list=
[{collage_name: IIIT-A, Student_name:'Rakesh','Student_id':1,.....up to 100 key},
{collage_name: IIIT-G, Student_name: 'Shyam', 'Student_id':2, ........ up to 100 key}]
As you can see that key order is not matching with the column's name order. If I try to write like above it will take 100 line of code. So I am looking for solution where we can assign the cell value according to column value. How to use xlsxWrite through key -value dictionary, so that require to write less number of code..

Categories