Grouping data and comparing it from excel in python - python

I am working on a project using python to select certain values from an excel file. I am using the xlrd library and openpyxl library to do this.
The way the python program should we working is :
Grouping all the data point entries that are in a certain card tase. These are marked in column E. For example, all of the entries between row 26 and row 28 are in Card Task A, and hence they should be grouped together. All entries without a “Card Task” value in column E should not be considered as anything.
Next…
looking at the value from column N (lastExecTime) from a row and compare that time with the following value in column M
If it is seen that the times overlap (column M is less than the previous N value) it will increment a variable called “count” . Count stores the number of times a procedure overlaps.
Finally…
As for the output, the goal is to create a separate text file that displays which tasks are overlapping, and how many tasks overlap in a certain Card Task.
The problem that I am running into is that I cannot pair the data from a card task
Here is a sample of the excel data:
The data (a picture of it)
Here is a picture of more data (this will probably be more helpful)
Click here for it
And here is the code that I have written that tells me if there are multiple procedures going on:
from openpyxl import load_workbook
book = load_workbook('LearnerSummaryNoFormat.xlsx')
sheet = book['Sheet1']
for row in sheet.rows:
if ((row[4].value[:9]) != 'Card Task'):
print ("Is not a card task: " + str(row[1].value))
Essentially my problem is that I am not able to compare all the values from one card task with each other.
Blockquote

I would read through the data once like you have already but store all rows with 'Card Task' in a separate list. Once you have a list of only card task items you can compare.
card_task_row_object_list = []
count = 0
for row in sheet.rows:
if 'Card Task' in row[4]:
card_task_row_object_list.append(row)
From here you would want to compare the time values. What are you needed to check, if two different card task times overlap?
(row 12: start, row 13: end)
def compare_times(card_task_row_object_list):
for row in card_task_row_object_list:
for comparison_row in card_task_row_object_list:
if (comparison_row[12] <= row[13] && comparison_row[13] >= row[12])
# No overlap
else
count+=1

Related

python nested for loop keeps running indefinitely, yet the intended list is being created

I have a dataframe called 'dft' of Netflix's TV Shows and movies, with a column, named "listed_in" with entries being a string of all the genres TV shows are classified under. Each row entry has multiple genre classification of different lengths. The genres are written as strings and separated by commas.
A single entry is something like, for example: 'Documentary','International TV Shows','Crime TV Shows'. Another row entry may have different number of genres it classifies under, some of who may be the same as some of the genres of other rows entries.
Now I want to create a list of the unique values in all the rows.
genres = []
for i in range(0,len(dft['listed_in'].str.split(','))):
for j in range(0,len(dft['listed_in'].str.split(',')[i])):
if (dft['listed_in'].str.split(',')[i][j]) not in genres:
genres.append(dft['listed_in'].str.split(',')[i][j])
else:
pass
This keeps the kernel running indefinitely. But the thing is, the list is being created. If I interrupt the kernel after some time, and print the list its there.
Then, I create a dataframe out of this list with the intention of having a column with the count of times each genre appears in the original dataframe.
data = {'Genres':genres,'count':[0 for i in range(0,len(genres))]}
gnr = pd.DataFrame(data = data)
Then to change the count column to each genre's count of occurrence:
for i in range(0,65):
for j in range(0,514):
if gnr.loc[i,'Genres'] in (dft['listed_in'].str.split(',').index[j]):
gnr.loc[i,'count'] = gnr.loc[i,'count'] + dft['listed_in'].str.split(',').value_counts()[j]
else:
pass
Then again this code keeps running indefinitely, but after interrupting it I saw the count for the 1st entry was updated in the gnr dataframe.
I don't know what is happening.
Are you sure that the process actually hangs? For loops with pandas is much slower than you would expect especially with the number of iterations you are doing (65*514). If you haven't already id put in a print(i) so you get some insight as to what iteration you're on

How to apply condition on column while iterating over every row of dataframe in python

Link of the data sets of csv file
Link contains the .csv file in which one column is state name and number of tooth lost. I want to calculate average teeth lost of children in every state. I tried to use df.interrows but I cannot use condition for columns value of any particular row.
for row in df.iterrows():
if row["State"] == "NSW":
Count += row["Number of teeth lost"]
If NSW state contain 3 values 2,3,4, then I need to calculate average of this three number and same for other six state. I am using pandas for manipulating csv file.
Try using df.loc[df['State'] == "NSW"]['Number of tooth lost'].mean()
It selects all the rows where the condition inside the .loc bracket is true and then just selects the column 'Number of tooth lost' to compute the mean for.
This is much faster than iterating all the rows as you did because pandas handles the operations on a lower level.

Tactic for comparing dataframes when column names are different and sequence is unknown

I need to compare two DataFrames at at time to find out if the values match or not. One DataFrame is from an Excel workbook and the other is from a SQL query. The problem is that not only might the columns be out of sequence, but the column headers might have a different name as well. This would prevent me from simply getting the Excel column headers and using those to rearrange the columns in the SQL DataFrame. In addition, I will be doing this across several tabs in an excel work book and against different queries. Not only do the column names differ from excel to SQL, but they may also differ from excel to excel and SQL to SQL.
I did create a solution, but not only is it very choppy, but I'm concerned it will begin to take up a considerable amount of memory to run.
The solution entails using lists in a list. If the excel value is in the same list as the SQL value they are considered a match and the function will return the final order that the SQL DataFrame must change to in order to match the same order that the Excel DataFrame is using. In case I missed some possibilities and the newly created order list has a different length than what is needed, I simply return the original SQL list of headers in the original order.
The example below is barely a fraction of what I will actually be working with. The actual number of variations and column names are much higher than the example below. Any suggestions anyone has on how to improve this function, or offer a better solution to this problem, would be appreciated.
Here is an example:
#Example data
exceltab1 = {'ColA':[1,2,3],
'ColB':[3,4,1],
'ColC':[4,1,2]}
exceltab2 = {'cColumn':[10,15,17],
'aColumn':[5,7,8],
'bColumn':[9,8,7]}
sqltab1 = {'Col/A':[1,2,3],
'Col/C':[4,1,2],
'Col/B':[3,4,1]}
sqltab2 = {'col_banana':[9,8,7],
'col_apple':[5,7,8],
'col_carrot':[10,15,17]}
#Code
import pandas as pd
ec1 = pd.DataFrame(exceltab1)
ec2 = pd.DataFrame(exceltab2)
sq1 = pd.DataFrame(sqltab1)
sq2 = pd.DataFrame(sqltab2)
#This will fail because the columns are out of order
result1 = (ec1.values == sq1.values).all()
def translate(excel_headers ,sql_headers):
translator = [["ColA", "aColumn", "Col/A", "col_apple"],
["ColB", "bColumn", "Col/B", "col_banana"],
["ColC", "cColumn", "Col/C", "col_carrot"]]
order = []
for i in range(len(excel_headers)):
for list in translator:
for item in sql_headers:
if excel_headers[i] in list and item in list:
order.append(item)
break
if len(order) != len(sql_headers):
return sql_headers
else:
return order
sq1 =sq1[translate(list(ec1.columns), list(sq1.columns))]
#This will pass because the columns now line up
result2 = (ec1.values == sq1.values).all()
print(f"Result 1: {result1} , Result 2: {result2}")
Result:
Result 1: False , Result 2: True
No code, but an algorithm.
We have a set of columns A and another B. We can compare a column from A and another from B and see if they're equal. We do that for all combinations of columns.
This can be seen as a bipartite graph where there are two groups of vertices A and B (one vertex for each column), and an edge exists between two vertices if those two columns are equal. Then the problem of translating column names is equivalent to finding a perfect matching in this bipartite graph.
An algorithm to do this with is Hopkroft-Karp, which has a Python implementation here. That finds maximum matchings, so you still have to check whether it found a perfect matching (that is, each column from A has an associated column from B).

xlwings syntax for n-rows and specific columns with output to new sheet

I've been trying to find a good resource for the syntax needed with xlwings and have been unsuccessful. I am trying to make a program that will repeat for n rows of data and output certain information to a new sheet. Here is a snippet of the algorithm. If you can point me to a good reference or just lend a hand I'd be grateful.
data = number of rows in worksheet #either input the number manually or automate
for row n to data: #start at row 1 and loop through each line of data
axles = get row n, column M data #retrieve data in column M
if axles<2: #Test the data from column M for validity
continue #return to the for loop and start on next line
distance = get row n, column Q data #retrieve data in column Q
if distance < 100 or distance > 300: #test the data from column Q for validity
continue #return to the for loop and start on next line
weight = get row n, column P data #retrieve data in column P
print weight into row n, column A on sheet 2 #display output on a new sheet
xlwings is a pretty cool interface to excel- the Range object will be doing the heavy lifting for your application. Depending on if your columns are all togetether, you could use either the table or vertical methods to read in all together or column wise. Here are two equivalent approaches for a simple set of data in Excel:
axles distance weight
1 150 1.5
2 200 2
1 250 2.5
2 300 3
4 350 3.5
The python code is:
from xlwings import Workbook, Range
wb=Workbook(r'C:\\Full\\Path\\to\\Book1.xlsx')
# Method 1:
# if your table is all together read it in at once:
# read all data in as table
allrows=Range('Sheet1','A2').table.value
for rownum, row in enumerate(allrows):
axles=row[0]
if axles<2:
continue
distance=row[1]
if distance< 100 or distance>300:
continue
weight = row[2]
# +2 to correct for python indexing and header row
Range('Sheet2',(rownum+2,1)).value=weight
# Method 2:
# if your columns are separated read them individually:
# read all data in as columns
axles=Range('Sheet1','A2').vertical.value
distance=Range('Sheet1','B2').vertical.value
weight=Range('Sheet1','C2').vertical.value
# in case the columns have different lengths, look at the shortest one
for rownum in range(min(len(axles),len(distance),len(weight))):
if axles[rownum]<2:
continue
if distance[rownum]< 100 or distance[rownum]>300:
continue
# +2 to correct for python indexing and header row
Range('Sheet2',(rownum+2,1)).value=weight[rownum]
In either case the second and fourth data points will be written to Sheet 2 on the same rows as in Sheet 1
xlwings is a package for the Python programming language. To learn Python, you can have a start on the official site, for example: https://www.python.org/about/gettingstarted/

Updating a cell based with the value that came before it

I want to develop a script to update individual cells (row of a specific column) of an attribute table based on the value of the cell that comes immediately before it as well as data in other columns but in the same row. I'm sure that this can be done with cursors but I'm having trouble conceptualizing exactly how to tackle this.
Essentially what I want to do is this:
If Column A, row 13 = a certain value AND Column B, row 13 = a certain value (but different from A), then change Column A, row 13 to be the same value as Column A, row 12.
If this can't be done with cursors then maybe some kind of array or matrix, or list of lists would be the way to go? I'm basically looking for the best direction to take with this. EDIT: My files are shapefiles or I also have them in .csv format. My code is really basic right now:
import arcpy
from arcpy import env
env.workspace = "C:/All Data Files/My Documents All/My Documents/wrk"
inputLyr = "C:/All Data Files/My Documents All/My Documents/wrk/file.lyr"
fields = ["time", "lon", "activityIn", "time", "fixType"]
cursor180 = arcpy.da.SearchCursor(inputLyr, fields, """"lon" = -180""")
for row in cursor180:
# Print the rows that have no data, along with activity Intensity
print row[0], row[1], row[2]

Categories