Finding a logic to map the values of two .tsv files

Finding a logic to map the values of two .tsv files - python

I am working on a problem where I have 2 .tsv files and one has been arranged wrongly with respect to the other one.
When I scan the file , I noticed a pattern which I am unable to put it in terms of coding language. The pattern that I observed was :
For every increase in the row number of metadata file = 8 rows of increment to match in the flipped_metadata.tsv file to match the same values in the metadata file
For every increase in the flipped_metadata file = 12 rows if increment in the metadata.tsv file to match the same values in the flipped_metadata file.
For more clarity I have attached the 2 .tsv files along with this:
Metadata.tsv file and Flipped_metadata.tsv file

The openpyxl library has good functions for dealing with Excel cell locations. These can be used to convert A1 to a proper row and column.
Read each row in and convert the cell reference to a simple numeric row and column value. Use a dictionary to store each cell found with the two values for that cell. e.g. cells[(1,1)] = "123 456"
Whilst reading in, keep a track of the largest row and column.
Create an empty array (list of lists) to allow each cell to be assigned into.
Iterate over all of the dictionary items and assign each value into the array.
Finally save the array to a new CSV file.
For example:
from openpyxl.utils.cell import coordinate_from_string, column_index_from_string
import csv
def flip(input_filename, output_filename):
cells = {}
max_row = 0
max_col = 0
with open(input_filename) as f_input:
for cell, v1, v2 in csv.reader(f_input, delimiter='\t'):
col_letter, row_number = coordinate_from_string(cell)
col_number = column_index_from_string(col_letter)
cells[(row_number, col_number)] = f"{v1} {v2}"
if row_number > max_row:
max_row = row_number
if col_number > max_col:
max_col = col_number
output = [[''] * max_col for _ in range(max_row)]
for (row_number, col_number), values in cells.items():
output[row_number-1][col_number-1] = values
with open(output_filename, 'w', newline='') as f_output:
csv.writer(f_output).writerows(output)
flip('metadata.tsv', 'output_metadata.csv')
flip('flipped_metadata.tsv', 'output_flipped_metadata.csv')
This would give you:
Note: this approach correctly handles all cell references e.g. FK42. It would also handle holes in the data, if A2 was deleted it would still align correctly, as it is not 100% clear if data in cells can be missing,

Related

How to group csv in python without using pandas

I have a CSV file with 3 rows: "Username", "Date", "Energy saved" and I would like to sum the "Energy saved" of a specific user by date.
For example, if username = 'merrytan', how can I print all the rows with "merrytan" such that the total energy saved is aggregated by date? (Date: 24/2/2022 Total Energy saved = 1001 , Date: 24/2/2022 Total Energy saved = 700)
I am a beginner at python and typically, I would use pandas to resolve this issue but it is not allowed for this project so I am at a complete loss on where to even begin. I would appreciate any help and guidance. Thank you.

My alternative to opening csv files is to use csv module of native python. You read them as a "file" and just extract the values that you need. I filter using the first column and keep only keep the equal index values from the concerned column. (which is thrid and index 2.)
import csv
energy_saved = []
with open(r"D:\test_stack.csv", newline="") as csvfile:
file = csv.reader(csvfile)
for row in file:
if row[0]=="merrytan":
energy_saved.append(row[2])
energy_saved = sum(map(int, energy_saved))
Now you have a list of just concerned values, and you can sum them afterwards.
Edit - So, I just realized that I left out the time part of your request completely lol. Here's the update.
import csv
my_dict = {}
with open(r"D:\test_stack.csv", newline="") as file:
for row in csv.reader(file):
if row[0]=="merrytan":
my_dict[row[1]] = my_dict.get(row[1], 0) + int(row[2])
So, we need to get the date column of the file as well. We need to make a presentation of two "rows" but when Pandas has been prohibited, we will go to dictionary with date as keys and energy as values.
But your date column has repeated values (regardless intended or else) and Dictionaries require keys to be unique. So, we use a loop. You add one date value after another as key and corresponding energy as value to the new dictionary, but when it is already present, you will sum with the existing value instead.

I would turn your CSV file into a two-level dictionary, with username and then date as the keys
infile = open("data.csv", "r").readlines()
savings = dict()
# Skip the first line of the CSV, since that has the column names
# not data
for row in infile[1:]:
username, date_col, saved = row.strip().split(",")
saved = int(saved)
if username in savings:
if date_col in savings[username]:
savings[username][date_col] = savings[username][date_col] + saved
else:
savings[username][date_col] = saved
else:
savings[username] = {date_col: saved}

Increase the speed of an excel file operations (using openpyxl): check value and delete rows operations if condition

I have a medium size excel file, with about 25000 rows.
In the excel file I check if a specific column value is in a list, and if is in the list I delete the row.
I'm using openpyxl.
The code:
count = 1
while count <= ws.max_row:
if ws.cell(row=count, column=2).value in remove_list:
ws.delete_rows(count, 1)
else:
count += 1
wb.save(src)
The code works, but is very slow(take hours) to finish.
I know that is a read-only and write-only modes, but in my case, I use both, first checking and second deleting.

I see you are using a list of rows which you need to delete. Instead, you can create "sequences" of rows to delete, thus changing a delete list like [2,3,4,5,6,7,8,45,46,47,48] to one like [[2, 7],[45, 4]]
i.e. Delete 7 rows starting at row 2, then delete 4 rows starting at row 45
Deleting in bulk is faster than 1 by 1. I deleted 6k rows in around 10 seconds
The following code will convert a list to a list of lists/sequences:
def get_sequences(list_of_ints):
sequence_count = 1
sequences = []
for row in list_of_ints:
next_item = None
if list_of_ints.index(row) < (len(list_of_ints) - 1):
next_item = list_of_ints[list_of_ints.index(row) + 1]
if (row + 1) == next_item:
sequence_count += 1
else:
first_in_sequence = list_of_ints[list_of_ints.index(row) - sequence_count + 1]
sequences.append([first_in_sequence, sequence_count])
sequence_count = 1
return sequences
Then run another loop to delete
for sequence in sequences:
sheet.delete_rows(sequence[0], sequence[1])

Personally, I would do two things:
first transform the list into a set so the lookup of the item takes less time
remove_set = set(remove_list)
...
if ws.cell(row=count, column=2).value in remove_set:
then I would avoid removing the rows in place, as it takes a lot of time to reorganise the data structures representing the sheet.
I would create a new blank worksheet and add to it only the rows which must be kept.
Then save the new worksheet, overwriting the original if you wish.
If it still takes too long, consider using a CSV format so you can treat the input data as text and output it the same way, re-importing the data later from the spreadsheet program (e.g. Ms-Excel)
Have a look at the official docs and at this tutorial to find out how to use the CSV library
Further note: as spotted by #Charlie Clark, the calculation of
ws.max_row
may take some time as well and there is no need to repeat it.
To do that, the easiest solution is to work backwards from the last row down to the first, so that the deleted rows do not affect the position of the ones before them.

When a number of rows have to be deleted from a sheet, I create a list of these row numbers, e.g. remove_list and then I rewrite the sheet to a temporary sheet, excluding these rows. I delete the original sheet and rename the temporary sheet to the original sheet. See my function for doing this below:
def delete_excel_rows_with_openpyxl(workbook, sheet, remove_list):
""" Delete rows with row numbers in remove_list from sheet contained in workbook """
temp_sheet = workbook.create_sheet('TempSheet')
destination_row_counter = 1
for source_row_counter, source_row in enumerate(sheet.iter_rows(min_row=1, max_row=sheet.max_row)):
try:
i = remove_list.index(source_row_counter+1) # enumerate counts from 0 and sheet from 1
# do not copy row
del remove_list[i]
except ValueError:
# copy row
column_count = 1
for cell in source_row:
temp_sheet.cell(row=destination_row_counter, column=column_count).value = cell.value
column_count = column_count + 1
destination_row_counter = destination_row_counter + 1
sheet_title = sheet.title
workbook.remove_sheet(sheet)
temp_sheet.title = sheet_title
return workbook, temp_sheet

Adding on to ketdaddy's response. I tested it and noticed that when you use this sequence in a for loop as suggested, you need to update the row number in every loop to account for the deleted rows.
For example, when you get to the second step in the loop, the start row is not the original start row, it's the original start row minus the rows which were previously deleted.
This code will update ketdaddy's sequence to generate a sequence which takes this into account.
original sequence = get_sequences(deleterows)
updated_sequence=[]
cumdelete = 0
for start, delete in original sequence:
new_start = start-cumdelete
cumdelete = cumdelete + delete
updated_sequence.append([new_start, delete])
updated_sequence

Selectin Dataframe columns name from a csv file

I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!

Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]

XLRD: Start reading a column from a specific cell / range (Python)

I am trying to read all values within the first sheet of an excel file via xlrd, but I need it to start reading values from row 3 of the excel sheet, until the end of values in the column
Current version reads all information within the columns including the headers, this is not desired
code:
for col in range(sheet.nrows):
names = sheet.cell(col,0)
nums = sheet.cell(col,1)
if names.value != xlrd.empty_cell.value:
if nums.value != xlrd.empty_cell.value:
f.write('\t\t\t\t\t\t\t\t\t'+ '<li><strong>' + names.value + '</strong> '+ repr(nums.value)+'</li>' + "\n")

Change your index in the code..... for col in range(2,sheet.nrows): should give the desired behaviour.
On a sidenote, you should really rename your variables, you're using col as a variable for the number of rows in a sheet (which causes all kinds of confusion).
EDIT to point out that XLREAD is 0 indexed.

Checking for Regular Expressions within a CSV

I'm currently trying to run through my csv file and identify the rows in a column.
The output should be something like "This column contains alpha characters only".
My code currently:
Within a method I have:
print('\nREGULAR EXPRESSIONS\n' +
'----------------------------------')
for x in range(0, self.tot_col):
print('\n' + self.file_list[0][x] +
'\n--------------') # Prints the column name
for y in range(0, self.tot_rows + 1):
if regex.re_alpha(self.file_list[y][x]) is True:
true_count += 1
else:
false_count += 1
if true_count > false_count:
percentage = (true_count / self.tot_rows) * 100
print(str(percentage) + '% chance that this column is alpha only')
true_count = 0
false_count = 0
self.file_list is the csv file in list format.
self.tot_rows & self.tot_col are the total rows and total columns respectively which has been calculated earlier within the program.
regex.re_alpha has been imported from a file and the method looks like:
def re_alpha(column):
# Checks alpha characters
alpha_valid = alpha.match(column)
if alpha_valid:
return True
else:
return False
This currently works, however I am unable to add my other regex checks such as alpha, numeric etc
I have tried to duplicate the if statement with a different regex check but it doesn't work.
I've also tried to do the counts in the regex.py file however the count stops at '1' and returns the wrong information..
I thought creating a class in the regex.py file would help however no avail.
Summary:
I would like to run multiple regex checks against my csv file and have them ordered via columns.
Thanks in advance.

From the code above, the first line of the CSV contains the column names. This means you could make a dictionary to contain your result where the keys are the column names.
from csv import DictReader
reader = DictReader(open(filename)) # filename is the name of the CSV file
results = {}
for row in reader:
for col_name, value in row.items():
results.setdefault(col_name, []).append(regex.re_alpha(value))
Now you have a dictionary called 'results' which has the output from the regex checks stored by column name. You can then output statistics. Or you could save the rows as you read them in a list and once you decide on an order you can go back and output rows to a new CSV file by outputting the items in each dictionary using the keys in the new order.
csv_writer = csv.writer(open(output_filename, 'w'))
new_order = [list of key names in the right order]
for row in saved_data:
new_row = map(row.get, new_order)
csv_writer.writerow(new_row)
Admittedly this is a bit of a sketch but it should get you going.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding a logic to map the values of two .tsv files - python

Related

How to group csv in python without using pandas

Increase the speed of an excel file operations (using openpyxl): check value and delete rows operations if condition

Selectin Dataframe columns name from a csv file

XLRD: Start reading a column from a specific cell / range (Python)

Checking for Regular Expressions within a CSV

Categories

Resources