Compare columns and print smaller and bigger rows - python

I have 2 files in this way
file 1 has 1 row:
6
4
13
25
35
50
65
75
and so on.....
file2 has 1 row in it
24
45
76
and so on.....
I want to take each value(one at a time) in file2 and compare with file1 and if the value of file1 is less than that number take those value and keep them in list and then sort them according to number and print the largest value
for example:
I took 24 number in file2 and compared with file1 and saw 6,4 and 13 are below that number,then I extract them keep it in list and sort it and print the largest value(i.e. 13)

Read each file into a list, converting each line to int. Then sort both lists to allow us to iterate efficiently,
file1 = sorted([int(l) for l in open('file1.txt').read().split()])
file2 = sorted([int(l) for l in open('file2.txt').read().split()])
i = 0
for file2_number in file2:
while i+1 < len(file1) and file1[i+1] < file2_number:
i += 1
print file1[i]
This currently prints the answers (13 35 75) but you can easily modify it to return a list if wanted.

Using Python, after first reading in all of the lines in file1 and all of the lines in file2 into two separate lists, then you can simply traverse through them, comparing each number from file 1 to each number in file2 like so:
#First load each of the lines in the data files into two separate lists
file1Numbers = [6, 4, 13, 25, 35, 50, 65, 75]
file2Numbers = [24, 45, 76]
extractedNumbers = []
#Loops through each number in file2
for file2Number in file2Numbers:
#Loops through each number in file
for file1Number in file1Numbers:
#Compares the current values of the numbers from file1 and file2
if (file1Number < file2Number):
#If the number in file1 is less than the number in file2, add
#the current number to the list of extracted numbers
extractedNumbers.append(file1Number)
#Sorts the list of extracted numbers from least to greatest
extractedNumbers.sort()
#Prints out the greater number in the list
#which is the number located at the end of the sorted list (position -1)
print extractedNumbers[-1]

awk solution:
awk 'NR==FNR{a[$0];next} {b[FNR]=$0}
END{
n=asort(b)
for(j in a)
for(i=n;i>0;i--)
if(b[i]<j){
print "for "j" in file2 we found : "b[i]
break
}
}' file2 file1
output:
for 45 in file2 we found : 35
for 76 in file2 we found : 75
for 24 in file2 we found : 13
Note : there is room to optimize. if performance is critical, you could consider to (just suggestion)
sort file1 descending
sort file2 ascending
take first from sorted file2, go through file1 from the file1.first, when you find the smaller one, record the position/index x
take the 2nd from sorted file2, from file1.x starting compare, when found the right one, update x
till the end of file2
the brute force way will take O(mxn) or O(nxm) depends on n and m which one is bigger.
The algorithm above... I didn't analyze , should be faster than O(mxn).. ;)
both python and awk could do the job. if possible, load the two files into memory. if you have monster files, that's another algorithm problem. e.g. sort huge file

Related

Find the minimum between two consecutive occurrence of numbers and repeat the process to the end of the file

Having a 0.txt file below:
Report "Curve ABC" for bread and milk on New Mexico City:
Bread and mil price by markets:
milk on market "Onlyfoods"
10
bread on market "Onlyfoods"
23
milk on market "spassus"
30
bread on market "spassus"
4
bread on market "chaim"
56
milk on market "chaim"
96
bread on market "house green"
7
milk on market "house green"
0.8
I would like to compare the first line where there are any number with the second consecutive line where there are also some number, and then determine the minimum value between them and print the number of the line where the minimum value is. I would like to repeat the same process for the third and fourth occurrence of some number, that is, compare them and get the minimum between them, and so on.
For example: It would be necessary to compare 10 and 23 and its minimum would be 10 and the line number would be 6; By continuing, between 20 and 4 the minimum is 4 and the line number would be 12.
The output file should be something like:
minimun is between 10 and 23 is 10 in line 6
minimun is between 30 and 4 is 4 in line 12
.
.
...
I have made a similar question but got many downvotes and I do not understand why this issue is not useful being that no one has shown me some other issue already posted here in the stackoverflow.
I could combine this issue, "calculate the difference between number lines file", with this question, Find the line with the min and max value and your line number from text file (Get Value Error Float Type), but I have not been able to understand how to do this.
EDIT UPDATE 1:
Alternative partial pseudocode:
A starting point using UNIX/REGEX/AWK: Basically I should: 1 - Extract the line number of each instance of numeric value and save to a 1.txt file in a list vertically; 2 - Extract the numeric value of each instance and save in 2.txt in a vertical list and then apply the solution given here to find the minimum between each pair of consecutive rows and save the minimum values (in order as extracted) in a file 3.txt; 3 - So I should append or paste on the left or right of each respective 1.txt line using a separator such as a comma or : each line of 3.txt and save in a 4.txt file.
Read the file, create an empty variable, check if the string is a number, compare and find the minimal value.
file = open("0.txt")
m = None
results = []
indexes = []
for i,d in enumerate(file):
try:
results.append(float(d))
indexes.append(i+1)
except ValueError:
pass
for r in range(1,len(results),2):
if results[r - 1] < results[r]:
print("minimum between",results[r-1],"and",results[r],"is",results[r-1],"line",indexes[r])
else:
print("minimum between",results[r-1],"and",results[r],"is",results[r],"line",indexes[r])

Why the size of this binary files are equal although they should not?

By writing simple python script, I encoutered a weird problem: Two files with a different content have same size.
So, I have a two same list of some binary data, one in string, one in int:
char_list = '10101010'
int_list = [1, 0, 1, 0, 1, 0, 1, 0]
Then, I convert lists to bytearray:
bytes_from_chars = bytearray(char_list, "ascii")
bytes_from_ints = bytearray(int_list)
Printing this out, give me this result:
bytearray(b'10101010')
bytearray(b'\x01\x00\x01\x00\x01\x00\x01\x00')
but, this is ok.
Writing this data to disk:
with open("from_chars.hex", "wb") as f:
f.write(bytes_from_chars)
with open("from_ints.hex", "wb") as f:
f.write(bytes_from_ints)
And the size of files are same, but files contains different data!
ls -l:
hexdump of files:
And my question is, why the size of file are equal? As I now, to write value of 0 or 1 we need 1 bit, and to write hex value of 30 or 31 we need 5 bits (1 1110 and 1 1111)
To write the value of 0 or 1 you do not need a single bit. How could you tell the difference between 3 = 11 or having two 1?
You are writing in both cases an array of 8 bytes, Just in the first case your using the whole byte to write the char.
Think of it as writing a word from the letters 0 and 1, the word 1 is 0000 0001 , Without the 0s in the start, you wont be able to tell what the word is.

python - Issue in processing files with big size

Basically, I wanted to create a Python script for my daily tasks wherein i wanted to compare two file with any size & wanted to generated 2 new files having matching records & non-matching records from both file.
I have written below python script & found it's working properly for file size having few records.
But when i am executing same script with files with 200,000 and 500,000 records then resulting file getting generated is not giving valid output.
So, can you check below script and help to identify issue in it causing wrong output...?
Thanks in advance.
from sys import argv
script, filePathName1, filePathName2 = argv
def FileDifference(filePathName1, filePathName2):
fileObject1 = open(filePathName1,'r')
fileObject2 = open(filePathName2,'r')
newFilePathName1 = filePathName1 + ' - NonMatchingRecords.txt'
newFilePathName2 = filePathName1 + ' - MatchingRecords.txt'
newFileObject1 = open(newFilePathName1,'a')
newFileObject2 = open(newFilePathName2,'a')
file1 = fileObject1.readlines()
file2 = fileObject2.readlines()
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for j in range(0,len(Matching)):
newFileObject2.write(Matching[j])
fileObject1.close()
fileObject2.close()
newFileObject1.close()
newFileObject2.close()
FileDifference(filePathName1, filePathName2)
Edit-1 : Pls note that above program executes without any error. Its just that output is incorrect and program takes much longer time to get over large file.
I'll take a wild guess and assume that "no valid output" means: "runs forever and does nothing useful".
Which would be logical because of your list comprehensions:
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for i in range(0,len(Matching)):
newFileObject2.write(Matching[i])
They perform O(n) lookup, which is okay on a small number of lines but never ends if, say len(file1) == 100000 and so is file2. You then perform 100000*100000 iterations => 10**10 => forever.
Fix is simple: create sets and use intersection & difference, much faster.
file1 = set(fileObject1.readlines())
file2 = set(fileObject2.readlines())
difference = file1 - file2
for i in difference:
newFileObject1.write(i)
matching = file1 & file2
for i in matching:
newFileObject2.write(matching)

Filling tabs until the maximum length of column

I have a tab-delimited txt that looks like
11 22 33 44
53 25 36 25
74 89 24 35 and
But there is no "tab" after 44 and 25. So the 1st and 2nd rows have 4 columns, 3rd row has 5 columns.
To rewrite it so that tabs are shown,
11\t22\t33\t44
53\t25\t36\t25
74\t89\t24\t35\tand
I need to have a tool to mass-add tabs where there are no entries.
If the maximum length of column is n (n=5 in the above example), then I want to fill tabs until that nth column for all rows to make
11\t22\t33\t44\t
53\t25\t36\t25\t
74\t89\t24\t35\tand
I tried to do it by notepad++, and python by using replacer code like
map_dict = {'':'\t'}
but it seems I need more logic to do it.
I am assuming your file also contains newlines so it would actually look like this:
11\t22\t33\t44\n
53\t25\t36\t25\n
74\t89\t24\t35\tand\n
If you know for sure that the maximum length of your columns is 5, you can do it like this:
with open('my_file.txt') as my_file:
y = lambda x: len(x.strip().split('\t'))
a = [line if y(line) == 5 else '%s%s\n' % (line.strip(), '\t'*(5 - y(line)))
for line in my_file.readlines()]
# ['11\t22\t33\t44\t\n', '53\t25\t36\t25\t\n', '74\t89\t24\t35\tand\n']
This will add ending tabs until you reach 5 columns. You will get a list of lines that you need to write back to a file (i have 'my_file2.txt' but you can write back to the original one if you want).
with open('my_file2.txt', 'w+') as out_file:
for line in a:
out_file.write(line)
If I understood it correctly, you can achieve this in Notepad++ only using following:
And yes, if you have several files on which you want to perform this, you can record this as a macro and bind it on to key as a shortcut

Complex parsing query

I have a very complex parsing problem. Any thoughts would be appreciated here. I have a test.dat file.The file to be parsed looks like this:
* Number = 40
Time = 0
1 10.13 10 10.11 12 13
.
.
Time = n
1 10 10 10 12.50 13
.
.
There are N time blocks and each block has 40 lines like shown above. What I would like to do is add e.g. the 1st line of first block , then 1st line in block #2 .. and so on to to a new file -test_1.dat. Similarly, 2nd line of every block to test_2.datand so on.The lines in the block should be written as is to the new _n.dat file. Is there any way to do this? The number I have assumed here is 40, so if the * number = 40 there will be 40 lines under each time block.
regards,
Ris
You can read the file in as a list of strings (call it fileList), where each string is a different line:
f = open('filename')
fileList = f.readlines()
Then, remove the "header" part of your file with
fileList.pop(0)
fileList.pop(0)
Then, do
outFileContents = {} # This will be a dict, where number -> content of test_number.dat
for outFileName in range(1,41): #outFileName will be the number going after the _ in your filename
outFileContents[outFileName] = []
for n in range(40): # Counting through the time blocks
currentRowIndex = (42 * n) + outFileName # 42 to account for the Time = and blank row
outFileContents[outFileName].append(fileList[currentRowIndex])
Finally you can loop through outFileContents and write the contents of each value to separate files.

Categories