Problems looking for csv values in txt file using Python - python

I'm new to Stackoverflow and relatively new to Python. I Have tried searching the site for an answer to this question, but haven't found one related to matching values between csv and txt files.
I'm writing a simple Python script that reads in a row from large csv file (~600k lines), grabs a value from that row, assigns to a variable, then uses the variable to try to find a matching value from a large txt file (~1.8MM lines). It's not working and I'm not sure why.
Here's a snippet from the source.csv file:
DocNo,Title,DOI
1,"Title One",10.1080/02724634.2016.1269539
2,"Title Two",10.1002/2015ja021888
3,"Title Three",10.1016/j.palaeo.2016.09.019
Here's a snippet from the lookup.txt file (note that it's separated by \t):
DOI 10.1016/j.palaeo.2016.09.019 M First
DOI 10.1016/j.radmeas.2015.12.002 M First
DOI 10.1097/SCS.0000000000002859 M First
Here's the offending code:
import csv
with open('source.csv', newline='', encoding = "ISO-8859-1") as f, open('lookup.txt', 'r') as i:
reader = csv.reader(f, dialect='excel')
counter = 0
for line in i:
for row in reader:
doi = row[2]
doi = str(doi) # I think this might actually be redundant...
if doi in line:
# This will eventually do more interesting things, but right now it's just a test
print(doi)
break
else:
# This will be removed--is also just a test (so I can watch progress)
print(counter)
counter += 1
Currently, when it runs, it just counts the lines, even though there's a matching doi in each file.
The maddening thing is that when I give doi a hard-coded value, it executes as it should. This makes me think that either the slashes in doi are breaking things somehow, or I need to convert the data type of the doi variable.
For example, this works:
doi = "10.1016/j.palaeo.2016.09.019"
for line in i:
if doi in line:
print(doi)
break
else:
print(counter)
counter += 1
Thanks in advance for your help, I'm at my wit's end!

Your problem is that repeating for line in i: does not start over from the beginning on each loop, but rather it keeps going where it was when you called break the last time. If you have any line in the lookup file i that has no match, you will effectively go through the lookup file completely and then all calls to for line in i: will do nothing (empty loop).
You probably want to keep the lookup lines in a list, as a first step. Turning it into a lookup dict by parsing the row would likely be the next step.
Here is a demonstration of what happens:
!cat 1.txt
row1
row2
row3
!cat 2.txt
row A
row B
row C
with open('1.txt', 'r') as i, open('2.txt', 'r') as j:
for irow in i:
print "irow", irow.strip()
for jrow in j:
print "jrow", jrow.strip()
irow row1
jrow row A
jrow row B
jrow row C
irow row2
irow row3

You can try this:
import csv
data = csv.reader(open('data1.csv'))
data1 = [i.strip('\n').split()[1] for i in open('data2.txt')]
file_data = [i[-1] for i in data if i[-1] in data1]
Output from sample files provided:
['10.1016/j.palaeo.2016.09.019']

Related

split the data into separate file after encountering a column name

eno,ename,
101,'sam',
102,'bill',
eno,ename,
103,'jack',
eno,ename,
104,'pam',
I have a huge .csv file in which column names reappear after certain number of rows. is there a way in python to split such data into multiple files as soon as it encounter the "repeated column names"?
I would like the above data to be in 3 separate .csv files since the same column names appear 3 times.
Challenging! Here's my solution. There is likely a more straightforward way to do this though.
with open("./file.csv", "r") as readfile:
file_number = 0
current_line_no = 0
tmpline = None
for line in readfile:
# count which file you're on. Also use write mode "W" if the first line. Else append.
with open(f"./writefile{file_number}.csv", ("w" if current_line_no == 0 else "a")) as writefile:
# check if the "headers" are appearing and if the current file has more than 1 line.
# Not sure if the header check is the best for your use case. Maybe regex is best here.
if current_line_no != 0 and ("eno" in line and "ename" in line):
file_number += 1 # increment to next file
current_line_no = 0 # reset file number
tmpline = line # remember the "current line". This needs to be added to next file.
continue # continue to next line in readfile
# if there is a templine from previous, add it to this as header.
if tmpline is not None:
writefile.write(tmpline)
tmpline = None
# write the line and increment to new line
writefile.write(line)
current_line_no += 1
I've tried to comment as best as possible. The code basically opens the files one by one as it loops through the lines of the readfile. When it reads the contents it checks if the current line is a "header". Here I simply checked if "eno" and "ename" are in the line, but there is probably a better approach for your use case. If the current line is a header, then you need to close the current file and open a new one. Hopefully this helps!
I know you asked for Python, but there are some questions that just cry out for the power of AWK :)
awk '/eno,ename/{x="F"++i ".csv";}{print > x;}' input.csv
One way of doing it is to save the headers to a variable, and then when reading the file check if the current row matches the header. If it does, increment a counter that can be used to determine which file to write to.
import csv
HEADERS = next(csv.reader(open('data.csv')))
print(HEADERS)
with open('data.csv') as f:
reader = csv.reader(f)
file_name_counter = 0
for row in reader:
if row == HEADERS:
file_name_counter += 1
with open(f'data{file_name_counter}.csv', ('w' if row == HEADERS else "a"), newline="") as f:
writer = csv.writer(f)
writer.writerow(row)
NOTE: I believe the newline="" argument is necessary on Windows, as otherwise csv.writer() will add an extra new line between each entry.

Reading a numbers off a list from a txt file, but only upto a comma

This is data from a lab experiment (around 717 lines of data). Rather than trying to excell it, I want to import and graph it on either python or matlab. I'm new here btw... and am a student!
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
more numbers : see Screenshot of more data from my file
I just can't figure out how to read the line up until a comma. Specifically, I need the Load numbers for one of my arrays/list, so for example on the first line I only need 62.638 (which would be the first number on my first index on my list/array).
How can I get an array/list of this, something that iterates/reads the list and ignores strings?
Thanks!
NOTE: I use Anaconda + Jupyter Notebooks for Python & Matlab (school provided software).
EDIT: Okay, so I came home today and worked on it again. I hadn't dealt with CSV files before, but after some searching I was able to learn how to read my file, somewhat.
import csv
from itertools import islice
with open('Blue_bar_GroupD.txt','r') as BB:
BB_csv = csv.reader(BB)
x = 0
BB_lb = []
while x < 7: #to skip the string data
next(BB_csv)
x+=1
for row in islice(BB_csv,0,758):
print(row[0]) #testing if I can read row data
Okay, here is where I am stuck. I want to make an arraw/list that has the 0th index value of each row. Sorry if I'm a freaking noob!
Thanks again!
You can skip all lines till the first data row and then parse the data into a list for later use - 700+ lines can be easily processd in memory.
Therefor you need to:
read the file line by line
remember the last non-empty line before number/comma/dot ( == header )
see if the line is only number/comma/dot, else increase a skip-counter (== data )
seek to 0
skip enough lines to get to header or data
read the rest into a data structure
Create test file:
text = """
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
"""
with open ("t.txt","w") as w:
w.write(text)
Some helpers and the skipping/reading logic:
import re
import csv
def convert_row(row):
"""Convert one row of data into a list of mixed ints and others.
Int is the preferred data type, else string is used - no other tried."""
d = []
for v in row:
try:
# convert to int && add
d.append(float(v))
except:
# not an int, append as is
d.append(v)
return d
def count_to_first_data(fh):
"""Count lines in fh not consisting of numbers, dots and commas.
Sideeffect: will reset position in fh to 0."""
skiplines = 0
header_line = 0
fh.seek(0)
for line in fh:
if re.match(r"^[\d.,]+$",line):
fh.seek(0)
return skiplines, header_line
else:
if line.strip():
header_line = skiplines
skiplines += 1
raise ValueError("File does not contain pure number rows!")
Usage of helpers / data conversion:
data = []
skiplines = 0
with open("t.txt","r") as csvfile:
skip_to_data, skip_to_header = count_to_first_data(csvfile)
for _ in range(skip_to_header): # skip_to_data if you do not want the headers
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
row_data = convert_row(row)
if row_data:
data.append(row_data)
print(data)
Output (reformatted):
[['Load (lbf)', 'Time (s)', 'Crosshead (in)', 'Extensometer (in)'],
[62.638, 0.9, 0.0, 8e-05],
[122.998, 1.7, 0.001, 0.00012]]
Doku:
re.match
csv.reader
Method of file objekts (i.e.: seek())
With this you now have "clean" data that you can use for further processing - including your headers.
For visualization you can have a look at matplotlib
I would recommend reading your file with python
data = []
with open('my_txt.txt', 'r') as fd:
# Suppress header lines
for i in range(6):
fd.readline()
# Read data lines up to the first column
for line in fd:
index = line.find(',')
if index >= 0:
data.append(float(line[0:index]))
leads to a list containing your data of the first column
>>> data
[62.638, 122.998]
The MATLAB solution is less nice, since you have to know the number of data lines in your file (which you do not need to know in the python solution)
n_header = 6
n_lines = 2 % Insert here 717 (as you mentioned)
M = csvread('my_txt.txt', n_header, 0, [n_header 0 n_header+n_lines-1 0])
leads to:
>> M
M =
62.6380
122.9980
For the sake of clarity: You can also use MATLABs textscan function to achieve what you want without knowing the number of lines, but still, the python code would be the better choice in my opinion.
Based on your format, you will need to do 3 steps. One, read all lines, two, determine which line to use, last, get the floats and assign them to a list.
Assuming you file name is name.txt, try:
f = open("name.txt", "r")
all_lines = f.readlines()
grid = []
for line in all_lines:
if ('"' not in line) and (line != '\n'):
grid.append(list(map(float, line.strip('\n').split(','))))
f.close()
The grid will then contain a series of lists containing your group of floats.
Explanation for fun:
In the "for" loop, i searched for the double quote to eliminate any string as all strings are concocted between quotes. The other one is for skipping empty lines.
Based on your needs, you can use the list grid as you please. For example, to fetch the first line's first number, do
grid[0][0]
as python's list counts from 0 to n-1 for n elements.
This is super simple in Matlab, just 2 lines:
data = dlmread('data.csv', ',', 6,0);
column1 = data(:,1);
Where 6 and 0 should be replaced by the row and column offset you want. So in this case, the data starts at row 7 and you want all the columns, then just copy over the data in column 1 into another vector.
As another note, try typing doc dlmread in matlab - it brings up the help page for dlmread. This is really useful when you're looking for matlab functions, as it has other suggestions for similar functions down the bottom.

Compare Two Files by Row outputting each row difference to a new line in a new file using python 3.5

This type of question has been asked several times but I cant seem to find the exact same scenario and be using python 3.(3.5 in my case)
I have two files txt or csv. I need to compare each row and output the differences to a new lines in a new file.
Here is what I have tried so far: which is close but I cant figure out how to make each row difference a new line, i seem to only be able to make each word a new line or everything on one line.
a = open('test1.txt').read().split()
b = open('test2.txt').read().split()
c = [x for x in b if x not in a]
open('test3.txt', 'wt').write('\n'.join(c)+'\n')
the \n before the .join makes every word a new row, I don't want each difference a new row, I want all the differences from one row on the same row.
I hope that makes sense.
Example:
test1.txt:
how are you
I am well
all is good
test2.txt:
how are you
I like toys
all is not well
output:
test3.txt
am well
good
I have also tried this code for CSV: but I cant an error.
import csv
f1 = open ("test1.csv")
oldFile1 = csv.reader(f1)
oldList1 = []
for row in oldFile1:
oldList1.append(row)
f2 = open ("test2.csv")
oldFile2 = csv.reader(f2)
oldList2 = []
for row in oldFile2:
oldList2.append(row)
f1.close()
f2.close()
print [row for row in oldList1 if row not in oldList2]
I get this error: I think its related to me being on version 3.5 and this code was written for 2.7?
File "test3.py", line 18
print [row for row in oldList1 if row not in oldList2]
^
SyntaxError: Missing parentheses in call to 'print'
Thank you for your help
The problem with your first code is that you are splitting whole of your file, which will split your file by whitespace (not only new-line).
You can simply zip your splitted lines and compare the words together:
with open('test1.txt') as f1, open('test2.txt') as f2, open('result.txt', 'w') as f3:
for line1, line2 in zip(f1, f2):
sp1 = line1.split()
sp2 = line2.split()
f3.write(' '.join([i for i in sp1 if i not in sp2]) + '\n')
Additionally you could have a look into using difflib if you need fancier output, for example. Here's a nice tutorial and a fitting question
The problem with the second code is simply that 'print' works differently in python 2 and 3. if you just add a parentheses it should work, like this:
print([row for row in oldList1 if row not in oldList2])

How to clean large malformed CSV file using Python

I'm attempting to use Python 2.7.5 to clean up a malformed CSV file. The CSV file is fairly large (over 1GB). The first row of the file correctly lists the column headings, but after that each field is on a new line (unless it is blank) and some fields are multi-line. The multi-line fields are not surrounded by quotes, but need to be surrounded by quotes in the output. The number of columns is static and known. The pattern in the sample input provided is repeated throughout the length of the file.
Input file (sample):
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
my_hostname
,my_username
,10.0.0.1
192.168.1.1
,2015-02-11 13:41:54 -0600
,,true
,false
my_2nd_hostname
,my_2nd_username
,10.0.0.2
192.168.1.2
,2015-02-11 14:04:41 -0600
,true
,,false
Desired output:
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
my_hostname,my_username,"10.0.0.1 192.168.1.1",2015-02-11 13:41:54 -0600,,true,false
my_2nd_hostname,my_2nd_username,"10.0.0.2 192.168.1.2",2015-02-11 14:04:41 -0600,true,,false
I've gone down a couple paths that address one of the issues only to realize that it doesn't handle another aspect of the malformed data. I would appreciate if anyone could please help me identify an efficient way to clean up this file.
Thanks
EDIT
I have several code scraps from going down different paths, but here is the current iteration. It isn't pretty, just a bunch of hacks to try and figure this out.
import csv
inputfile = open('input.csv', 'r')
outputfile_1 = open('output.csv', 'w')
counter = 1
for line in inputfile:
#Skip header row
if counter == 1:
outputfile_1.write(line)
counter = counter + 1
else:
line = line.replace('\r', '').replace('\n', '')
outputfile_1.write(line)
inputfile.close()
outputfile_1.close()
with open('output.csv', 'r') as f:
text = f.read()
comma_count = text.count(',') #comma_count/6 = total number of rows
#need to insert a newline after the field contents after every 6th comma
#unfortunately the last field of the row and the first field of the next row are now rammed up together becaue of the newline replaces above...
#then process as normal CSV
#one path I started to go down... but this isn't even functional
groups = text.split(',')
counter2 = 1
while (counter2 <= comma_count/6):
line = ','.join(groups[:(6*counter2)]), ','.join(groups[(6*counter2):])
print line
counter2 = counter2 + 1
EDIT 2
Thanks to #DSM and #Ryan Vincent for getting me on the right track. Using their ideas I made the following code, which seems to correct my malformed CSV. I'm sure there are many places for improvement though, which I would happily accept.
import csv
import re
outputfile_1 = open('output.csv', 'wb')
wr = csv.writer(outputfile_1, quoting=csv.QUOTE_ALL)
with open('input.csv', 'r') as f:
text = f.read()
comma_indices = [m.start() for m in re.finditer(',', text)] #Find all the commas - the fields are between them
cursor = 0
field_counter = 1
row_count = 0
csv_row = []
for index in comma_indices:
newrowflag = False
if "\r" in text[cursor:index]:
#This chunk has two fields, the last of one row and first of the next
next_field=text[cursor:index].split('\r')
next_field_trimmed = next_field[0].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed]) #Add the last field of this row
#Reset the cursor to be in the middle of the chuck (after the last field and before the next)
#And set a flag that we need to start the next csvrow before we move on to the next comma index
cursor = cursor+text[cursor:index].index('\r')+1
newrowflag = True
else:
next_field_trimmed = text[cursor:index].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed])
#Advance the cursor to the character after the comma to start the next field
cursor = index + 1
#If we've done 7 fields then we've finished the row
if field_counter%7==0:
row_count = row_count + 1
wr.writerow(csv_row)
#Reset
csv_row = []
#If the last chunk had 2 fields in it...
if newrowflag:
next_field_trimmed = next_field[1].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed])
field_counter = field_counter + 1
field_counter = field_counter + 1
#Write the last row
wr.writerow(csv_row)
outputfile_1.close()
# Process output.csv as normal CSV file...
This is a comment about how i would tackle this.
For each line:
I can easily identify start and of end of certain groups:
Hostname - there is only one
usernames - read these until you meet something that does not look like a username (comma delimited)
ip address - read these until you meet a timestamp - identified with a pattern match - be aware these are separated by space rather than comma. The end of group is identified by the trailing comma.
timestamp - easy to identify with a pattern match
test1, test2, test3 - certain to be there as comma delimted fields
Notes: I would use the 'pattern' matches to enable me to identify i have the correct thing in the correct place. It enables spotting errors sooner.
From your data excerpt it seems like any line that starts with a comma needs to be joined to the preceding line and any line starting with anything other than a comma marks a new row.
If that's the case than you could use something the following code to clean up the CSV file such that the standard library csv parser can handle it.
#!/usr/bin/python
raw_data = 'somefilename.raw'
csv_data = 'somefilename.csv'
with open(raw_data, 'Ur') as inp, open(csv_data, 'wb') as out:
row = list()
for line in inp:
line.rstrip('\n')
if line.startswith(','):
row.append(line)
else:
out.write(''.join(row)+'\n')
row = list()
row.append(line))
# Don't forget to write the last row!
out.write(''.join(row)+'\n')
This is a miniature state machine ... accumulating lines into each row until we find a line that doesn't start with a comma, writing the previous row and so on.

Problems with handling files in Python

Good evening everyone,
I am fairly new to Python and at the moment and at the moment I'm struggling with the problem of how to properly edit a file (.txt or .csv) in python. I am trying to write a little program that will take each line of a text file, encrypt it and then overwrite the file line by line and save it. The relevant part of my code looks like this so far:
with open('/home/path/file.csv', 'r+') as csvfile:
for row in csv.reader(csvfile, delimiter='\t'):
y = []
for i in range(0, len(row)):
x = encrypt(row[i], password)
y.append(x)
csvfile.write(''.join(y))
Which, when executed, does nothing. I've played with the code a little, sometimes it runs into a
TypeError: expected a character buffer object
The encryption function returns a string and my file consists of 3 strings per row, seperated by a tab, like this:
key1 value1 value1'
key2 value2 value2'
key3 value3 value3'
...
The csv.reader seems to read the file properly and returns one list per row, y then returns a list with the encrypted phrases. However, I can't seem to get the file.write() function to actually overwrite the file. Does anyone know how to get around this?
Any help would be greatly appreciated.
Thanks,
Andy
You've open the file as read only. You need to open a second file for writing.
with open('/home/path/file.csv', 'r+') as csvfile:
for row in csv.reader(csvfile, delimiter='\t'):
y = []
for i in range(0, len(row)):
x = encrypt(row[i], password)
y.append(x)
with open('/home/path/file.csv', 'w') as csvfile:
csvfile.write(''.join(y))
I never like to overwrite my files, disk space is cheap.
with open('/home/path/file.csv', 'r+') as csvfile:
with open('/home/path/file.enc', 'w') as csvencryptedfile:
for row in csv.reader(csvfile, delimiter='\t'):
y = []
for i in range(0, len(row)):
x = encrypt(row[i], password)
y.append(x)
csvencryptedfile.write('\t'.join(y))
csvencryptedfile.write('\n')

Categories