Input file comparison not working Python

Input file comparison not working Python - python

Why is the removal of the double quotes not working? (country = country[1:-1])
Why is the first_list.append line not executing? I am looping through the lines of an input file and matching it against the first element of a list of lists stored in final_lists. When I print the output, even when the two values are in fact the same verified by the print statements (for example, when row[0]==Zimbabwe AND country==Zimbabwe) the next append statement does not run.
with open('world_bank_regions.tsv', 'rU') as f:
next(f)
for line in f:
[region, subregion, country] = line.split('\t')
if country.startswith('"') and country.endswith('"'):
country = country[1:-1]
print country #the double quotes remain
for row in final_list: #final list is a list of lists
print row[0] #row[0] == Zimbabwe
print country #country == Zimbabwe
if row[0] == country:
final_list.append([region, subregion])
print final_list #no changes were made to the list from the previous steps

You can solve the majority of your problem by using the csv module. The second problem you have is that you are modifying the same list you are looping over. This is never a good idea.
To solve your first problem:
import csv
with open('world_bank_regions.tsv', 'rU') as f:
reader = csv.DictReader(f, delimiter='\t', quotechar='"')
for row in reader:
print(row['Country'])
For your second problem, you can approach it two ways. The first is to convert it into a dictionary for fast lookups, using the country name as the key.
As the country name is the first entry in the inner lists, you can do this:
lookup = {i[0]: i for i in final_list}
Then your complete code looks like this:
import csv
lookup = {i[0]: i for i in final_list}
with open('world_bank_regions.tsv', 'rU') as f:
reader = csv.DictReader(f, delimiter='\t', quotechar='"')
for row in reader:
if row['Country'] in lookup.keys():
lookup[row['Country']] += [row['Region'], row['Subregion']]
csv.DictReader takes the first row, and uses them as keys for the data in the remaining rows and returns a dictionary when you loop over it.
The example above assumes your input file looks like this:
Country\tRegion\tSubregion
Zimbabwe\tAfrica\t

Related

Writing in separate columns in csv python

from operator import itemgetter
COLS = 15,21,27
COLS1 = 16,22,28
filename = "result.csv"
getters = itemgetter(*(col-1 for col in COLS))
getters1 = itemgetter(*(col-1 for col in COLS1))
with open('result.csv', newline='') as csvfile:
for row in csv.reader(csvfile):
row = zip(getters(row))
for row1 in csv.reader(csvfile):
row1 = zip(getters1(row1))
print(row)
print(row1)
with open('results1.csv', "w", newline='') as f:
fieldnames = ['AAA','BBB']
writer = csv.writer(f,delimiter=",")
for row in row:
writer.writerow(row)
writer.writerow(row1)
I am getting a NameError: name 'row1' is not defined error. I want to write each of the COLS in a separate column in the results1 file. How would I go about this?

So, there are few things going on in the code that are potentially leading to errors.
First is the way csv.reader(csvfile) works in python. When reading the file with csv.reader it basically scans the next line in the file when called and returns it. The csv part in this case simply uses the .cvs format and returns the data in a list, rather than a simple string of text in the standard python file reader. This is fine for a lot of use cases, but the issue here we are running into, is that when you run:
for row in csv.reader(csvfile):
row = zip(getters(row))
the csv.reader(csvfile) gets called for every row in the entire file and the for loop only stops when it runs out of data in the "results.csv" file. Meaning, if you want to use the data from each row, you need to store it in some way before running out the file. I think that's what you are trying to achieve with row = zip(getters(row)) but the issue here is row is both being assigned to zip(getters(row)) and being used as the variable in the for loop. Since you are essentially calling csv.reader, returning to variable row, then reassigning row to being zip(getters(row)), you are just writing over the variable row every iteration of the for loop and the result is nothing gets stored.
In order to store your csv data, try this:
data = [[]]
for row in csv.reader(csvfile):
temp = zip(getters(row))
data.append(temp)
This will store your row in a list called data.
Then, the second error is the one you are asking about, which is row1 not being defined. This happened in your code because the for loop ran through every row in the csv file. When you then call csv.reader again in the second for loop it can't read anything because the first for loop already read through the entire csv file and it doesn't know to start over at the beginning of the file. Therefore, row1 never gets declared or assigned, and therefore when you call again it in writer.writerow(row1), row1 doesn't exist.
There a couple ways to fix this. You could close the file, reopen it again and start from the beginning of the file again. Or you could store it at the same time in the first for loop. So like this:
data = [[]]
data1 = [[]]
for row in csv.reader(csvfile):
temp = zip(getters(row))
data.append(temp)
temp2 = zip(getters1(row))
data2.append(temp2)
Now you will have 3 columns of data in both data and data1.
Now for writing to the "results1.csv" file. Here you used row as the for loop variable as well as the iterable to run through, which does not work. Also, you call writer.writerow(row) then writer.writerow(row1), which also doesn't work. Try this instead:
with open('results1.csv', "w", newline='') as f:
writer = csv.writer(f,delimiter=",")
for row in range(len(data)):
writer.writerow(data[row] + data1[row])
Now it also looks like you want to add headers for each column in fieldnames = ['AAA','BBB'] . Unfortunetly, csv.writer does not have an easy way to do this, instead csv.DictWriter and writer.writeheader() must be used first.
with open('results1.csv', "w", newline='') as f:
fieldnames = ['A','A','A','B','B','B']
writer = csv.DictWriter(f,delimiter=",", fieldnames=fieldnames)
writer.writeheader()
writer = csv.writer(f,delimiter=",")
for row in range(len(data)):
writer.writerow(data[row] + data1[row])
Hope this helps!

How to find if any element within a list is present within a row in a CSV file when using a for loop

import csv
with open('example.csv', 'r') as f:
csvfile = csv.reader(f, delimiter = ',')
client_email = ['#example.co.uk', '#moreexamples.com', 'lastexample.com']
for row in csvfile:
if row not in client_email:
print row
Assume code is formatted in blocks properly, it's not translating properly when I copy paste. I've created a list of company email domain names (as seen in the example), and I've created a loop to print out every row in my CSV that is not present in the list. Other columns in the CSV file include first name, second name, company name etc. so it is not limited to only emails.
Problem is when Im testing, it is printing off rows with the emails in the list i.e jackson#example.co.uk.
Any ideas?

In your example, row refers to a list of strings. So each row is ['First name', 'Second name', 'Company Name'] etc.
You're currently checking whether any column is exactly one of the elements in your client_email.
I suspect you want to check whether the text of any column contains one of the elements in client_email.
You could use another loop:
for row in csvfile:
for column in row:
# check if the column contains any of the email domains here
# if it does:
print row
continue
To check if a string contains any strings in another list, I often find this approach useful:
s = "xxabcxx"
stop_list = ["abc", "def", "ghi"]
if any(elem in s for elem in stop_list):
pass

One way to check may be to see if set of client_email and set in row has common elements (by changing if condition in loop):
import csv
with open('example.csv', 'r') as f:
csvfile = csv.reader(f, delimiter = ',')
client_email = ['#example.co.uk', '#moreexamples.com', 'lastexample.com']
for row in csvfile:
if (set(row) & set(client_email)):
print (row)
You can also use any as following:
import csv
with open('untitled.csv', 'r') as f:
csvfile = csv.reader(f, delimiter = ',')
client_email = ['#example.co.uk', '#moreexamples.com', 'lastexample.com']
for row in csvfile:
if any(item in row for item in client_email):
print (row)

Another possible way,
import csv
data = csv.reader(open('example.csv', 'r'))
emails = {'#example.co.uk', '#moreexamples.com', 'lastexample.com'}
for row in data:
if any(email in cell for cell in row for email in emails):
print(row)

Trouble with sorting list and "for" statement snytax

I need help sorting a list from a text file. I'm reading a .txt and then adding some data, then sorting it by population change %, then lastly, writing that to a new text file.
The only thing that's giving me trouble now is the sort function. I think the for statement syntax is what's giving me issues -- I'm unsure where in the code I would add the sort statement and how I would apply it to the output of the for loop statement.
The population change data I am trying to sort by is the [1] item in the list.
#Read file into script
NCFile = open("C:\filelocation\NC2010.txt")
#Save a write file
PopulationChange =
open("C:\filelocation\Sorted_Population_Change_Output.txt", "w")
#Read everything into lines, except for first(header) row
lines = NCFile.readlines()[1:]
#Pull relevant data and create population change variable
for aLine in lines:
dataRow = aLine.split(",")
countyName = dataRow[1]
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
popChange = ((population2010-population2000)/population2000)*100
outputRow = countyName + ", %.2f" %popChange + "%\n"
PopulationChange.write(outputRow)
NCFile.close()
PopulationChange.close()

You can fix your issue with a couple of minor changes. Split the line as you read it in and loop over the sorted lines:
lines = [aLine.split(',') for aLine in NCFile][1:]
#Pull relevant data and create population change variable
for dataRow in sorted(lines, key=lambda row: row[1]):
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
...
However, if this is a csv you might want to look into the csv module. In particular DictReader will read in the data as a list of dictionaries based on the header row. I'm making up the field names below but you should get the idea. You'll notice I sort the data based on 'countryName' as it is read in:
from csv import DictReader, DictWriter
with open("C:\filelocation\NC2010.txt") as NCFile:
reader = DictReader(NCFile)
data = sorted(reader, key=lambda row: row['countyName'])
for row in data:
population2000 = float(row['population2000'])
population2010 = float(row['population2010'])
popChange = ((population2010-population2000)/population2000)*100
row['popChange'] = "{0:.2f}".format(popChange)
with open("C:\filelocation\Sorted_Population_Change_Output.txt", "w") as PopulationChange:
writer = csv.DictWriter(PopulationChange, fieldnames=['countryName', 'popChange'])
writer.writeheader()
writer.writerows(data)
This will give you a 2 column csv of ['countryName', 'popChange']. You would need to correct this with the correct fieldnames.

You need to read all of the lines in the file before you can sort it. I've created a list called change to hold the tuple pair of the population change and the country name. This list is sorted and then saved.
with open("NC2010.txt") as NCFile:
lines = NCFile.readlines()[1:]
change = []
for line in lines:
row = line.split(",")
country_name = row[1]
population_2000 = float(row[6])
population_2010 = float(row[8])
pop_change = ((population_2010 / population_2000) - 1) * 100
change.append((pop_change, country_name))
change.sort()
output_rows = []
[output_rows.append("{0}, {1:.2f}\n".format(pair[1], pair[0]))
for pair in change]
with open("Sorted_Population_Change_Output.txt", "w") as PopulationChange:
PopulationChange.writelines(output_rows)
I used a list comprehension to generate the output rows which swaps the pair back in the desired order, i.e. country name first.

How to read a text file into a list or an array with Python

I am trying to read the lines of a text file into a list or array in python. I just need to be able to individually access any item in the list or array after it is created.
The text file is formatted as follows:
0,0,200,0,53,1,0,255,...,0.
Where the ... is above, there actual text file has hundreds or thousands more items.
I'm using the following code to try to read the file into a list:
text_file = open("filename.dat", "r")
lines = text_file.readlines()
print lines
print len(lines)
text_file.close()
The output I get is:
['0,0,200,0,53,1,0,255,...,0.']
1
Apparently it is reading the entire file into a list of just one item, rather than a list of individual items. What am I doing wrong?

You will have to split your string into a list of values using split()
So,
lines = text_file.read().split(',')
EDIT:
I didn't realise there would be so much traction to this. Here's a more idiomatic approach.
import csv
with open('filename.csv', 'r') as fd:
reader = csv.reader(fd)
for row in reader:
# do something

You can also use numpy loadtxt like
from numpy import loadtxt
lines = loadtxt("filename.dat", comments="#", delimiter=",", unpack=False)

So you want to create a list of lists... We need to start with an empty list
list_of_lists = []
next, we read the file content, line by line
with open('data') as f:
for line in f:
inner_list = [elt.strip() for elt in line.split(',')]
# in alternative, if you need to use the file content as numbers
# inner_list = [int(elt.strip()) for elt in line.split(',')]
list_of_lists.append(inner_list)
A common use case is that of columnar data, but our units of storage are the
rows of the file, that we have read one by one, so you may want to transpose
your list of lists. This can be done with the following idiom
by_cols = zip(*list_of_lists)
Another common use is to give a name to each column
col_names = ('apples sold', 'pears sold', 'apples revenue', 'pears revenue')
by_names = {}
for i, col_name in enumerate(col_names):
by_names[col_name] = by_cols[i]
so that you can operate on homogeneous data items
mean_apple_prices = [money/fruits for money, fruits in
zip(by_names['apples revenue'], by_names['apples_sold'])]
Most of what I've written can be speeded up using the csv module, from the standard library. Another third party module is pandas, that lets you automate most aspects of a typical data analysis (but has a number of dependencies).
Update While in Python 2 zip(*list_of_lists) returns a different (transposed) list of lists, in Python 3 the situation has changed and zip(*list_of_lists) returns a zip object that is not subscriptable.
If you need indexed access you can use
by_cols = list(zip(*list_of_lists))
that gives you a list of lists in both versions of Python.
On the other hand, if you don't need indexed access and what you want is just to build a dictionary indexed by column names, a zip object is just fine...
file = open('some_data.csv')
names = get_names(next(file))
columns = zip(*((x.strip() for x in line.split(',')) for line in file)))
d = {}
for name, column in zip(names, columns): d[name] = column

This question is asking how to read the comma-separated value contents from a file into an iterable list:
0,0,200,0,53,1,0,255,...,0.
The easiest way to do this is with the csv module as follows:
import csv
with open('filename.dat', newline='') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
Now, you can easily iterate over spamreader like this:
for row in spamreader:
print(', '.join(row))
See documentation for more examples.

Im a bit late but you can also read the text file into a dataframe and then convert corresponding column to a list.
lista=pd.read_csv('path_to_textfile.txt', sep=",", header=None)[0].tolist()
example.
lista=pd.read_csv('data/holdout.txt',sep=',',header=None)[0].tolist()
Note: the column name of the corresponding dataframe will be in the form of integers and i choose 0 because i was extracting only the first column

Better this way,
def txt_to_lst(file_path):
try:
stopword=open(file_path,"r")
lines = stopword.read().split('\n')
print(lines)
except Exception as e:
print(e)

Regular Expression, Matrix, CSV in Python

I've seen a few related posts about the numpy module, etc. I need to use the csv module, and it should work for this. While a lot has been written on using the csv module here, I didn't quite find the answer I was looking for. Thanks so much in advance
Essentially I have the following function/pseudocode (tab didn't copy over well...):
import csv
def copy(inname, outname):
infile = open(inname, "r")
outfile = open(outname, "w")
copying = False ##not copying yet
# if the first string up to the first whitespace in the "name" column of a row
# equals the first string up to the first whitespace in the "name" column of
# the row directly below it AND the value in the "ID" column of the first row
# does NOT equal the value in the "ID" column of the second row, copy these two
# rows in full to a new table.
For example, if inname looks like this:
ID,NAME,YEAR, SPORTS_ALMANAC,NOTES
(first thousand rows)
1001,New York Mets,1900,ESPN
1002,New York Yankees,1920,Guiness
1003,Boston Red Sox,1918,ESPN
1004,Washington Nationals,2010
(final large amount of rows until last row)
1231231231235,Detroit Tigers,1990,ESPN
Then I want my output to look like:
ID,NAME,YEAR,SPORTS_ALMANAC,NOTES
1001,New York Mets,1900,ESPN
1002,New York Yankees,1920,Guiness
Because the string "New" is the same string up to the first whitespace in the "Name" column, and the ID's are different. To be clear, I need the code to be as generalizable as possible, since a regular expression on "New" is not what I need, since the common first string could be really any string. And it doesn't matter what happens after the first whitespace (ie "Washington Nationals" and "Washington DC" should still give me a hit, as should the New York examples above...)
I'm confused because in R there is a way to do:
inname$name to search easily by values in a specific row. I tried writing my script in R first, but it got confusing. So I want to stick with Python.

Does this do what you want (Python 3)?
import csv
def first_word(value):
return value.split(" ", 1)[0]
with open(inname, "r") as infile:
with open(outname, "w", newline="") as outfile:
in_csv = csv.reader(infile)
out_csv = csv.writer(outfile)
column_names = next(in_csv)
out_csv.writerow(column_names)
id_index = column_names.index("ID")
name_index = column_names.index("NAME")
try:
row_1 = next(in_csv)
written_row = False
for row_2 in in_csv:
if first_word(row_1[name_index]) == first_word(row_2[name_index]) and row_1[id_index] != row_2[id_index]:
if not written_row:
out_csv.writerow(row_1)
out_csv.writerow(row_2)
written_row = True
else:
written_row = False
row_1 = row_2
except StopIteration:
# No data rows!
pass
For Python 2, use:
with open(outname, "w") as outfile:
in_csv = csv.reader(infile)
out_csv = csv.writer(outfile, lineterminator="\n")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Input file comparison not working Python - python

Related

Writing in separate columns in csv python

How to find if any element within a list is present within a row in a CSV file when using a for loop

Trouble with sorting list and "for" statement snytax

How to read a text file into a list or an array with Python

Regular Expression, Matrix, CSV in Python

Categories

Resources