Regular Expression, Matrix, CSV in Python - python

I've seen a few related posts about the numpy module, etc. I need to use the csv module, and it should work for this. While a lot has been written on using the csv module here, I didn't quite find the answer I was looking for. Thanks so much in advance
Essentially I have the following function/pseudocode (tab didn't copy over well...):
import csv
def copy(inname, outname):
infile = open(inname, "r")
outfile = open(outname, "w")
copying = False ##not copying yet
# if the first string up to the first whitespace in the "name" column of a row
# equals the first string up to the first whitespace in the "name" column of
# the row directly below it AND the value in the "ID" column of the first row
# does NOT equal the value in the "ID" column of the second row, copy these two
# rows in full to a new table.
For example, if inname looks like this:
ID,NAME,YEAR, SPORTS_ALMANAC,NOTES
(first thousand rows)
1001,New York Mets,1900,ESPN
1002,New York Yankees,1920,Guiness
1003,Boston Red Sox,1918,ESPN
1004,Washington Nationals,2010
(final large amount of rows until last row)
1231231231235,Detroit Tigers,1990,ESPN
Then I want my output to look like:
ID,NAME,YEAR,SPORTS_ALMANAC,NOTES
1001,New York Mets,1900,ESPN
1002,New York Yankees,1920,Guiness
Because the string "New" is the same string up to the first whitespace in the "Name" column, and the ID's are different. To be clear, I need the code to be as generalizable as possible, since a regular expression on "New" is not what I need, since the common first string could be really any string. And it doesn't matter what happens after the first whitespace (ie "Washington Nationals" and "Washington DC" should still give me a hit, as should the New York examples above...)
I'm confused because in R there is a way to do:
inname$name to search easily by values in a specific row. I tried writing my script in R first, but it got confusing. So I want to stick with Python.

Does this do what you want (Python 3)?
import csv
def first_word(value):
return value.split(" ", 1)[0]
with open(inname, "r") as infile:
with open(outname, "w", newline="") as outfile:
in_csv = csv.reader(infile)
out_csv = csv.writer(outfile)
column_names = next(in_csv)
out_csv.writerow(column_names)
id_index = column_names.index("ID")
name_index = column_names.index("NAME")
try:
row_1 = next(in_csv)
written_row = False
for row_2 in in_csv:
if first_word(row_1[name_index]) == first_word(row_2[name_index]) and row_1[id_index] != row_2[id_index]:
if not written_row:
out_csv.writerow(row_1)
out_csv.writerow(row_2)
written_row = True
else:
written_row = False
row_1 = row_2
except StopIteration:
# No data rows!
pass
For Python 2, use:
with open(outname, "w") as outfile:
in_csv = csv.reader(infile)
out_csv = csv.writer(outfile, lineterminator="\n")

Related

Csv, Python, separating elements in one column to different columns

So I have a CSV file like this,
how can I separate them into different columns like this,
using python without using the pandas lib.
Implementation that should work in python 3.6+.
import csv
with open("input.csv", newline="") as inputfile:
with open("output.csv", "w", newline="") as outputfile:
reader = csv.DictReader(inputfile) # reader
fieldnames = reader.fieldnames
writer = csv.DictWriter(outputfile, fieldnames=fieldnames) # writer
# make header
writer.writeheader()
# loop over each row in input CSV
for row in reader:
# get first column
column: str = str(row[fieldnames[0]])
numbers: list = column.split(",")
if len(numbers) != len(fieldnames):
print("Error: Lengths not equal")
# write row in output CSV
writer.writerow({field: num for field, num in zip(fieldnames, numbers)})
Explanation of the code:
The above code takes two file names input.csv and output.csv. The names being verbose don't need any further explanation.
It reads each row from input.csv and writes corresponding row in output.csv.
The last line is a "dictionary comprehension" combined with zip (similar to "list comprehensions" for lists). It's a nice way to do a lot of stuff in a single line but same code in expanded form looks like:
row = {}
for field, num in zip(fieldnames, numbers):
row[field] = num
writer.writerow(row)
It is already separated into different columns by , as separator, but the european version of excel usually uses ; as separator. You can specify the separator, when you import the csv:
https://support.microsoft.com/en-us/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba
If you really want to change the file content with python use the replace function and replace , with ;: How to search and replace text in a file?

How to replace number in csv with a string with python

I am trying to fix the first row of a CSV file. If column name in header starts from anything other than a-z, NUM has to be prepended. The following code fixes the special characters in each column of the first row but somehow can't get the !a-z.
path = ('test.csv')
for fname in glob.glob(path):
with open(fname, newline='') as f:
reader = csv.reader(f)
header = next(reader)
header = [column.replace ('-','_') for column in header]
header = [column.replace ('[!a-z]','NUM') for column in header]
what am I doing wrong. Please provide suggestions.
Thanks
You can do it like this.
# csv file:
# 2Hello, ?WORLD
# 1, 2
import csv
with open("test.csv", newline='') as f:
reader = csv.reader(f)
header = next(reader)
print("Original header", header)
header = [("NUM" + header[indx][1::]) for indx in range(len(header)) if not header[indx][0].isalpha()]
print("Modified header", header)
Output:
Original header ['2HELLO', '?WORLD']
Modified header ['NUMHELLO', 'NUMWORLD']
The above list comprehension is equivalent to the following for loop:
for indx in range(len(header)):
if not header[indx][0].isalpha():
header[indx] = "NUM" + header[indx][1::]
If you want to replace only numbers, then use the following:
if header[indx][0].isdigit():
You can modify this according to your requirements in case if it changes based on many relevant string functions.
https://docs.python.org/2/library/string.html
I believe you would want to replace the 'column.replace' portion with something along these lines:
re.sub(r'[!a-z]', 'NUM', column)
The full documentation reference is here for specifics: https://docs.python.org/2/library/re.html
https://www.regular-expressions.info/python.html
Since you said you want to prepend 'NUM', you could do something like this (which could be more efficient, but this shows the basic idea).
import string
column = '123'
if column[0] not in string.ascii_lowercase:
column = 'NUM' + column
# column is now 'NUM123'

Python compare two list

I'm trying to find all the rows inside training_full.csv (two column,"macroclass" and "description") from contatti.csv (containing two columns, "name" and "surname").
I want to retrieve all the rows of "description", inside training_full.csv, in which there is "name" and "surname" contained in contatti.csv.
The script I've created seems to evaluate only the first row of training_full.csv and, for this reason, print only the first row of training_full.csv (in which the script finds a match).
If I modify training_full.csv in way that in the first row there isn't any match, the result is empty.
Here the code:
import csv
match=[]
with open('xxxxxxxxxxx/training_full1.csv', encoding='utf-8') as csvfile, open('output.csv', 'wb') as output, open('xxxxxxxxxxx/contatti.CSV') as contatti:
spamreader = csv.reader(csvfile)
spamreader_contacts = csv.reader(contatti, delimiter=';')
spamwriter = csv.writer(output)
for row_desc in spamreader:
#print(righe[0])
for row_cont in spamreader_contacts:
#print(row[0])
if (row_cont[0] + " " + row_cont[1]) in row_desc[0]:
match.append(row_desc[0])
print(match)
Thanks for any help,
Filippo.
Looking at your problem, it seems to be separable in three parts:
1) Read the names, and build a list
2) Compare the training file with the names list
3) Write the matches
Doing that, we can end up with a solution similar to:
import csv
names = []
with open('xxxxxxxxxxx/contatti.csv', 'rb') as f:
contatti = csv.reader(f, delimiter=';')
for row in contatti:
names.append("{} {}".format(row[0], row[1]))
matches=[]
with open('xxxxxxxxxxx/training_full1.csv', 'rb', encoding='utf-8') as f:
training = csv.reader(f)
for row in training:
for name in names:
if name in row[1]: # description being the second column
matches.append(row[1])
break
with open('output.csv', 'wb') as f:
output = csv.writer(f)
for match in matches:
output.writerow(match)
print(matches)
The main issue with your solution attempt, was, as pointed out in the comments, that once you looked for the first match, you exhausted your csv reader. In the solution I present, a list of names is first being built. This will ensure that we can search the for names multiple times.

How to write a dict to a csv

I have a CSV file with one column that has a person's first and last name. I am trying to use a CSV to split each name into two columns, first and last. The code below splits all of the first names into one row and all of the last names into one row instead of having a first name into a row and the last name in the next column next the the first name. Thanks for your time.
Code:
import csv
with open('fullnames.csv','r') as f:
reader = csv.reader(f)
newcsvdict = {"first name": [], "last name": []}
for row in reader:
first = row[0].split()[0]
last = row[0].split()[1]
newcsvdict["first name"].append(first)
newcsvdict["last name"].append(last)
with open('new.csv','w') as f:
w = csv.DictWriter(f, newcsvdict.keys())
w.writeheader()
w.writerow(newcsvdict)
Output:
In this simple case there is little benefit in using a csv.DictWriter, just use csv.writer:
import csv
header = ['first name', 'last name']
with open('fullnames.csv', 'r') as infile, open('new.csv', 'w') as outfile:
writer = csv.writer(outfile)
writer.writerow(header)
writer.writerows(row[0].split() for row in csv.reader(infile))
This works fine provided that the name column in the input CSV always consists of exactly one first name and one surname separated by whitespace. However, if there can be double-barrelled surnames, e.g. Helena Bonham Carter, you need to be more careful about splitting the name. This might work:
row[0].split(' ', 1)
but it assumes that the separator is exactly one space.
You can use pandas to write your csv (you could actually use pandas for the whole problem), this will automatically transpose you data from a dict of columns to a list of rows:
import pandas as pd
df = pd.DataFrame(newcsvdict)
df.to_csv('new.csv', index=False)
You're creating a single list associated with key. Either use Pandas, as #maxymoo suggested, or write each line separately.
import csv
with open(r'~/Documents/names.csv', 'r') as fh:
reader = csv.reader(fh)
with open(r'~/Documents/output.csv', 'w+') as o:
writer = csv.writer(o)
for row in reader:
output = row[0].split(' ', 1)
writer.writerow(output)

Input file comparison not working Python

Why is the removal of the double quotes not working? (country = country[1:-1])
Why is the first_list.append line not executing? I am looping through the lines of an input file and matching it against the first element of a list of lists stored in final_lists. When I print the output, even when the two values are in fact the same verified by the print statements (for example, when row[0]==Zimbabwe AND country==Zimbabwe) the next append statement does not run.
with open('world_bank_regions.tsv', 'rU') as f:
next(f)
for line in f:
[region, subregion, country] = line.split('\t')
if country.startswith('"') and country.endswith('"'):
country = country[1:-1]
print country #the double quotes remain
for row in final_list: #final list is a list of lists
print row[0] #row[0] == Zimbabwe
print country #country == Zimbabwe
if row[0] == country:
final_list.append([region, subregion])
print final_list #no changes were made to the list from the previous steps
You can solve the majority of your problem by using the csv module. The second problem you have is that you are modifying the same list you are looping over. This is never a good idea.
To solve your first problem:
import csv
with open('world_bank_regions.tsv', 'rU') as f:
reader = csv.DictReader(f, delimiter='\t', quotechar='"')
for row in reader:
print(row['Country'])
For your second problem, you can approach it two ways. The first is to convert it into a dictionary for fast lookups, using the country name as the key.
As the country name is the first entry in the inner lists, you can do this:
lookup = {i[0]: i for i in final_list}
Then your complete code looks like this:
import csv
lookup = {i[0]: i for i in final_list}
with open('world_bank_regions.tsv', 'rU') as f:
reader = csv.DictReader(f, delimiter='\t', quotechar='"')
for row in reader:
if row['Country'] in lookup.keys():
lookup[row['Country']] += [row['Region'], row['Subregion']]
csv.DictReader takes the first row, and uses them as keys for the data in the remaining rows and returns a dictionary when you loop over it.
The example above assumes your input file looks like this:
Country\tRegion\tSubregion
Zimbabwe\tAfrica\t

Categories