Grabbing CSV Information with Regex in Python - python

I'm trying to extract all the phone numbers from a CSV document and append them to a list in string format. Here is a sample of my input:
someone#somewhere.com,John,Doe,,,(555) 555-5555
And here is the code I am using:
l = []
with open('sample.csv', 'r') as f:
reader = csv.reader(f)
for x in reader:
number = re.search(r'.*?#.*?,.*?,.*?,.*?,.*?,(.*?),',x)
if number in x:
l.append(''.join(number))
Basically, I'm trying to check if there is a number at a certain position in the row (where the parentheses are) and then append that to a list as a string using join. However, I keep getting this error:
Traceback (most recent call last):
File "C:/Users/svillamil/Desktop/Final Phone.py", line 14, in <module>
number = re.search(b'.*?#.*?,.*?,.*?,.*?,.*?,(.*?),', x)
File "C:\Users\svillamil\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 182, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
How do I get around this?

Iterating over a csv.reader gives you a list of strings for each row.
Taking the value at index 5 already gives you the phone number (if I counted correctly). You don't need a regular expression to do this.
l = []
with open('sample.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
number = row[5]
if number:
l.append(number)
(Conversely, if you insisted on using a regular expression, you wouldn't need csv to do the splitting and could just iterate over the raw lines of the file.)

You should just split a file line by comma and iterate through elements checking each if it matches (...), assuming a phone number can appear at any delimited position in a file line:
import re
result = []
with open('sandbox.txt', 'r') as f:
fileLines = f.readlines()
for fileLine in fileLines:
fileLine = fileLine.strip()
lineElems = fileLine.split(',')
for lineElem in lineElems:
pattern = re.compile("\(...\)")
if pattern.match(lineElem):
print("Adding %s" % lineElem)
result.append(lineElem)

x is a list which contains each field of the row.
So one approach is to join the array and then apply the regex,
foo=','.join(x)
number = re.search(r'.*?#.*?,.*?,.*?,.*?,.*?,(.*?),', foo)
Or you can iterate over each field in the row and check if its a number
for row in reader:
for field in row:
number = re.search(r'<phone-number-regex>', field)
if number in x:
l.append(''.join(number))

Related

Python - Get the largest age from a list txt.file

any idea how should I get the largest age from the text file and print it?
The text file:
Name, Address, Age,Hobby
Abu, “18, Jalan Satu, Penang”, 18, “Badminton, Swimming”
Choo, “Vista Gambier, 10-3A-88, Changkat Bukit Gambier Dua, 11700, Penang”, 17, Dancing
Mutu, Kolej Abdul Rahman, 20, “Shopping, Investing, Youtube-ing”
This is my coding:
with open("iv.txt",encoding="utf8") as file:
data = file.read()
splitdata = data.split('\n')
I am not getting what I want from this.
This works! I hope it helps. Let me know if there are any questions.
This approach essentially assumes that values associated with Hobby do not have numbers in them.
import csv
max_age = 0
with open("iv.txt", newline = '', encoding = "utf8") as f:
# spamreader returns reader object used to iterate over lines of f
# delimiter=',' is the default but I like to be explicit
spamreader = csv.reader(f, delimiter = ',')
# skip first row
next(spamreader)
# each row read from file is returned as a list of strings
for row in spamreader:
# reversed() returns reverse iterator (start from end of list of str)
for i in reversed(row):
try:
i = int(i)
break
# ValueError raised when string i is not an int
except ValueError:
pass
print(i)
if i > max_age:
max_age = i
print(f"\nMax age from file: {max_age}")
Output:
18
17
20
Max age from file: 20
spamreader from the csv module of Python's Standard Library returns a reader object used to iterate over lines of f. Each row (i.e. line) read from the file f is returned as a list of strings.
The delimiter (in our case, ',', which is also the default) determines how a raw line from the file is broken up into mutually exclusive but exhaustive parts -- these parts become the elements of the list that is associated with a given line.
Given a raw line, the string associated with the start of the line to the first comma is an element, then the string associated with any part of the line that is enclosed by two commas is also an element, and finally the string associated with the last comma to the end of the line is also an element.
For each line/list of the file, we start iterating from the end of the list, using the reversed built-in function, because we know that age is the second-to-last category. We assume that the hobby category does not have numbers in them such that the number would appear as an element of the list for the raw line. For example, for the line associated with Abu, if instead of "Badminton, Swimming" we had "Badminton, 30, Swimming", then the code would not have the desired effect as 30 would be treated as Abu's age.
I'm sure there is a built-in feature to parse a composite string like the one you posted, but as I don't know, I've created a CustomParse class to do the job:
class CustomParser():
def __init__(self, line: str, delimiter: str):
self.line = line
self.delimiter = delimiter
def split(self):
word = ''
words = []
inside_string = False
for letter in line:
if letter in '“”"':
inside_string = not inside_string
continue
if letter == self.delimiter and not inside_string:
words.append(word.strip())
word = ''
continue
word += letter
words.append(word.strip())
return words
with open('people_data.csv') as file:
ages = []
for line in file:
ages.append(CustomParser(line, ',').split()[2])
print(max(ages[1:]))
Hope that helps.

Nested lists in python containing a single string and not single letters

I need to load text from a file which contains several lines, each line contains letters separated by coma, into a 2-dimensional list. When I run this, I get a 2 dimensional list, but the nested lists contain single strings instead of separated values, and I can not iterate over them. how do I solve this?
def read_matrix_file(filename):
matrix = []
with open(filename, 'r') as matrix_letters:
for line in matrix_letters:
line = line.split()
matrix.append(line)
return matrix
result:
[['a,p,p,l,e'], ['a,g,o,d,o'], ['n,n,e,r,t'], ['g,a,T,A,C'], ['m,i,c,s,r'], ['P,o,P,o,P']]
I need each letter in the nested lists to be a single string so I can use them.
thanks in advance
split() function splits on white space by default. You can fix this by passing the string you want to split on. In this case, that would be a comma. The code below should work.
def read_matrix_file(filename):
matrix = []
with open(filename, 'r') as matrix_letters:
for line in matrix_letters:
line = line.split(',')
matrix.append(line)
return matrix
The input format you described conforms to CSV format. Python has a library just for reading CSV files. If you just want to get the job done, you can use this library to do the work for you. Here's an example:
Input(test.csv):
a,string,here
more,strings,here
Code:
>>> import csv
>>> lines = []
>>> with open('test.csv') as file:
... reader = csv.reader(file)
... for row in reader:
... lines.append(row)
...
>>>
Output:
>>> lines
[['a', 'string', 'here'], ['more', 'strings', 'here']]
Using the strip() function will get rid of the new line character as well:
def read_matrix_file(filename):
matrix = []
with open(filename, 'r') as matrix_letters:
for line in matrix_letters:
line = line.split(',')
line[-1] = line[-1].strip()
matrix.append(line)
return matrix

How to append from file into list in Python?

I have a sample file called 'scores.txt' which holds the following values:
10,0,6,3,7,4
I want to be able to somehow take each value from the line, and append it to a list so that it becomes sampleList = [10,0,6,3,7,4].
I have tried doing this using the following code below,
score_list = []
opener = open('scores.txt','r')
for i in opener:
score_list.append(i)
print (score_list)
which partially works, but for some reason, it doesn't do it properly. It just sticks all the values into one index instead of separate indexes. How can I make it so all the values get put into their own separate index?
You have CSV data (comma separated). Easiest is to use the csv module:
import csv
all_values = []
with open('scores.txt', newline='') as infile:
reader = csv.reader(infile)
for row in reader:
all_values.extend(row)
Otherwise, split the values. Each line you read is a string with the ',' character between the digits:
all_values = []
with open('scores.txt', newline='') as infile:
for line in infile:
all_values.extend(line.strip().split(','))
Either way, all_values ends up with a list of strings. If all your values are only consisting of digits, you could convert these to integers:
all_values.extend(map(int, row))
or
all_values.extend(map(int, line.strip().split(',')))
That is an efficient way how to do that without using any external package:
with open('tmp.txt','r') as f:
score_list = f.readline().rstrip().split(",")
# Convert to list of int
score_list = [int(v) for v in score_list]
print score_list
Just use split on comma on each line and add the returned list to your score_list, like below:
opener = open('scores.txt','r')
score_list = []
for line in opener:
score_list.extend(map(int,line.rstrip().split(',')))
print( score_list )

Convert first 2 elements of tuple to integer

I have a csv file with the following structure:
1234,5678,"text1"
983453,2141235,"text2"
I need to convert each line to a tuple and create a list. Here is what I did
with open('myfile.csv') as f1:
mytuples = [tuple(line.strip().split(',')) for line in f1.readlines()]
However, I want the first 2 columns to be integers, not strings. I was not able to figure out how to continue with this, except by reading the file line by line once again and parsing it. Can I add something to the code above so that I transform str to int as I convert the file to list of tuples?
This is a csv file. Treat it as such.
import csv
with open("test.csv") as csvfile:
reader = csv.reader(csvfile)
result = [(int(a), int(b), c) for a,b,c in reader]
If there's a chance your input may not be what you think it is:
import csv
with open('test.csv') as csvfile:
reader = csv.reader(csvfile)
result = []
for line in reader:
this_line = []
for col in line:
try:
col = int(col)
except ValueError:
pass
this_line.append(col)
result.append(tuple(this_line))
Instead of trying to cram all of the logic in a single line, just spread it out so that it is readable.
with open('myfile.csv') as f1:
mytuples = []
for line in f1:
tokens = line.strip().split(',')
mytuples.append( (int(tokens[0]), int(tokens[1]), tokens[2]) )
Real python programmers aren't afraid of using multiple lines.
You can use isdigit() to check if all letters within element in row is digit so convert it to int , so replace the following :
tuple(line.strip().split(','))
with :
tuple(int(i) if i.isdigit() else i for i in (line.strip().split(','))
You can cram this all into one line if you really want, but god help me I don't know why you'd want to. Try giving yourself room to breathe:
def get_tuple(token_list):
return (int(token_list[0]), int(token_list[1]), token_list[2])
mytuples = []
with open('myfile.csv') as f1:
for line in f1.readlines():
token_list = line.strip().split(',')
mytuples.append(get_tuple(token_list))
Isn't that way easier to read? I like list comprehension as much as the next guy, but I also like knowing what a block of code does when I sit down three weeks later and start reading it!

Python: read from file into list

I want my program to read from a .txt file, which has data in its lines arranged like this:
NUM NUM NAME NAME NAME. How could I read its lines into a list so that each line becomes an element of the list, and each element would have its first two values as ints and the other three as strings?
So the first line from the file: 1 23 Joe Main Sto should become lst[0] = [1, 23, "Joe", "Main", "Sto"].
I already have this, but it doesn't work perfectly and I'm sure there must be a better way:
read = open("info.txt", "r")
line = read.readlines()
text = []
for item in line:
fullline = item.split(" ")
text.append(fullline)
Use str.split() without an argument to have whitespace collapsed and removed for you automatically, then apply int() to the first two elements:
with open("info.txt", "r") as read:
lines = []
for item in read:
row = item.split()
row[:2] = map(int, row[:2])
lines.append(row)
Note what here we loop directly over the file object, no need to read all lines into memory first.
with open(file) as f:
text = [map(int, l.split()[:2]) + l.split()[2:] for l in f]

Categories