Extracting data by regex and writing to CSV, Python glob (pandas?)

Extracting data by regex and writing to CSV, Python glob (pandas?) - python

I have a large list of varyingly dirty CSVs containing phone numbers in various formats. What I want is to comb through all of them and export to a single-column CSV of cleaned phone numbers in a simple format. So far, I have pieced together something to work, though it has some issues: (partial revision further below)
import csv
import re
import glob
import string
with open('phonelist.csv', 'wb') as out:
seen = set()
output = []
out_writer = csv.writer(out)
csv_files = glob.glob('CSVs\*.csv')
for filename in csv_files:
with open(filename, 'rbU') as ifile:
read = csv.reader(ifile)
for row in read:
for column in row:
s1 = column.strip()
if re.match(r'\b\d\d\d\d\d\d\d\d\d\d\b', s1):
if s1 not in seen:
seen.add(s1)
output.append(s1)
elif re.search(r'\b\(\d\d\d\) \d\d\d-\d\d\d\d\b', s1):
s2 = filter(lambda x: x in string.digits, s1)
if s2 not in seen:
seen.add(s2)
output.append(s2)
for val in output:
out_writer.writerow([val])
I'm putting this together with no formal python knowledge, just piecing things I've gleaned on this site. Any advice regarding pythonic stylization or utilizing the pandas library for shortcuts would all be welcome.
First issue: What's the simplest way to filter to just the matched values? IE, I may get 9815556667 John Smith, but I just want the number.
Second issue: This takes forever. I assume it's the lambda part. Is there a faster or more efficient method?
Third issue: How do I glob *.csv in the directory of the program and the CSVs directory (as written)?
I know that's several questions at once, but I got myself halfway there. Any additional pointers are appreciated.
For examples, requested, this isn't from a file (these are multi-gigabyte files), but here's what I'm looking for:
John Smith, (981) 991-0987, 9987765543 extension 541, 671 Maple St 98402
(998) 222-0011, 13949811123, Foo baR Us, 2567 Appleberry Lane
office, www.somewebpage.com, City Group, Anchorage AK
9281239812
(345) 666-7777
Should become:
9819910987
9987765543
9982220011
3949811123
3456667777
(I forgot that I need to drop a leading 1 from 11-digit numbers, too)
EDIT: I've changed my current code to incorporate Shahram's advice, so now, from for column in row above, I have, instead of above:
for column in row:
s1 = column.strip()
result = re.match(
r'.*(\+?[2-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)? ?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1) or re.match(
r'.*(\+?[2-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)?-?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1)
if result:
tempStr = result.group(1)
for ch in ['(', ')', '-', ' ']:
tempStr = tempStr.replace(ch, '')
if tempStr not in seen:
seen.add(tempStr)
output.append(tempStr)
This seems to work for my purposes, but I still don't know how to glob the current directory and subdirectory, and I still don't know if my code has issues that I'm unaware of because of my hodge-podge-ing. Also, in my larger directory, this is taking forever - as in, about a gig of CSVs is timing out for me (by my hand) at around 20 minutes. I don't know if it's hitting a snag, but judging by the speed at which python normally chews through any number of CSVs, it feels like I'm missing something.

About your first question, You can use the below regular expression to capture different types of Phone Numbers:
result = re.match(r'.*(\+?[0-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)?-?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1)
if result:
if result.group(1) not in seen:
seen.add(result.group(1))
output.append(result.group(1))
About your second question: You may want to look at the replace function. So the above code can be changed to:
result = re.match(r'.*(\+?[0-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)?-?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1)
if result:
if result.group(1) not in seen:
tempStr = result.group(1)
tempStr.replace('-','')
seen.add(tempStr)
output.append(tempStr)

Related

Parse unstructured text in python

Am new to python and am trying to read a PDF file to pull the ID No.. I have been successful so far to extract the text out of the PDF file using pdfplumber. Below is the code block:
import pdfplumber
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
print (raw_text)
Here is the text output:
Welcome to ABC
01 January, 1991
ID No. : 10101010
Welcome to your ABC portal. Learn
More text here..
Even more text here..
Mr Jane Doe
Jack & Jill Street Learn more about your
www.abc.com
....
....
....
However, am unable to find the optimum way to parse this unstructured text further. The final output am expecting to be is just the ID No. i.e. 10101010. On a side note, the script would be using against fairly huge set of PDFs so performance would be of concern.

Try using a regular expression:
import pdfplumber
import re
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
m = re.search(r'ID No\. : (\d+)', raw_text)
if m:
print(m.group(1))
Of course you'll have to iterate over all the PDF's contents - not just the first page! Also ask yourself if it's possible that there's more than one match per page. Anyway: you know the structure of the input better than I do (and we don't have access to the sample file), so I'll leave it as an exercise for you.

If the length of the id number is always the same, I would try to find the location of it with the find-function. position = raw_text.find('ID No. : ')should return the position of the I in ID No. position + 9 should be the first digit of the id. When the number has always a length of 8 you could get it with int(raw_text[position+9:position+17])

If you are new to Python and actually need to process serious amounts of data, I suggest that you look at Scala as an alternative.
For data processing in general, and regular expression matching in particular, the time it takes to get results is much reduced.
Here is an answer to your question in Scala instead of Python:
import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.PdfTextExtractor
val fil = "ABC.pdf"
val textFromPage = (1 until (new PdfReader(fil)).getNumberOfPages).par.map(page => PdfTextExtractor.getTextFromPage(new PdfReader(fil), page)).mkString
val r = "ID No\\. : (\\d+)".r
val res = for (m <- r.findAllMatchIn(textFromPage )) yield m.group(0)
res.foreach(println)

What is a working method for extracting numeric values with associated data from open text?

I tried to look for a solution but nothing was giving me quite what I needed. I'm not sure regex can do what I need.
I need to process a large amount of data where license information is provided. I just need to grab the number of licenses and the name for each license then group and tally the license counts for each company.
Here's an example of the data pulled:
L00129A578-E105C1D138 1 Centralized Recording
$42.00
L00129A677-213DC6D60E 1 Centralized Recording
$42.00
1005272AE2-C1D6CACEC8 5 Station
$45.00
100525B658-3AC4D2C93A 5 Station
$45.00
I would need to grab the license count and license name then add like objects so it would grab (1 Centralized Recording, 1 Centralized Recording, 5 Station, 5 Station) then add license counts and output (2 Centralized Recording, 10 Station)
What would be the easiest way to implement this?

It looks like you're trying to ignore the license number, and get the count and name. So, the following should point you on your way for your data, if it is as uniform as it seems:
import re
r = re.compile(r"\s+(\d+)\s+[A-Za-z ]+")
r = re.compile(r"\s+(\d+)\s+([A-Za-z ]+)")
m = r.search(" 1 Centralized")
m.groups()
# ('1', 'Centralized')
That regex just says, "Require but ignore 1 or more spaces, pay attention to the string of digits after it, require but ignore 1 or more spaces after it, and pay attention to the capital letters, lower case letters, and spaces after it." (You may need to trim of a newline when you're done.)
The file-handling bit would look like:
f = open('/path/to/your_data_file.txt')
for line in f.readlines():
# run regex and do stuff for each line
pass

import re, io, pandas as pd
a = open('your_data_file.txt')
pd.read_csv(io.StringIO(re.sub(r'(?m).*\s(\d+)\s+(.*\S+)\s+$\n|.*','\\1,\\2',a)),
header=None).groupby(1).sum()[0].to_dict()

Pandas is a good tool for jobs like this. You might have to play around with it a bit. You will also need to export your excel file as a .csv file. In the interpreter,try:
import pandas
raw = pandas.read_csv('myfile.csv')
print(raw.columns)
That will give you the column headings for the csv file. If you have headers name and nums, then you can extract those as a list of tuples as follows:
extract = list(zip(raw.name, raw.nums))
You can then sort this list by name:
extract = sorted(extract)
Pandas probably has a method for compressing this easily, but I can't recall it so:
def accum(c):
nm = c[0][0]
count = 0
result = []
for x in c:
if x[0] == nm:
count += x[1]
else:
result.append((nm, count))
nm = x[0]
count = x[1]
result.append((nm, count))
return result
done = accum(extract)
Now you can write this to a text file as follows(fstrings require Python 3.6+)
with open("myjob.txt", "w+") as fout:
for x in done:
line = f"name: {x[0]} count: {x[1]} \n"
fout.write(line)

Counting how many times a string appears in a CSV file

I have a piece of code what is supposed to tell me how many times a word occurs in a CSV file. Note: the file is pretty large (2 years of text messages)
This is my code:
key_word1 = 'Exmple_word1'
key_word2 = 'Example_word2'
counter = 0
with open('PATH_TO_FILE.csv',encoding='UTF-8') as a:
for line in a:
if (key_word1 or key_word2) in line:
counter = counter + 1
print(counter)
There are two words because I did not know how to make it non-case sensitive.
To test it I used the find function in word on the whole file (using only one of the words as I was able to do a non-case sensitive search there) and I received more than double of what my code has calculated.
At first I did use the value_counts() function BUT I received different values for the same word (searching Exmple_word1 appeared 32 and 56 times and 2 times and so on. I kind of got stuck there for a while but it got me thinking. I use two keyboards on my phone which I change regularly - could it be that the same words could actually be different and that would explain why I am getting these results?
Also, I pretty much checked all sources regarding this matter and I found different approaches that did not actually do what I want them to do. ( the value_counts() method for example)
If that is the case, how can I fix this?

Notice some mistakes in your code:
key_word1 or key_word2 - it's "lazy", meaning if the left part - "key_word1" evaluated to True, it won't even look at key_word2. The will cause checking only if key_word1 appeared in the line.
An example to emphesize:
w1 = 'word1'
w2 = 'word2'
s = 'bla word2'
(w1 or w2) in s
>> False
(w2 or w1) in s
>> True
2. Reading csv file: I recommend using csv package (just import it), something like:
import csv
with open('PATH_TO_FILE.csv') as f:
for line in csv.reader(f):
# do you logic here
Case sensitivity - don't work hard, you probably can lower case the line you read, just to not hold 2 words..
guess the solution you are looking for should look something like:
import csv
word_to_search = 'donald'
with open('PATH_TO_FILE.csv', encoding='UTF-8') as f:
for line in csv.reader(f):
if any(word_to_search in l for l in map(str.lower, line)):
counter += 1
Running on input:
bla,some other bla,donald rocks
make,who,great
again, donald is here, hura
will result:
counter=2

How would I take the number of names in a list and then write the results to a file?

I am fairly new to python and am having difficulties with this (most likely simple) problem. I'm accepting a file with the format.
name_of_sports_team year_they_won_championship
e.g.,
1991 Minnesota
1992 Toronto
1993 Toronto
They are already separated into a nested list [year][name]. I am tasked to add up all the repetitions from the list and display them as such in a new file.
Toronto 2
Minnesota 1
My code is as follows-
def write_tab_seperated(n):
'''
N is the filename
'''
file = open(n, "w")
# names are always in the second position?
data[2] = names
countnames = ()
# counting the names
for x in names:
# make sure they are all the same
x = str(name).lower()
# add one if it shows.
if x in countnames:
countnames[x] += 1
else:
countnames[x] = 1
# finish writing the file
file.close
This is so wrong its funny, but I planned out where to go from here:
Take the file
separate into the names list
add 1 for each repetition
display in name(tab)number format
close the file.
Any help is appreciated and thank you in advance!

There's a built-in datatype that's perfect for your use case called collections.Counter.
I'm assuming from the sample I/O formatting that your data file columns are tab separated. In the question text it looks like 4-spaces — if that's the case, just change '\t' to ' ' or ' '*4 below.
with open('data.tsv') as f:
lines = (l.strip().split('\t') for l in f.readlines())
Once you've read the data in, it really is as simple as passing it to a Counter and specifying that it should create counts on the values in the second column.
from collections import Counter
c = Counter(x[1] for x in lines)
And printing them back out for reference:
for k, v in c.items():
print('{}\t{}'.format(k, v))
Output:
Minnesota 1
Toronto 2

From what I understand through your explanation, the following is my piece of code:
#input.txt is the input file with <year><tab><city> data
with open('input.txt','r') as f:
input_list =[x.strip().split('\t') for x in f]
output_dict = {}
for per_item in input_list:
if per_item[1] in output_dict:
output_dict[per_item[1]] += 1
else:
output_dict[per_item[1]] = 1
#output file has <city><tab><number of occurence>
file_output = open("output.txt","w")
for per_val in output_dict:
file_output.write(per_val + "\t" + str(output_dict[per_val]) + "\n")
Let me know if it helps.

One of the great things about python is the huge number of packages. For handling tabular data, I'd recommend using pandas and the csv format:
import pandas as pd
years = list(range(1990, 1994))
names = ['Toronto', 'Minnesota', 'Boston', 'Toronto']
dataframe = pd.DataFrame(data={'years': years, 'names': names})
dataframe.to_csv('path/to/file.csv')
That being said, I would still highly recommend to go through your code and learn how these things are done from scratch.

Split txt file into multiple new files with regex

I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.

I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.

While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()

If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data by regex and writing to CSV, Python glob (pandas?) - python

Related

Parse unstructured text in python

What is a working method for extracting numeric values with associated data from open text?

Counting how many times a string appears in a CSV file

How would I take the number of names in a list and then write the results to a file?

Split txt file into multiple new files with regex

Categories

Resources