Unable to search and compile regex code from each line in python - python

I am trying to write a program to match regex in the file. Initial lines of my file looks as shown below
Alternate Take with Liz Copeland (Day 1) (12am-1am)
Saturday March 31, 2007
No. Artist Song Album (Label) Comment
buy 1 Tones on Tail Go! (club mix) Everything! (Beggars Banquet)
buy 2 Devo (I Can't Get No) Satisfaction Anthology: Pioneers Who Got Scalped (Warner Archives/Rhino)
My code to match first line of the file is as follows
with open("data.csv") as my_file:
for line in my_file:
re_show = re.compile(r'(Alternate Take with Liz Copeland) \((.*?)\)\s\((.*?)\)')
num_showtitle_lines_matched = 0
m_show = re.match(re_show, line)
bool(m_show) == 1
if m_show:
num_showtitle_lines_matched += 1
show_title = m_show.group()
print("Num show lines matched --> {}".format(num_showtitle_lines_matched))
print(show_title)
It should give me result as below
Alternate Take with Liz Copeland (Day 1) (12am-1am)
num_showtitle_lines_matched -->1
But my result doesn't show any output.
Please let me know how to accomplish this.Thanks in advance.

As in the comment:
just put the num_showtitle_lines_matched = 0 above the loop:
with open("data.csv") as my_file:
num_showtitle_lines_matched = 0
for line in my_file:
re_show = re.compile(r'(Alternate Take with Liz Copeland) \((.*?)\)\s\((.*?)\)')
m_show = re.match(re_show, line)
bool(m_show) == 1
if m_show:
num_showtitle_lines_matched += 1
show_title = m_show.group()
print("Num show lines matched --> {}".format(num_showtitle_lines_matched))
print(show_title)
Output:
Num show lines matched --> 1
Alternate Take with Liz Copeland (Day 1) (12am-1am)

Related

Rewriting a txt file in python, creating new lines where there is a certain string

I have converted a PDF bank statement to a txt file. Here is a snippet of the .txt file:
15 Apr 20DDOPEN 100.00DDBENNON WATER SRVCS29.00DDBG BUSINESS106.00BPC BOB PETROL MINISTRY78.03BPC BARBARA STREAMING DATA30.50CRPAYPAL Z4J22FR450.00CRPAYNAL AAWDL4Z4J22222KHMG30.0019,028.4917 Apr 20CRCASH IN AT HSBC BANK
What is the easiest way of re-writing the text file in python to create a new line at certain points. i.e. after a number ‘xx.xx’ there in a new date such as ‘xx APR’
For example the text to become:
15 Apr 20DDOPEN 100.00
BENNON WATER SRVCS29.00
DDBG BUSINESS106.00...(etc)
I am just trying to make a PDF more readable and useful when working amongst my other files.
If you know of another PDF to txt python converter which works better, I would also be interested.
Thanks for your help
First step would be getting the text file into Python
with open(“file.txt”) as file:
data = file.read()
This next part, initially, I thought you wouldn't be able to do, but in your example, each part contains a number XX.XX The important thing to notice here is that there is a '.' in each number.
Using Python's string find command, you can iteratively look for that '.' and add a newline character two characters later. You can change my indices below to remove the DD as well if you want.
index = 0
while(index != -1):
index = data.find('.', index)
if index != -1:
data = data[:index+3] + '\n' + data[index+3:]
Then you need to write the new data back to the file.
file = open('ValidEmails.txt','w')
file.write(data)
For the given input the following should work:
import re
counter = 0
l = "15 Apr 20DDOPEN 100.00DDBENNON WATER SRVCS29.00DDBG BUSINESS106.00BPC BOB PETROL MINISTRY78.03BPC BARBARA STREAMING DATA30.50CRPAYPAL Z4J22FR450.00CRPAYNAL AAWDL4Z4J22222KHMG30.0019,028.4917 Apr 20CRCASH IN AT HSBC BANK"
nums = re.finditer("[\d]+[\.][\d]+", l)
for elem in nums:
idx = elem.span()[1] + counter
l = l[:idx] + '\n' + l[idx:]
counter += 1
print(l)
The output is:
15 Apr 20DDOPEN 100.00
DDBENNON WATER SRVCS29.00
DDBG BUSINESS106.00
BPC BOB PETROL MINISTRY78.03
BPC BARBARA STREAMING DATA30.50
CRPAYPAL Z4J22FR450.00
CRPAYNAL AAWDL4Z4J22222KHMG30.0019
,028.4917
Apr 20CRCASH IN AT HSBC BANK
Then you should easily able to write line by line to a file.

Process lines with different sizes to csv

I'm trying to convert a PDF bank extract to csv. I'm fairly new into python, but I managed to extract text from pdf. I'm ended with something similar to this:
AMAZON 23/12/2019 15:40 -R$ 100,00 R$ 400,00 credit
Some Restaurant 23/12/2019 14:00 -R$ 10,00 R$ 500 credit
Received from John Doe 22/12/2019 15:00 R$ 510 R$ 500,00
03 Games 22/12/2019 15:00 R$ 10 R$ 10,00 debit
I want this output:
AMAZON;23/12/2019;-100,00
Some Restaurant;23/12/2019;-10,00
Received from John Doe;22/12/2019;510
03 Games;22/12/2019;10
First field have different sizes, I don't need time and currency format. I don't need last 2 fields.
I have this code so far (just extracting text from PDF):
import pdfplumber
import sys
url = sys.argv[1]
pdf = pdfplumber.open(url)
pdf_pages = len(pdf.pages)
for i in range(pdf_pages):
page = pdf.pages[(i)]
text = page.extract_text()
print(text)
pdf.close()
Can anyone give some directions?
Try using this the split method. To split the strings into lines and into the separate parts and pick then the parts.
The following link explains it very nicely.
https://www.w3schools.com/python/showpython.asp?filename=demo_ref_string_split
lines:List[str] = text.split("\n")
for line in lines:
entries:List[str] = line.split()
date_entry_index: int = get_date_index(entries)
name = entries[0]
for index in range(1, date_entry_index + 1):
name += " " + entries[index]
print(f"{name};{entries[date_entry_index]};{entries[date_entry_index + 2]}")
def get_date_index(entries_check:List[str]) -> int:
# either you could use the function below or you check if the entry only contains digits and "/"
for index, entry in enumerate(entries):
if len(entry) == 10:
continue
if entry[2] != "/" or entry[5] != "/":
continue
# here you could check if the other parts of the date are digits or some letters or something similar.
return index
else:
raise ValueError("No Date found")
That should print it.

how to get a sequence after a word with whitespace

For school I have to parse a string after a word with a lot of whitespace, but I just can't get it.
Because the file is a genbank.
So for example:
BLA
1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
//
What I have tried is this.
if line.startswith("BLA"):
start = line.find("BLA")
end = line.find("//")
line = line[:end]
s_string = ""
string = list()
if s_string:
string.append(line)
else:
line = line.strip()
my_seq += line
But what I get is:
**output**
BLA
and that is the only thing it get and I want to get the output be like
**output**
BLA 1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
So I don't know what to do, I tried to get it like that last output. But without success. My teacher told me that I had to do like. If BLA is True you can go iterate it. And if you see "//" you have to stop, but when I tried it with that True - statement I get nothing.
I tried to search it up online, and it said I had to do it with bio seqIO. But the teacher said we can't use that.
Here is my solution:
lines = """BLA
1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
//"""
lines = lines.strip().split("//")
lines = lines[0].split("BLA")
lines = [i.strip() for i in lines]
print("BLA", " ", lines[1])
Output:
BLA 1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj

Using txt files in Python

I have a txt file which I need to access through python. The data in the txt file displays a football league in CSV format. The CSV data covers the games played, won and lost, where this will calculate the teams points (2 points for a win, 0 for a loss). I have an idea on how to start this but not sure if I have started on the right foot.
How do I calculate the total points for each team? And can I get the headings above the data from the txt file? (Team,Played, Won, Lost, Total) Any support would be appreciated.
CSV Data:
Liverpool,19,7,12
Chelsea,19,8,11
Arsenal,19,0,19
Tottenham,19,7,12
Man Utd,19,7,12
Man City,19,5,14
Southampton,19,3,16
Code:
text_file = open ("leagueResults.txt","r")
print (text_file.read())
text_file.close()
As mentioned in the comments you should look into the csv module.
However in your case since I assume you have just started learning python and the problem is relatively trivial we can do it by just reading the file line by line splitting on the delimiter ,.
team_name = []
games_won = []
num_records = 0
with open('leagueResults.txt') as f:
for line in f:
record = line.strip().split(',')
team_name.append(record[0])
games_won.append(record[2])
num_records += 1
print("Points Table")
print("============")
for i in range(0, num_records):
print("%s: %d" % (team_name[i], (int(games_won[i]) * 2)))
Output:
Points Table
============
Liverpool: 14
Chelsea: 16
Arsenal: 0
Tottenham: 14
Man Utd: 14
Man City: 10
Southampton: 6
Notice how I am only interested in the team_name and games_won since those are the only two actually required to calculate the amount of points per team in the problem (games_played is always 19 and games_lost has no affect on the total points as it is multiplied by a scale factor of 0 in the total points calculation).

Python MapReduce Hadoop Streaming Job that requires 3 input files?

I have 3 small sample input files (the actual files are much larger),
# File Name: books.txt
# File Format: BookID|Title
1|The Hunger Games
2|To Kill a Mockingbird
3|Pride and Prejudice
4|Animal Farm
# File Name: ratings.txt
# File Format: ReaderID|BookID|Rating
101|1|1
102|2|2
103|3|3
104|4|4
105|1|5
106|2|1
107|3|2
108|4|3
# File Name: readers.txt
# File Format: ReaderID|Gender|PostCode|PreferComms
101|M|1000|email
102|F|1001|mobile
103|M|1002|email
104|F|1003|mobile
105|M|1004|email
106|F|1005|mobile
107|M|1006|email
108|F|1007|mobile
I want to create a Python MapReduce Hadoop Streaming Job to get the following output which is the Average Rating by Title by Gender
Animal Farm F 3.5
Pride and Prejudice M 2.5
The Hunger Games M 3
To Kill a Mockingbird F 1.5
I searched this forum and someone pointed out a solution but it is for 2 input files instead of 3. I gave it a go but am stuck at the mapper part because I am not able to sort it correctly so that the reducer can appropriately recognise the 1st record for Title & Gender, then start aggregating. My mapper code below,
#!/usr/bin/env python
import sys
for line in sys.stdin:
try:
ReaderID = "-1"
BookID = "-1"
Title = "-1"
Gender = "-1"
Rating = "-1"
line = line.strip()
splits = line.split("|")
if len(splits) == 2:
BookID = splits[0]
Title = splits[1]
elif len(splits) == 3:
ReaderID = splits[0]
BookID = splits[1]
Rating = splits[2]
else:
ReaderID = splits[0]
Gender = splits[1]
print('%s\t%s\t%s\t%s\t%s' % (BookID, Title, ReaderID, Rating, Gender))
except:
pass
PS: I need to use Python and Hadoop Streaming only. Not allowed to install Python packages like Dumbo, mrjob and etc.
Appreciate your help in advance.
Thanks,
Lobbie
Went through some core Java MR and all have suggested, the three files cannot be merged together in a single map job. We have to first join the first two, and the resultant should be joined with the third one. Applying your logic for the three, does not give me good result. Hence, I tried with Pandas, and its seems to give promising result. If using pandas is not a constraint for you, please try my code. Else, we will try to join these three files with Python Dictionary and Lists.
Here is my suggested code. I have just concatenated all the input to test it. In you code, just comment my for loop (line #36) and un-comment your for loop (line #35).
import pandas as pd
import sys
input_string_book = [
"1|The Hunger Games",
"2|To Kill a Mockingbird",
"3|Pride and Prejudice",
"4|Animal Farm"]
input_string_book_df = pd.DataFrame(columns=('BookID','Title'))
input_string_rating = [
"101|1|1",
"102|2|2",
"103|3|3",
"104|4|4",
"105|1|5",
"106|2|1",
"107|3|2",
"108|4|3"]
input_string_rating_df = pd.DataFrame(columns=('ReaderID','BookID','Rating'))
input_string_reader = [
"101|M|1000|email",
"102|F|1001|mobile",
"103|M|1002|email",
"104|F|1003|mobile",
"105|M|1004|email",
"106|F|1005|mobile",
"107|M|1006|email",
"108|F|1007|mobile"]
input_string_reader_df = pd.DataFrame(columns=('ReaderID','Gender','PostCode','PreferComms'))
#for line in sys.stdin:
for line in input_string_book + input_string_rating + input_string_reader:
try:
line = line.strip()
splits = line.split("|")
if len(splits) == 2:
input_string_book_df = input_string_book_df.append(pd.DataFrame([[splits[0],splits[1]]],columns=('BookID','Title')))
elif len(splits) == 3:
input_string_rating_df = input_string_rating_df.append(pd.DataFrame([[splits[0],splits[1],splits[2]]],columns=('ReaderID','BookID','Rating')))
else:
input_string_reader_df = input_string_reader_df.append(pd.DataFrame([[splits[0],splits[1],splits[2],splits[3]]]
,columns=('ReaderID','Gender','PostCode','PreferComms')))
except:
raise
l_concat_1 = input_string_book_df.merge(input_string_rating_df,on='BookID',how='inner')
l_concat_2 = l_concat_1.merge(input_string_reader_df,on='ReaderID',how='inner')
for each_iter in l_concat_2[['BookID', 'Title', 'ReaderID', 'Rating', 'Gender']].iterrows():
print('%s\t%s\t%s\t%s\t%s' % (each_iter[1][0], each_iter[1][1], each_iter[1][2], each_iter[1][3], each_iter[1][4]))
Output
1 The Hunger Games 101 1 M
1 The Hunger Games 105 5 M
2 To Kill a Mockingbird 102 2 F
2 To Kill a Mockingbird 106 1 F
3 Pride and Prejudice 103 3 M
3 Pride and Prejudice 107 2 M
4 Animal Farm 104 4 F
4 Animal Farm 108 3 F

Categories