How to loop using .split() function on a text file python - python

I have a html file with different team names written throughout the file. I just want to grab the team names. The team names always occur after certain text and end before certain text, so I've split function to find the team name. I'm a beginner, and I'm sure I'm making this harder than it is. Data is the file
teams = data.split('team-away">')[1].split("</sp")[0]
for team in teams:
print team
This returns each individual character for the first team that it finds (so for example, if teams = San Francisco 49ers, it prints "S", then "A", etc. instead of what I need it to do: Print "San Francisco 49ers" then on the next line the next team "Carolina Panthers", etc.
Thank you!

"I'm a beginner, and I'm sure I'm making this harder than it is."
Well, kind of.
import re
teams = re.findall('team-away">(.*)</sp', data)
(with credit to Kurtis, for a simpler regular expression than I originally had)
Though an actual HTML parser would be best practice.

Don't re-invent the wheel! Look into BeautifulSoup, it'll to the job for you.

Related

using str.contains to replace part of a string up to a delimiter

you are my last bastion of hope before I turn to the horrid world of excel macros,
I have a humongous data frame from excel that I'm manipulating, Pandas has proved useful for editing as excel really struggles.
My final issue is as follows:
Now, I have a column that lists interests by user with over 100k rows. The problem being that the data was never validated, so I have potentially useful information which i need to manipulate into 1 of 10 outputs.
I've found using str.replace and str.contains is extremely useful and I think I would build some dicts or lists to iterate through to work the logic.
When I use str.contains on my list it replaces the whole string and I need to keep information after the delimiter to replace (as users can have more than one interest).
so I could have
User, Interest
a Racing, Football, Soccer, Kickball, footy, Basketball, Hockey, Running, Jogging, Jogging & Running
b Racing, Jogging, Basketball, Computers, Reading.
c Ice Hockey
so for example, there are multiple examples of Football which would need to be put into one category, and so forth.
with the assumption that we are only after sports, what would also be an efficient method to clean the data that is non sport?
I hope the entirety of the my issue makes sense.
Output:
User, Interest
a Race, Ball Sport, Athletics
b Race, Ball Sport, Athletics
c Athletics
I don't know if you want to clean the file by modifying it, or if you want to selectively filter the interests at runtime, but here is how I would do this:
First I would get the sorted (and unique) list of all interests: copy all of them in a file, one per line, and run some sort -u FILE > OUTPUT in Bash or else.
Then I would regroup the interests (Racing and Race => Race)
With these groups, I would create a mapping with a dictionary in Python:
mapping = {
'racing': 'Race',
'race': 'Race',
'football': 'Ball Sport',
...
}
Finally, when reading the file, I would use a function to return the validated interests for each line:
def validate_interests(*interests):
validated = []
for interest in interests:
valid = mapping.get(interest.lower(), None)
if valid is not None:
validated.append(valid)
return validated
In [10]: validate_interests('Football', 'Racing')
Out[10]: ['Ball Sport', 'Race']
Of course you would need to iterate on the lines and parse them into a list of interests, but I won't go too much into details.

Removing Duplicates in Json file

I have a Json file which contain some duplicates and I am looking for the way to remove them. Two examples of the beginning of my Json texts:
"date": "May 16, 2012 Wednesday", "body": "THE future of one of Scotland's most important listed buildings .... World Monuments Fund. o See a picture gallery of Mavisbank House at scotsman.com/scotland ", "title": "Rescue deal to bring Adam mansion back from brink"
"date": "May 16, 2012 Wednesday", "body": "The future of one of Scotland's most important listed buildings .... World Monuments Fund.", "title": "Rescue deal to bring Adam mansion back from brink"
I have cut the text in the middle due to the extension of it and irrelevance since they match perfectly. As we can see the text matches almost 100% except at the beginning THE vs The and at the end (extra sentence: o See a picture gallery of Mavisbank House at scotsman.com/scotland). In this line I will like to come with a way to I) Find the duplicates and II) remove one of the duplicates (note that they can also be more than one duplicate). I just started programming in Python and I am not sure how to handle this problem. Any help is really appreciated!
kind regards!
I think it would be better if you first convert your json String into a model object.
After that you can simply iterate over the elements and remove the duplicates (to whatever level). You can ignore case while comparing each individual elements.
Also, you can simply convert each of your body/title elements to a consistent case and add them in a set for duplicate check, while iterating, as #ForceBru pointed out in comments.
Following link will point you in appropriate direction for json to object conversion.
Is there a python json library can convert json to model objects, similar to google-gson?
Hope this helps.

Regex pattern using muiltiple groups which may or may not exist with text inbetween

I using Regex on a list of strings (one string at a time) in order to extract information pertaining to the string. I have a almost functioning pattern which works on all the possible events i will potentially pass into it except one. I'm fairly new to Regex and therefore i am beginning to find it impossible to handle, especially as the pattern gets more complicated. I have multiple possible strings to match, and they all work except one.
Here are the possible strings, separated by lines. The format is consistent but the content such as the names, scores and additional information are not.
Goal scored Sunderland 4, Cardiff City 0. Connor Wickham (Sunderland) header from the centre of the box to the bottom left corner. Assisted by Emanuele Giaccherini with a cross following a corner.
Booking Sebastian Larsson (Sunderland) is shown the yellow card.
Foul by Jordon Mutch (Cardiff City).
Dismissal Cala (Cardiff City) is shown the red card.
Penalty conceded by Cala (Cardiff City) after a foul in the penalty area.
They all follow the same format other than goals, and therefore work with my current pattern however i would like the goal string to also work, but it will not due to the capitalization of team names. Ideally i would like to capture the team names and score into two separate groups, home team and away team, although it is not completely necessary.
Here is my current regex pattern which, other than for goals, correctly detects the event, players names, team and any extra information after it. I initially had .* instead of `[A-Z]*' which worked on goals but always cut off players first names, which i believe is due to it being optional within the group.
(?P<event>\A\w+)[^A-Z]*(?P<playername>(?:[A-Z]\w+)*\s\w+\s)(?P<team>\(.+\))(?P<extrainfo>[^\Z.]+)*
to break this down, this is what i am trying to look for currently
the first word that appears, which is under the event group (?P<event>\A\w+)
any number of characters which are not a capital(initial reason goal is broken) [^A-Z]*
a player name, which can be be of any length (some names are singular, others have multiple parts hence the non-matched group to detect any first names) (?P<playername>(?:[A-Z]\w+)*\s\w+\s)
a team name which is always enclosed in brackets after the player name (?P<team>\(.+\))
any extra information about the event, so anything which is after the team name. I make sure to also check its not just a . to ensure None in the result of the matched group (?P<extrainfo>[^\Z.]+)*
I am currently trying to find a solution along the lines of [^A-Z.]*(?P<hometeam>\w+[^,.])*(?P<awayteam>\w+[^,.])* but this is not working and i am struggling.
A further task which is trivial but if possible i would love to add would be somehow removing the brackets from the teamname group so instead of teamname (Cardiff City) it becomes teamname Cardiff City
Thanks for the help.
I would suggest splitting this into two tasks:
Extract the goals scored (r"^(?P<event>goal scored) (?P<hometeam>.*) (?P<homescore>\d), (?P<awayteam>.*) (?P<awayscore>\d). (?P<playername>.*) \((?P<scoringteam>.*)\).*$"); and
Extract the other events (r"^(?P<event>booking|foul|dismissal|penalty conceded) (?:by )?(?P<playername>.*) \((?P<teamname>.*)\).*$").
In your example, the former matches:
event [0-11] `Goal scored`
hometeam [12-23] `Sunderland`
homescore [23-24] `4`
awayteam [26-39] `Cardiff City`
awayscore [39-40] `0`
playername [42-56] `Connor Wickham`
scoringteam [58-68] `Sunderland`
And the latter, for example:
event [197-204] `Booking`
playername [205-222] `Sebastian Larsson`
teamname [224-234] `Sunderland`

Python 3 Extracting Candidate Words from a Debate File

This is my first post, so I'm sorry if I do anything wrong. That said, I searched for the question and found something similar that was never answered due to the OP not giving sufficient information. This is also homework, so I'm just looking for a hint. I really want to get this on my own.
I need to read in a debate file (.txt), and pull and store all of the lines that one candidate says to put in a word cloud. The file format is supposed to help, but I'm blanking on how to do this. The hint is that each time a new person speaks, their name followed by a colon is the first word in the first line. However, candidates' data can span multiple lines. I am supposed to store each person's lines separately. Here is a sample of the file:
LEHRER: This debate and the next three -- two presidential, one vice
presidential -- are sponsored by the Commission on Presidential
Debates. Tonight's 90 minutes will be about domestic issues and will
follow a format designed by the commission. There will be six roughly
15-minute segments with two-minute answers for the first question,
then open discussion for the remainder of each segment.
Gentlemen, welcome to you both. Let's start the economy, segment one,
and let's begin with jobs. What are the major differences between the
two of you about how you would go about creating new jobs?
LEHRER: You have two minutes. Each of you have two minutes to start. A
coin toss has determined, Mr. President, you go first.
OBAMA: Well, thank you very much, Jim, for this opportunity. I want to
thank Governor Romney and the University of Denver for your
hospitality.
There are a lot of points I want to make tonight, but the most
important one is that 20 years ago I became the luckiest man on Earth
because Michelle Obama agreed to marry me.
This is what I have for a function so far:
def getCandidate(myFile):
file = open(myFile, "r")
obama = []
romney = []
lehrer = []
file = file.readlines()
I'm just not sure how to iterate through the data so that it separates each person's words correctly. I created a dummy file to create the word cloud, and I'm able to do that fine, so all I am wondering is how to extract the information I need.
Thank you! If there is more information I can offer please let me know. This is a beginning Python course.
EDIT: New code added from a response. This works to an extent, but only grabs the first line of each candidate's response, not their entire response. I need to write code that continues to store each line under that candidate until a new name is at the start of a line.
def getCandidate(myFile, candidate):
file = open(myFile, "r")
OBAMA = []
ROMNEY = []
LEHRER = []
file = file.readlines()
for line in file:
if line.startswith("OBAMA:"):
OBAMA.append(line)
if line.startswith("ROMNEY:"):
ROMNEY.append(line)
if line.startswith("LEHRER:"):
LEHRER.append(line)
if candidate == "OBAMA":
return OBAMA
if candidate == "ROMNEY":
return ROMNEY
EDIT: I now have a new question. How can I generalize the file so that I can open any debate file between two people and a moderator? I am having a lot of trouble with this one.
I've been given a hint to look at the beginning of the line and see if the last word of each line to see if it ends in ":", but I'm still not sure how to do this. I tried splitting each line on spaces and then looking at the first item in the line, but that's as far as I've gotten.
The hint is this: after you split your lines, iterate over them and check with the string function startswith for each candidate, then append.
The iteration over a file is very simple:
for row in file:
do_something_with_row
EDIT:
To keep putting the lines until you find a new candidate, you have to keep track with a variable of the last candidate seen and if you don't find any match at the beginning of the line, you stick with the same candidate as before.
if line.startswith('OBAMA'):
last_seen=OBAMA
OBAMA.append(line)
elif blah blah blah
else:
last_seen.append(line)
By the way, I would change the definitio of the function: instead of take the name of the candidate and returning only his lines, it would be better to return a dictionary with the candidate name as keys and their lines as values, so you wouldn't need to parse the file more than once. When you will work with bigger file this could be a lifesaver.

lists and sublists

i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list
The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]
open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.
This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.
The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +

Categories