Converting a text file to a list - python

I have the following text file:
"""[' Hoffa remains Allen Iverson Bill Cosby WWE Payback results Juneteenth shooting Miss Utah flub Octopus pants Magna Carta Holy Grail China supercomputer Sibling bullying ']"""
I would like to create a list from it and apply a function to each name
this is my code so far:
listing = open(fileName, 'r')
lines = listing.read().split(',')
for line in lines:
#Function

Strip out character like """['] first from the start and end of the string using str.strip, now split the resulting string at six spaces(' '*6). Splitting returns a list, but some items still have traling and leading white-spaces, you can remove them using str.strip again.
with open(fileName) as f:
lis = [x.strip() for x in f.read().strip('\'"[]').split(' '*6)]
print lis
...
['Hoffa remains', 'Allen Iverson', 'Bill Cosby', 'WWE Payback results', 'Juneteenth shooting', 'Miss Utah flub', 'Octopus pants', 'Magna Carta Holy Grail', 'China supercomputer', 'Sibling bullying']
Applying function to the above list:
List comprehension:
[func(x) for x in lis]
map:
map(func, lis)

I would first refer you to some other similar posts: similar post
And you can't use a coma here you don't have a coma between the data you wan't to separate. This function splits the string you have into substring depending on the delimiter you gave it: a coma ','.

Related

How to split based off two characters "[" and "]" in a string

For example calling .split() on the following would give...
x = "[Chorus: Rihanna & Swizz Beatz]
I just wanted you to know
...more lyrics
[Verse 2: Kanye West & Swizz Beatz]
I be Puerto Rican day parade floatin'
... more lyrics"
x.split()
print(x)
would give
["I just wanted you to know ... more lyrics", " be Puerto Rican day parade floatin' ... more lyrics]
Also, how would you save the deleted parts in brackets, thank you. Splitting by an unknown string inside two things is hard :/
Use re.split
>>> x = """[Chorus: Rihanna & Swizz Beatz] I just wanted you to know...more lyrics [Verse 2: Kanye West & Swizz Beatz] I be Puerto Rican day parade floatin' ... more lyrics"""
>>> [i.strip() for i in re.split(r'[\[\]]', x) if i]
# ['Chorus: Rihanna & Swizz Beatz', 'I just wanted you to know...more lyrics', 'Verse 2: Kanye West & Swizz Beatz', "I be Puerto Rican day parade floatin' ... more lyrics"]
data=x.split(']')
print(data)
data=data[1::]
print(data)
location=0;
for i in data:
data[location]=i.split('[')[0]
location=location+1;
print(data)
I got this output for your initial input
['I just wanted you to know...more lyrics', "I be Puerto Rican day parade floatin'... more lyrics"]
I hope this helps
Per the python documentation: https://docs.python.org/2/library/re.html
Python is by and large an excellent language with good consistency, but there are still some quirks to the language that should be ironed out. You would think that the re.split() function would just have a potential argument to decide whether the delimiter is returned. It turns out that, for whatever reason, whether it returns the delimiter or not is based on the input. If you surround your regex with parentheses in re.split(), Python will return the delimiter as part of the array.
Here are two ways you might try to accomplish your goal:
re.split("]",string_here)
and
re.split("(])",string_here)
The first way will return the string with your delimiter removed. The second way will return the string with your delimiter still there, as a separate entry.
For example, running the first example on the string "This is ] a string" would produce:
["This is a ", " string."]
And running the second example would produce:
["This is a ", "]", " string."]
Personally, I'm not sure why they made this strange design choice.
import re
...
input='[youwontseethis]what[hi]ever'
...
output=re.split('\[.*?\]',input)
print(output)
#['','what','ever']
If the input string starts immediately with a 'tag' like your example, the first item in the tuple will be an empty string. If you don't want this functionality you could also do this:
import re
...
input='[youwontseethis]what[hi]ever'
...
output=re.split('\[.*?\]',input)
output=output[1:] if output[0] == '' else output
print(output)
#['what',ever']
To get the tags simply replace the
output=re.split('\[.*?\]',input)
with
output=re.findall('\[.*?\]',input)
#['[youwontseethis]','[hi]']

strip white spaces and new lines when reading from file

I have the following code, that successfully strips end of line characters when reading from file, but doesn't do so for any leading and trailing white spaces (I want the spaces in between to be left!)
What is the best way to achieve this? (Note, this is a specific example, so not a duplicate of general methods to strip strings)
My code: (try it with the test data: "Mr Moose" (not found) and if you try "Mr Moose " (that is a space after the Moose) it will work.
#A COMMON ERROR is leaving in blank spaces and then finding you cannot work with the data in the way you want!
"""Try the following program with the input: Mr Moose
...it doesn't work..........
but if you try "Mr Moose " (that is a space after Moose..."), it will work!
So how to remove both new lines AND leading and trailing spaces when reading from a file into a list. Note, the middle spaces between words must remain?
"""
alldata=[]
col_num=0
teacher_names=[]
delimiter=":"
with open("teacherbook.txt") as f:
for line in f.readlines():
alldata.append((line.strip()))
print(alldata)
print()
print()
for x in alldata:
teacher_names.append(x.split(delimiter)[col_num])
teacher=input("Enter teacher you are looking for:")
if teacher in teacher_names:
print("found")
else:
print("No")
Desired output, on producing the list alldata
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
i.e - remove all leading and trailing white space at the start, and before or after the delimiter. The spaces in between words such as Mr Moose, must be left.
Contents of teacherbook:
Mr Moose : Maths
Mr Goose: History
Mrs Congenelipilling: English
Thanks in advance
You could use a regex:
txt='''\
Mr Moose : Maths
Mr Goose: History
Mrs Congenelipilling: English'''
>>> [re.sub(r'\s*:\s*', ':', line).strip() for line in txt.splitlines()]
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
So your code becomes:
import re
col_num=0
teacher_names=[]
delimiter=":"
with open("teacherbook.txt") as f:
alldata=[re.sub(r'\s*{}\s*'.format(delimiter), delimiter, line).rstrip() for line in f]
print(alldata)
for x in alldata:
teacher_names.append(x.split(delimiter)[col_num])
print(teacher_names)
Prints:
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
['Mr Moose', 'Mr Goose', 'Mrs Congenelipilling']
The key part is the regex:
re.sub(r'\s*{}\s*'.format(delimiter), delimiter, line).rstrip()
^ 0 to unlimited spaced before the delimiter
^ place for the delimiter
^ unlimited trailing space
Interactive Demo
For an all Python solution, I would use str.partition to get the left hand and right hand side of the delimiter then strip the whitespace as needed:
alldata=[]
with open("teacherbook.txt") as f:
for line in f:
lh,sep,rh=line.rstrip().partition(delimiter)
alldata.append(lh.rstrip() + sep + rh.lstrip())
Same output
Another suggestion. Your data is more suited to a dict than a list.
You can do:
di={}
with open("teacherbook.txt") as f:
for line in f:
lh,sep,rh=line.rstrip().partition(delimiter)
di[lh.rstrip()]=rh.lstrip()
Or comprehension version:
with open("teacherbook.txt") as f:
di={lh.rstrip():rh.lstrip()
for lh,_,rh in (line.rstrip().partition(delimiter) for line in f)}
Then access like this:
>>> di['Mr Moose']
'Maths'
No need to use readlines(), you can simply iterate through the file object to get each line, and use strip() to remove the \n and whitespaces. As such, you can use this list comprehension;
with open('teacherbook.txt') as f:
alldata = [':'.join([value.strip() for value in line.split(':')])
for line in f]
print(alldata)
Outputs;
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
Change:
teacher_names.append(x.split(delimiter)[col_num])
to:
teacher_names.append(x.split(delimiter)[col_num].strip())
remove all leading and trailing white space at the start, and before or after the delimiter. The spaces in between words such as Mr Moose, must be left.
You can split your string at the delimiter, strip the whitespace from them, and concatenate them back together again:
for line in f.readlines():
new_line = ':'.join([s.strip() for s in line.split(':')])
alldata.append(new_line)
Example:
>>> lines = [' Mr Moose : Maths', ' Mr Goose : History ']
>>> lines
[' Mr Moose : Maths', ' Mr Goose : History ']
>>> data = []
>>> for line in lines:
new_line = ':'.join([s.strip() for s in line.split(':')])
data.append(new_line)
>>> data
['Mr Moose:Maths', 'Mr Goose:History']
You can do it easily with regex - re.sub:
import re
re.sub(r"[\n \t]+$", "", "aaa \t asd \n ")
Out[17]: 'aaa \t asd'
first argument pattern - [all characters you want to remove]++ - one or more matches$$ - end of the string
https://docs.python.org/2/library/re.html
With string.rstrip('something') you can remove that 'something' from the right end of the string like this:
a = 'Mr Moose \n'
print a.rstrip(' \n') # prints 'Mr Moose\n' instead of 'Mr Moose \n\n'

Python - print tab delimited two-word set

I have a set of words such as this:
mike dc car dc george dc jerry dc
Each word, mike dc george dc is separated by a space. How can I create a two-word set and separate the two-word set by a tab? I would like to print it to the standard output stdout.
EDIT
I tried using this:
print '\t'.join(hypoth), but it doesn't really cut it. All the words here are just tab delimited. I would ideally like the first two words separated by a space and each two word-set tab delimited.
Assuming you have
two_word_sets = ["mike dc", "car dc", "george dc", "jerry dc"]
use
print "\t".join(two_word_sets)
or, for Python 3:
print("\t".join(two_word_sets))
to print the tab-separated list to stdout.
If you only have
mystr = "mike dc car dc george dc jerry dc"
you can calculate a as follows:
words = mystr.split()
two_word_sets = [" ".join(tup) for tup in zip(words[::2], words[1::2])]
This might look a bit complicated, but note that zip(a_proto[::2], a_proto[1::2]) is just [('mike', 'dc'), ('car', 'dc'), ('george', 'dc'), ('jerry', 'dc')]. The rest of the list comprehension joins these together with a space.
Note that for very long lists/input strings you would use izip from [itertools], because zip actually creates a list of tuples whereas izip returns a generator.
You can do this in 1-2 lines, but it is easiest to read if you break it up:
words = "mike dc car dc george dc jerry dc"
wlist = words.split()
mystr = ""
for i in range(0, len(wlist), 2):
mystr = "%s%s %s\t" % (mystr, wlist[i], wlist[i+1])
print mystr

Cut of middle word from a string python

I am trying to cut of few words from the scraped data.
3 Bedroom, Residential Apartment in Velachery
There are many rows of data like this. I am trying to remove the word 'Bedroom' from the string. I am using beautiful soup and python to scrape the webpage, and here I am using this
for eachproperty in properties:
print eachproperty.string[2:]
I know what the above code will do. But I cannot figure out how to just remove the "Bedroom" which is between 3 and ,Residen....
>>> import re
>>> strs = "3 Bedroom, Residential Apartment in Velachery"
>>> re.sub(r'\s*Bedroom\s*', '', strs)
'3, Residential Apartment in Velachery'
or:
>>> strs.replace(' Bedroom', '')
'3, Residential Apartment in Velachery'
Note that strings are immutable, so you need to assign the result off re.sub and str.replace to a variable.
What you need is the replace method:
line = "3 Bedroom, Residential Apartment in Velachery"
line = line.replace("Bedroom", "")
# For multiple lines use a for loop
for line in lines:
line = line.replace("Bedroom", "")
A quick answer is
k = input_string.split()
if "Bedroom" in k:
k.remove("Bedroom")
answer = ' '.join(k)
This won't handle punctuation like in your question. To do that you need
rem = "Bedroom"
answer = ""
for i in range(len(input_string)-len(rem)):
if (input_string[i:i+len(rem)]==rem):
answer = input_string[:i]+input_string[i+len(rem)]
break

How to strip variable spaces in each line of a text file based on special condition - one-liner in Python?

I have some data (text files) that is formatted in the most uneven manner one could think of. I am trying to minimize the amount of manual work on parsing this data.
Sample Data :
Name Degree CLASS CODE EDU Scores
--------------------------------------------------------------------------------------
John Marshall CSC 78659944 89989 BE 900
Think Code DB I10 MSC 87782 1231 MS 878
Mary 200 Jones CIVIL 98993483 32985 BE 898
John G. S Mech 7653 54 MS 65
Silent Ghost Python Ninja 788505 88448 MS Comp 887
Conditions :
More than one spaces should be compressed to a delimiter (pipe better? End goal is to store these files in the database).
Except for the first column, the other columns won't have any spaces in them, so all those spaces can be compressed to a pipe.
Only the first column can have multiple words with spaces (Mary K Jones). The rest of the columns are mostly numbers and some alphabets.
First and second columns are both strings. They almost always have more than one spaces between them, so that is how we can differentiate between the 2 columns. (If there is a single space, that is a risk I am willing to take given the horrible formatting!).
The number of columns varies, so we don't have to worry about column names. All we want is to extract each column's data.
Hope I made sense! I have a feeling that this task can be done in a oneliner. I don't want to loop, loop, loop :(
Muchos gracias "Pythonistas" for reading all the way and not quitting before this sentence!
It still seems tome that there's some format in your files:
>>> regex = r'^(.+)\b\s{2,}\b(.+)\s+(\d+)\s+(\d+)\s+(.+)\s+(\d+)'
>>> for line in s.splitlines():
lst = [i.strip() for j in re.findall(regex, line) for i in j if j]
print(lst)
[]
[]
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
['Silent Ghost', 'Python Ninja', '788505', '88448', 'MS Comp', '887']
Regex is quite straightforward, the only things you need to pay attention to are the delimiters (\s) and the word breaks (\b) in case of the first delimiter. Note that when the line wouldn't match you get an empty list as lst. That would be a read flag to bring up the user interaction described below. Also you could skip the header lines by doing:
>>> file = open(fname)
>>> [next(file) for _ in range(2)]
>>> for line in file:
... # here empty lst indicates issues with regex
Previous variants:
>>> import re
>>> for line in open(fname):
lst = re.split(r'\s{2,}', line)
l = len(lst)
if l in (2,3):
lst[l-1:] = lst[l-1].split()
print(lst)
['Name', 'Degree', 'CLASS', 'CODE', 'EDU', 'Scores']
['--------------------------------------------------------------------------------------']
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
another thing to do is simply allow user to decide what to do with questionable entries:
if l < 3:
lst = line.split()
print(lst)
iname = input('enter indexes that for elements of name: ') # use raw_input in py2k
idegr = input('enter indexes that for elements of degree: ')
Uhm, I was all the time under the impression that the second element might contain spaces, since it's not the case you could just do:
>>> for line in open(fname):
name, _, rest = line.partition(' ')
lst = [name] + rest.split()
print(lst)
Variation on SilentGhost's answer, this time first splitting the name from the rest (separated by two or more spaces), then just splitting the rest, and finally making one list.
import re
for line in open(fname):
name, rest = re.split('\s{2,}', line, maxsplit=1)
print [name] + rest.split()
This answer was written after the OP confessed to changing every tab ("\t") in his data to 3 spaces (and not mentioning it in his question).
Looking at the first line, it seems that this is a fixed-column-width report. It is entirely possible that your data contains tabs that if expanded properly might result in a non-crazy result.
Instead of doing line.replace('\t', ' ' * 3) try line.expandtabs().
Docs for expandtabs are here.
If the result looks sensible (columns of data line up), you will need to determine how you can work out the column widths programatically (if that is possible) -- maybe from the heading line.
Are you sure that the second line is all "-", or are there spaces between the columns?
The reason for asking is that I once needed to parse many different files from a database query report mechanism which presented the results like this:
RecordType ID1 ID2 Description
----------- -------------------- ----------- ----------------------
1 12345678 123456 Widget
4 87654321 654321 Gizmoid
and it was possible to write a completely general reader that inspected the second line to determine where to slice the heading line and the data lines. Hint:
sizes = map(len, dash_line.split())
If expandtabs() doesn't work, edit your question to show exactly what you do have i.e. show the result of print repr(line) for the first 5 or so lines (including the heading line). It might also be useful if you could say what software produces these files.

Categories