How to get text between 2 lines with PYthon - python

So I have a text file that is structured like this:
Product ID List:
ABB:
578SH8
EFC025
EFC967
CNC:
HDJ834
HSLA87
...
...
This file continues on with many companies' names and Id's below them. I need to then get the ID's of the chosen company and append them to a list, where they will be used to search a website. Here is the current line I have to get the data:
PID = open('PID.txt').read().split()
This works great if there are only Product ID's of only 1 company in there and no text. This does not work for what I plan on doing however... How can I have the reader read from (an example) after where it says ABB: to before the next company? I was thinking maybe add some kind of thing in the file like ABB END to know where to cut to, but I still don't know how to cut out between lines in the first place... If you could let me know, that would be great!

Two consecutive newlines act as a delimeter, so just split there an construct a dictionary of the data:
data = {i.split()[0]: i.split()[1:] for i in open('PID.txt').read().split('\n\n')}

Since the file is structured like that you could follow these steps:
Split based on the two newline characters \n\n into a list
Split each list on a single newline character \n
Drop the first element for a list containing the IDs for each company
Use the first element (mentioned above) as needed for the company name (make sure to remove the colon)
Also, take a look at regular expressions for parsing data like this.

with open('file.txt', 'r') as f: # open the file
next(f) # skip the first line
results = {} # initialize a dictionary
for line in f: # iterate through the remainder of the file
if ':' in line: # if the line contains a :
current = line.strip() # strip the whitespace
results[current] = [] # and add it as a dictionary entry
elif line.strip(): # otherwise, and if content remains after stripping whitespace,
results[current].append(line.strip()) # append this line to the relevant list

This should at least get you started, you will likely have better luck using dictionaries than lists, at least for the first part of your logic. By what method will you pass the codes along?
a = {}
f1 = open("C:\sotest.txt", 'r')
current_key = ''
for row in f1:
strrow = row.strip('\n')
if strrow == "":
pass
elif ":" in strrow:
current_key = strrow.strip(':')
a[current_key] = []
else:
a[current_key].append(strrow)
for key in a:
print key
for item in a[key]:
print item

Related

Python stripping words upon specific condition in a list of sentences

My starting file was .txt one, that looked like:
https://www.website.com/something1/id=39494 notes !!!! other notes
https://www.website2.com/something1/id=596774 ... notes2 !! other notes2
and so on.. so very messy
to clean it up I did:
import re
with open('file.txt', 'r') as filehandle:
places = [current_place.rstrip() for current_place in filehandle.readlines()]
filtered = [x for x in places if x.strip()]
This gave me a list of websites (without spaces in between) but still with notes in the same string.
My goal is the first have a list of "cleaned" websites without any notes afterwords:
https://www.website.com/something1/id=39494
https://www.website2.com/something1/id=596774
For that I thought to target the space after the end of website and get rid of all the words afterwords:
for s in filtered:
f = re.search('\s')
This returns an error, but even if it worked it wouldn't return what I thought.
The second step is to strip the website of some characters and compose it like: https://www.website.com/embed/id=39494
but this would come later.
I just wonder how can I achieve the first step and get rid of the notes after the website and have a clean list.
If each line consists of a URL followed by a space and any other text, you can simply split by the space and take the first element of each line:
urls = []
with open('file.txt') as filehandle:
for line in filehandle:
if not line.strip(): continue # skip empty lines
urls.append(line.split(" ")[0])
# now the variable `urls` should contain all the URLs you are looking for
EDIT: second step
for url in urls:
print('<iframe src="{}"></iframe>'.format(url))
You can use this:
# to read the lines
with open('file.txt', 'r') as f:
strlist = f.readlines()
# list to store the URLs
webs = []
for x in strlist:
webs.append(x.split(' ')[0])
print(webs)
In case if the URL position is not always at the beginning of the line. You can try
https?:\/\/www\.\w+\.com\/\w+\/id=(\d+)
then you can use back reference to get the URL and id.
Code example
with open('file.txt') as file:
for line in file:
m = re.match(r'https?:\/\/www\.\w+\.com\/\w+\/id=(\d+)', line)
if m:
print("URL=%s" % m.group(0))
print("ID=%d" % int(m.group(1)))
Demo

How to import a special format as a dictionary in python?

I have the text files as below format in single line,
username:password;username1:password1;username2:password2;
etc.
What I have tried so far is
with open('list.txt') as f:
d = dict(x.rstrip().split(None, 1) for x in f)
but I get an error saying that the length is 1 and 2 is required which indicates the file is not being as key:value.
Is there any way to fix this or should I just reformat the file in another way?
thanks for your answers.
What i got so far is:
with open('tester.txt') as f:
password_list = dict(x.strip(":").split(";", 1) for x in f)
for user, password in password_list.items():
print(user + " - " + password)
the results comes out as username:password - username1:password1
what i need is to split username:password where key = user and value = password
Since variable f in this case is a file object and not a list, the first thing to do would be to get the lines from it. You could use the https://docs.python.org/2/library/stdtypes.html?highlight=readline#file.readlines* method for this.
Furthermore, I think I would use strip with the semicolon (";") parameter. This will provide you with a list of strings of "username:password", provided your entire file looks like this.
I think you will figure out what to do after that.
EDIT
* I auto assumed you use Python 2.7 for some reason. In version 3.X you might want to look at the "distutils.text_file" (https://docs.python.org/3.7/distutils/apiref.html?highlight=readlines#distutils.text_file.TextFile.readlines) class.
Load the text of the file in Python with open() and read() as a string
Apply split(;) to that string to create a list like [username:password, username1:password1, username2:password2]
Do a dict comprehension where you apply split(":") to each item of the above list to split those pairs.
with open('list.txt', 'rt') as f:
raw_data = f.readlines()[0]
list_data = raw_data.split(';')
user_dict = { x.split(':')[0]:x.split(':')[1] for x in list_data }
print(user_dict)
Dictionary comprehension is useful here.
One liner to pull all the info out of the text file. As requested. Hope your tutor is impressed. Ask him How it works and see what he says. Maybe update your question to include his response.
If you want me to explain, feel free to comment and I shall go into more detail.
The error you're probably getting:
ValueError: dictionary update sequence element #3 has length 1; 2 is required
is because the text line ends with a semicolon. Splitting it on semicolons then results in a list that contains some pairs, and an empty string:
>>> "username:password;username1:password1;username2:password2;".split(";")
['username:password', 'username1:password1', 'username2:password2', '']
Splitting the empty string on colons then results in a single empty string, rather than two strings.
To fix this, filter out the empty string. One example of doing this would be
[element for element in x.split(";") if element != ""]
In general, I recommend you do the work one step at a time and assign to intermediary variables.
Here's a simple (but long) answer. You need to get the line from the file, and then split it and the items resulting from the split:
results = {}
with open('file.txt') as file:
for line in file:
#Only one line, but that's fine
entries = line.split(';')
for entry in entries:
if entry != '':
#The last item in entries will be blank, due to how split works in this example
user, password = entry.split(':')
results[user] = password
Try this.
f = open('test.txt').read()
data = f.split(";")
d = {}
for i in data:
if i:
value = i.split(":")
d.update({value[0]:value[1]})
print d

Python: Appending string constructed out of multiple lines to list

I'm trying to parse a txt file and put sentences in a list that fit my criteria.
The text file consists of several thousand lines and I'm looking for lines that start with a specific string, lets call this string 'start'.
The lines in this text file can belong together and are somehow seperated with \n at random.
This means I have to look for any string that starts with 'start', put it in an empty string 'complete' and then continue scanning each line after that to see if it also starts with 'start'.
If not then I need to append it to 'complete' because then it is part of the entire sentence. If it does I need to append 'complete' to a list, create a new, empty 'complete' string and start appending to that one. This way I can loop through the entire text file without paying attention to the number of lines a sentence exists of.
My code thusfar:
import sys, string
lines_1=[]
startswith = ('keys', 'values', 'files', 'folders', 'total')
completeline = ''
with open (sys.argv[1]) as f:
data = f.read()
for line in data:
if line.lower().startswith(startswith):
completeline = line
else:
completeline += line
lines_1.append(completeline)
# check some stuff in output
for l in lines_1:
print "______"
print l
print len(lines_1)
However this puts the entire content in 1 item in the list, where I'd like everything to be seperated.
Keep in mind that the lines composing one sentence can span one, two, 10 or 1000 lines so it needs to spot the next startswith value, append the existing completeline to the list and then fill completeline up with the next sentence.
Much obliged!
Two issues:
Iterating over a string, not lines:
When you iterate over a string, the value yielded is a character, not a line. This means for line in data: is going character by character through the string. Split your input by newlines, returning a list, which you then iterate over. e.g. for line in data.split('\n'):
Overwriting the completeline inside the loop
You append a completed line at the end of the loop, but not when you start recording a new line inside the loop. Change the if in the loop to something like this:
if line.lower().startswith(startswith):
if completeline:
lines_1.append(completeline)
completeline = line
For task like this
"I'm trying to parse a txt file and put sentences in a list that fit my criteria"
I usually prefer using dictionary for such kind of ideas, for example
from collections import defaultdict
seperatedItems = defaultdict(list)
for sentence in fileDataAsAList:
if satisfiesCriteria("start",sentence):
seperatedItems["start"].append(sentence)
def satisfiesCriteria(criteria,sentence):
if sentence.lower.startswith(criteria):
return True
return False
Something like this should suffise.. the code is just for giving you idea of what you might like to do.. you can have list of criterias and loop over them which will add sentences related to different creterias into dictionary something like this
mycriterias = ['start','begin','whatever']
for criteria in mycriterias:
for sentence in fileDataAsAList:
if satisfiesCriteria(criteria ,sentence):
seperatedItems[criteria ].append(sentence)
mind the spellings :p

remove duplicates from list product of tab delimited file and further classification

I have a tab delimited file that I need to extract all of of the column 12 content from (which documents categories). However the column 12 content is highly repetitive so firstly I need to get a list that just returns the number of categories (by removing repeats). And then I need to find a way to get the number of lines per category. My attempt is as follows:
def remove_duplicates(l): # define function to remove duplicates
return list(set(l))
input = sys.argv[1] # command line arguments to open tab file
infile = open(input)
for lines in infile: # split content into lines
words = lines.split("\t") # split lines into words i.e. columns
dataB2.append(words[11]) # column 12 contains the desired repetitive categories
dataB2 = dataA.sort() # sort the categories
dataB2 = remove_duplicates(dataA) # attempting to remove duplicates but this just returns an infinite list of 0's in the print command
print(len(dataB2))
infile.close()
I have no idea how I would get the number of lines for each category though?
So my questions are: how do eliminate the repeats effectively? and how do I get the number of lines for each category?
I suggest using a python Counter to implement this. A counter does almost exactly what you are asking for and so your code would look like follows:
from collections import Counter
import sys
count = Counter()
# Note that the with open()... syntax is generally preferred.
with open(sys.argv[1]) as infile:
for lines in infile: # split content into lines
words = lines.split("\t") # split lines into words i.e. columns
count.update([words[11]])
print count
All you need to do is read each line from a file, split it by tabs, grab column 12 for each line and put it in a list. (if you don't care about repeating lines just make column_12 = set() and use add(item) instead of append(item)). Then you simply use len() to get the length of the collection. Or if you want both you can use a list and change it to a set later.
EDIT: To count each catagory (Thank you Tom Morris for alerting me to the fact I didn't actually answer the question). You iterate over the set of column_12 so as to not count anything more than once and use lists built in count() method.
with open(infile, 'r') as fob:
column_12 = []
for line in fob:
column_12.append(line.split('\t')[11])
print 'Unique lines in column 12 %d' % len(set(column_12))
print 'All lines in column 12 %d' % len(column_12)
print 'Count per catagory:'
for cat in set(column_12):
print '%s - %d' % (cat, column_12.count(cat))

Trouble sorting a list with python

I'm somewhat new to python. I'm trying to sort through a list of strings and integers. The lists contains some symbols that need to be filtered out (i.e. ro!ad should end up road). Also, they are all on one line separated by a space. So I need to use 2 arguments; one for the input file and then the output file. It should be sorted with numbers first and then the words without the special characters each on a different line. I've been looking at loads of list functions but am having some trouble putting this together as I've never had to do anything like this. Any takers?
So far I have the basic stuff
#!/usr/bin/python
import sys
try:
infilename = sys.argv[1] #outfilename = sys.argv[2]
except:
print "Usage: ",sys.argv[0], "infile outfile"; sys.exit(1)
ifile = open(infilename, 'r')
#ofile = open(outfilename, 'w')
data = ifile.readlines()
r = sorted(data, key=lambda item: (int(item.partition(' ')[0])
if item[0].isdigit() else float('inf'), item))
ifile.close()
print '\n'.join(r)
#ofile.writelines(r)
#ofile.close()
The output shows exactly what was in the file but exactly as the file is written and not sorted at all. The goal is to take a file (arg1.txt) and sort it and make a new file (arg2.txt) which will be cmd line variables. I used print in this case to speed up the editing but need to have it write to a file. That's why the output file areas are commented but feel free to tell me I'm stupid if I screwed that up, too! Thanks for any help!
When you have an issue like this, it's usually a good idea to check your data at various points throughout the program to make sure it looks the way you want it to. The issue here seems to be in the way you're reading in the file.
data = ifile.readlines()
is going to read in the entire file as a list of lines. But since all the entries you want to sort are on one line, this list will only have one entry. When you try to sort the list, you're passing a list of length 1, which is going to just return the same list regardless of what your key function is. Try changing the line to
data = ifile.readlines()[0].split()
You may not even need the key function any more since numbers are placed before letters by default. I don't see anything in your code to remove special characters though.
since they are on the same line you dont really need readlines
with open('some.txt') as f:
data = f.read() #now data = "item 1 item2 etc..."
you can use re to filter out unwanted characters
import re
data = "ro!ad"
fixed_data = re.sub("[!?#$]","",data)
partition maybe overkill
data = "hello 23frank sam wilbur"
my_list = data.split() # ["hello","23frank","sam","wilbur"]
print sorted(my_list)
however you will need to do more to force numbers to sort maybe something like
numbers = [x for x in my_list if x[0].isdigit()]
strings = [x for x in my_list if not x[0].isdigit()]
sorted_list = sorted(numbers,key=lambda x:int(re.sub("[^0-9]","",x))) + sorted(strings(
Also, they are all on one line separated by a space.
So your file contains a single line?
data = ifile.readlines()
This makes data into a list of the lines in your file. All 1 of them.
r = sorted(...)
This makes r the sorted version of that list.
To get the words from the line, you can .read() the entire file as a single string, and .split() it (by default, it splits on whitespace).

Categories