regex to find string in string without respect to order? - python

I'm not sure hwo best to word this, so I'll dive straight into an example.
a bunch of lines we don't care about [...]
This is a nice line I can look for
This is the string I wish to extract
a bunch more lines we do't care about [...]
This line contains an integer 12345 related to the string above
more garbage [...]
But sometimes (and I have no control over this) the order is swapped:
a bunch of lines we don't care about [...]
Here is another string I wish to extract
This is a nice line I can look for
a bunch more lines we do't care about [...]
This line contains an integer 67890 related to the string above
more garbage [...]
The two lines ("nice line" and "string I wish to extract") are always adjacent but the order is not predictable. The integer containing line is an inconsistent number of lines below. The "nice line" appears multiple times and is always the same and the string and integer I'm extracting (globally) may be the same or different from each other.
Ultimately the idea is to populate two lists, one containing the strings and the other containing the integers, both ordered as they are found so the two can later be used as key/value pairs.
What I have no idea how to do, or even if its possible, is to write a regex that finds the string immediately before OR after a target line???
Doing this in Python, btw.
Thoughts?
edit/addition: So what I'm expecting as a result out of the above sample text would be something like:
list1["This is the string I wish to extract", "Here is another string I wish to extract"]
list2[12345, 67890]

A good strategy might be to look for "nice lines" and then search the lines above and below.
See the following (untested) python psuedocode:
L1, L2 = [], []
lines = open("file.txt").readlines()
for i, line in enumerate(i, lines):
if 'nice line' in line:
before_line = lines[min(i-1, 0)]
after_line = lines[min(i+1, len(lines) - 1)]
# You can generalize the above to a few lines above and below
# Use regex to parse information from `before_line` and `after_line`
# and add it to the lists: L1, L2

Related

I have a question on using a split() method and the data's format (solved, uploading as memo)

I have made a variable like below,
data = '''
hello my name is mj
and I like reading novels and webtoons
nice meeting you all!
'''
and used data.split('\n') to split by sentences.
The data came up like,
['', 'hello my name is mj', 'and I like reading novels and webtoons', 'nice meeting you all!', '']
At the above list, why is there double quotations(") in the starting and at the end? Are those single sentence like 'hello my name is mj' and 'and I like ~' tied up as one string? If so, why??
Wait, while writing this question I think I got the answer, it is not just double quotation it is two single quotation written in order. As there is nothing written next to two '''s, so it just made empty string.
There is \n character at the beginning and end of you string, therefore it is also a part of the return value from split. You can do something like this:
[x for x in data.split('\n') if x]
Using list comprehension with a condition to filter only lines that are not empty.
...to split by sentence
There is the native splitlines method for this. It is advised to use that method, as it is aware of all kinds of varying encodings for line breaks.
Also, it will not create an extra entry at the end when the input ends with a line break like in your example. However, since you have an explicit empty line at the beginning, that one would still be included.
It might be a pragmatic solution to just strip your input from surrounding white space:
data.strip().splitlines()
For your example input, this will evaluate to:
[
'hello my name is mj',
'and I like reading novels and webtoons',
'nice meeting you all!'
]

Comparing two text files to EXCLUDE duplicates, and not line-by line. I want to output the exclusion of any duplicate strings, specifically

I feel like this isn't that difficult but some reason it is and I'm sleep deprived...so yeah. I've been able to neatly format and isolate the words of interests from two long .txt files. I've searched around StackOverflow and I can only seem to find line-by-line comparisons (which specifically seeks out duplicate strings and I'm trying to do the exact opposite), so it is not at all what I'm looking for. My objective is to check whether the same string appears ANYWHERE in (as in, is duplicated) in either txt file (I'm comparing just two) and the resultant output should exclude any and all duplicates and written to a .txt file or at least printed to the console. I've read the Python documentation and am aware of set(). I don't mind tips on that, but is there another way to go about it?
edit: it is solely a string of (numerous) five numeric characters, if that helps. Thank you in advance!
Both .txt files I'm comparing look like this essentially (I have had to change it a bit, but it is same exact idea).
1-94823 Words Words a numeric percentage time lapsed
2-84729 Words Words a numeric percentage time lapsed
The whole document, line-by-line is like that however there is some overlap between these two txt files and I am solely interested in the five digit number after the dash. I apologize my title is/was unclear, I want to compare every instance of these five digit numbers from both txt files and specifically exclude duplicates found if anything matches up in either of the two txt files, not just line-by-line and output that (there are a fair number of duplicates).
Thanks,
Amelia
Once you have a list of those 5 digit number you can do this:
List of numbers:
list1 = [12345, 67890, 13579]
list2 = [54321, 67890, 11235]
Create the sets:
set1 = set(list1)
set2 = set(list2)
Get the union without the intersection
non_duplicates_list = list(set1.symmetric_difference(set2))
and the result is:
[11235, 13579, 54321, 12345]
If there is any problem let me know :)

Removing various symbols from a text

I am trying to clean some texts that are very different from one another. I would like to remove the headlines, quotation marks, abbreviations, special symbols and points that don't actually end sentences.
Example input:
This is a headline
And inside the text there are 'abbreviations', e.g. "bzw." in German or some German dates, like 2. Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely.
• they have
◦ different bullet points
- or even equations and
Sometimes there are special symbols. ✓
Example output:
And inside the text there are abbreviations, for example beziehungsweise in German or some German dates, like 2 Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely. Sometimes there are special symbols.
What I did:
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'r', encoding="utf8") as infile:
data = infile.read()
data = data.replace("'", '')
data = data.replace("e.g.", 'for example')
#and so on
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'w', encoding="utf8") as outfile:
outfile.write(data)
My problems (although number 2 is the most important):
I just want a string with this input, but it obviously breaks because of the quotation marks, is there any way to do this other than working with files like I did? In reality, I'm copy-pasting a text and want an app to clean it.
The code seems very inefficient because I just manually write the things that I remember to delete/clean, but I don't know all the abbreviations by heart. How do I clean it in one go, so to say?
Is there any way to eliminate the headline and enumeration, and the point . that appears in that German date? My code doesn't do that.
Edit: I just remembered stuff like text = re.sub(r"(#\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text), but regex is inefficient for huge texts, isn't it?
To easily remove all non standard symbols you can use the str.isalnum() which only returns true for any alphaneumaric sequence or str.isascii() for any ascii strings. isprintable() seems viable too. A full list can be found here Using those functions you can iterate over the string and filter each character. So something like this:
filteredData = filter(str.isidentifier, data)
You can also combine those by creating a function that checks multiple string variables like this:
def FilterKey(char:str): return char.isidentifier() and char.isalpha()
Which can be used in filter like this:
filteredData = filter(FilterKey, data)
if it returns true its included in the output if it returns false its excluded.
You can also extend this by including your own checks on the chars in the return of the function and then afterwards, to remove the large chunks of strings, you can use the typical str.replace(old,new) function.

Python: joining multiple lines into a single line/string and appending it to a list

Basically I have a text file, I am reading it line by line. I want to merge some lines together (a part of the text) in to a single string and add it as an element to a list.
These parts of the text that I want to combine start with the letters "gi" and end with ">". I can successfully isolate this part of the text but I am having trouble manipulating with it in any way, i would like it to be a single variable, acting like a individual entity. So far it is only adding single lines to the list.
def lines(File):
dataFile = open(File)
list =[]
for letters in dataFile:
start = letters.find("gi") + 2
end = letters.find(">", start)
unit = letters[start:end]
list.append(unit)
return list
This is an example:
https://www.dropbox.com/s/1cwv2spfcpp0q0s/pythonmafft.txt?dl=0
So every entry that is in the file I would like to manipulate as a single string and be able to append it to a list. Every entry is seperated by a few empty lines.
First off, don't use list as a variable name. list is a builtin and you override it each time you assign the same name elsewhere in your code. Try to use more descriptive names in general and you'll easily avoid this pitfall.
There is an easier way to do what you're asking, since '>gi' (in the example you gave) is placed together. You can simply use split and it'll give you the units (without '>gi').
def lines(File):
dataFile = open(File)
wordlist = dataFile.read().split('>gi')
return wordlist

Looking for a strategy for parsing a file

I'm an experienced C programmer, but a complete python newbie. I'm learning python mostly for fun, and as a first exercise want to parse a text file, extracting the meaningful bits from the fluff, and ending up with a tab-delimited string of those bits in a different order.
I've had a blast plowing through tutorials and documentation and stackoverflow Q&As, merrily splitting strings and reading lines from files and etc. Now I think I'm at the point where I need a few road signs from experienced folks to avoid blind alleys.
Here's one chunk of the text I want to parse (you may recognize this as a McMaster order). The actual file will contain one or more chunks like this.
1 92351A603 Lag Screw for Wood, 18-8 Stainless Steel, 5/16" Diameter, 5" Long, packs of 5
Your Part Number: 7218-GYROID
22
packs today
5.85
per pack 128.70
Note that the information is split over several lines in the file. I'd like to end up with a tab-delimited string that looks like this:
22\tpacks\tLag Screw for Wood, 18-8 Stainless Steel, 5/16" Diameter, 5" Long, packs of 5\t\t92351A603\t5.85\t\t128.70\t7218-GYROID\n
So I need to extract some parts of the string while ignoring others, rearrange them a bit, and re-pack them into a string.
Here's the (very early) code I have at the moment, it reads the file a line at a time, splits each line with delimiters, and I end up with several lists of strings, including a bunch of empty ones where there were double tabs:
import sys
import string
def split(delimiters, string, maxsplit=0):
"""Split the given string with the given delimiters (an array of strings)
This function lifted from stackoverflow in a post by Kos"""
import re
regexPattern = '|'.join(map(re.escape, delimiters))
return re.split(regexPattern, string, maxsplit)
delimiters = "\t", "\n", "\r", "Your Part Number: "
with open(sys.argv[1], 'r') as f:
for line in f:
print(split( delimiters, line))
f.close()
Question 1 is basic: how can I remove the empty strings from my lists, then mash all the strings together into one list? In C I'd loop through all the lists, ignoring the empties and sticking the other strings in a new list. But I have a feeling python has a more elegant way to do this sort of thing.
Question 2 is more open ended: what's a robust strategy here? Should I read more than one line at a time in the first place? Make a dictionary, allowing easier re-ordering of the items later?
Sorry for the novel. Thanks for any pointers. And please, stylistic comments are more than welcome, style matters.
You don't need to close file when using with.
And if I were to implement this. I might use a big regex to extract parts from each chunk(with finditer), and reassemble them for output.
You can remove empty strings by:
new_list = filter(None, old_list)
Replace the first parameter with a lambda expression that is True for elements you want to keep. Passing None is equivalent to lambda x: x.
You can mash strings together into one string using:
a_string = "".join(list_of_strings)
If you have several lists (of whatever) and you want to join them together into one list, then:
new_list = reduce(lambda x, y: x+y, old_list)
That will simply concatenate them, but you can use any non-empty string as the separator.
If you're new to Python, then functions like filter and reduce (EDIT: deprecated in Python 3) may seem a bit alien, but they save a lot of time coding, so it's worth getting to know them.
I think you're on the right track to solving your problem. I'd do this:
break up everything into lines
break the resulting list into smaller list, one list per order
parse the orders into "something meaningful"
sort, output the result
Personally, I'd make a class to handle the last two parts (they kind of belong together logically) but you could get by without it.

Categories