Sort with re.search() - Python

Sort with re.search() - Python - python

i have some problems with solving the follwoing problem.
I have to *.txt files in both files are cities from austria. In the first file "cities1" are the cities are ordered by population.
The first file (cities1.txt) is looking like this:
1.,Vienna,Vienna,1.840.573
2.,Graz,Styria,273.838
3.,Linz,Upper Austria,198.181
4.,Salzburg,Salzburg,148.420
5.,Innsbruck,Tyrol,126.851
The second file (cities2.txt) is looking like this:
"Villach","Carinthia",60480,134.98,501
"Innsbruck","Tyrol",126851,104.91,574
"Graz","Styria",273838,127.57,353
"Dornbirn","Vorarlberg",47420,120.93,437
"Vienna","Vienna",1840573,414.78,151
"Linz","Upper Austria",198181,95.99,266
"Klagenfurt am Woerthersee","Carinthia",97827,120.12,446
"Salzburg","Salzburg",148420,65.65,424
"Wels","Upper Austria",59853,45.92,317
"Sankt Poelten","Lower Austria",52716,108.44,267
What i like to do, or in other words what i should do is, the first file cities1.txt is already sorted. I only need the second element of every line. That means i only need the name of the city. For example from the line 2.,Graz,Styria,273.838, i only need Graz.
Than second i should print out the area of the city, this is the fourth element of every line in cities2.txt. That means, for example from the third line "Graz","Styria",273838,127.57,353, i only need 127.57.
At the end the console should display the following:
Vienna,414.78
Graz,127.57
Linz,95.99
Salzburg,65.65
Innsbruck,104.91
So, my problem now is, how can i do this, if i only allowed to use the re.search() method. Cause the second *.txt file is not in the same order and i have to bring the cities in the same order as in the first file that this will work, or?
I know, it would be much easier to use re.split() because than you are able to compare the list elements form both files. But I'm not allowed to do this.
I hope someone can help me and sorry for the long text.

Here's an implementation based on my earlier comment:
with open('cities2.txt') as c:
D = {}
for line in c:
t = line.split(',')
cn = t[0].strip('"')
D[cn] = t[-2]
with open('cities1.txt') as d:
for line in d:
t = line.split(',')
print(f'{t[1]},{D[t[1]]}')
Note that this may not be robust. If there's a city name in cities1.txt that does not exist in cities2.txt then you'll get a KeyError

This is just a hint, it's your university assignment after all.
import re
TEST = '2.,Graz,Styria,273.838'
RE = re.compile('^[^,]*,([^,]*),')
if match := RE.search(TEST):
print(match.group(1)) # 'Graz'
Let's break down the regexp:
^ - start of line
[^,]* - any character except a comma - repeated 0 or more times
this is the first field
, - one comma character
this is the field separator
( - start capturing, we are interested in this field
[^,]* - any character except a comma - repeated 0 or more times
this is the second field
) - stop capturing
, - one comma character
(don't care about the rest of line)

Related

Replacing specific substrings in a specific part of a string

I have a following text file that is to be edited in a certain manner. The part of the file that comes to inside the (init: part is to be overwritten and nothing except that should be edited.
File:
(define (problem bin-picking-doosra)
(:domain bin-picking-second)
;(:requirements :typing :negative-preconditions)
(:objects
)
(:init
(batsmen first_batsman)
(bowler none_bowler)
(umpire third_umpire)
(spectator no_spectator)
)
(:goal (and
(batsmen first_batsman)
(bowler last_bowler)
(umpire third_umpire)
(spectator full_spectator)
)
)
)
In this file I want replace every line that is inside the (init: section with the required string. In this case, I want to replace:
(batsmen first_batsman) with (batsmen none_batsmen)
(bowler none_bowler) with (bowler first_bowler)
(umpire third_umpire) with (umpire leg_umpire)
(spectator no_spectator) with (spectator empty_spectator)
The code I currently have the following:
file_path = "/home/mus/problem_turtlebot.pddl"
s = open(file_path).read()
s = s.replace('(batsmen first_batsman)', '(batsmen '+ predicate_batsmen + '_batsman)')
f = open(file_path, 'w')
f.write(s)
f.close()
The term predicate_batsmen here contains the word none. It works fine this way. This code only satisfies point number 1. mentioned above
There are three problems that I have.
This code also changes the '(batsmen first_batsmen)' part in (goal: part which I dont want. I only want it to change the (init: part
Currently for the other strings in the (init: part, I have to redo this code with different statement. For eg: for '(bowler none_bowler)' i.e. point number 2 above, I have to have a copy of the coded lines again which I think is a not a good coding technique. Any better way for it.
If we consider the first string in (init: that is to be overwritten i.e (batsmen first_batsman). Is there a way in python that no matter what matter what is written in the question mark part of the string like (batsmen ??????_batsman) could be replaced with none. For now it is 'first' but even if it is written 'second'((batsmen second_batsman)) or 'last' ((batsmen last_batsman)) , I want to replace it with 'none'(batsmen none_batsman).
Any ideas on these issues?
Thanks

First of all you need to find the init-group. The init-group seems to have the structure:
(:init
...
)
where ... is some recurrence of text contained inside parenthesis, e.g. "(batsmen first_batsman)". Regular expressions is a powerful way to locate these kind of patterns in text. If you are not familiar with regular expressions (or regex for short) have a look here.
The following regex locates this group:
import re
#Matches the items in the init-group:
item_regex = r"\([\w ]+\)\s+"
#Matches the init-group including items:
init_group_regex = re.compile(r"(\(:init\s+({})+\))".format(item_regex))
init_group = init_group_regex.search(s).group()
Now you have the init-group in match. The next step is to locate the term you would want to replace, and actually replace it. re.sub can do just that! First store the mappings in a dictionary:
mappings = {'batsmen first_batsman': 'batsmen '+ predicate_batsmen + '_batsman',
'bowler none_bowler': 'bowler first_bowler',
'umpire third_umpire': 'umpire leg_umpire',
'spectator no_spectator': 'spectator empty_spectator'}
Finding the occurrences and replacing them by their corresponding value one-by-one:
for key, val in mappings.items():
init_group = re.sub(key, val, init_group)
Finally you can replace the init-group in the original string:
s = init_group_regex.sub(init_group, s)
This is really flexible! You can use regex in mappings to have it match anything you like, including:
mappings = {'batsmen \w+_batsman': '(batsmen '+ predicate_batsmen + '_batsman)'}
to match 'batsmen none_batsman', 'batsmen first_batsman' etc.

Add charachters to the dates in a text with a lot of information

I have a textfile with dates, names and adresses like so:
190524 David Bakerstreet 190515 Peter Hollandstreet etc
I want to preceed the dates with {" exactly and proceed the dates with ": exactly to make it fit a Dictionary. I tried the following, but it subs every date in file.txt for the latest in the loop, instead of changing one a a time, so every date i file.txt becomes the same. How can I do it?
file = open(file.txt)
FILE = file.read()
a = re.compile(r"\d\d\d\d\d\d") # To find dates like 190213
b = re.findall(a, FILE) # Finding all the dates and put them in a list
for k in b:
for q in FILE.split():
if k in q:
c = a.sub("{\""+k+"\":", FILE)
print(c)
Outcome: {"190515:" David Bakerstreet {"190515": Peter Hollandstreet etc
Outcome I want: {"190524:" David Bakerstreet {"190515": Peter Hollandstreet etc

You can make use of the \1 token in the replacement to refer to a part of the string matched.
First, capture the 6 digits matched with a group. Use this regex:
(\d{6})
For the replacement string, instead of "{\"" + k + "\":", you can simply use "{\"\\1\":".
You don't actually need that many for loops either. Assuming the input file is small, I don't think you need any for loop.
a = re.compile(r"(\d{6})")
c = a.sub("{\"\\1\":", FILE)
print(c)
If your input file is large, then you might need to read it bit by bit, instead of the whole thing at once.

Convert .txt into .csv when some rows have missing data for certain columns (python)

I have a .txt file that is currently formatted kind of like this:
John,bread,17,www.google.com
Emily,apples,24,
Anita,35,www.website.com
Charles,banana,www.stackoverflow.com
Susie,french fries,31,www.regexr.com
...
The first column will never have any missing values.
I'm trying to use python to convert this into a .csv file. I know how to do this if I have all of the column data for each row, but my .txt is missing some data in certain columns. How can I convert this to a .csv while making sure the same type of data remains in the same column? Thanks :)

Split by commas. You know the pattern should be word, word, int(I'm assuming), string in the pattern of www.word.word.
If there is only 1 word at the front instead of 2, add another comma after the first word.
If the number is missing, add a comma after the second word.
Etc...
Say you get a line "Susie,www.regexr.com" , you know that there is a missing word and missing number. Add 2 commas after the first word.
It's essentially a bunch of if statements or a switch-case statement.
There probably is a more elegant way of doing this, but my mind is fried from dealing with server and phone issues all morning.
This isn't tested in any way, I hope I didn't just embarrass myself:
import re
#read_line is a line read from the csv
split_line = read_line.split(',')
num_elements = len(split_line) #do this only once for efficiency
if (num_elements == 3): #Need to add an element somewhere, depending on what's missing
if(re.search('[^#]+#[^#]+\.[^#]+',split_line[2])): #Starting at the last element, if it is an email address
if(re.search('[\d]',split_line[1])): #If the previous element is a digit
#if so, add a comma as the only element missing is the string at split_line[1]
read_line = split_line[0]+','+','+split_line[1]+','+split_line[2]
else:
#if not so, add a comma at split_line[2]
read_line = split_line[0]+','+split_line[1]+','+','+split_line[2]
else:
#last element isn't email address, add a comma in its place
read_line = split_line[0]+','+split_line[1]+','+split_line[2]+','
elif (num_elements == 2) #need two elements, first one is assumed to always be there
if(re.search('[^#]+#[^#]+\.[^#]+',split_line[1])): #The second element is an email address
#Insert 2 commas in for missing string and number
read_line = split_line[0]+',,,'+split_line[1]
elif(re.search('[\d]',split_line[1])): #The second element contains digits
#Insert commas for missing string and email address
read_line = split_line[0]+',,'+split_line[1]+','
else:
#Insert commas for missing number and email address
read_line = split_line[0]+','+split_line[1]+',,'
elif (num_elements == 1):
read_line = split_line[0]+',,,'

I thought about your issue and I can only offer a half baked solution as your CSV file, when having missing data do not show it with something like ,,.
Your current csv file is like that
John,bread,17,www.google.com
Emily,apples,24,
Anita,35,www.website.com
Charles,banana,www.stackoverflow.com
Susie,french fries,31,www.regexr.com
If you find a way to change your CSV file to like that
John,bread,17,www.google.com
Emily,apples,24,
Anita,,35,www.website.com
Charles,banana,,www.stackoverflow.com
Susie,french fries,31,www.regexr.com
You can use the solution like below. For info, I've put your input into a text file
In [1]: import pandas as pd
In [2]: population = pd.read_csv('input_to_csv.txt')
In [3]: mod_population=population.fillna("NaN")
In [4]: mod_population.to_csv('output_to_csv.csv',index=False)

One suggestion would be to do a regex check, if you can assume some kind of uniformity. For example, build a list of regex patterns, since each piece of data seems to be different.
If the second column you read in matches all characters and spaces, it's likely food. On the other hand, if it's a digit match, you should assume that food is missing. If it's a url match, you missed both. You'll want to be thorough with your test cases, but if the actual data is similar to your example you have 3 relatively unique cases, with a string, an integer, and a url. This should make writing regex tasks relatively trivial. Importing re and using re.search should help you test each regex without too much overhead.

In Python,if startswith values in tuple, I also need to return which value

I have an area codes file I put in a tuple
for line1 in area_codes_file.readlines():
if area_code_extract.search(line1):
area_codes.append(area_code_extract.search(line1).group())
area_codes = tuple(area_codes)
and a file I read into Python full of phone numbers.
If a phone number starts with one of the area codes in the tuple, I need to do to things:
1 is to keep the number
2 is to know which area code did it match, as need to put area codes in brackets.
So far, I was only able to do 1:
for line in txt.readlines():
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print (line)
How do I do the second part?

The simple (if not necessarily highest performance) approach is to check each prefix individually, and keep the first match:
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print(line, next(filter(line.startswith, area_codes)))
Since we know filter(line.startswith, area_codes) will get exactly one hit, we just pull the hit using next.
Note: On Python 2, you should start the file with from future_builtins import filter to get the generator based filter (which will also save work by stopping the search when you get a hit). Python 3's filter already behaves like this.
For potentially higher performance, the way to both test all prefixes at once and figure out which value hit is to use regular expressions:
import re
# Function that will match any of the given prefixes returning a match obj on hit
area_code_matcher = re.compile(r'|'.join(map(re.escape, area_codes))).match
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Returns None on miss, match object on hit
m = area_code_matcher(line)
if m is not None:
# Whatever matched is in the 0th grouping
print(line, m.group())
Lastly, one final approach you can use if the area codes are of fixed length. Rather than using startswith, you can slice directly; you know the hit because you sliced it off yourself:
# If there are a lot of area codes, using a set/frozenset will allow much faster lookup
area_codes_set = frozenset(area_codes)
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Assuming lines that match always start with ###
if line[:3] in area_codes_set:
print(line, line[:3])

Trying to use Python script to add strings to file

I have a spanish novel, in a plain textfile, and I want to make a Python script that puts a translation in brackets after difficult words. I have a list of the words (with translations) I want to do this with in a separate text file, which I have tried to format correctly.
I've forgotten everything I knew about Python, which was very little to begin with, so I'm struggling.
This is a script someone helped me with:
bookin = (open("C:\Users\King Kong\Documents\_div_tekstfiler_\coc_es.txt")).read()
subin = open("C:\Users\King Kong\Documents\_div_tekstfiler_\cocdict.txt")
for line in subin.readlines():
ogword, meaning = line.split()
subword = ogword + " (" + meaning + ")"
bookin.replace(ogword, subword)
ogword = ogword.capitalize()
subword = ogword + " (" + meaning + ")"
bookin.replace(ogword, subword)
subin.close()
bookout = open("fileout.txt", "w")
bookout.write(bookin)
bookout.close()
When I ran this, I got this error message:
Traceback (most recent call last):
File "C:\Python27\translscript_secver.py", line 4, in <module>
ogword, meaning = line.split()
ValueError: too many values to unpack
The novel pretty big, and the dictionary I've made consists of about ten thousand key value pairs.
Does this mean there's something wrong with the dictionary? Or it's too big?
Been researching this a lot, but I can't seem to make sense of it. Any advice would be appreciated.

line.split() in ogword, meaning = line.split() returns a list, and in this case it may be returning more than 2 values. Write your code in a way that can handle more than two values. For instance, by assigning line.split() to a list and then asserting that the list has two items:
mylist = line.split()
assert len(mylist) == 2

ogword, meaning = line.split()[:2]

line.split() return a list of words (space separated token) in line. The error you get suggest that somewhere, your dictionnary contains more than just pair. You may add trace message to locate the error (see below).
If your dictionnary contains richer definitions than synonym, you may use following lines, which put the first word in ogword and following ones in meaning.
words = line.split()
ogword, meaning = words[0], " ".join(words[1:])
If your dictionary syntax is more complex (composed ogword), you have to rely on an explicit separator. You can still use split to divide your lines (line.split("=") will split a line on "=" characters)
Edit: to ignore and display bad lines, replace ogword, meaning = line.split() with
try:
ogword,meaning = line.split()
except:
print "wrong formated line:", line
continue

split()
returns a single list, ie one item, you are trying to assign this one thing to two variables.
It will work if the number of items in the list is equal to the number of variables on the left hand side of the assignment statement. I.e., the list is unpacked and the individual parts are assigned to the variables on the left hand side.
In this case, as pointed out by #Josvic Zammit, the problem can occur if there are more than 2 items in the list and can not properly "unpacked" and assigned.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.