Modifying data in a column Python - python

Hi guys i have a column like this,
Start
Start = 11122001
Start = 12012014
Start = 23122001
And i want to remove the "Start =" and the date format into
Start
11/12/2001
12/01/2014
23/12/2001
How do I do this properly?

It depends on what you are trying to do.
If you want to remove Start = from each line:
lines = [ format_date(re.sub("^Start =", '', line)) for line in lines ]
(presuming you have your text line by line in a list).
To format date you need to implement the function format_date
which will convert dates from 11122001 to 11/12/2001.
There are several ways how to do this depending on the input format.
One of the solutions:
def format_date(x):
if re.match(x, '[0-9]{8}'):
return "/".join([x[:2], x[2:4], x[4:]])
else:
return x
You check first if the line match the date expression (looks like a date),
and if it does, rewrite it. Otherwise just return it as is.
Of course, you can combine the solution in one line
and don't use function at all, but in this case
it will be not so clear.
Another, map-based solution:
def format_line(x):
x = re.sub("^Start =", '', line)
if re.match(x, '[0-9]{8}'):
return "/".join([x[:2], x[2:4], x[4:]])
else:
return x
map(format_line, lines)

Related

Sort with re.search() - Python

i have some problems with solving the follwoing problem.
I have to *.txt files in both files are cities from austria. In the first file "cities1" are the cities are ordered by population.
The first file (cities1.txt) is looking like this:
1.,Vienna,Vienna,1.840.573
2.,Graz,Styria,273.838
3.,Linz,Upper Austria,198.181
4.,Salzburg,Salzburg,148.420
5.,Innsbruck,Tyrol,126.851
The second file (cities2.txt) is looking like this:
"Villach","Carinthia",60480,134.98,501
"Innsbruck","Tyrol",126851,104.91,574
"Graz","Styria",273838,127.57,353
"Dornbirn","Vorarlberg",47420,120.93,437
"Vienna","Vienna",1840573,414.78,151
"Linz","Upper Austria",198181,95.99,266
"Klagenfurt am Woerthersee","Carinthia",97827,120.12,446
"Salzburg","Salzburg",148420,65.65,424
"Wels","Upper Austria",59853,45.92,317
"Sankt Poelten","Lower Austria",52716,108.44,267
What i like to do, or in other words what i should do is, the first file cities1.txt is already sorted. I only need the second element of every line. That means i only need the name of the city. For example from the line 2.,Graz,Styria,273.838, i only need Graz.
Than second i should print out the area of the city, this is the fourth element of every line in cities2.txt. That means, for example from the third line "Graz","Styria",273838,127.57,353, i only need 127.57.
At the end the console should display the following:
Vienna,414.78
Graz,127.57
Linz,95.99
Salzburg,65.65
Innsbruck,104.91
So, my problem now is, how can i do this, if i only allowed to use the re.search() method. Cause the second *.txt file is not in the same order and i have to bring the cities in the same order as in the first file that this will work, or?
I know, it would be much easier to use re.split() because than you are able to compare the list elements form both files. But I'm not allowed to do this.
I hope someone can help me and sorry for the long text.
Here's an implementation based on my earlier comment:
with open('cities2.txt') as c:
D = {}
for line in c:
t = line.split(',')
cn = t[0].strip('"')
D[cn] = t[-2]
with open('cities1.txt') as d:
for line in d:
t = line.split(',')
print(f'{t[1]},{D[t[1]]}')
Note that this may not be robust. If there's a city name in cities1.txt that does not exist in cities2.txt then you'll get a KeyError
This is just a hint, it's your university assignment after all.
import re
TEST = '2.,Graz,Styria,273.838'
RE = re.compile('^[^,]*,([^,]*),')
if match := RE.search(TEST):
print(match.group(1)) # 'Graz'
Let's break down the regexp:
^ - start of line
[^,]* - any character except a comma - repeated 0 or more times
this is the first field
, - one comma character
this is the field separator
( - start capturing, we are interested in this field
[^,]* - any character except a comma - repeated 0 or more times
this is the second field
) - stop capturing
, - one comma character
(don't care about the rest of line)

Counting date occurrences in python?

I'm currently trying to count the number of times a date occurs within a chat log for example the file I'm reading from may look something like this:
*username* (mm/dd/yyyy hh:mm:ss): *message here*
However I need to split the date from the time as I currently treat them as one. Im currently struggling to solve my problem so any help is appreciated. Down below is some sample code that I'm currently using to try get the date count working. Im currently using a counter however I'm wondering if there are other ways to count dates.
filename = tkFileDialog.askopenfile(filetypes=(("Text files", "*.txt") ,))
mtxtr = filename.read()
date = []
number = []
occurences = Counter(date)
mtxtformat = mtxtr.split("\r\n")
print 'The Dates in the chat are as follows'
print "--------------------------------------------"
for mtxtf in mtxtformat:
participant = mtxtf.split("(")[0]
date = mtxtf.split("(")[-1]
message = date.split(")")[0]
date.append(date1.strip())
for item in date:
if item not in number:
number.append(item)
for item in number:
occurences = date.count(item)
print("Date Occurences " + " is: " + str(occurences))
Easiest way would be to use regex and take the count of the date pattern you have in the log file. It would be faster too.
If you know the date and time are going to be enclosed in parentheses at the start of the message (i.e. no parentheses (...): will be seen before the one containing the date and time):
*username* (mm/dd/yyyy hh:mm:ss): *message here*
Then you can extract based on the parens:
import re
...
parens = re.compile(r'\((.+)\)')
for mtxtf in mtxtformat:
match = parens.search(mtxtf)
date.append(match.group(1).split(' ')[0])
...
Note: If the message itself contains parens, this may match more than just the needed (mm/dd/yyyy hh:mm:ss). Doing match.group(1).split(' ')[0] would still give you the information you are looking for assuming there is no information enclosed in parens before your date-time information (for the current line).
Note2: Ideally enclose this in a try-except to continue on to the next line if the current line doesn't contain useful information.

Python read .txt File -> list

I have a .txt File and I want to get the values in a list.
The format of the txt file should be:
value0,timestamp0
value1,timestamp1
...
...
...
In the end I want to get a list with
[[value0,timestamp0],[value1,timestamp1],.....]
I know it's easy to get these values by
direction = []
for line in open(filename):
direction,t = line.strip().split(',')
direction = float(direction)
t = long(t)
direction.append([direction,t])
return direction
But I have a big problem: When creating the data I forgot to insert a "\n" in each row.
Thats why I have this format:
value0, timestamp0value1,timestamp1value2,timestamp2value3.....
Every timestamp has exactly 13 characters.
Is there a way to get these data in a list as I want it? Would be very much work get the data again.
Thanks
Max
import re
input = "value0,0123456789012value1,0123456789012value2,0123456789012value3"
for (line, value, timestamp) in re.findall("(([^,]+),(.{13}))", input):
print value, timestamp
You will have to strip the last , but you can insert a comma after every 13 chars following a comma:
import re
s = "-0.1351197,1466615025472-0.25672746,1466615025501-0.3661744,1466615025531-0.4646‌​7665,1466615025561-0.5533287,1466615025591-0.63311553,1466615025621-0.7049236,146‌​6615025652-0.7695509,1466615025681-1.7158673,1466615025711-1.6896278,146661502574‌​1-1.65375,1466615025772-1.6092329,1466615025801"
print(re.sub("(?<=,)(.{13})",r"\1"+",", s))
Which will give you:
-0.1351197,1466615025472,-0.25672746,1466615025501,-0.3661744,1466615025531,-0.4646‌​7665,1466615025561,-0.5533287,1466615025591,-0.63311553,1466615025621,-0.7049236,146‌​6615025652-0.7695509,1466615025681,-1.7158673,1466615025711,-1.6896278,146661502574‌​1-1.65375,1466615025772,-1.6092329,1466615025801,
I coded a quickie using your example, and not using 13 but len("timestamp") so you can adapt
instr = "value,timestampvalue2,timestampvalue3,timestampvalue4,timestamp"
previous_i = 0
for i,c in enumerate(instr):
if c==",":
next_i = i+len("timestamp")+1
print(instr[previous_i:next_i])
previous_i = next_i
output is descrambled:
value,timestamp
value2,timestamp
value3,timestamp
value4,timestamp
I think you could do something like this:
direction = []
for line in open(filename):
list = line.split(',')
v = list[0]
for s in list[1:]:
t = s[:13]
direction.append([float(v), long(t)])
v = s[13:]
If you're using python 3.X, then the long function no longer exists -- use int.

In Python,if startswith values in tuple, I also need to return which value

I have an area codes file I put in a tuple
for line1 in area_codes_file.readlines():
if area_code_extract.search(line1):
area_codes.append(area_code_extract.search(line1).group())
area_codes = tuple(area_codes)
and a file I read into Python full of phone numbers.
If a phone number starts with one of the area codes in the tuple, I need to do to things:
1 is to keep the number
2 is to know which area code did it match, as need to put area codes in brackets.
So far, I was only able to do 1:
for line in txt.readlines():
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print (line)
How do I do the second part?
The simple (if not necessarily highest performance) approach is to check each prefix individually, and keep the first match:
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print(line, next(filter(line.startswith, area_codes)))
Since we know filter(line.startswith, area_codes) will get exactly one hit, we just pull the hit using next.
Note: On Python 2, you should start the file with from future_builtins import filter to get the generator based filter (which will also save work by stopping the search when you get a hit). Python 3's filter already behaves like this.
For potentially higher performance, the way to both test all prefixes at once and figure out which value hit is to use regular expressions:
import re
# Function that will match any of the given prefixes returning a match obj on hit
area_code_matcher = re.compile(r'|'.join(map(re.escape, area_codes))).match
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Returns None on miss, match object on hit
m = area_code_matcher(line)
if m is not None:
# Whatever matched is in the 0th grouping
print(line, m.group())
Lastly, one final approach you can use if the area codes are of fixed length. Rather than using startswith, you can slice directly; you know the hit because you sliced it off yourself:
# If there are a lot of area codes, using a set/frozenset will allow much faster lookup
area_codes_set = frozenset(area_codes)
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Assuming lines that match always start with ###
if line[:3] in area_codes_set:
print(line, line[:3])

python : reading a datetime from a log file using regex

I have a log file which has text that looks like this.
Jul 1 03:27:12 syslog: [m_java][ 1/Jul/2013 03:27:12.818][j:[SessionThread <]^Iat com/avc/abc/magr/service/find.something(abc/1235/locator/abc;Ljava/lang/String;)Labc/abc/abcd/abcd;(bytecode:7)
There are two time formats in the file. I need to sort this log file based on the date time format enclosed in [].
This is the regex I am trying to use. But it does not return anything.
t_pat = re.compile(r".*\[\d+/\D+/.*\]")
I want to go over each line in file, be able to apply this pattern and sort the lines based on the date & time.
Can someone help me on this? Thanks!
You have a space in there that needs to be added to the regular expression
text = "Jul 1 03:27:12 syslog: [m_java][ 1/Jul/2013 03:27:12.818][j:[SessionThread <]^Iat com/avc/abc/magr/service/find.something(abc/1235/locator/abc;Ljava/lang/String;)Labc/abc/abcd/abcd;(bytecode:7)"
matches = re.findall(r"\[\s*(\d+/\D+/.*?)\]", text)
print matches
['1/Jul/2013 03:27:12.818']
Next parse the time using the following function
http://docs.python.org/2/library/time.html#time.strptime
Finally use this as a key into a dict, and the line as the value, and sort these entries based on the key.
You are not matching the initial space; you also want to group the date for easy extraction, and limit the \D and .* patterns to non-greedy:
t_pat = re.compile(r".*\[\s?(\d+/\D+?/.*?)\]")
Demo:
>>> re.compile(r".*\[\s?(\d+/\D+?/.*?)\]").search(line).group(1)
'1/Jul/2013 03:27:12.818'
You can narrow down the pattern some more; you only need to match 3 letters for the month for example:
t_pat = re.compile(r".*\[\s?(\d{1,2}/[A-Z][a-z]{2}/\d{4} \d{2}:\d{2}:[\d.]{2,})\]")
Read all the lines of the file and use the sort function and pass in a function that parses out the date and uses that as the key for sorting:
import re
import datetime
def parse_date_from_log_line(line):
t_pat = re.compile(r".*\[\s?(\d+/\D+?/.*?)\]")
date_string = t_pat.search(line).group(1)
format = '%d/%b/%Y %H:%M:%S.%f'
return datetime.datetime.strptime(date_string, format)
log_path = 'mylog.txt'
with open(log_path) as log_file:
lines = log_file.readlines()
lines.sort(key=parse_date_from_log_line)

Categories