Regex - Skip subsequent matches (Python) - python

I am trying to delimit the following into a table, but am running into issues with the name having 2 spaces in it or else "[\s]{2,}" would work. I also can't ignore whitespace between letters since the 1st column ends with a letter and the 2nd column starts with a letter.
I would like to skip any whitespace in between letters after the 1st occurrence.
String:
> TESTID DR5 777777 0 50000 TEST NAME 23.40 600000.00 1000000 20 5 09 05 18 09 07 18 3876.00
TESTID
DR5 777777 0
50000
TEST NAME
23.40
600000.00
1000000 20 5
09 05 18
09 07 18
3876.00

I will try to solve your stated problem (vs the regex thing), because I don't fully understand the regex question.
If I were going to make that string into a list, I would do it like this:
my_str = "TESTID DR5 777777 0 50000 TEST NAME 23.40 600000.00 1000000 20 5 09 05 18 09 07 18 3876.00"
my_list = [section for section in my_str.split(" ") if section != ""]
This uses list comprehension to filter out the blank strings from the split.
You can also use a regular expression as the separator.
import re
my_str = "TESTID DR5 777777 0 50000 TEST NAME 23.40 600000.00 1000000 20 5 09 05 18 09 07 18 3876.00"
my_list = re.split(r'\s{2, }', my_str)

Related

Pandas : A possible one-liner code that iterates through a list within a cell and extracting combinations from within that list. Using df.apply()

Instead of confusing by explanation, I'll let the code explain what I'm trying to achieve
I'm comparing a combination in one dataframe with another.
The 2 dataframes has the following:
an address
the string of the articles from which it was extracted (in a list format)
I WANT TO FIND : address-article combination within one file, say reg, that's not present in the other look. The df names refer to the method the addresses were extracted from the articles.
Note: The address when written 'AZ 08' and '08 AZ' should be treated the same.
reg = pd.DataFrame({'Address': {0: 'AZ 08',1: '04 CA',2: '10 FL',3: 'NY 30'},
'Article': {0: '[\'Location AZ 08 here\', \'Went to 08 AZ\']',
1: '[\'Place 04 CA here\', \'Going to 04 CA\', \'Where is 04 CA\']',
2: '[\'This is 10 FL \', \'Coming from FL 10\']',
3: '[\'Somewhere around NY 30\']'}})
look = pd.DataFrame({'Address': {0: 'AZ 08',1: '04 CA',2: 'NY 30' },
'Article': {0: '[\'Location AZ 08 here\']',
1: '[\'Place 04 CA here\', \'Going to 04 CA\', \'Where is 04 CA\']',
2: '[\'Somewhere around NY 30\', \'Almost at 30 NY\']'}})
What i did (able to) find is, the records in which there is a mismatch. But unable to get a address - location level info.
My method shown below.
def make_set_expanded(string,review):
rev_l = ast.literal_eval(review)
s = set(str(string).lower().split())
s.update(rev_l)
return s
reg_list_expand = reg.apply(lambda x: make_set_expanded(x['Address'], x['Article']), axis=1).to_list()
look_list_expand = look.apply(lambda x: make_set_expanded(x['Address'], x['Article']), axis=1).to_list()
reg_diff = reg[reg.apply(lambda x: 'Yes' if make_set_expanded(x['Address'], x['Article']) in look_list_expand else 'No', axis=1) == 'No']
look_diff = look[look.apply(lambda x: 'Yes' if make_set_expanded(x['Address'], x['Article']) in reg_list_expand else 'No', axis=1) == 'No']
The functions, in overall :
treates an address 'AZ 08' and '08 AZ' as the same
shows missing addresses.
shows addresses which came from a diferent article
But instead of showing the whole list as is (i.e including the ones which already has a match), I would like to show only the particular combination thats missing.
For eg in : in reg_diff, instead of showing the whole set again, i'd like to see only the address-article combination :
'AZ 08': 'Went to 08 AZ' in the row.
IIUC, try:
convert your "Article" column from string to list
explode to get each article in a separate row.
outer merge with indicator=True to identify which DataFrame each row comes from
filter the merged dataframe to get the required output.
reg["Article"] = reg["Article"].str[1:-1].str.split(", ")
reg = reg.explode("Article")
look["Article"] = look["Article"].str[1:-1].str.split(", ")
look = look.explode("Article")
merged = reg.merge(look, how="outer", indicator=True)
reg_diff = merged[merged["_merge"].eq("left_only")].drop("_merge", axis=1)
look_diff = merged[merged["_merge"].eq("right_only")].drop("_merge", axis=1)
>>> reg_diff
Address Article
1 AZ 08 'Went to 08 AZ'
5 10 FL 'This is 10 FL '
6 10 FL 'Coming from FL 10'
>>> look_diff
Address Article
8 NY 30 'Almost at 30 NY'
I'm not fully sure what your logic is looking to do, so I'll start with some example methods that may be useful:
Example of how to use .apply(eval) to format the text as lists. And an example of using .explode() to make those lists into rows.
def format_df(df):
df = df.copy()
df.Article = df.Article.apply(eval)
df = df.explode("Article")
return df.reset_index(drop=True)
reg, look = [format_df(x) for x in [reg, look]]
print(reg)
print(look)
Output:
Address Article
0 AZ 08 Location AZ 08 here
1 AZ 08 Went to 08 AZ
2 04 CA Place 04 CA here
3 04 CA Going to 04 CA
4 04 CA Where is 04 CA
5 10 FL This is 10 FL
6 10 FL Coming from FL 10
7 NY 30 Somewhere around NY 30
Address Article
0 AZ 08 Location AZ 08 here
1 04 CA Place 04 CA here
2 04 CA Going to 04 CA
3 04 CA Where is 04 CA
4 NY 30 Somewhere around NY 30
5 NY 30 Almost at 30 NY
Example of packing rows back into lists:
reg = reg.groupby('Address', as_index=False).agg(list)
look = look.groupby('Address', as_index=False).agg(list)
print(reg)
print(look)
Output:
Address Article
0 04 CA [Place 04 CA here, Going to 04 CA, Where is 04...
1 10 FL [This is 10 FL , Coming from FL 10]
2 AZ 08 [Location AZ 08 here, Went to 08 AZ]
3 NY 30 [Somewhere around NY 30]
Address Article
0 04 CA [Place 04 CA here, Going to 04 CA, Where is 04...
1 AZ 08 [Location AZ 08 here]
2 NY 30 [Somewhere around NY 30, Almost at 30 NY]

Check for blanks at specified positions in a string

I have the following problem, which I have been able to solve in a very long way and I would like to know if there is any other way to solve it. I have the following string structure:
text = 01 ARA 22 - 02 GAG 23
But due to processing sometimes the spaces are not added properly and it may look like this:
text = 04 GOR23- 02 OER 23
text = 04 ORO 21-02 RRO 24
text = 04 DRE25- 12 RIS21
When they should look as follows:
text = 04 GOR 23 - 02 OER 23
text = 04 ORO 21 - 02 RRO 24
text = 04 DRE 25 - 12 RIS 21
To add the space in those specific positions, basically I check if in that position of the string the space exists, if it does not exist I add it.
Is there another way in python to do it more efficiently?
I appreciate any advice.
You can use a regex to capture each of the components in the text, and then replace any missing spaces with a space:
import re
regex = re.compile(r'(\d{2})\s*([A-Z]{3})\s*(\d{2})\s*-\s*(\d{2})\s*([A-Z]{3})\s*(\d{2})')
text = ['04 GOR23- 02 OER 23',
'04 ORO 21-02 RRO 24',
'04 DRE25- 12 RIS21']
[regex.sub(r'\1 \2 \3 - \4 \5 \6', t) for t in text]
Output:
['04 GOR 23 - 02 OER 23',
'04 ORO 21 - 02 RRO 24',
'04 DRE 25 - 12 RIS 21']
Here is another way to do so:
data = '04 GOR23- 02 OER 23'
new_data = "".join(char if not i in [2, 5, 7, 8, 10, 13] else f" {char}" for i, char in enumerate(data.replace(" ", "")))

Python - How do I format a hexadecimal number in uppercase and two digits?

I needed to do a hexadecimal counter.
I tried to do it this way:
x = 0
while(x != 10):
print('Number '+'{0:x}'.format(int(x)))
x = x + 1
The counter is working. The only problem is that the output looks like this
0 5 a f 14 19
1 6 b 10 15 1a
2 7 c 11 16 1b
3 8 d 12 17 1c
4 9 e 13 18 1d
and I would like it to look like this
00 05 0A 0F 14 19
01 06 0B 10 15 1A
02 07 0C 11 16 1B
03 08 0D 12 17 1C
04 09 0E 13 18 1D
How could I do that?
TL;DR (Python 3.6+)
print(f'Number {x:02x}')
This is explained in the Format specification Mini-Language.
To get uppercase letters:
'x' Hex format. Outputs the number in base 16, using lowercase letters for the digits above 9.
'X' Hex format. Outputs the number in base 16, using uppercase letters for the digits above 9.
To get 2 digits:
width is a decimal integer defining the minimum field width. If not specified, then the field width will be determined by the content.
When no explicit alignment is given, preceding the width field by a zero ('0') character enables sign-aware zero-padding for numeric types. This is equivalent to a fill character of '0' with an alignment type of '='.
So, you either need to set the width to 2, the fill to 0, and the alignment to =, or, more simply, just use the special 0 prefix before the width.
So:
print('Number '+'{0:02X}'.format(int(x)))
While we're at it, this is pretty silly code.
First, x is already an int, so why call int on it?
print('Number '+'{0:02X}'.format(x))
Meanwhile, if the only thing you're putting in a format string is a single format specifier, you don't need str.format, just format:
print('Number ' + format(x, '02X'))
Or, alternatively, the whole point of str.format is that you can throw multiple things into one format string:
print('Number {:02X}'.format(x))
If you are using >= Python 3.6, you can use elegant f-strings:
print(f'Number {x:02x}')

How to read Deep-ocean Assessment and Reporting of Tsunamis (DART®) data in python

I am trying to plot water column height with python from DART data,
http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart
import pandas as pd
link = "http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart"
data = pd.read_table(link)
but data have only one column and I cant access seprated data
Site: 23228
0 #Paroscientific Serial: W23228
1 #YY MM DD hh mm ss T HEIGHT
2 #yr mo dy hr mn s - m
3 2014 08 08 06 00 00 1 2609.494
4 2014 08 08 05 45 00 1 2609.550
5 2014 08 08 05 30 00 1 2609.605
6 2014 08 08 05 15 00 1 2609.658
7 2014 08 08 05 00 00 1 2609.703
8 2014 08 08 04 45 00 1 2609.741
9 2014 08 08 04 30 00 1 2609.769
10 2014 08 08 04 15 00 1 2609.787
11 2014 08 08 04 00 00 1 2609.799
12 2014 08 08 03 45 00 1 2609.802
for example I just want HEIGHT value as numpy array, I dont know have to access this specific column
With pure Python (no NumPy) I would use the csv module:
import urllib2
import csv
u = urllib2.urlopen('http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart')
r = csv.reader(r, delimiter=' ')
# skip the headers
for _ in range(3):
next(r, None)
Now r contains an iterable which gives one row (list of 8 items) at a time for whatever you need. Of course, if you need a list of lists, you may just do list(r).
However, as you are handling rather a large amount of data, you may probably want to use NumPy. In that case:
import urllib2
import numpy as np
u = urllib2.urlopen('http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart')
arr = np.loadtxt(u, skiprows=3)
This gives you an array of 92551 x 8 values.
Accessing the heights as a NumPy array is then simple:
arr[:,7]
Pandas is another possibility, as you correctly thought. It is just a matter of a few parameters...
import urllib2
import pandas as pd
link = 'http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart'
df = pd.read_table(link, delimiter=r'\s+', skiprows=[1,3], header=1)
Now you have a nice DataFrame with df["HEIGHT"] as the height. (The column names are taken from row 2 of the file.)
And for the plotting...
df["HEIGHT"].plot()
creates
(Then I guess you will ask how to get the proper date on the X axis. I think that is worth a completely new question...)
Perhaps you can modify the following:
import urllib2, numpy
response = urllib2.urlopen('http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart')
all_lines = response.read().splitlines()
lines_of_interest = all_lines[4:len(all_lines)]
heights = numpy.zeros(len(lines_of_interest), dtype=float)
for idx, line in enumerate(lines_of_interest):
heights[idx] = float(line.split()[7])
Then:
>>> heights.shape
(92551,)
>>> heights
array([ 2609.27 , 2609.213, 2609.153, ..., 2611.157, 2611.084,
2611.008])
Etc.

How to match this regular expression in python?

I have the following string s = "~ VERSION 11 11 11.1 222 22 22.222"
I Want to extract the following into the following variables:
string Variable1 = "11 11 11.1"
string Variable2 = "222 22 22.222"
How do I extract this with regular expression? Or is there a better alternative way? (note, There may be variable spacing in between the the tokens I want to extract and the leading character may be something other than a ~, but it will definitely be a symbol:
e.g. could be:
~ VERSION 11 11 11.1 222 22 22.222
$ VERSION 11 11 11.1 222 22 22.222
# VERSION 11 11 11.1 222 22 22.222
If regular expression does not make sense for this or if there is a better way, please recommend.
How do I preform the extraction into those two variables in python?
Try this:
import re
test_lines = """
~ VERSION 11 11 11.1 222 22 22.222
$ VERSION 11 11 11.1 222 22 22.222
# VERSION 11 11 11.1 222 22 22.222
"""
version_pattern = re.compile(r"""
[~!##$%^&*()] # Starting symbol
\s+ # Some amount of whitespace
VERSION # the specific word "VERSION"
\s+ # Some amount of whitespace
(\d+\s+\d+\s+\d+\.\d+) # First capture group
\s+ # Some amount of whitespace
(\d+\s+\d+\s+\d+\.\d+) # Second capture group
""", re.VERBOSE)
lines = test_lines.split('\n')
for line in lines:
m = re.match(version_pattern, line)
if (m):
print (line)
print (m.groups())
which gives output:
~ VERSION 11 11 11.1 222 22 22.222
('11 11 11.1', '222 22 22.222')
$ VERSION 11 11 11.1 222 22 22.222
('11 11 11.1', '222 22 22.222')
# VERSION 11 11 11.1 222 22 22.222
('11 11 11.1', '222 22 22.222')
Note the use of verbose regular expressions with comments.
To convert the extracted version numbers to their numeric representation (i.e. int, float) use the regexp in #Preet Kukreti's answer and convert using int() or float() as suggested.
You can use split method of String.
v1 = "~ VERSION 11 11 11.1 222 22 22.222"
res_arr = v1.split(' ') # get ['~', 'VERSION', '11', '11', '11.1', '222', '22', '22.222']
and then use elements 2-4 and 5-7 as you want.
import re
pattern_string = r"(\d+)\s+(\d+)\s+([\d\.]+)" #is the regex you are probably after
m = re.match(pattern_string, "222 22 22.222")
groups = None
if m:
groups = m.groups()
# groups is ('222', '22', '22.222')
after which you could use int() and float() to convert to primitive numeric types if needed. For performant code you might want to precompile the regex beforehand with re.compile(...), and calling match(...) or search(...) on the resulting precompiled regex object
It is definitely easy with regular expression. Here would be one way to do it
>>> st="~ VERSION 11 11 11.1 222 22 22.222 333 33 33.3333"
>>> re.findall(r"(\d+[ ]+\d+[ ]+\d+\.\d+)",st)
['11 11 11.1', '222 22 22.222', '333 33 33.3333']
Once you get the result(s) in a list you can index and get the individual strings.

Categories