Python: Find and increment a number in a string - python

I can't find a solution to this, so I'm asking here. I have a string that consists of several lines and in the string I want to increase exactly one number by one.
For example:
[CENTER]
[FONT=Courier New][COLOR=#00ffff][B][U][SIZE=4]{title}[/SIZE][/U][/B][/COLOR][/FONT]
[IMG]{cover}[/IMG]
[IMG]IMAGE[/IMG][/CENTER]
[QUOTE]
{description_de}
[/QUOTE]
[CENTER]
[IMG]IMAGE[/IMG]
[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]01/5
[IMG]IMAGE[/IMG]
[spoiler]
[spoiler=720p]
[CODE=rich][color=Turquoise]
{mediaInfo1}
[/color][/code]
[/spoiler]
[spoiler=1080p]
[CODE=rich][color=Turquoise]
{mediaInfo2}
[/color][/code]
[/spoiler]
[/spoiler]
[hide]
[IMG]IMAGE[/IMG]
[/hide]
[/CENTER]
I'm getting this string from a request and I want to increment the episode by 1. So from 01/5 to 02/5.
What is the best way to make this possible?
I tried to solve this via regex but failed miserably.

Assuming the number you want to change is always after a given pattern, e.g. "Episodes: [/B]", you can use this code:
def increment_episode_num(request_string, episode_pattern="Episodes: [/B]"):
idx = req_str.find(episode_pattern) + len(episode_pattern)
episode_count = int(request_string[idx:idx+2])
return request_string[:idx]+f"{(episode_count+1):0>2}"+request_string[idx+2:]
For example, given your string:
req_str = """[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]01/5
"""
res = increment_episode_num(req_str)
print(res)
which gives you the desired output:
[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]02/5

As #Barmar suggested in Comments, and following the example from the documentation of re, also formatting to have the right amount of zeroes as padding:
pattern = r"(?<=Episodes: \[/B\])[\d]+?(?=/\d)"
def add_one(matchobj):
number = str(int(matchobj.group(0)) + 1)
return "{0:0>2}".format(number)
re.sub(pattern, add_one, request)
The pattern uses look-ahead and look-behind to capture only the number that corresponds to Episodes, and should work whether it's in the format 01/5 or 1/5, but always returns in the format 01/5. Of course, you can expand the function so it recognizes the format, or even so it can add different numbers instead of only 1.

Related

PySpark / Python Slicing and Indexing Issue

Can someone let me know how to pull out certain values from a Python output.
I would like the retrieve the value 'ocweeklyreports' from the the following output using either indexing or slicing:
'config': '{"hiveView":"ocweeklycur.ocweeklyreports"}
This should be relatively easy, however, I'm having problem defining the Slicing / Indexing configuation
The following will successfully give me 'ocweeklyreports'
myslice = config['hiveView'][12:30]
However, I need the indexing or slicing modified so that I will get any value after'ocweeklycur'
I'm not sure what output you're dealing with and how robust you're wanting it but if it's just a string you can do something similar to this (for a quick and dirty solution).
input = "Your input"
indexStart = input.index('.') + 1 # Get the index of the input at the . which is where you would like to start collecting it
finalResponse = input[indexStart:-2])
print(finalResponse) # Prints ocweeklyreports
Again, not the most elegant solution but hopefully it helps or at least offers a starting point. Another more robust solution would be to use regex but I'm not that skilled in regex at the moment.
You could almost all of it using regex.
See if this helps:
import re
def search_word(di):
st = di["config"]["hiveView"]
p = re.compile(r'^ocweeklycur.(?P<word>\w+)')
m = p.search(st)
return m.group('word')
if __name__=="__main__":
d = {'config': {"hiveView":"ocweeklycur.ocweeklyreports"}}
print(search_word(d))
The following worked best for me:
# Extract the value of the "hiveView" key
hive_view = config['hiveView']
# Split the string on the '.' character
parts = hive_view.split('.')
# The value you want is the second part of the split string
desired_value = parts[1]
print(desired_value) # Output: "ocweeklyreports"

How to use regex to dynamically find a value in a file in Python

I have a long string like this for example:
V:"production",PUBLIC_URL:"",WDS_SOCKET_HOST:void 0,WDS_SOCKET_PATH:void 0,WDS_SOCKET_PORT:void 0,FAST_REFRESH:!0,REACT_APP_CANDY_MACHINE_ID:"9mn5duMPUeNW5AJfbZWQgs5ivtiuYvQymqsCrZAenEdW",REACT_APP_SOLANA_NETWORK:"mainnet-beta
and I need to get the value of REACT_APP_CANDY_MACHINE_ID with regex, the value of it is always 44 characters long so that is a good thing I hope. Also the file/string im pulling it from is much much longer and the REACT_APP_CANDY_MACHINE_ID appears multiple times but it doesnt change
You don't need regex for that, just use index() to get the location of REACT_APP_CANDY_MACHINE_ID.
data = 'V:"production",PUBLIC_URL:"",WDS_SOCKET_HOST:void 0,WDS_SOCKET_PATH:void 0,WDS_SOCKET_PORT:void 0,FAST_REFRESH:!0,REACT_APP_CANDY_MACHINE_ID:"9mn5duMPUeNW5AJfbZWQgs5ivtiuYvQymqsCrZAenEdW",REACT_APP_SOLANA_NETWORK:"mainnet-beta'
key = "REACT_APP_CANDY_MACHINE_ID"
start = data.index(key) + len(key) + 2
print(data[start: start + 44])
# 9mn5duMPUeNW5AJfbZWQgs5ivtiuYvQymqsCrZAenEdW

Keep leading zeros when saving numbers that start with 0

I am trying to save a list of site codes, for example:
site_codes = [1302,9033,1103,5005,0016]
Then I want to add the site code to URLs before running web scraping, using site_codes[i], for example:
for i in range(len(site_codes)):
Data_site_A.append("https://.../"+str(parameters[i])+"site="+str(site_codes[0]))
Data_site_B.append("https://.../"+str(parameters[i])+"site="+str(site_codes[1]))
But I can not save 0016 into the list just like other numbers. I have tried many ways including:
# make a string
str("{0}{1}{2}".format(0,0,16))
# fill the 0
"%04d" % 16
But they all return '0016' instead of 0016. So when I input '0016' into the urls, it won't work, because it is not a number.
Is there a way to save this number just as 0016? Or since that print("%04d" % 16) will print out a pure 0016, is there a way to save the output from there?
For the desired output, the computer should interpret it as:
"https://...."+str(parameters[i])+"site=0016")
# use regular expression
import re
site_codes = '''
site code:
site_A: 1302
site_B: 9033
site_C: 1103
site_D: 5005
site_E: 0016
'''
site_codes = re.findall(r'\d+',site_codes)
for i in range(len(site_codes)):
Data_site_A.append("https://.../"+str(parameters[i])+"site="+str(site_codes[0]))
Data_site_B.append("https://.../"+str(parameters[i])+"site="+str(site_codes[1]))
Use str.zfill() to add leading zeros to a number;
Call str(object) with a number as object to convert it to a string.
Call str.zfill(width) on the numeric string to pad it with 0 to the specified width.
print(a_number)
OUTPUT=
123
Convert a_number to a string
number_str = str(a_number)
Pad number_str with zeros to 5 digits
zero_filled_number = number_str.zfill(5)
print(zero_filled_number)
OUTPUT=
00123
Assuming that you really do have a list of integers that can't be retained as strings and want to create the URLs. Also assuming that you are using Python 3.6 or above, you can achieve this with a simple f-string.
print(f"https://.../{str(parameters[i])}site={site_codes[1]:04d}")
This will pad with leading zeros without the need to resort to zfill.
Alternatively, or if you're running Python below 3.6, this will also work:
print("https://.../{}site={:04d}".format(str(parameters[i]), site_codes[1]))
With a site code of 16, both of the above will give you
https://.../parametersite=0016

how to return string with re when formatting is different?

Introduction to the problem
I have inputs in a .txt file and I want to 'extract' the values when a velocity is given.
Inputs have the form: velocity\t\val1\t\val2...\tvaln
[...]
16\t1\t0\n
1.0000\t9.3465\t8.9406\t35.9604\n
2.0000\t10.4654\t9.9456\t36.9107\n
3.0000\t11.1235\t10.9378\t37.1578\n
[...]
What have I done
I have written a piece of code to return values when a velocity is requested:
def values(input,velocity):
return re.findall("\n"+str(velocity)+".*",input)[-1][1:]
It works "backwards" because I want to ignore the first row from the inputs (16\t1\t0\n), this way if I call:
>>>values('inputs.txt',16)
>>>16.0000\t0.5646\t14.3658\t1.4782\n
But it has a big problem: if I call the function for 1, it returns the value for 19.0000
Since I thought all inputs would be in the same format I made a litte fix:
def values(input,velocity):
if velocity <= 5: #Because velocity goes to 50
velocity = str(velocity)+'.0'
return re.findall("\n"+velocity+".*",input)[-1][1:]
And it works pretty well, maybe is not the most beautiful (or efficient) way of do it but I'm a beginner.
The problem
But with this code I have a problem and it is that sometimes inputs have this form:
[...]
16\t1\t0\n
1\t9.3465\t8.9406\t35.9604\n
2\t10.4654\t9.9456\t36.9107\n
3\t11.1235\t10.9378\t37.1578\n
[...]
And, of course my solution doesn't work
So, is there any pattern that fit both kinds of inputs?
Thank you for your help.
P.S. I have a solution using the function split('\n') and indexes but I would like to solve it with re library:
def values(input,velocity):
return input.split('\n)[velocity+1] #+1 to avoid first row
You could use a positive look ahead to check that after your velocity there is either a period or a tab. That will stop you picking up further numbers without hardcoding there must be .0. This means that velocity 1 will be able to match 1 or 1.xxxxx
import re
from typing import List
def find_by_velocity(velocity: int, data: str) -> List[str]:
return re.findall(r"\n" + str(velocity) + r"(?=\.|\t).*", data)
data = """16\t1\t0\n1\t9.3465\t8.9406\t35.9604\n2\t10.4654\t9.9456\t36.9107\n3\t11.1235\t10.9378\t37.1578\n16\t1\t0\n1.0000\t9.3465\t8.9406\t35.9604\n2.0000\t10.4654\t9.9456\t36.9107\n3.0000\t11.1235\t10.9378\t37.1578\n"""
print(find_by_velocity(1, data))
OUTPUT
['\n1\t9.3465\t8.9406\t35.9604', '\n1.0000\t9.3465\t8.9406\t35.9604']

Apply operation and a division operation in the same step using Python

I am trying to get proportion of nouns in my text using the code below and it is giving me an error. I am using a function that calculates the number of nouns in my text and I have the overall word count in a different column.
pos_family = {
'noun' : ['NN','NNS','NNP','NNPS']
}
def check_pos_tag(x, flag):
cnt = 0
try:
for tag,value in x.items():
if tag in pos_family[flag]:
cnt +=value
except:
pass
return cnt
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')/df2['word_count'])
Note: I have used nltk package to get the counts by PoS tags and I have the counts in a dictionary in PoS_Count column in my dataframe.
If I remove "/df2['word_count']" in the first run and get the noun count and include it again and run, it works fine but if I run it for the first time I get the below error.
ValueError: Wrong number of items passed 100, placement implies 1
Any help is greatly appreciated
Thanks in Advance!
As you have guessed, the problem is in the /df2['word_count'] bit.
df2['word_count'] is a pandas series, but you need to use a float or int here, because you are dividing check_pos_tag(x, 'noun') (which is an int) by it.
A possible solution is to extract the corresponding field from the series and use it in your lambda.
However, it would be easier (and arguably faster) to do each operation alone.
Try this:
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')) / df2['word_count']

Categories