Keep leading zeros when saving numbers that start with 0

Keep leading zeros when saving numbers that start with 0 - python

I am trying to save a list of site codes, for example:
site_codes = [1302,9033,1103,5005,0016]
Then I want to add the site code to URLs before running web scraping, using site_codes[i], for example:
for i in range(len(site_codes)):
Data_site_A.append("https://.../"+str(parameters[i])+"site="+str(site_codes[0]))
Data_site_B.append("https://.../"+str(parameters[i])+"site="+str(site_codes[1]))
But I can not save 0016 into the list just like other numbers. I have tried many ways including:
# make a string
str("{0}{1}{2}".format(0,0,16))
# fill the 0
"%04d" % 16
But they all return '0016' instead of 0016. So when I input '0016' into the urls, it won't work, because it is not a number.
Is there a way to save this number just as 0016? Or since that print("%04d" % 16) will print out a pure 0016, is there a way to save the output from there?
For the desired output, the computer should interpret it as:
"https://...."+str(parameters[i])+"site=0016")

# use regular expression
import re
site_codes = '''
site code:
site_A: 1302
site_B: 9033
site_C: 1103
site_D: 5005
site_E: 0016
'''
site_codes = re.findall(r'\d+',site_codes)
for i in range(len(site_codes)):
Data_site_A.append("https://.../"+str(parameters[i])+"site="+str(site_codes[0]))
Data_site_B.append("https://.../"+str(parameters[i])+"site="+str(site_codes[1]))

Use str.zfill() to add leading zeros to a number;
Call str(object) with a number as object to convert it to a string.
Call str.zfill(width) on the numeric string to pad it with 0 to the specified width.
print(a_number)
OUTPUT=
123
Convert a_number to a string
number_str = str(a_number)
Pad number_str with zeros to 5 digits
zero_filled_number = number_str.zfill(5)
print(zero_filled_number)
OUTPUT=
00123

Assuming that you really do have a list of integers that can't be retained as strings and want to create the URLs. Also assuming that you are using Python 3.6 or above, you can achieve this with a simple f-string.
print(f"https://.../{str(parameters[i])}site={site_codes[1]:04d}")
This will pad with leading zeros without the need to resort to zfill.
Alternatively, or if you're running Python below 3.6, this will also work:
print("https://.../{}site={:04d}".format(str(parameters[i]), site_codes[1]))
With a site code of 16, both of the above will give you
https://.../parametersite=0016

Related

PySpark / Python Slicing and Indexing Issue

Can someone let me know how to pull out certain values from a Python output.
I would like the retrieve the value 'ocweeklyreports' from the the following output using either indexing or slicing:
'config': '{"hiveView":"ocweeklycur.ocweeklyreports"}
This should be relatively easy, however, I'm having problem defining the Slicing / Indexing configuation
The following will successfully give me 'ocweeklyreports'
myslice = config['hiveView'][12:30]
However, I need the indexing or slicing modified so that I will get any value after'ocweeklycur'

I'm not sure what output you're dealing with and how robust you're wanting it but if it's just a string you can do something similar to this (for a quick and dirty solution).
input = "Your input"
indexStart = input.index('.') + 1 # Get the index of the input at the . which is where you would like to start collecting it
finalResponse = input[indexStart:-2])
print(finalResponse) # Prints ocweeklyreports
Again, not the most elegant solution but hopefully it helps or at least offers a starting point. Another more robust solution would be to use regex but I'm not that skilled in regex at the moment.

You could almost all of it using regex.
See if this helps:
import re
def search_word(di):
st = di["config"]["hiveView"]
p = re.compile(r'^ocweeklycur.(?P<word>\w+)')
m = p.search(st)
return m.group('word')
if __name__=="__main__":
d = {'config': {"hiveView":"ocweeklycur.ocweeklyreports"}}
print(search_word(d))

The following worked best for me:
# Extract the value of the "hiveView" key
hive_view = config['hiveView']
# Split the string on the '.' character
parts = hive_view.split('.')
# The value you want is the second part of the split string
desired_value = parts[1]
print(desired_value) # Output: "ocweeklyreports"

Python: Find and increment a number in a string

I can't find a solution to this, so I'm asking here. I have a string that consists of several lines and in the string I want to increase exactly one number by one.
For example:
[CENTER]
[FONT=Courier New][COLOR=#00ffff][B][U][SIZE=4]{title}[/SIZE][/U][/B][/COLOR][/FONT]
[IMG]{cover}[/IMG]
[IMG]IMAGE[/IMG][/CENTER]
[QUOTE]
{description_de}
[/QUOTE]
[CENTER]
[IMG]IMAGE[/IMG]
[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]01/5
[IMG]IMAGE[/IMG]
[spoiler]
[spoiler=720p]
[CODE=rich][color=Turquoise]
{mediaInfo1}
[/color][/code]
[/spoiler]
[spoiler=1080p]
[CODE=rich][color=Turquoise]
{mediaInfo2}
[/color][/code]
[/spoiler]
[/spoiler]
[hide]
[IMG]IMAGE[/IMG]
[/hide]
[/CENTER]
I'm getting this string from a request and I want to increment the episode by 1. So from 01/5 to 02/5.
What is the best way to make this possible?
I tried to solve this via regex but failed miserably.

Assuming the number you want to change is always after a given pattern, e.g. "Episodes: [/B]", you can use this code:
def increment_episode_num(request_string, episode_pattern="Episodes: [/B]"):
idx = req_str.find(episode_pattern) + len(episode_pattern)
episode_count = int(request_string[idx:idx+2])
return request_string[:idx]+f"{(episode_count+1):0>2}"+request_string[idx+2:]
For example, given your string:
req_str = """[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]01/5
"""
res = increment_episode_num(req_str)
print(res)
which gives you the desired output:
[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]02/5

As #Barmar suggested in Comments, and following the example from the documentation of re, also formatting to have the right amount of zeroes as padding:
pattern = r"(?<=Episodes: \[/B\])[\d]+?(?=/\d)"
def add_one(matchobj):
number = str(int(matchobj.group(0)) + 1)
return "{0:0>2}".format(number)
re.sub(pattern, add_one, request)
The pattern uses look-ahead and look-behind to capture only the number that corresponds to Episodes, and should work whether it's in the format 01/5 or 1/5, but always returns in the format 01/5. Of course, you can expand the function so it recognizes the format, or even so it can add different numbers instead of only 1.

Biopython gives ValueError: Sequences must all be the same length even though sequences are of the same length

I'm trying to create a phylogenetic tree by making a .phy file from my data.
I have a dataframe
ndf=
ESV trunc
1 esv1 TACGTAGGTG...
2 esv2 TACGGAGGGT...
3 esv3 TACGGGGGG...
7 esv7 TACGTAGGGT...
I checked the length of the elements of the column "trunc":
length_checker = np.vectorize(len)
arr_len = length_checker(ndf['trunc'])
The resulting arr_len gives the same length (=253) for all the elements.
I saved this dataframe as .phy file, which looks like this:
23 253
esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
This is similar to the file used in this tutorial.
However, when I run the command
aln = AlignIO.read('msa.phy', 'phylip')
I get "ValueError: Sequences must all be the same length"
I don't know why I'm getting this or how to fix it. Any help is greatly appreciated!
Thanks

Generally phylip is the fiddliest format in phylogenetics between different programs. There is strict phylip format and relaxed phylip format etc ... t is not easy to know which is the separator being used, a space character and/or a carriage return.
I think that you appear to have left a space between the name of the taxon (i.e. the sequence label) and sequence name, viz.
2. esv2
Phylip format is watching for the space between the label and the sequence data. In this example the sequence would be 3bp long. The use of a "." is generally not a great idea as well. The integer doesn't appear to denote a line number.
The other issue is you could/should try keeping the sequence on the same line as the label and remove the carriage return, viz.
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
Sometimes a carriage return does work (this could be relaxed phylip format), the traditional format uses a space character " ". I always maintained a uniform number of spaces to preserve the alignment ... not sure if that is needed.
Note if you taxon name exceeeds 10 characters you will need relaxed phylip format and this format in any case is generally a good idea.
The final solution is all else fails is to convert to fasta, import as fasta and then convert to phylip. If all this fails ... post back there's more trouble-shooting
Fasta format removes the "23 254" header and then each sequence looks like this,
>esv2
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
There is always a carriage return between ">esv2" and the sequence. In addition, ">" is always present to prefix the label (taxon name) without any spae. You can simply convert via reg-ex or "re" in Python. Using a perl one-liner it will be s/^([az]+[0-9]+)/>$1/g type code. I'm pretty sure they'll be an online website that will do this.
You then simply replace the "phylip" with "fasta" in your import command. Once imported you ask BioPython to convert to whatever format you want and it should not have any problem.

First, please read the answer to How to make good reproducible pandas examples. In the future please provide a minimal reproducibl example.
Secondly, Michael G is absolutely correct that phylip is a format that is very peculiar about its syntax.
The code below will alow you to generate a phylogenetic tree from your Pandas dataframe.
First some imports and let's recreate your dataframe.
import pandas as pd
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio import AlignIO
data = {'ESV' : ['esv1', 'esv2', 'esv3'],
'trunc': ['TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG',
'TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG',
'TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG']
}
ndf = pd.DataFrame.from_dict(data)
print(ndf)
Output:
ESV trunc
0 esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCG...
1 esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTG...
2 esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...
Next, write the phylip file in the correct format.
with open("test.phy", 'w') as f:
f.write("{:10} {}\n".format(ndf.shape[0], ndf.trunc.str.len()[0]))
for row in ndf.iterrows():
f.write("{:10} {}\n".format(*row[1].to_list()))
Ouput of test.phy:
3 253
esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
Now we can start with the creation of our phylogenetic tree.
# Read the sequences and align
aln = AlignIO.read('test.phy', 'phylip')
print(aln)
Output:
SingleLetterAlphabet() alignment with 3 rows and 253 columns
TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCG...AGG esv1
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGG...AGG esv2
TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGG...CAG esv3
Calculate the distance matrix:
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)
print(dm)
Output:
esv1 0
esv2 0.3003952569169961 0
esv3 0.6086956521739131 0.6245059288537549 0
Construct the phylogenetic tree using UPGMA algorithm and draw the tree in ascii
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
Phylo.draw_ascii(tree)
Output:
________________________________________________________________________ esv3
_|
| ___________________________________ esv2
|____________________________________|
|___________________________________ esv1
Or make a nice plot of the tree:
Phylo.draw(tree)
Output:

adding string objects which are numbers in dictionary

for line in open('transactions.dat','r'):
item=line.rstrip('\n')
item=item.split(',')
custid=item[2]
amt=item[4]
if custid in cust1:
a=cust1[custid]
b=amt
c=(a)+(b)
print(cust1[custid]+" : "+a+" :"+b+":"+c)
break
else:
cust1[custid]=amt
Output:
85.91 : 85.91 :85.91:85.9185.91
Well above is my code what I want is
when I read from a file I want to add the customer amount with same
id.
Secondly there should not be repetition of customer id in my
dictionary.
so I am trying to add customer amount which is c but it gives me appended string instead of adding the two. You can see in the last part of my output which is value of c. So how do I add the values.
Sample transaction data:
109400182,2016-09-10,119257029,1094,40.29
109400183,2016-09-10,119257029,1094,9.99
377700146,2016-09-10,119257029,3777,49.37
276900142,2016-09-10,135127654,2769,23.31
276900143,2016-09-10,135127654,2769,25.58

You reading strings, instead of floats, from the file. Use this amt=float(item[4]) to convert strings representing numbers to floats, and then print(str(cust1[custid])+" : "+str(a)+" :"+str(b)+":"+str(c)) to print out.

Your code may need lots of refactor, but in a nutshell and if I understand what you are trying to do you could do
c = float(a) + float(b)
and that should work.

limit a float list into 10 digits

I have a list import from a data file.
lines=['1628.246', '100.0000', '0.4563232E-01', '0.4898217E-01', '0.3017656E-02', '0.2271272', '0.2437533', '0.1500232E-01', '0.4102987', '0.4117742', '0.5461504E-02', '2.080838', '0.5527303E-03', '-0.4542367E-03', '-0.2238781E-01', '-0.8196812E-03', '-0.3796306E-01', '-0.7906407E-03', '-0.6738000E-03', '0.000000']
I want to generate a new list include all element in same 10 digits and put back to file
Here is I did:
newline=map(float,lines)
newline=map("{:.10f}".format,newline)
newline=map(str,newline)
jitterfile.write(join(newline)+'\n')
It works, but looks not beautiful. Any idea to make it good looking?

You can do it in a single line like so:
newline=["{:.10f}".format(float(i)) for i in lines]
jitterfile.write(join(newline)+'\n')
Of note, your third instruction newline=map(str,newline) is redundant as the entries in the list are already strings, so casting them is unnecessary.

The map function also accept lambda , also as the result of format is string you don't need to apply the str on your list ,and you need to use join with a delimiter like ',':
>>> newline=map(lambda x:"{:.10f}".format(float(x)),newline)
>>> newline
['1628.2460000000', '100.0000000000', '0.0456323200', '0.0489821700', '0.0030176560', '0.2271272000', '0.2437533000', '0.0150023200', '0.4102987000', '0.4117742000', '0.0054615040', '2.0808380000', '0.0005527303', '-0.0004542367', '-0.0223878100', '-0.0008196812', '-0.0379630600', '-0.0007906407', '-0.0006738000', '0.0000000000']
jitterfile.write(','.join(newline)+'\n')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keep leading zeros when saving numbers that start with 0 - python

Related

PySpark / Python Slicing and Indexing Issue

Python: Find and increment a number in a string

Biopython gives ValueError: Sequences must all be the same length even though sequences are of the same length

adding string objects which are numbers in dictionary

limit a float list into 10 digits

Categories

Resources