Separating comma delimited text within a field using Python - python

I'm currently trying to convert a table into RDF using Python and attach the values from each cell to the end of a URL (eg E00 becomes statistics.data.gov.uk/id/statistical-geography/E00).
I can do this for cells containing a single value using the script.
FirstCode = row[11]
if row[11] != '':
RDF = RDF + '<http://statistics.data.gov.uk/id/statistical-geography/' + FirstCode + '>.\n'
One field within the database contains multiple values that are comma delimited.
The code above therefore returns all the codes appended to the URL
e.g. http://statistics.data.gov.uk/id/statistical-geography/E00,W00,S00
Whereas I'd like it to return three values
statistics.data.gov.uk/id/statistical-geography/E00
statistics.data.gov.uk/id/statistical-geography/W00
statistics.data.gov.uk/id/statistical-geography/S00
Is there some code that will allow me to separate these out?

Yes, there is the split method.
FirstCode.split(",")
will return a list like (E00, W00, S00)
You can than iterate over the items in the list:
for i in FirstCode.split(","):
print i
Will print out:
E00
W00
S00
This page has some other useful string functions

for i in FirstCode.split(','):
RDF = RDF + '<http://statistics.data.gov.uk/id/statistical-geography/' + i + '>.\n'

Related

How do I properly write a CSV file within a for loop in python?

I am using the following code to scrape content from a webpage with the end goal of writing to a CSV. On the first iteration I had this portion working, but now that my data is formatted differently it writes the data in a way that gets mangled when I try to view it in excel.
If I use the code below the "heading.text" data is correctly put into one cell when viewed in excel. Where as the contents of "child.text" is packed into one cell rather then being split based on the commas. You will see I have attempted to clean up the content of "child.text" in an effort to see if that was my issue.
If I remove "heading.text" from "z" and try again, it writes in a way that has excel showing one letter per cell. In the end I would like each value that is seperated by commas to display in one cell when viewed in excel, I believe I am doing something (many things?) incorrectly in structuring "z" and or when I write the row.
Any guidance would be greatly appreciated. Thank you.
csvwriter = csv.writer(csvfile)
for heading in All_Heading:
driver.execute_script("return arguments[0].scrollIntoView(true);", heading)
print("------------- " + heading.text + " -------------")
ChildElement = heading.find_elements_by_xpath("./../div/div")
for child in ChildElement:
driver.execute_script("return arguments[0].scrollIntoView(true);", child)
#print(heading.text)
#print(child.text)
z = (heading.text, child.text)
print (z)
csvwriter.writerow(z)
When I print "z" I get the following:
('Flower', 'Afghani 3.5g Pre-Pack Details\nGREEN GOLD ORGANICS\nAfghani 3.5g Pre-Pack\nIndica\nTHC: 16.2%\n1/8 oz - \n$45.00')
When I print "z" with the older code that split the string on "\n" I get the following:
('Flower', "Cherry Limeade 3.5g Flower - BeWell Details', 'BE WELL', 'Cherry Limeade 3.5g Flower - BeWell', 'Hybrid', 'THC: 18.7 mg', '1/8 oz - ', '$56.67")
csv.writerow() takes an iterable, each element of which is separated by the writer's delimiter i.e. made a different cell.
First let’s see what’s been happening with you till now:
(heading.text, child.text) has two elements i.e. two cells, heading.text and child.text
(child.text) is simply child.text (would be a tuple if it was (child.text**,**)) and a string's elements are each letter. Hence each letter made its own cell.
To get different cells in a row we need separate elements in our iterable so we want an iterable like [header.text, child.text line 1, child.text line 2, ...]. You were right in splitting the text into lines but the lines weren’t being added to it correctly.
Tuples being immutable I’ll use a list instead:
We know heading.text is to take a single cell so we can write the following to start with
row = [heading.text] # this is what your z is
We want each line to be a separate element so we split child.text:
lines = child.text.split("\n")
# The text doesn’t start or end with a newline so this should suffice
Now we want each element to be added to the row separately, we can make use of the extend() method on lists:
row.extend(lines)
# [1, 2].extend([3, 4, 5]) would result in [1, 2, 3, 4, 5]
To cumulate it:
row = [heading.text]
lines = child.text.split("\n")
row.extend(lines)
or unpacking it in a single line:
row = [heading.text, *child.text.split("\n")] # You can also use a tuple here

How to find the title of a file that sits in between title tags

I have some files that have "TITLE..." then have "JOURNAL..." followed directly afterward. The specific lines are varied and are not static per file. I am trying to pull all of the information that exists between "...TITLE..." and "...JOURNAL...". So far, I am able to only pull the line that contains "TITLE", but for some files, that spills onto the next line.
I deduced that I must use a=line.find("TITLE") and b=line.find("JOURNAL")
then set up a for loop of for i in range(a,b): which displays all of the numerical values of the strings from 698-768, but only displays the number instead of the string. How do I display the string? and how do I then, clean that up to not display "TITLE", "JOURNAL", and the whitespaces in between those two and the text I need? Thanks!
This is the one that displays the single line that "TITLE" exists on
def extract_title():
f=open("GenBank1.gb","r")
line=f.readline()
while line:
line=f.readline()
if "TITLE" in line:
line.strip("TITLE ")
print(line)
f.close()
extract_title()
This the the current block that displays all of thos enumbers in increasing order on seperate lines.
def extract_title():
f=open("GenBank1.gb","r")
line=f.read()
a=line.find("TITLE")
b=line.find("JOURNAL")
line.strip()
f.close()
if "TITLE" in line and "JOURNAL" in line:
for i in range(a,b):
print(i)
extract_title()
Currently, I have from 698-768 displayed like:
698
699
700
etc...
I want to first get them like, 698 699 700,
then convert them to their string value
then I want to understand how to strip the white spaces and the "TITLE" and "JOURNAL" values. Thanks!
I am not sure if I get what you want to achieve here but if I understood it correctly you have a string similar to this "TITLE 659 JOURNAL" and want to get the value in the middle ? If so you could use the slicing notation as such:
line = f.read()
a = line.find("TITLE") + 5 # Because find gives index of the start so we add length
b = line.find("JOURNAL")
value = line[a:b]
value = value.strip() # Strip whitespace
If we now were to return value or print it out we get:
'659'
Similar if you want to get the value after JOURNAL you could use slicing notation again:
idx = line.find("JOURNAL") + 7
value = line[idx:] # Start after JOURNAL till end of string
you don't need the loop. just use slicing:
line = 'fooTITLEspamJOURNAL'
start = line.find('TITLE') + 5 # 5 is len('TITLE')
end = line.find('JOURNAL')
print(line[start:end])
output
spam
another option is to split
print(line.split('TITLE')[1].split('JOURNAL')[0])
str.split() returns list. we use indexes to get the element we want.
in slow motion:
part2 = line.split('TITLE')[1]
title = part2.split('JOURNAL')[0]
print(title)

How to write data to CSV file into correct column heading in Python?

I have a dictionary data that I need to write it into a CSV file under the heading Form Name Type & Definition, the dictionary data to write is in the code snippet below.
writeData.py
def writeCSV():
# Dictionary Data to write received from Form Submit
csvData = {
'form' : 'Customer',
'Customer [form]' : 'Customer is module for recording information related to customer such as Name, Address, Date of Birth, Place of Birth, ID Number, etc.',
'Salutation [label]' : 'A greeting in words or actions, or the words used at the beginning of a letter or speech. This field has values such as Mr, Ms, Miss.',
'First Name English [label]': 'The name that was given to you when you were born and that comes before your family name. This field accept only English Character.'
}
items = {key:value for key,value in csvData.iteritems() if key != 'form'}
form = csvData.get('form')
Columns = ['Form','Name','Type','Definition']
string = ''
with open("myfile.csv","w+") as f:
# Write Heading
for col_header in Columns:
string = string + "," + col_header
f.write(string[1:]+'\n')
# Write Data Body
string = ''
for key,value in items.iteritems():
string = form + "," + key + "," + " " + ","+value
f.write(string)
f.write('\n')
return ''
writeCSV()
However, after I executed the python script above, data was written correctly under the heading Form, Name, and Type. Yet, under the heading Definition, the data was expanded to some more column be young its heading Definition.
I searched around but no clue why it expands column like this, or the amount of data is limited inside on csv column? What's wrong with this, how can I write its data in correct column of CSV file? Thanks.
The problem is that csv delimiter each column by a special character. In your case, you are using a comma ','. But in your text also commas occur. So the csv takes this as a delimiter and interprets it as a new column. You can switch from comma to semicolon ';' as a delimiter. But even then you have to ensure that there are no semicolons in your original text.
If you make it this way you need to change these lines:
string = string + ";" + col_header # ; instead of ,
string = form + ";" + key + ";" + " " + ";"+value
But I would suggest using a library, like #Nathaniel suggests
You may have success converting your dictionary to a data frame and then saving it to a csv:
import pandas as pd
csvData = {'form' : 'Customer',
'Customer [form]' : 'Customer is module for recording information related to customer such as Name, Address, Date of Birth, Place of Birth, ID Number, etc.',
'Salutation [label]' : 'A greeting in words or actions, or the words used at the beginning of a letter or speech. This field has values such as Mr, Ms, Miss.',
'First Name English [label]': 'The name that was given to you when you were born and that comes before your family name. This field accept only English Character.'
}
# Try either of these
df = pd.DataFrame(csvData, index=[0])
#df = pd.DataFrame.from_dict(csvData)
# Output to csv
df.to_csv('myfile.csv')
Without some example data, it is difficult to test this on your data.
It did not expand into adjacent columns; because of the size of the text, it doesn't fit the column width, and Excel's default is to draw that text over adjacent columns. You can verify this is the case by selecting cells that it appears to have expanded into, and seeing they are in fact empty. You can also change the way these cells are displayed, "wrapping" their contents within the column provided (making the rows taller).

Unexpected String Translation from Dictionary

I'd like to write a program that reads in a file and translates a short string of text 4 characters long to a new string of 4 characters. Currently, I read in a tab-delimited text file containing two columns: an "old tag" and a "new tag". I'm able to successfully build a dictionary that maps the "old tag" as the key and the "new tag" as the value.
My problem comes in when I attempt to use maketrans() and str.translate(). Somehow my "old_tag" is getting converted to a "new_tag" that I don't even have in my dictionary! I've attached screenshots of what I mean.
"P020" should get converted to "AGAC" as outline in my dictionary.
The error is that variable "old_tag" should get converted to "AGAC" as outlined in my dictionary, but it's instead getting converted to "ACAC" (look at variable "new_tag"). I don't even have ACAC in my translation table!
Here's my function that does the string translate:
def translate_tag(f_in, old_tag, trn_dict):
"""Function to convert any old tags to their new format based on the translation dictionary (variable "trn_dict")."""
try:
# tag_lookup = trn_dict[old_tag]
# trans = maketrans(old_tag, tag_lookup)
trans = maketrans(old_tag, trn_dict[old_tag]) # Just did the above two lines on one line
except KeyError:
print("Error in file {}! The tag {} wasn't found in the translation table. "
"Make sure the translation table is up to date. "
"The program will continue with the rest of the file, but this tag will be skipped!".format(f_in,
old_tag))
return None
new_tag = old_tag.translate(trans)
return new_tag
Here's my translation table. It's a tab-delimited text file, and the old tag is column 1, and the new tag is column 2. I translate from old tag to new tag.
The strange this is that it converts just fine for some tags. For example, "P010" gets translated correctly. What could be causing the problem?
You should not use maketrans, as it works on individual characters (per the official documentation). Make it a dictionary, with your original text (1st column) as the key and the new text (2nd column) as its value.
Then you can look up any tag x with trn_dict[x], wrapped by a try or with a test beforehand if x in trn_dict.
database = """P001 AAAA
P002 AAAT
P003 AAAG
P004 AAAC
P005 AATA
P006 AATT
P007 AATG
P008 AATC
P009 ATAA
P010 ATAT
P011 ATAG
P012 ATAC
P013 ATTA
P014 ATTT
P015 ATTG
P016 ATTC
P017 AGAA
P018 AGAT
P019 AGAG
P020 AGAC
P021 AGTA
P022 AGTT
P023 AGTG
P024 AGTC
""".splitlines()
trn_dict = {str.split()[0]:str.split()[1] for str in database}
def translate_tag(old_tag, trn_dict):
"""Function to convert any old tags to their new format based on the translation dictionary (variable "trn_dict")."""
try:
return trn_dict[old_tag]
except KeyError:
print("Error in file {}! The tag {} wasn't found in the translation table. "
"Make sure the translation table is up to date. "
"The program will continue with the rest of the file, but this tag will be skipped!")
return None
print (translate_tag('P020', trn_dict))
shows the expected value AGAC.
(That string-to-list-to-dict code is a quick hack to get the data in the program and is not really part of this how-to.)

How to append all the loop elements in single line while using string Template?

I have tried to make a Template for example.py by using string Template where I substitute each for loop elements in $i ["CA:"+$i+':'+" "]. Partially it works but substituting only the last element.
But, I want to append all the values in single line with certain format .
For example:
What my current script doing is follows:
for i in range(1,4):
#It takes each "i" elements and substituting only the last element
str='''s=selection( self.atoms["CA:"+$i+':'+" "].select_sphere(10) )
What I am getting is as follows:
s=selection( self.atoms["CA:"+3+':'+" "].select_sphere(10) )
What, I am expecting is as follows:
s=selection ( self.atoms["CA:"+1+':'+" "].select_sphere(10),self.atoms["CA:"+2+':'+" "].select_sphere(10),self.atoms["CA:"+3+':'+" "].select_sphere(10) )
My script:
import os
from string import Template
for i in range(1,4):
str='''
s=selection( self.atoms["CA:"+$i+':'+" "].select_sphere(10) )
'''
str=Template(str)
file = open(os.getcwd() + '/' + 'example.py', 'w')
file.write(str.substitute(i=i))
file.close()
I use this two scripts to get my desired output:
import os
from string import Template
a=[]
for i in range(1,4):
a.append(''.join("self.atoms["+ "'CA:' "+str(i)+""':'+" "+"]"+".select_sphere(10)"))
str='''s=selection( $a ).by_residue()'''
str=Template(str)
file = open(os.getcwd() + '/' + 'example.py', 'w')
file.write(str.substitute(a=a))
with open('example.py', 'w') as outfile:
selection_template = '''self.atoms["CA:"+{}+':'+" "].select_sphere(10)'''
selections = [selection_template.format(i) for i in range(1, 4)]
outfile.write('s = selection({})\n'.format(', '.join(selections)))
One problem is that your code, because it opens the output file with mode 'w', overwrites the file on each iteration of the for loop. That is why you only see the last one in the file.
Also I wouldn't use string.Template to perform these substitutions. Just use str.format(). Generate a list of selections and use str.join() to produce the final string:
with open('example.py', 'w') as outfile:
selection_template = 'self.atoms["CA:"+{}+":"+" "].select_sphere(10)'
selections = [selection_template.format(i) for i in range(1, 4)]
outfile.write('s = selection({})\n'.format(', '.join(selections)))
Here selection_template uses {} as a placeholder for variable substitution and a list comprehension is used to construct the selection strings. These selection strings are then joined together using the string ', ' as the separator and the resulting string inserted into the call to selection(), again using str.format().
In this example I use Python's built-in format string method, which is relatively easy to understand. If you prefer to use string templating you can easily adapt it.
The trick is to observe that there are two separate operations to perform:
Create the list of arguments
Substitute the argument list in the required output line
I use a generator-expression argument to join to achieve the necessary iteration and part 1, and then a simple string formatting to accomplish step 2.
I use the string's format method as a bound function to simplify the code by abbreviating the method calls.
main_format = '''
s = selection({})
'''.format
item_format = 'self.atoms["CA:"+{s}+\':\'+" "].select_sphere(10)'.format
items = ", ".join(item_format(s=i) for i in range(1, 4))
print(main_format(items))

Categories