How do you convert this textfile into dictionary? (PYTHON) - python

I have a .txt file that reads:
Areca Palm
2018-11-03 18:21:26
Tropical/sub-Tropical plant
Leathery leaves, mid to dark green
Moist and well-draining soil
Semi-shade/full shade light requirements
Water only when top 2 inches of soil is dry
Intolerant to root rot
Propagate by cuttings in water
Canary Date Palm
2018-11-05 10:12:15
Semi-shade, full sun
Dark green leathery leaves
Like lots of water,but soil cannot be water-logged
Like to be root bound in pot
I want to convert these .txt file into dictionary on python and the output should look something like this:
d = {'Areca Palm': ('2018-11-03 18:21:26', 'Tropical/sub-Tropical plant', 'Leathery leaves, mid to dark green', 'Moist and well-draining soil'..etc 'Canary Date Palm': ('2018-11-05 10:12:15', 'Semi-shade, full sun'...)
How do I go about doing this?

The following code shows one way to do this, by reading the file with a very simple two-state state machine:
with open("data.in") as inFile:
# Initialise dictionary and simple state machine.
afterBlank = True
myDict = {}
# Process each line in turn.
for line in inFile.readlines():
line = line.strip()
if afterBlank:
# First non-blank after blank (or at file start) is key
# (blanks after blanks are ignored).
if line != "":
key = line
myDict[key] = []
afterBlank = False
else:
# Subsequent non-blanks are additional lines for key
# (blank after non-blank switches state).
if line != "":
myDict[key].append(line)
else:
afterBlank = True
# Dictionary holds lists, make into tuples if desired.
for key in myDict.keys():
myDict[key] = tuple(myDict[key])
import pprint
pprint.pprint(myDict)
Using your input data gives the output (output with pprint to make a little more readable than the standard Python print):
{'Areca Palm': ('2018-11-03 18:21:26',
'Tropical/sub-Tropical plant',
'Leathery leaves, mid to dark green',
'Moist and well-draining soil',
'Semi-shade/full shade light requirements',
'Water only when top 2 inches of soil is dry',
'Intolerant to root rot',
'Propagate by cuttings in water'),
'Canary Date Palm': ('2018-11-05 10:12:15',
'Semi-shade, full sun',
'Dark green leathery leaves',
'Like lots of water,but soil cannot be water-logged',
'Like to be root bound in pot')}

Many parsing problems are greatly simplified by writing a function
to process the file and yield its lines one meaningful section at a time.
Quite often, the logic needed for this part is pretty simple. And
it stays simple, because the function isn't concerned with any other
details about your larger problem.
That step then simplifies the downstream code, which focuses on deconstruction
of one meaningful section at a time. This part can ignore larger file concerns -- also
keeping things simple.
An illustration:
import sys
def get_paragraphs(path):
par = []
with open(path) as fh: # The basic pattern tends to repeat:
for line in fh:
line = line.rstrip()
if line: # Store lines you want.
par.append(line)
elif par: # Yield prior batch.
yield par
par = []
if par: # Don't forget the last one.
yield par
path = sys.argv[1]
d = {
p[0] : tuple(p[1:])
for p in get_paragraphs(path)
}

Related

How to make a text file (name1:hobby1 name2:hobby2) into this (name1:hobby1, hobby2 name2:hobby1, hobby2)?

I'm new to programming and I need some help. I have a text file with lots of names and hobbies that looks something like this:
Jack:crafting
Peter:hiking
Wendy:gaming
Monica:tennis
Chris:origami
Sophie:sport
Monica:design
Some of the names and hobbies are repeated. I'm trying to make the program display something like this:
Jack: crafting, movies, yoga
Wendy: gaming, hiking, sport
This is my program so far, but the 4 lines from the end are incorrect.
def create_dictionary(file):
newlist = []
dict = {}
file = open("hobbies_database.txt", "r")
hobbies = file.readlines()
for rows in hobbies:
rows1 = rows.split(":")
k = rows1[0] # nimi
v = (rows1[1]).rstrip("\n") # hobi
dict = {k: v}
for k, v in dict.items():
if v in dict[k]:
In this case I would use defaultdict.
import sys
from collections import defaultdict
def create_dictionary(inputfile):
d = defaultdict(list)
for line in inputfile:
name, hobby = line.split(':', 1)
d[name].append(hobby.strip())
return d
with open(sys.argv[1]) as fp:
for name, hobbies in create_dictionary(fp).items():
print(name, ': ', sep='', end='')
print(*hobbies, sep=', ')
Your example give me this result:
Sophie: sport
Chris: origami
Peter: hiking
Jack: crafting
Wendy: gaming
Monica: tennis, design
you may try this one
data = map(lambda x:x.strip(), open('hobbies_database.txt'))
tmp = {}
for i in data:
k,v = i.strip().split(':')
if not tmp.get(k, []):
tmp[k] = []
tmp[k].append(v)
for k,v in tmp.iteritems():
print k, ':', ','.join(v)
output:
Monica : tennis,design
Jack : crafting
Wendy : gaming
Chris : origami
Sophie : sport
Peter : hiking
You could try something like this. I've deliberately rewritten this as I'm trying to show you how you would go about this in a more "Pythonic way". At least making use of the language a bit more.
For example, you can create arrays within dictionaries to represent the data more intuitively. It will then be easier to print the information out in the way you want.
def create_dictionary(file):
names = {} # create the dictionary to store your data
# using with statement ensures the file is closed properly
# even if there is an error thrown
with open("hobbies_database.txt", "r") as file:
# This reads the file one line at a time
# using readlines() loads the whole file into memory in one go
# This is far better for large data files that wont fit into memory
for row in file:
# strip() removes end of line characters and trailing white space
# split returns an array [] which can be unpacked direct to single variables
name, hobby = row.strip().split(":")
# this checks to see if 'name' has been seen before
# is there already an entry in the dictionary
if name not in names:
# if not, assign an empty array to the dictionary key 'name'
names[name] = []
# this adds the hobby seen in this line to the array
names[name].append(hobby)
# This iterates through all the keys in the dictionary
for name in names:
# using the string format function you can build up
# the output string and print it to the screen
# ",".join(array) will join all the elements of the array
# into a single string and place a comma between each
# set(array) creates a "list/array" of unique objects
# this means that if a hobby is added twice you will only see it once in the set
# names[name] is the list [] of hobby strings for that 'name'
print("{0}: {1}\n".format(name, ", ".join(set(names[name]))))
Hope this helps, and perhaps points you in the direction of a few more Python concepts. If you haven't been through the introductory tutorial yet... i'd definitely recommend it.

Trying to create index, but variable cannot reach defined value [Python]

I am trying to create an index of a .txt file, which has section titles in all caps. My attempt looks like this:
dictionary = {}
line_count = 0
for line in file:
line_count += 1
line = re.sub(r'[^a-zA-Z0-9 -]','',line)
list = []
if line.isupper():
head = line
else:
list = line.split(' ')
for i in list:
if i not in stopwords:
dictionary.setdefault(i, {}).setdefault(head, []).append(line_count)
The head variable, however, cannot find its value, which I am trying to assign to any lines that are all caps. My desired output would be something like:
>>dictionary['cat']
{'THE PARABLE': [3894, 3924, 3933, 3936, 3939], 'SNOW': [4501], 'THE CHASE': [6765, 6767, 6772, 6773, 6785, 6802, 6807, 6820, 6823, 6839]}
Here is a slice of the data:
THE GOLDEN BIRD
A certain king had a beautiful garden, and in the garden stood a tree
which bore golden apples. These apples were always counted, and about
the time when they began to grow ripe it was found that every night one
of them was gone.
THE PARABLE
Influenced by those remarks, the bird next morning refused to bring in
the wood, telling the others that he had been their servant long enough,
and had been a fool into the bargain, and that it was now time to make a
change, and to try some other way of arranging the work. Beg and pray
as the mouse and the sausage might, it was of no use; the bird remained
master of the situation, and the venture had to be made. They therefore
drew lots, and it fell to the sausage to bring in the wood, to the mouse
to cook, and to the bird to fetch the water.
At the heart your problem is that this:
if line.isupper()
this test will fail for a number ('0') or other such things. Instead try:
if line.isupper() == line:
But in general your code could use a little Pythonic love like maybe:
import re
data = {}
head = ''
with open('file1', 'rU') as f:
for line_num, line in enumerate(f):
line = re.sub(r'[^a-zA-Z0-9 -]', '', line)
if line.isupper() == line:
head = line
else:
for word in (w for w in line.split(' ') if w not in stopwords):
data.setdefault(word, {}).setdefault(head, []).append(line_num+1)

Going through a list individually

I have very limited coding background except for some Ruby, so if there's a better way of doing this, please let me know!
Essentially I have a .txt file full of words. I want to import the .txt file and turn it into a list. Then, I want to take the first item in the list, assign it to a variable, and use that variable in an external request that sends off to get the definition of the word. The definition is returned, and tucked into a different .txt file. Once that's done, I want the code to grab the next item in the list and do it all again until the list is exhausted.
Below is my code in progress to give an idea of where I'm at. I'm still trying to figure out how to iterate through the list correctly, and I'm having a hard time interpreting the documentation.
Sorry in advance if this was already asked! I searched, but couldn't find anything that specifically answered my issue.
from __future__ import print_function
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
lines = []
with open('words.txt') as f:
lines = f.readlines()
for each in lines
wordlist = open('test.txt', 'a')
word = ##figure out how to get items from list and assign them here
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url and make sure it's correct
html = urllib.urlopen(url).read()
# print html (deprecated)
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
print(visible_text, file=wordlist)
Keep everything in a loop. Like that:
with open('test.txt', 'a') as wordlist:
for word in lines:
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
print url
# print url and make sure it's correct
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
wordlist.write("{0}\n".format(visible_text))
Secondly, some suggestions:
f.readlines() won't discard the trailing \n. So, I would use f.read().splitlines()
lines = f.read().splitlines()
You don't to initialize the lines list with [ ], as you are forming the list at one shot and assigning it to lines. You need to initialize the list, only when you consider using append() to the list. So, the below line isn't needed.
lines = []
You can handle KeyError by the following:
try:
value = soup.find('pre', text=True)[0]
return value
except KeyError:
return None
I also show how you can use the Python requests library for retrieving the raw html page. This allows us to easily check the status code for whether the retrieval was successful. You can replace the relevant lines to urllib with this if you like.
You can install requests in the command line using pip: pip install requests
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import re
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
def get_html_with_urllib(word):
url = "http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query={word}".format(word=word)
html = urllib.urlopen(url).read()
return html
def get_html(word):
url = "http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query={word}".format(word=word)
response = requests.get(url)
# Something bad happened
if response.status_code != 200:
return ""
# Did not get back html
if not response.headers["Content-Type"].startswith("text/html"):
return ""
html = response.content
return html
def format_definitions(raw_definitions_text):
# Get individual lines in definitions text
parts = raw_definitions_text.split('\n')
# Convert to str
# Remove extra spaces on the left.
# Add one space at the end for later joining with next line
parts = map(lambda x: str(x).lstrip() + ' ', parts)
result = []
current = ""
for p in parts:
if re.search("\w*[0-9]+:", p):
# Start of new line. Contains some char followed by <number>:
# Save previous lines
result.append(current.replace('\n', ' '))
# Set start of current line
current = p
else:
# Continue line
current += p
result.append(current)
return '\n'.join(result)
def get_definitions(word):
# Uncomment this to use requests
# html = get_html(word)
# Could not get definition
# if not html:
# return None
html = get_html_with_urllib(word)
soup = BeautifulSoup(html, "html.parser")
# Get block containing definition
definitions = soup.find("pre").get_text()
definitions = format_definitions(definitions)
return definitions
def batch_query(input_filepath):
with open(input_filepath) as infile:
for word in infile:
word = word.strip() # Remove spaces from both ends
definitions = get_definitions(word)
if not definitions:
print("Could not retrieve definitions for {word}".format(word=word))
print("Definition for {word} is: ".format(word=word))
print(definitions)
def main():
input_filepath = sys.argv[1] # Alternatively, change this to file containing words
batch_query(input_filepath)
if __name__ == "__main__":
main()
Output:
Definition for cat is:
cat
n 1: feline mammal usually having thick soft fur and being unable to roar; domestic cats; wildcats [syn: true cat]
2: an informal term for a youth or man; "a nice guy"; "the guy's only doing it for some doll" [syn: guy, hombre, bozo]
3: a spiteful woman gossip; "what a cat she is!"
4: the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant; "in Yemen kat is used daily by 85% of adults" [syn: kat, khat, qat, quat, Arabian tea, African tea]
5: a whip with nine knotted cords; "British sailors feared the cat" [syn: cat-o'-nine-tails]
6: a large vehicle that is driven by caterpillar tracks; frequently used for moving earth in construction and farm work [syn: Caterpillar]
7: any of several large cats typically able to roar and living in the wild [syn: big cat]
8: a method of examining body organs by scanning them with X rays and using a computer to construct a series of cross-sectional scans along a single axis [syn: computerized tomography, computed tomography, CT, computerized axial tomography, computed axial tomography]
v 1: beat with a cat-o'-nine-tails
2: eject the contents of the stomach through the mouth; "After drinking too much, the students vomited"; "He purged continuously"; "The patient regurgitated the food we gave him last night" [syn: vomit, vomit up, purge, cast, sick, be sick, disgorge, regorge, retch, puke, barf, spew, spue, chuck, upchuck, honk, regurgitate, throw up] [ant: keep down] [also: catting, catted]
Definition for dog is:
dog
n 1: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds; "the dog barked all night" [syn: domestic dog, Canis familiaris]
2: a dull unattractive unpleasant girl or woman; "she got a reputation as a frump"; "she's a real dog" [syn: frump]
3: informal term for a man; "you lucky dog"
4: someone who is morally reprehensible; "you dirty dog" [syn: cad, bounder, blackguard, hound, heel]
5: a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll [syn: frank, frankfurter, hotdog, hot dog, wiener, wienerwurst, weenie]
6: a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward [syn: pawl, detent, click]
7: metal supports for logs in a fireplace; "the andirons were too hot to touch" [syn: andiron, firedog, dog-iron] v : go after with the intent to catch; "The policeman chased the mugger down the alley"; "the dog chased the rabbit" [syn: chase, chase after, trail, tail, tag, give chase, go after, track] [also: dogging, dogged]
Definition for car is:
car
n 1: 4-wheeled motor vehicle; usually propelled by an internal combustion engine; "he needs a car to get to work" [syn: auto, automobile, machine, motorcar]
2: a wheeled vehicle adapted to the rails of railroad; "three cars had jumped the rails" [syn: railcar, railway car, railroad car]
3: a conveyance for passengers or freight on a cable railway; "they took a cable car to the top of the mountain" [syn: cable car]
4: car suspended from an airship and carrying personnel and cargo and power plant [syn: gondola]
5: where passengers ride up and down; "the car was on the top floor" [syn: elevator car]

Parsing through sequence output- Python

I have this data from sequencing a bacterial community.
I know some basic Python and am in the midst of completing the codecademy tutorial.
For practical purposes, please think of OTU as another word for "species"
Here is an example of the raw data:
OTU ID OTU Sum Lineage
591820 1083 k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
532752 517 k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
218456 346 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__Bordetella; s__
590248 330 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__; s__
343284 321 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Limnohabitans; s__
The data includes three things: a reference number for the species, how many times that species is in the sample, and the taxonomy of said species.
What I'm trying to do is add up all the times a sequence is found for a taxonomic family (designated as f_x in the data).
Here is an example of the desired output:
f__Fusobacteriaceae 1600
f__Alcaligenaceae 676
f__Comamonadaceae 321
This isn't for a class. I started learning python a few months ago, so I'm at least capable of looking up any suggestions. I know how it works out from doing it the slow way (copy & paste in excel), so this is for future reference.
If the lines in your file really look like this, you can do
from collections import defaultdict
import re
nums = defaultdict(int)
with open("file.txt") as f:
for line in f:
items = line.split(None, 2) # Split twice on any whitespace
if items[0].isdigit():
key = re.search(r"f__\w+", items[2]).group(0)
nums[key] += int(items[1])
Result:
>>> nums
defaultdict(<type 'int'>, {'f__Comamonadaceae': 321, 'f__Fusobacteriaceae': 1600,
'f__Alcaligenaceae': 676})
Yet another solution, using collections.Counter:
from collections import Counter
counter = Counter()
with open('data.txt') as f:
# skip header line
next(f)
for line in f:
# Strip line of extraneous whitespace
line = line.strip()
# Only process non-empty lines
if line:
# Split by consecutive whitespace, into 3 chunks (2 splits)
otu_id, otu_sum, lineage = line.split(None, 2)
# Split the lineage tree into a list of nodes
lineage = [node.strip() for node in lineage.split(';')]
# Extract family node (assuming there's only one)
family = [node for node in lineage if node.startswith('f__')][0]
# Increase count for this family by `otu_sum`
counter[family] += int(otu_sum)
for family, count in counter.items():
print "%s %s" % (family, count)
See the docs for str.split() for the details of the None argument (matching consecutive whitespace).
Get all your raw data and process it first, I mean structure it and then use the structured data to do any sort of operations you desire.
In case if you have GB's of data you can use elasticsearch. Feed your raw data and query with your required string in this case f_* and get all entries and add them
That's very doable with basic python. Keep the Library Reference under your pillow, as you'll want to refer to it often.
You'll likely end up doing something similar to this (I'll write it the longer-more-readable way -- there's ways to compress the code and do this quicker).
# Open up a file handle
file_handle = open('myfile.txt')
# Discard the header line
file_handle.readline()
# Make a dictionary to store sums
sums = {}
# Loop through the rest of the lines
for line in file_handle.readlines():
# Strip off the pesky newline at the end of each line.
line = line.strip()
# Put each white-space delimited ... whatever ... into items of a list.
line_parts = line.split()
# Get the first column
reference_number = line_parts[0]
# Get the second column, convert it to an integer
sum = int(line_parts[1])
# Loop through the taxonomies (the rest of the 'columns' separated by whitespace)
for taxonomy in line_parts[2:]:
# skip it if it doesn't start with 'f_'
if not taxonomy.startswith('f_'):
continue
# remove the pesky semi-colon
taxonomy = taxonomy.strip(';')
if sums.has_key(taxonomy):
sums[taxonomy] += int(sum)
else:
sums[taxonomy] = int(sum)
# All done, do some fancy reporting. We'll leave sorting as an exercise to the reader.
for taxonomy in sums.keys():
print("%s %d" % (taxonomy, sums[taxonomy]))

python stops working in the middle of dataset

I wrote a script to read and plot data into the graphs. I have three input files
wells.csv: list of observation wells that I want to create graph
1201
1202
...
well_summary_table.csv: contained information for each well (e.g. reference elevation, depth to water)
Bore_Name Ref_elev
1201 20
data.csv: contained observation data for each well (e.g. pH, Temp)
RowId Bore_Name Depth pH
1 1201 2 7
Not all wells in wells.csv have data to plot
My script is as follow
well_name_list = []
new_depth_list =[]
pH_list = []
from pylab import *
infile = open("wells.csv",'r')
for line in infile:
line=line.strip('\n')
well=line
if not well in well_name_list:
well_name_list.append(well)
infile.close()
for well in well_name_list:
infile1 = open("well_summary_table.csv",'r')
infile2 = open("data.csv",'r')
for line in infile1:
line = line.rstrip()
if not line.startswith('Bore_Name'):
words = line.split(',')
well_name1 = words[0]
if well_name1 == well:
ref_elev = words[1]
for line in infile2:
if not line.startswith("RowId"):
line = line.strip('\n')
words = line.split(',')
well_name2 = words[1]
if well_name2 == well:
depth = words[2]
new_depth = float(ref_elev) - float(depth)
pH = words[3]
new_depth_list.append(float(new_depth))
pH_list.append(float(pH))
fig.plt.figure(figsize = (2,2.7), facecolor='white')
plt.axis([0,8,0,60])
plt.plot(pH_list, new_depth_list, linestyle='', marker = 'o')
plt.savefig(well+'.png')
new_depth_list = []
pH_list = []
infile1.close()
infile2.close()
It works on more than half of my well list then it stops without giving me any error message. I don't know what is going on. Can anyone help me with that problem? Sorry if it is an obvious question. I am a newbie.
Many thanks,
#tcaswell spotted a potential issue - you aren't closing infile1 and infile2 after each time you open them - you'll at the very least have a lot of open file handles floating around, depending on how many wells you have in the wells.csv file. In some versions of python this may cause issues, but this may not be the only problem - it's hard to say without some test data files. There might be an issue with seeking to the start of the file - going back to the beginning when you move on to the next well. This could cause the program to run as you've been experiencing, but it might also be caused by something else. You should avoid problems like this by using with to manage the scope of your open files.
You should also use a dictionary to marry up the well names with the data, and read all of the data up front before doing your plotting. This will allow you to see exactly how you've constructed your data set and where any issues exist.
I've made a few stylistic suggestions below too. This is obviously incomplete but hopefully you get the idea!
import csv
from pylab import * #imports should always go before declarations
well_details = {} #empty dict
with open('wells.csv','r') as well_file:
well_reader = csv.reader(well_file, delimiter=',')
for row in well_reader:
well_name = row[0]
if not well_details.has_key(well_name):
well_details[well_name] = {} #dict to store pH, depth, ref_elev
with open('well_summary_table.csv','r') as elev_file:
elev_reader = csv.reader(elev_file, delimiter=',')
for row in elev_reader:
well_name = row[0]
if well_details.has_key(well_name):
well_details[well_name]['elev_ref'] = row[1]

Categories