I have an tags object from treetagger's python wrapper that apparently is list:
In:
print type (tags)
Out:
<type 'list'>
When I print the content of tags as follows, I get the following lists:
In:
def postag_directory(input_directory, output_directory):
import codecs, treetaggerwrapper, glob, os
for filename in sorted(glob.glob(os.path.join(input_directory, '*.txt'))):
with codecs.open(filename, encoding='utf-8') as f:
lines = [f.read()]
#print 'lines:\n',lines
tagger = treetaggerwrapper.TreeTagger(TAGLANG = 'en')
tags = tagger.TagText(lines)
print '\n\n**** labels:****\n\n',tags
Out:
[[u'I\tPP\tI', u'am\tVBP\tbe', u'an\tDT\tan', u'amateur\tJJ\tamateur']]
[[u'This\tDT\tthis', u'my\tPP$\tmy']]
[[u'was\tVBD\tbe', u'to\tTO\tto', u'be\tVB\tbe', u'my\tPP$\tmy', u'camera\tNN\tcamera', u'for\tIN\tfor', u'long-distance\tJJ\tlong-distance', u'backpacking\tNN\tbackpacking', u'trips\tNNS\ttrip', u'.\tSENT\t.', u'It\tPP\tit']]
However, I would like to get just one single nested list like this:
[[u'I\tPP\tI', u'am\tVBP\tbe', u'an\tDT\tan', u'amateur\tJJ\tamateur'],[u'This\tDT\tthis', u'my\tPP$\tmy'],[u'was\tVBD\tbe', u'to\tTO\tto', u'be\tVB\tbe', u'my\tPP$\tmy', u'camera\tNN\tcamera', u'for\tIN\tfor', u'long-distance\tJJ\tlong-distance', u'backpacking\tNN\tbackpacking', u'trips\tNNS\ttrip', u'.\tSENT\t.', u'It\tPP\tit']]
I all ready tried with list(), append(), [] and also with:
for sublist in [item]:
new_list = []
new_list.append(sublist)
print new_list
Any idea of how can I nest each list from tags?.
This is a list of one element (another list).
[[u'I\tPP\tI', u'am\tVBP\tbe', u'an\tDT\tan', u'amateur\tJJ\tamateur']]
So if item is a list of lists, each with one element, then you can do
new_list = [sublist[0] for sublist in item]
If you had more than one element in each sublist, then you'll need another nested loop in that.
Though, in reality, you shouldn't use lines = [f.read()]. The documentation uses a single string when you use tag_text, so start with this
# Initialize one tagger
tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
# Loop over the files
all_tags = []
for filename in sorted(glob.glob(os.path.join(input_directory, '*.txt'))):
with codecs.open(filename, encoding='utf-8') as f:
# Read the file
content = f.read()
# Tag it
tags = tagger.tag_text(content)
# add those tags to the master tag list
all_tags.append(tags)
print(all_tags)
Related
I have three files in a directory and I wanted them to be matched with a list of strings to dictionary.
The files in dir looks like following,
DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz
DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz
DB_DEF_S1_001_MM_R1.faq.gz
DB_DEF_S1_001_MM_R2.faq.gz
The list has part of the filename as,
ABC
DEF
So here is what I tried,
import os
import re
dir='/user/home/files'
list='/user/home/list'
samp1 = {}
samp2 = {}
FH_sample = open(list, 'r')
for line in FH_sample:
samp1[line.strip().split('\n')[0]] =[]
samp2[line.strip().split('\n')[0]] =[]
FH_sample.close()
for file in os.listdir(dir):
m1 =re.search('(.*)_R1', file)
m2 = re.search('(.*)_R2', file)
if m1 and m1.group(1) in samp1:
samp1[m1.group(1)].append(file)
if m2 and m2.group(1) in samp2:
samp2[m2.group(1)].append(file)
I wanted the above script to find the matches from m1 and m2 and collect them in dictionaries samp1 and samp2. But the above script is not finding the matches, within the if loop. Now the samp1 and samp2 are empty.
This is what the output should look like for samp1 and samp2:
{'ABC': [DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz, DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz], 'DEF': [DB_DEF_S1_001_MM_R1.faq.gz, DB_DEF_S1_001_MM_R2.faq.gz]}
Any help would be greatly appreciated
A lot of this code you probably don't need. You could just see if the substring that you have from list is in dir.
The code below reads in the data as lists. You seem to have already done this, so it will simply be a matter of replacing files with the file names you read in from dir and replacing st with the substrings from list (which you shouldn't use as a variable name since it is actually used for something else in Python).
files = ["BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz"]
my_strings = ["MOHUA", "MSJLF"]
res = {s: [] for s in my_strings}
for k in my_strings:
for file in files:
if k in file:
res[k].append(file)
print(res)
You can pass the python script a dict and provide id_list and then add id_list as dict keys and append the fastqs if the dict key is in the fastq_filename:
import os
import sys
dir_path = sys.argv[1]
fastqs=[]
for x in os.listdir(dir_path):
if x.endswith(".faq.gz"):
fastqs.append(x)
id_list = ['MOHUA', 'MSJLF']
sample_dict = dict((sample,[]) for sample in id_list)
print(sample_dict)
for k in sample_dict:
for z in fastqs:
if k in z:
sample_dict[k].append(z)
print(sample_dict)
to run:
python3.6 fq_finder.py /path/to/fastqs
output from above to show what is going on:
{'MOHUA': [], 'MSJLF': []} # first print creates dict with empty list as vals for keys
{'MOHUA': ['BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz', 'BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz'], 'MSJLF': ['BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz', 'BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz']}
I am trying to count the number of contractions used by politicians in certain speeches. I have lots of speeches, but here are some of the URLs as a sample:
every_link_test = ['http://www.millercenter.org/president/obama/speeches/speech-4427',
'http://www.millercenter.org/president/obama/speeches/speech-4424',
'http://www.millercenter.org/president/obama/speeches/speech-4453',
'http://www.millercenter.org/president/obama/speeches/speech-4612',
'http://www.millercenter.org/president/obama/speeches/speech-5502']
I have a pretty rough counter right now - it only counts the total number of contractions used in all of those links. For example, the following code returns 79,101,101,182,224 for the five links above. However, I want to link up filename, a variable I create below, so I would have something like (speech_1, 79),(speech_2, 22),(speech_3,0),(speech_4,81),(speech_5,42). That way, I can track the number of contractions used in each individual speech. I'm getting the following error with my code: AttributeError: 'tuple' object has no attribute 'split'
Here's my code:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
url = 'http://www.millercenter.org/president/speeches'
url2 = 'http://www.millercenter.org'
conn = urllib2.urlopen(url)
html = conn.read()
miller_center_soup = BeautifulSoup(html)
links = miller_center_soup.find_all('a')
linklist = [tag.get('href') for tag in links if tag.get('href') is not None]
# remove all items in list that don't contain 'speeches'
linkslist = [_ for _ in linklist if re.search('speeches',_)]
del linkslist[0:2]
# concatenate 'http://www.millercenter.org' with each speech's URL ending
every_link_dups = [url2 + end_link for end_link in linkslist]
# remove duplicates
seen = set()
every_link = [] # no duplicates array
for l in every_link_dups:
if l not in seen:
every_link.append(l)
seen.add(l)
def processURL_short_2(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return item_str, filename
every_link_test = every_link[0:5]
print every_link_test
count = 0
for l in every_link_test:
content_1 = processURL_short_2(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
print count, filename
As the error message explains, you cannot use split the way you are using it. split is for strings.
So you will need to change this:
for word in content_1.split():
to this:
for word in content_1[0]:
I chose [0] by running your code, I think that gives you the chunk of the text you are looking to search through.
#TigerhawkT3 has a good suggestion you should follow in their answer too:
https://stackoverflow.com/a/32981533/1832539
Instead of print count, filename, you should save these data to a data structure, like a dictionary. Since processURL_short_2 has been modified to return a tuple, you'll need to unpack it.
data = {} # initialize a dictionary
for l in every_link_test:
content_1, filename = processURL_short_2(l) # unpack the content and filename
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
data[filename] = count # add this to the dictionary as filename:count
This would give you a dictionary like {'obama_4424':79, 'obama_4453':101,...}, allowing you to easily store and access your parsed data.
I have a list in my python code with the following structure:
file_info = ['{file:C:\\samples\\123.exe, directory:C:\\}','{file:C:\\samples\\345.exe, directory:C:\\}',...]
I want to extract just the file and directory values for every value of the list and print it. With the following code, I am able to extract the directory values:
for item in file_info:
print item.split('directory:')[1].strip('}')
But I am not able to figure out a way to extract the 'file' values. The following doesn't work:
print item.split('file:')[1].strip(', directory:C:\}')
Suggestions? If there is any better method to extract the file and directory values other than this, that would be great too. Thanks in advance.
If the format is exactly the same you've provided, you'd better go with using re:
import re
file_info = ['{file:file1, directory:dir1}', '{file:file2, directory:directory2}']
pattern = re.compile(r'\w+:(\w+)')
for item in file_info:
print re.findall(pattern, item)
or, using string replace(), strip() and split() (a bit hackish and fragile):
file_info = ['{file:file1, directory:dir1}', '{file:file2, directory:directory2}']
for item in file_info:
item = item.strip('}{').replace('file:', '').replace('directory:', '')
print item.split(', ')
both code snippets print:
['file1', 'dir1']
['file2', 'directory2']
If the file_info items are just dumped json items (watch the double quotes), you can use json to load them into dictionaries:
import json
file_info = ['{"file":"file1", "directory":"dir1"}', '{"file":"file2", "directory":"directory2"}']
for item in file_info:
item = json.loads(item)
print item['file'], item['directory']
or, literal_eval():
from ast import literal_eval
file_info = ['{"file":"file1", "directory":"dir1"}', '{"file":"file2", "directory":"directory2"}']
for item in file_info:
item = literal_eval(item)
print item['file'], item['directory']
both code snippets print:
file1 dir1
file2 directory2
Hope that helps.
I would do:
import re
regx = re.compile('{\s*file\s*:\s*([^,\s]+)\s*'
','
'\s*directory\s*:\s*([^}\s]+)\s*}')
file_info = ['{file:C:\\samples\\123.exe, directory : C:\\}',
'{ file: C:\\samples\\345.exe,directory:C:\\}'
]
for item in file_info:
print '%r\n%s\n' % (item,
regx.search(item).groups())
result
'{file:C:\\samples\\123.exe, directory : C:\\}'
('C:\\samples\\123.exe', 'C:\\')
'{ file: C:\\samples\\345.exe,directory:C:\\}'
('C:\\samples\\345.exe', 'C:\\')
I have two fastq files like the one given below. Each record in the file starts with '#'. For two such files, my aim is to extract records that are common btw two files.
#IRIS:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^
#IRIS:7:1:17:800#0/1
GGAAACACTACTTAGGCTTATAAGATCNGGTTGCGG
+IRIS:7:1:17:800#0/1
ababbaaabaaaaa`]`ba`]`aaaaYD\\_a``XT
I have tried this:
first I get a list of read IDs that are common in file1 and 2.
import sys
#('reading files and storing all lines in a list')
data1 = open(sys.argv[1]).read().splitlines()
data2 = open(sys.argv[2]).read().splitlines()
#('listing all read IDs from file1')
list1 = []
for item in data1:
if '#' in item:
list1.append(item)
#('listing all read IDs from file2')
list2 = []
for item in data2:
if '#' in item:
list2.append(item)
#('finding common reads in file1 and file2')
def intersect(a, b):
return list(set(a) & set(b))
common = intersect(list1, list2)
Here, I search for commom IDs in the main file and export the data in a new file. following code works fine for small files but freezes my computer if I try it with larger files. I believe that the 'for' is taking too much memory:
#('filtering read data from file1')
mod_data1 = open(sys.argv[1]).read().rstrip('\n').replace('#', ',#')
tab1 = open(sys.argv[1] + '_final', 'wt')
records1 = mod_data1.split(',')
for item in records1[1:]:
if item.replace('\n', '\t').split('\t')[0] in common:
tab1.write(item)
Please suggest what should I do with the code above, such that it works on larger files(40-100 million records/file, and each record is 4 line).
Using list comprehension, you could write :
list1 = [i for item in data1 if '#' in item]
list2 = [i for item in data2 if '#' in item]
You could also define them as sets directly using set comprehension (depending on the version of python you are using).
set1 = {i for item in data1 if '#' in item}
set2 = {i for item in data2 if '#' in item}
I'd expect creating the set from the beginning to be faster than creating a list and then make a set out of it.
As for the second part of the code, I am not quite sure yet what you are trying to achieve.
This is some what of a supplementary question to my recent query about searching dictionary items in a list:
Check if python dictionary contains value and if so return related value
I have an array containing dictionaries in the format:
fileList = [
{"fileName": "file1.txt", "fileMod": "0000048723"},
{"fileName": "file2.txt", "fileMod": "0000098573"}
]
I was able to return a list of fileMod values for existing items in the fileList using
a rather neat list comprehension as suggested:
fileMod = [item['fileMod'] for item in fileList if item['fileName'] == filename]
This returns a value if there is a matching filename, but I forgot to include that I also need to know when there is a filename that does not match any of the entries in filelist.
I am sure this should be simple, but I think I have just been looking at it too long to see the woods for the trees.
Perhaps you should use a dictionary rather than a list?
files = {
'file1.txt': {'fileMod': '0000048723'},
'file2.txt': {'fileMod': '0000048723'}
}
This stores the same information as your list, but finding elements is easy:
mod = None
if 'file1.txt' in files:
mod = files['file1.txt']['fileMod']
"checkin python if a list of dictonaries does NOT contain a specific value"
if not any(item for item in fileList if item['fileName'] == filename):
returns true if no dictionary in your list fulfills the condition.
This is possibly faster than checking the whole fileMod, because it stops as soon as a match is found.
"This returns a value if there is a matching filename, but I forgot to include that I also need to know when there is a filename that does not match any of the entries in filelist." (a different question?)
fileMod = []
fileBad = []
for item in fileList:
if item['fileName'] == filename:
fileMod.append(item['fileMod'])
else:
fileBad.append(item['fileMod'])
or
fileMod = {True: [], False: []} # a dictionary of lists
for item in fileList:
fileMod[item['fileMod'] == filename].append(item['fileMod'])
This last code returns a dict of lists: fileMod[True] is a list of positive founds, fileMod[False] is a list of negative founds.
If the filename doesn't match any entry in the filelist then the list fileMod would be empty.
>>> if fileMod:
... # Code when the filename matches at least one file
... else:
... # Code when the filename doesn't match any entry.
To check for empty lists in python:
>>> l = []
>>> if l:
... print "not empty"
... else:
... print "empty"
...
empty
Please notice that
fileMod = [item['fileMod'] for item in fileList if item['fileName'] == filename]
returns a list of modifications dates (not a value),
if the list has zero length then there are no items in fileList matching the filename.