Extract values from within strings in a list - python - python

I have a list in my python code with the following structure:
file_info = ['{file:C:\\samples\\123.exe, directory:C:\\}','{file:C:\\samples\\345.exe, directory:C:\\}',...]
I want to extract just the file and directory values for every value of the list and print it. With the following code, I am able to extract the directory values:
for item in file_info:
print item.split('directory:')[1].strip('}')
But I am not able to figure out a way to extract the 'file' values. The following doesn't work:
print item.split('file:')[1].strip(', directory:C:\}')
Suggestions? If there is any better method to extract the file and directory values other than this, that would be great too. Thanks in advance.

If the format is exactly the same you've provided, you'd better go with using re:
import re
file_info = ['{file:file1, directory:dir1}', '{file:file2, directory:directory2}']
pattern = re.compile(r'\w+:(\w+)')
for item in file_info:
print re.findall(pattern, item)
or, using string replace(), strip() and split() (a bit hackish and fragile):
file_info = ['{file:file1, directory:dir1}', '{file:file2, directory:directory2}']
for item in file_info:
item = item.strip('}{').replace('file:', '').replace('directory:', '')
print item.split(', ')
both code snippets print:
['file1', 'dir1']
['file2', 'directory2']
If the file_info items are just dumped json items (watch the double quotes), you can use json to load them into dictionaries:
import json
file_info = ['{"file":"file1", "directory":"dir1"}', '{"file":"file2", "directory":"directory2"}']
for item in file_info:
item = json.loads(item)
print item['file'], item['directory']
or, literal_eval():
from ast import literal_eval
file_info = ['{"file":"file1", "directory":"dir1"}', '{"file":"file2", "directory":"directory2"}']
for item in file_info:
item = literal_eval(item)
print item['file'], item['directory']
both code snippets print:
file1 dir1
file2 directory2
Hope that helps.

I would do:
import re
regx = re.compile('{\s*file\s*:\s*([^,\s]+)\s*'
','
'\s*directory\s*:\s*([^}\s]+)\s*}')
file_info = ['{file:C:\\samples\\123.exe, directory : C:\\}',
'{ file: C:\\samples\\345.exe,directory:C:\\}'
]
for item in file_info:
print '%r\n%s\n' % (item,
regx.search(item).groups())
result
'{file:C:\\samples\\123.exe, directory : C:\\}'
('C:\\samples\\123.exe', 'C:\\')
'{ file: C:\\samples\\345.exe,directory:C:\\}'
('C:\\samples\\345.exe', 'C:\\')

Related

Python regular expression for a string and match them into a dictonnary

I have three files in a directory and I wanted them to be matched with a list of strings to dictionary.
The files in dir looks like following,
DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz
DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz
DB_DEF_S1_001_MM_R1.faq.gz
DB_DEF_S1_001_MM_R2.faq.gz
The list has part of the filename as,
ABC
DEF
So here is what I tried,
import os
import re
dir='/user/home/files'
list='/user/home/list'
samp1 = {}
samp2 = {}
FH_sample = open(list, 'r')
for line in FH_sample:
samp1[line.strip().split('\n')[0]] =[]
samp2[line.strip().split('\n')[0]] =[]
FH_sample.close()
for file in os.listdir(dir):
m1 =re.search('(.*)_R1', file)
m2 = re.search('(.*)_R2', file)
if m1 and m1.group(1) in samp1:
samp1[m1.group(1)].append(file)
if m2 and m2.group(1) in samp2:
samp2[m2.group(1)].append(file)
I wanted the above script to find the matches from m1 and m2 and collect them in dictionaries samp1 and samp2. But the above script is not finding the matches, within the if loop. Now the samp1 and samp2 are empty.
This is what the output should look like for samp1 and samp2:
{'ABC': [DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz, DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz], 'DEF': [DB_DEF_S1_001_MM_R1.faq.gz, DB_DEF_S1_001_MM_R2.faq.gz]}
Any help would be greatly appreciated
A lot of this code you probably don't need. You could just see if the substring that you have from list is in dir.
The code below reads in the data as lists. You seem to have already done this, so it will simply be a matter of replacing files with the file names you read in from dir and replacing st with the substrings from list (which you shouldn't use as a variable name since it is actually used for something else in Python).
files = ["BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz"]
my_strings = ["MOHUA", "MSJLF"]
res = {s: [] for s in my_strings}
for k in my_strings:
for file in files:
if k in file:
res[k].append(file)
print(res)
You can pass the python script a dict and provide id_list and then add id_list as dict keys and append the fastqs if the dict key is in the fastq_filename:
import os
import sys
dir_path = sys.argv[1]
fastqs=[]
for x in os.listdir(dir_path):
if x.endswith(".faq.gz"):
fastqs.append(x)
id_list = ['MOHUA', 'MSJLF']
sample_dict = dict((sample,[]) for sample in id_list)
print(sample_dict)
for k in sample_dict:
for z in fastqs:
if k in z:
sample_dict[k].append(z)
print(sample_dict)
to run:
python3.6 fq_finder.py /path/to/fastqs
output from above to show what is going on:
{'MOHUA': [], 'MSJLF': []} # first print creates dict with empty list as vals for keys
{'MOHUA': ['BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz', 'BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz'], 'MSJLF': ['BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz', 'BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz']}

Regex on list element in for loop

I have a script that searches through config files and finds all matches of strings from another list as follows:
dstn_dir = "C:/xxxxxx/foobar"
dst_list =[]
files = [fn for fn in os.listdir(dstn_dir)if fn.endswith('txt')]
dst_list = []
for file in files:
parse = CiscoConfParse(dstn_dir+'/'+file)
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm)
if len(int_objs) > 0:
dst_list.append(["\n","#" *40,file + " " + sfarm,"#" *40])
dst_list.append(int_objs)
I need to change this part of the code:
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm)
search_str is a list containing strings similar to ['xrout:55','old:23'] and many others.
So it will only find entries that end with the string from the list I am iterating through in sfarm. My understanding is that this would require my to use re and match on something like sfarm$ but Im not sure on how to do this as part of the loop.
Am I correct in saying that sfarm is an iterable? If so I need to know how to regex on an iterable object in this context.
Strings in python are iterable, so sfarm is an iterable, but that has little meaning in this case. From reading what CiscoConfParse.find_all_children() does, it is apparent that your sfarm is the linespec, which is a regular expression string. You do not need to explicitly use the re module here; just pass sfarm concatenated with '$':
search_string = ['xrout:55','old:23']
...
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm + '$') # one of many ways to concat
...
Please check this code. Used glob module to get all "*.txt" files in folder.
Please check here for more info on glob module.
import glob
import re
dst_list = []
search_str = ['xrout:55','old:23']
for file_name in glob.glob(r'C:/Users/dinesh_pundkar\Desktop/*.txt'):
with open(file_name,'r') as f:
text = f.read()
for sfarm in search_str:
regex = re.compile('%s$'%sfarm)
int_objs = regex.findall(text)
if len(int_objs) > 0:
dst_list.append(["\n","#" *40,file_name + " " + sfarm,"#" *40])
dst_list.append(int_objs)
print dst_list
Output:
C:\Users\dinesh_pundkar\Desktop>python a.py
[['\n', '########################################', 'C:/Users/dinesh_pundkar\\De
sktop\\out.txt old:23', '########################################'], ['old:23']]
C:\Users\dinesh_pundkar\Desktop>

How to create a nested list from an object?

I have an tags object from treetagger's python wrapper that apparently is list:
In:
print type (tags)
Out:
<type 'list'>
When I print the content of tags as follows, I get the following lists:
In:
def postag_directory(input_directory, output_directory):
import codecs, treetaggerwrapper, glob, os
for filename in sorted(glob.glob(os.path.join(input_directory, '*.txt'))):
with codecs.open(filename, encoding='utf-8') as f:
lines = [f.read()]
#print 'lines:\n',lines
tagger = treetaggerwrapper.TreeTagger(TAGLANG = 'en')
tags = tagger.TagText(lines)
print '\n\n**** labels:****\n\n',tags
Out:
[[u'I\tPP\tI', u'am\tVBP\tbe', u'an\tDT\tan', u'amateur\tJJ\tamateur']]
[[u'This\tDT\tthis', u'my\tPP$\tmy']]
[[u'was\tVBD\tbe', u'to\tTO\tto', u'be\tVB\tbe', u'my\tPP$\tmy', u'camera\tNN\tcamera', u'for\tIN\tfor', u'long-distance\tJJ\tlong-distance', u'backpacking\tNN\tbackpacking', u'trips\tNNS\ttrip', u'.\tSENT\t.', u'It\tPP\tit']]
However, I would like to get just one single nested list like this:
[[u'I\tPP\tI', u'am\tVBP\tbe', u'an\tDT\tan', u'amateur\tJJ\tamateur'],[u'This\tDT\tthis', u'my\tPP$\tmy'],[u'was\tVBD\tbe', u'to\tTO\tto', u'be\tVB\tbe', u'my\tPP$\tmy', u'camera\tNN\tcamera', u'for\tIN\tfor', u'long-distance\tJJ\tlong-distance', u'backpacking\tNN\tbackpacking', u'trips\tNNS\ttrip', u'.\tSENT\t.', u'It\tPP\tit']]
I all ready tried with list(), append(), [] and also with:
for sublist in [item]:
new_list = []
new_list.append(sublist)
print new_list
Any idea of how can I nest each list from tags?.
This is a list of one element (another list).
[[u'I\tPP\tI', u'am\tVBP\tbe', u'an\tDT\tan', u'amateur\tJJ\tamateur']]
So if item is a list of lists, each with one element, then you can do
new_list = [sublist[0] for sublist in item]
If you had more than one element in each sublist, then you'll need another nested loop in that.
Though, in reality, you shouldn't use lines = [f.read()]. The documentation uses a single string when you use tag_text, so start with this
# Initialize one tagger
tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
# Loop over the files
all_tags = []
for filename in sorted(glob.glob(os.path.join(input_directory, '*.txt'))):
with codecs.open(filename, encoding='utf-8') as f:
# Read the file
content = f.read()
# Tag it
tags = tagger.tag_text(content)
# add those tags to the master tag list
all_tags.append(tags)
print(all_tags)

Parse a json file and add the strings to a URL

How do I parse a json output get the list from data only and then add the output into say google.com/confidetial and the other strings in the list.
so my json out put i will name it "text"
text = {"success":true,"code":200,"data":["Confidential","L1","Secret","Secret123","foobar","maret1","maret2","posted","rontest"],"errs":[],"debugs":[]}.
What I am looking to do is get the list under data only. so far the script i got is giving me the entire json out put.
json.loads(text)
print text
output = urllib.urlopen("http://google.com" % text)
print output.geturl()
print output.read()
jsonobj = json.loads(text)
print jsonobj['data']
Will print the list in the data section of your JSON.
If you want to open each as a link after google.com, you could try this:
def processlinks(text):
output = urllib.urlopen('http://google.com/' % text)
print output.geturl()
print output.read()
map(processlinks, jsonobj['data'])
info = json.loads(text)
json_text = json.dumps(info["data"])
Using json.dumps converts the python data structure gotten from json.loads back to regular json text.
So, you could then use json_text wherever you were using text before and it should only have the selected key, in your case: "data".
Perhaps something like this where result is your JSON data:
from itertools import product
base_domains = ['http://www.google.com', 'http://www.example.com']
result = {"success":True,"code":200,"data":["Confidential","L1","Secret","Secret123","foobar","maret1","maret2","posted","rontest"],"errs":[],"debugs":[]}
for path in product(base_domains, result['data']):
print '/'.join(path) # do whatever
http://www.google.com/Confidential
http://www.google.com/L1
http://www.google.com/Secret
http://www.google.com/Secret123
http://www.google.com/foobar
http://www.google.com/maret1
http://www.google.com/maret2
http://www.google.com/posted
http://www.google.com/rontest
http://www.example.com/Confidential
http://www.example.com/L1
http://www.example.com/Secret
http://www.example.com/Secret123
http://www.example.com/foobar
http://www.example.com/maret1
http://www.example.com/maret2
http://www.example.com/posted
http://www.example.com/rontest

Why are these strings escaping from my regular expression in python?

In my code, I load up an entire folder into a list and then try to get rid of every file in the list except the .mp3 files.
import os
import re
path = '/home/user/mp3/'
dirList = os.listdir(path)
dirList.sort()
i = 0
for names in dirList:
match = re.search(r'\.mp3', names)
if match:
i = i+1
else:
dirList.remove(names)
print dirList
print i
After I run the file, the code does get rid of some files in the list but keeps these two especifically:
['00. Various Artists - Indie Rock Playlist October 2008.m3u', '00. Various Artists - Indie Rock Playlist October 2008.pls']
I can't understand what's going on, why are those two specifically escaping my search.
You are modifying your list inside a loop. That can cause issues. You should loop over a copy of the list instead (for name in dirList[:]:), or create a new list.
modifiedDirList = []
for name in dirList:
match = re.search(r'\.mp3', name)
if match:
i += 1
modifiedDirList.append(name)
print modifiedDirList
Or even better, use a list comprehension:
dirList = [name for name in sorted(os.listdir(path))
if re.search(r'\.mp3', name)]
The same thing, without a regular expression:
dirList = [name for name in sorted(os.listdir(path))
if name.endswith('.mp3')]
maybe you should use the glob module - here is you entire script:
>>> import glob
>>> mp3s = sorted(glob.glob('*.mp3'))
>>> print mp3s
>>> print len(mp3s)
As soon as you call dirList.remove(names), the original iterator doesn't do what you want. If you iterate over a copy of the list, it will work as expected:
for names in dirList[:]:
....
Alternatively, you can use list comprehensions to construct the right list:
dirList = [name for name in dirList if re.search(r'\.mp3', name)]

Categories