Split a list to individual entries - python

I am extracting some emails from a CSV file and then saving it to another CSV file.
email variable should be in this format:
email = ['email#email.com'], ['email2#company.com'], ['email3#company2.com']
but in certain cases it will be returned as:
email = ['email#email.com', 'email2#email.com'], ['email3#email.com']
In certain rows it finds 2 emails, so that is when it is presented like this.
What would be an efficient way to change it??

The next should be quite efficient:
>>> import itertools
>>> data = [ ['email#email.com', 'email2#email.com'], ['email3#email.com'] ]
>>> [[i] for i in itertools.chain(*data)]
[['email#email.com'], ['email2#email.com'], ['email3#email.com']]

data = [ ['email#email.com', 'email2#email.com'], ['email3#email.com'] ]
def flatten(data):
for item in data:
if isinstance(item, basestring):
yield item
else:
for i in item:
yield [i]
or, if you want to support arbitrary levels of nesting:
def flatten(data):
for item in data:
if isinstance(item, basestring):
yield [item]
else:
for i in flatten(item):
yield i
If you only need a list of emails, without each element wrapped in a list (which seems more reasonable to me), the solution is much simpler:
import itertools
print list(itertools.chain.from_iterable(data))

If you are working with CSV files you may want to try the CSV module from the standard library.
http://docs.python.org/library/csv.html
Example:
$ cat > test.csv
['email#email.com', 'email2#email.com'], ['email3#email.com']
$ python
>>> import csv
>>> reader = csv.reader(open('test.csv', 'r'))
>>> for row in reader:
... print row
...
["['email#email.com'", " 'email2#email.com']", " ['email3#email.com']"]
What I did there may not be what you want but if you look at the library you might find what you need.

Related

Python: write files from lists based on match

please help. i am trying to write separate files from list of lists based on a string match.
Below list X has 3 sub-lists, and based on match from filter, i want to filter the lines and write them to separate files.
X = ['apple,banana,fruits,orange', 'dog,cat,animals,horse', 'mouse,elephant,animals,peacock']
filter = (fruits, animals)
from lists in X, i want to write csv files separately based on match found in filter.
Tried below incomplete code:
def write(Y):
temp = []
for elem in Y:
for n in filter:
if n in elem:
temp.append(elem)
expected output:
cat fruits.csv:
apple,banana,fruits,orange
cat animals.csv
dog,cat,animals,horse
mouse,elephant,animals,peacock
Please help or advice best method to do this.
THANKS in advance.
You can create a dictionary from your list with keys as filenames and then iterate the dictionary to write to file:
import re
from collections import defaultdict
X = ['apple,banana,fruits,orange', 'dog,cat,animals,horse', 'mouse,elephant,animals,peacock']
filter = ('fruits', 'animals')
d = defaultdict(list)
for x in X:
for f in filter:
if re.search(fr'\b{f}\b', x):
d[f].append(x)
for k, v in d.items():
with open(f'{k}.csv', 'w') as fi:
for y in v:
fi.write(y)

Using Zip Function to Create Columns in CSV with non-identical lengths of data

I have large number of files that are named according to a gradually more specific criteria.
Each part of the filename separate by the '_' relate to a drilled down categorization of that file.
The naming convetion looks like this:
TEAM_STRATEGY_ATTRIBUTION_TIMEFRAME_DATE_FILEVIEW
What I am trying to do is iterate through all these files and then pull out a list of how many different occurrences of each naming convention exists.
So essentially this is what I've done so far, I iterated through all the files and made a list of each name. I then separated each name by the '_' and then appended each of those to their respective category lists.
Now I'm trying to export them to a CSV file separated by columns and this is where I'm running into problems
L = [teams, strategies, attributions, time_frames, dates, file_types]
columns = zip(*L)
list(columns)
with open (_outputfolder_, 'w') as f:
writer = csv.writer(f)
for column in columns:
print(column)
This is a rough estimation of the list I'm getting out:
[{'TEAM1'},
{'STRATEGY1', 'STRATEGY2', 'STRATEGY3', 'STRATEGY4', 'STRATEGY5', 'STRATEGY6', 'STRATEGY7', 'STRATEGY8', 'STRATEGY9', 'STRATEGY10','STRATEGY11', 'STRATEGY12', 'STRATEGY13', 'STRATEGY14', 'STRATEGY15'},
{'ATTRIBUTION1','ATTRIBUTION1','Attribution3','Attribution4','Attribution5', 'Attribution6', 'Attribution7', 'Attribution8', 'Attribution9', 'Attribution10'},
{'TIME_FRAME1', 'TIME_FRAME2', 'TIME_FRAME3', 'TIME_FRAME4', 'TIME_FRAME5', 'TIME_FRAME6', 'TIME_FRAME7'},
{'DATE1'},
{'FILE_TYPE1', 'FILE_TYPE2'}]
What I want the final result to look like is something like:
Team1 STRATEGY1 ATTRIBUTION1 TIME_FRAME1 DATE1 FILE_TYPE1
STRATEGY2 ATTRIBUTION2 TIME_FRAME2 FILE_TYPE2
... ... ...
etc. etc. etc.
But only the first line actually gets stored in the CSV file.
can anyone help me understand how to iterate just past the first line? I'm sure this is happening because the Team type has only one option, but I don't want this to hinder it.
I referred to the answer, you have to transpose the result and use it.
refer the post below ,
Python - Transposing a list (rows with different length) using numpy fails.
I have used natural sorting to sort the integers and appended the lists with blanks to have the expected outcome.
The natural sorting is slower for larger lists
you can also use third party libraries,
Does Python have a built in function for string natural sort?
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
res = [[] for _ in range(max(len(sl) for sl in columns))]
count = 0
for sl in columns:
sorted_sl = natural_sort(sl)
for x, res_sl in zip(sorted_sl, res):
res_sl.append(x)
for result in res:
if (count > 0 ):
result.insert(0,'')
count = count +1
with open ("test.csv", 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(res)
f.close()
the columns should be converted in to list before printing to csv file
writerows method can be leveraged to print multiplerows
https://docs.python.org/2/library/csv.html -- you can find more information here
TEAM1,STRATEGY1,ATTRIBUTION1,TIME_FRAME1,DATE1,FILE_TYPE1
,STRATEGY2,Attribution3,TIME_FRAME2,FILE_TYPE2
,STRATEGY3,Attribution4,TIME_FRAME3
,STRATEGY4,Attribution5,TIME_FRAME4
,STRATEGY5,Attribution6,TIME_FRAME5
,STRATEGY6,Attribution7,TIME_FRAME6
,STRATEGY7,Attribution8,TIME_FRAME7
,STRATEGY8,Attribution9
,STRATEGY9,Attribution10
,STRATEGY10
,STRATEGY11
,STRATEGY12
,STRATEGY13
,STRATEGY14
,STRATEGY15

Extract values from within strings in a list - python

I have a list in my python code with the following structure:
file_info = ['{file:C:\\samples\\123.exe, directory:C:\\}','{file:C:\\samples\\345.exe, directory:C:\\}',...]
I want to extract just the file and directory values for every value of the list and print it. With the following code, I am able to extract the directory values:
for item in file_info:
print item.split('directory:')[1].strip('}')
But I am not able to figure out a way to extract the 'file' values. The following doesn't work:
print item.split('file:')[1].strip(', directory:C:\}')
Suggestions? If there is any better method to extract the file and directory values other than this, that would be great too. Thanks in advance.
If the format is exactly the same you've provided, you'd better go with using re:
import re
file_info = ['{file:file1, directory:dir1}', '{file:file2, directory:directory2}']
pattern = re.compile(r'\w+:(\w+)')
for item in file_info:
print re.findall(pattern, item)
or, using string replace(), strip() and split() (a bit hackish and fragile):
file_info = ['{file:file1, directory:dir1}', '{file:file2, directory:directory2}']
for item in file_info:
item = item.strip('}{').replace('file:', '').replace('directory:', '')
print item.split(', ')
both code snippets print:
['file1', 'dir1']
['file2', 'directory2']
If the file_info items are just dumped json items (watch the double quotes), you can use json to load them into dictionaries:
import json
file_info = ['{"file":"file1", "directory":"dir1"}', '{"file":"file2", "directory":"directory2"}']
for item in file_info:
item = json.loads(item)
print item['file'], item['directory']
or, literal_eval():
from ast import literal_eval
file_info = ['{"file":"file1", "directory":"dir1"}', '{"file":"file2", "directory":"directory2"}']
for item in file_info:
item = literal_eval(item)
print item['file'], item['directory']
both code snippets print:
file1 dir1
file2 directory2
Hope that helps.
I would do:
import re
regx = re.compile('{\s*file\s*:\s*([^,\s]+)\s*'
','
'\s*directory\s*:\s*([^}\s]+)\s*}')
file_info = ['{file:C:\\samples\\123.exe, directory : C:\\}',
'{ file: C:\\samples\\345.exe,directory:C:\\}'
]
for item in file_info:
print '%r\n%s\n' % (item,
regx.search(item).groups())
result
'{file:C:\\samples\\123.exe, directory : C:\\}'
('C:\\samples\\123.exe', 'C:\\')
'{ file: C:\\samples\\345.exe,directory:C:\\}'
('C:\\samples\\345.exe', 'C:\\')

how to speed up the pattern search btw two lists : python

I have two fastq files like the one given below. Each record in the file starts with '#'. For two such files, my aim is to extract records that are common btw two files.
#IRIS:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^
#IRIS:7:1:17:800#0/1
GGAAACACTACTTAGGCTTATAAGATCNGGTTGCGG
+IRIS:7:1:17:800#0/1
ababbaaabaaaaa`]`ba`]`aaaaYD\\_a``XT
I have tried this:
first I get a list of read IDs that are common in file1 and 2.
import sys
#('reading files and storing all lines in a list')
data1 = open(sys.argv[1]).read().splitlines()
data2 = open(sys.argv[2]).read().splitlines()
#('listing all read IDs from file1')
list1 = []
for item in data1:
if '#' in item:
list1.append(item)
#('listing all read IDs from file2')
list2 = []
for item in data2:
if '#' in item:
list2.append(item)
#('finding common reads in file1 and file2')
def intersect(a, b):
return list(set(a) & set(b))
common = intersect(list1, list2)
Here, I search for commom IDs in the main file and export the data in a new file. following code works fine for small files but freezes my computer if I try it with larger files. I believe that the 'for' is taking too much memory:
#('filtering read data from file1')
mod_data1 = open(sys.argv[1]).read().rstrip('\n').replace('#', ',#')
tab1 = open(sys.argv[1] + '_final', 'wt')
records1 = mod_data1.split(',')
for item in records1[1:]:
if item.replace('\n', '\t').split('\t')[0] in common:
tab1.write(item)
Please suggest what should I do with the code above, such that it works on larger files(40-100 million records/file, and each record is 4 line).
Using list comprehension, you could write :
list1 = [i for item in data1 if '#' in item]
list2 = [i for item in data2 if '#' in item]
You could also define them as sets directly using set comprehension (depending on the version of python you are using).
set1 = {i for item in data1 if '#' in item}
set2 = {i for item in data2 if '#' in item}
I'd expect creating the set from the beginning to be faster than creating a list and then make a set out of it.
As for the second part of the code, I am not quite sure yet what you are trying to achieve.

Import Mongodb to CSV - removing duplicates

I am importing data from Mongo into a CSV file. The import consists of "timestamp" and "text" for each JSON Document.
The documents:
{
name: ...,
size: ...,
timestamp: ISODate("2013-01-09T21:04:12Z"),
data: { text:..., place:...},
other: ...
}
The code:
with open(output, 'w') as fp:
for r in db.hello.find(fields=['text', 'timestamp']):
print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))
I would like to remove duplicates (some Mongo docs have the same text), and I would like to keep the first instance (with regards to the time) intact. Is it possible to remove these dupes as I import?
Thanks for your help!
I would use a set to store the hashes of the data, and check for duplicates. Something like this:
import md5
hashes = set()
with open(output, 'w') as fp:
for r in db.hello.find(fields=['text', 'timestamp']):
digest = md5.new(r['text']).digest()
if digest in hashes:
# It's a duplicate!
continue
else:
hashes.add(digest)
print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))
It's worth noting that you could use the text field directly, but for larger text fields storing just the hash is much more memory efficient.
You just need to maintain a map (dictionary) to maintain (text, timestamp) pairs. The 'text' is the key, so there won't be any duplicates. I will assume the order of reading is not guaranteed to return the oldest timestamp first. In that case you will have to make 2 passes-- once for reading and later one pass for writing.
textmap = {}
def insert(text, ts):
global textmap
if text in textmap:
textmap[text] = min(ts, textmap[text])
else:
textmap[text] = ts
for r in db.hello.find(fields=['text', 'timestamp']):
insert(r['text'], r['timestamp'])
for text in textmap:
print >>fp, text, textmap[text] # with whatever format desired.
At the end, you can also easily convert the dictionary into list of tuples, in case you want to sort the results using timestamp before printing, for example.
(See Sort a Python dictionary by value )

Categories