Counting unique IDs across several hundred files?

Counting unique IDs across several hundred files? - python

I have about 750 files (.csv) and each line has one entry which is a UUID. My goal for this script to to count how many unique UUIDs exist across all 750 or so files. The file name structure looks like the following:
DATA-20200401-005abf4e3f864dcb83bd9030e63c6da6.csv
As you can see, it has a date and some random id. They're all in the same directory and they all have the same file extension. The format of each file is new line delimited and just has a UUID that looks like the following: b0d6e1e9-1b32-48d5-b962-671664484616
I tried merging all the files, but things got messy and this is about 15GB worth of data.
My final goal is to get an output such that it states the number of unique IDs across all the files. For example:
file1:
xxx-yyy-zzz
aaa-bbb-ccc
xxx-yyy-zzz
file2:
xxx-yyy-zzz
aaa-bbb-ccc
xxx-yyy-zzz
The final output after scanning these two files would be:
The total number of unique ids is: 2

I reckon using a Counter may be the fastest way to do this:
from collections import Counter
with open(filename) as f:
c = Counter(f)
print(sum(c.values()))
The counter provides the count of each unique item. This is implemented using a hashtable so should be fairly quick with a large number of items.

If you don't have to use Python, then a simple solution might be the command line:
cat *.csv | sort -u | wc -l
This pipes the content of all of the CSV file into sort -u which sorts and removes duplicates, then pipes that into wc -l which does a line count.
Note: sort will spill to disk as needed, and you can control its memory usage with -S size if you like.
I'd be tempted to run this on a powerful machine with lots of RAM.

Maybe something like this would work:
from os import listdir
import re
import pandas as pd
my_folder_path = "C:\\\\"
# Generic regular expression
pat = r"DATA-\d{8}-.+\.csv}"
p = re.compile(pat)
# UUID column in each file (I don't know if this is the case; Adjust accodingly.
uuid_column = "uuids"
# Empty result dataframe with single column
result_df = pd.DataFrame(columns=["unique_uuid"])
file_list = [rf"{my_folder_path}\{i}" for i in listdir(my_folder_path)]
for f in file_list:
# Check for matching regular expression pattern
if p.search(f):
# Read file if pattern matches.
df = pd.read_csv(f, usecols=[uuid_column])
# Append only unique values from the new Series to the dataframe
(result_df["unique_uuid"]
.append(list(set(df[uuid_column].values)
.difference(result_df["unique_uuid"].values)))
)

Concatenating all of the csv files in a directory has been solved in a pretty popular post The only difference here is that you drop duplicates. This would of course work well only if there are a significant amount of duplicates in each file (at least enough for all of the deduped frames to fit into memory and perform the final drop_duplicates).
There are also some other suggestions in that link, such as skipping the list altogether.
import glob
import pandas as pd
files = glob.glob('./data_path/*.csv')
li = []
for file in files:
df = pd.read_csv(file, index_col=None, header=None)
li.append(df.drop_duplicates())
output = pd.concat(li, axis=0, ignore_index=True)
output = output.drop_duplicates()

Read all the files and add all the UUIDs to a set as you go. Sets enforce uniqueness, so the length of the set is the number of unique UUIDs you found. Roughly:
import csv
import os
uuids = set()
for path in os.listdir():
with open(path) as file:
for row in csv.reader(file):
uuids.update(row)
print(f"The total number of unique ids is: {len(uuids)}")
This assumes that you can store all the unique UUIDs in memory. If you can't, building a database on disk would be the next thing to try (e.g. replace the set with a sqlite db or something along those lines). If you had a number of unique IDs that's too large to store anywhere, there are still solutions as long as you're willing to sacrifice some precision: https://en.wikipedia.org/wiki/HyperLogLog

Related

How can I read and search multiple textfiles so that I can store a list of files that match my search?

I hope you can help out a new learner of Python. I could not find my problem in other questions, but if so: apologies. What I basically want to do is this:
Read a large number of text files and search each for a number of string terms.
If the search terms are matched, store the corresponding file name to a new file called "filelist", so that I can tell the good files from the bad files.
Export "filelist" to Excel or CSV.
Here is the code that I have so far:
#textfiles all contain only simple text e.g. "6 Apples"
filelist=[]
for file in os.listdir('C:/mydirectory/'):
with open('C:/mydirectory/' + file, encoding="Latin1") as f:
fine=f.read()
if re.search('APPLES',fine) or re.search('ORANGE',fine) or re.search('BANANA',fine):
filelist.append(file)
listoffiles = pd.DataFrame(filelist)
writer = pd.ExcelWriter('ListofFiles.xlsx', engine='xlsxwriter')
listoffiles.to_excel(writer,sheet_name='welcome',index=False)
writer.save()
print(filelist)
Questions:
Surely, there is a more elegant or time-efficient way? I need to do this for a large amount of files :D
Related to the former, is there a way to solve the reading-in of files using pandas? Or is it less time efficient? For me as a STATA user, having a dataframe feels a bit more like home....
I added the "Latin1" option, as some characters in the raw data create conflict in encoding. Is there a way to understand which characters are causing the problem? Can I get rid of this easily, e.g. by cutting of the first line beforehand (skiprow maybe)?

Just couple of things to speed up the script:
1.) compile your regex beforehand, not every time in the loop (also use | to combine multiple strings to one regex!
2.) read files line by line, not all at once!
3.) Use any() to terminate search when you get first positive
For example:
import re
import os
filelist=[]
r = re.compile(r'APPLES|ORANGE|BANANA') # you can add flags=re.I for case insensitive search
for file in os.listdir('C:/mydirectory/'):
with open('C:/mydirectory/' + file, 'r', encoding='latin1') as f:
if any(r.search(line) for line in f): # read files line by line, not all content at once
filelist.append(file) # add to list
# convert list to pandas, etc...

Splitting an unordered csv file based on the values in the nth column/

I have a large csv file containing information on sampled pathogens representing several different species. I want to split this csv file by species, so I will have one csv file per species. The data in the file aren't in any particular order. My csv file looks like this:
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044420,EQUI0208,1336,Streptococcus equi,15/10/2010,2010,Belgium,Belgium
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852528,2789STDY5834916,154046,Hungatella hathewayi,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852530,2789STDY5834918,33039,Ruminococcus torques,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852533,2789STDY5834921,40520,Blautia obeum,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852535,2789STDY5834923,1150298,Fusicatenibacter saccharivorans,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852537,2789STDY5834925,1407607,Fusicatenibacter,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852540,2789STDY5834928,39492,Eubacterium siraeum,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852544,2789STDY5834932,292800,Flavonifractor plautii,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852551,2789STDY5834939,169435,Anaerotruncus colihominis,2013,2013,United Kingdom,UK
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044418,EQUI0206,1336,Streptococcus equi,05/02/2010,2010,Belgium,Belgium
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044419,EQUI0207,1336,Streptococcus equi,29/07/2010,2010,Belgium,Belgium
The name of the species is at index 5.
I originally tried this:
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("file.csv")),
lambda row: row[5]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
But this fails because the data aren't ordered by species and there isn't an append arguement for the output (that I'm aware of) so each time the script encounters a new entry of a species that it has already written to a file it overwrites the first entries.
Is there a simple way to order the data by species and then execute the above script or a way to append the output of the above script to a file instead of overwriting it?
Also I'd ideally like each of the output files to be named after the species they contain.
Thanks.

In reference to your comment: "there isn't an append arguement for the output (that I'm aware of)", you can use 'a' instead of 'w' to append to the file like:
with open("%s.csv" % key, "a")
Probably is not the best approach because if you run the code two times you'll get everything double.

You could sort the csv files using the same lambda function as you're using for the groupby operation:
import csv
from itertools import groupby
groupfunc = lambda row: row[5]
for key, rows in groupby(sorted(csv.reader(open("file.csv")),key=groupfunc),groupfunc):
with open("%s.csv" % key, "w") as output:
cw = csv.writer(output)
cw.writerows(rows)
note:
I rewrote the write routine to use csv module as output
I created a variable for your lambda so no copy-paste
Note that you have to cleanup your csv files if you change your input data, because if one species isn't in the new data, the old csv remains on the disk. I would to that with some code like:
import glob,os
for f in glob.glob("*.csv"):
os.remove(f)
But beware of the *.csv pattern because it's too wide and it may be a little too effective on your other csv files :)
Note: This method uses sort and is therefore more memory hungry. You could choose to open each file in append mode instead as the other solution suggests to save memory, but perform more file I/O.

Using Python v3.5 to load a tab-delimited file, omit some rows, and output max and min floating numbers in a specific column to a new file

I've tried for several hours to research this, but every possible solution hasn't suited my particular needs.
I have written the following in Python (v3.5) to download a tab-delimited .txt file.
#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)
This downloads the file from here and renames it based on the current time. It works perfectly.
I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).
This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):
import csv
with open(filename, 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
print(row)
As you can see from the file, the first 18 lines are informational.
Line 19 provides the actual column names. Then there is a line of dashes.
The actual data I'm interested in starts on line 21.
I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.
Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...
import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])
I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.
I would appreciate any pointers anyone may have, please.

This is meant to be a comment on your latest question to Apoc, but I'm new, so I'm not allowed to comment. One thing that might create problems is that bz_values (and bt_values, for that matter) might be a list of strings (at least it was when I tried to run Apoc's script on the example file you linked to). You could solve this by substituting this:
min_bz = min([float(x) for x in bz_values])
max_bz = max([float(x) for x in bz_values])
for this:
min_bz = min(bz_values)
max_bz = max(bz_values)

The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:
from itertools import islice
# Line that data starts from, zero-indexed.
START_LINE = 20
# The column containing the data in question, zero-indexed.
DATA_COL = 10
# The value present when a measurement failed.
FAILED_MEASUREMENT = '-999.9'
with open('data.txt', 'r') as f:
bt_values = []
for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
if val != FAILED_MEASUREMENT:
bt_values.append(float(val))
min_bt = min(bt_values)
max_bt = max(bt_values)
with open('minBt.txt', 'a') as minFile:
print(min_bt, file=minFile)
with open('maxBt.txt', 'a') as maxFile:
print(max_bt, file=maxFile)
I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.
Edit: Updated to include workaround for failed measurements, as discussed in comments.
Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.

Text file manipulation with Python

First off, I am very new to Python. When I started to do this it seemed very simple. However I am at a complete loss.
I want to take a text file with as many as 90k entries and put the data groups on a single line separated by a ';' My examples are below. Keep in mind that the groups of data vary in size. They could be two entries, or 100 entries.
Raw Data
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
Formatted Data
group1;data;
group2;data;data;data;
group3;data;data;data;data;data;data;data;data;data;data;data;data;
group4;data;data;

try something like the following. (untested...you can learn a bit of python by debugging!)
create python file "parser.py"
import sys
f = open('filename.txt', 'r')
for line in f:
txt = line.strip()
if txt == '':
sys.stdout.write('\n\n')
sys.stdout.flush()
sys.stdout.write( txt + ';')
sys.stdout.flush()
f.close()
and in a shell, type:
python parser.py > output.txt
and see if output.txt is what you want.

Assuming the groups are separated with an empty line, you can use the following one-liner:
>>> print "\n".join([item.replace('\n', ';') for item in open('file.txt').read().split('\n\n')])
group1;data
group2;data;data;data
group3;data;data;data;data;data;data;data;data;data;data;data;data
group4;data;data;
where file.txt contains
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
First the file content (open().read()) is split on empty lines split('\n\n') to produce a list of blocks, then, in each block [item ... for item in list], newlines are replaced with semi-colons, and finally all blocks are printed separated with a newline "\n".join(list)
Note that the above is not safe for production, that is code that you would write for interactive data transformation, not in production-level scripts.

What have you tried? Text file is for/from what? File manipulation is one of the last "basic" things I plan on learning. I'm saving it for when I understand the nuances of for loops, while loops, dictionaries, lists, appending, and a million other handy functions out there. That's after 2-3 months of research, coding and creating GUI's by the way.
Anyways here's some basic suggestions.
';'.join(group) will put a ";" in between each group, effectively creating one long (semi-colon delimited) string
group.replace("SPACE CHARACTER", ";") : This will replace any spaces or specified character (like a newline) within a group with a semi-colon.
There's a lot of other methods that include loading the txt file into a python script, .append() functions, putting the groups into lists, dictionaries, or matrix's, etc..

These are my bits to throw on the problem:
from collections import defaultdict
import codecs
import csv
res = defaultdict(list)
cgroup = ''
with codecs.open('tmp.txt',encoding='UTF-8') as f:
for line in f:
if line.startswith('group'):
cgroup = line.strip()
continue
res[cgroup].append(line.strip())
with codecs.open('out.txt','w',encoding='UTF-8') as f:
w = csv.writer(f, delimiter=';',quoting=csv.QUOTE_MINIMAL)
for k in res:
w.writerow([k,]+ res[k])
Let me explain a bit on the why I did things, as I did. First, I used the codecs module to open the data file explicitly with the codec, since data should always be treated right and not by just guessing what it might be. Then I used a defaultdict, which has a nice documentation online, cause its more pythonic, at least regarding to mr. hettinger. It is one of the patterns, that can be unlearned if you use python.
At least, I used a csv-writer to generate the output, cause writing CSV files is not as easy as one might think. And to be able to just meet the right criteria, or just to get the data into a correct csv format, it is better to use, what many eyes have seen, instead of reinventing the wheel.

How to copy specific data out of a file using python?

I have some large data files and I want to copy out certain pieces of data on each line, basically an ID code. The ID code has a | on one side and a space on the other. I was wondering would it be possible to pull out just the ID. Also I have two data files, one has 4 ID codes per line and the other has 23 per line.
At the moment I'm thinking something like copying each line from the data file, then subtract the strings from each other to get the desired ID code, but surely there must be an easier way! Help?
Here is an example of a line from the data file that I'm working with
cluster8032: WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327
and from this line I would want to output on separate lines
Wood_4286
EIK58010
AEV644870.1
PSEBR_a4327

Use the regex module for such a task. The following code shows you how to extract the ID's from a string (works for any number of ID's as long as they are structured the same way).
import re
s = """cluster8032: WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327"""
results = re.findall('\|([^ ]*)',s) #list of ids that have been extracted from string
print('\n'.join(results)) #pretty output
Output:
Wood_4286
EIK58010
AEV64487.1
PSEBR_a4327
To write the output to a file:
with open('out.txt', mode = 'w') as filehandle:
filehandle.write('\n'.join(results))
For more information, see the regex module documentation.

If all your lines have the given format, a simple split is enough:
#split by '|' and the result by space
ids = [x.split()[0] for x in line.split("|")[1:]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting unique IDs across several hundred files? - python

Related

How can I read and search multiple textfiles so that I can store a list of files that match my search?

Splitting an unordered csv file based on the values in the nth column/

Using Python v3.5 to load a tab-delimited file, omit some rows, and output max and min floating numbers in a specific column to a new file

Text file manipulation with Python

How to copy specific data out of a file using python?

Categories

Resources