MIME Type extraction on Codingame - python

I'm trying to solve a puzzle on codingame:
The goal is to extract the MIME-types of some strings like so
Input:
2
4
html text/html
png image/png
test.html
noextension
portrait.png
doc.TXT
Output:
text/html
UNKNOWN
image/png
UNKNOWN
My code is running smooth so far but at a longer list I get an error that my Process has timed out.
Can anybody hint me?
My code:
import sys
import math
# Auto-generated code below aims at helping you parse
# the standard input according to the problem statement.
n = int(input()) # Number of elements which make up the association table.
q = int(input()) # Number Q of file names to be analyzed.
mime = dict(input().split() for x in range(n))
fname = [input() for x in range(q)]
# Write an action using print
# To debug: print("Debug messages...", file=sys.stderr)
for b in fname:
if '.' not in b:
print('UNKNOWN')
continue
else:
for idx, char in enumerate(reversed(b)):
if char == '.':
ext = b[len(b)-idx:]
for k, v in mime.items():
if ext.lower() == k.lower():
print(v)
break
else:
print('UNKNOWN')
break

I don't know for certain that it's the reason for your code timing out, but iterating over your dictionary's keys and values to do a case-insenstive match against the keys is always going to be relatively slow. The main reason to use a dictionary is that is has O(1) lookup times. If you don't take advantage of that, you'd be better off with a list of tuples.
Fortunately, it's not that hard to fix up your dict so that you can use indexing instead of iterating to check each key manually. Here's the relevant changes to your code:
# set up the dict with lowercase extensions
mime = {ext.lower(): tp for ext, tp in (input().split() for _ in range(n))}
fnames = [input() for x in range(q)] # changed the variable name here, plural for a list
for filename in fnames: # better variable name than `b`
if '.' not in filename:
print("UNKNOWN") # no need for continue
else:
_, extension = filename.rsplit('.', 1) # no need to write your own split loop
print(mime.get(extension.lower(), "UNKNOWN")) # Do an O(1) lookup in the dictionary
In addition to changing up the dictionary creating and lookup code, I've also simplified the code you were using to split off the extension, though my way is still O(N) just like yours is (where N is the length of the file names).

Related

Directory comparision

I have a question in relation to comparing directories. I have requested a transfer of data from one company server , I received this data in 8 portable hard drives as it was not possible due to large volume to put it on one. each of them contains about 1TB data. Now, I want to make sure that all the files that were on that company server were fully transferred , and if there are any files missing I want to be able to detect it and ask for them. The issue is that the only thing I received from the company is one txt file inn which there is detailed directory structure saved in a tree format. In principle great I could just look one by one through it but due to large amount of data that is just not achievable. I can generate the same directory list out of every single one of the 8 drives that I received. but how can I ten compare this one file into those 8 files? I tried different python comparison codes to parse through line by line but it does not work as this compares them string by string(line by line) but the are not in string format , they are in tree style format. Anyone has any suggestions how to do it?is there a way to convert the file of a tree format into a string format to then run it in Python program and compare? Or should I request (not sure if thats possible) another file with directories saved in different structure format than tree? if yes how this should be generated?
I tried to parse it through lists in python
Your task consists of two subtasks:
Prepare data;
Compare trees.
Task 2 can be solved by this Tree class:
from collections import defaultdict
class Tree:
def __init__(self):
self._children = defaultdict(Tree)
def __len__(self):
return len(self._children)
def add(self, value: str):
value = value.removeprefix('/').removesuffix('/')
if value == '':
return
first, rest = self._get_values(value)
self._children[first].add(rest)
def remove(self, value: str):
value = value.removeprefix('/').removesuffix('/')
if value == '':
return
first, rest = self._get_values(value)
self._children[first].remove(rest)
if len(self._children[first]) == 0:
del self._children[first]
def _get_values(self, value):
values = value.split('/', 1)
first, rest = values[0], ''
if len(values) == 2:
rest = values[1]
return first, rest
def __iter__(self):
if len(self._children) == 0:
yield ''
return
for key, child in self._children.items():
for value in child:
if value != '':
yield key + '/' + value
else:
yield key
def main():
tree = Tree()
tree.add('a/b/c')
tree.add('a/b/d')
tree.add('b/b/e')
tree.add('b/c/f')
tree.add('c/b/e')
# duplicates are okay
tree.add('b/c/f')
# leading and trailing slashes are okay
tree.add('b/c/f/')
tree.add('/b/c/f')
print('Before removing:', [x for x in tree])
tree.remove('a/b/c')
tree.remove('a/b/d')
tree.remove('b/b/e')
# it's okay to remove non-existent values
tree.remove('this/path/does/not/exist')
# it will not remove root-level b folder, because it's not empty
tree.remove('b')
print('After removing:', [x for x in tree])
if __name__ == '__main__':
main()
Output:
Before removing: ['a/b/c', 'a/b/d', 'b/b/e', 'b/c/f', 'c/b/e']
After removing: ['b/c/f', 'c/b/e']
So, your algorithm is as follows:
Build tree of file paths on company server (let's call it A-tree);
Build trees of file paths on portable hard drives (let's call them B-trees);
Remove from the A-tree file paths that exist in B-trees;
Print content of the A-tree.
Now, all you need - is to build these trees, which is task 1 from the answer beginning. And how it will be - depends on the data, that you have in your .txt file.

How to traverse dictionary keys in sorted order

I am reading a cfg file, and receive a dictionary for each section. So, for example:
Config-File:
[General]
parameter1="Param1"
parameter2="Param2"
[FileList]
file001="file1.txt"
file002="file2.txt" ......
I have the FileList section stored in a dictionary called section. In this example, I can access "file1.txt" as test = section["file001"], so test == "file1.txt". To access every file of FileList one after the other, I could try the following:
for i in range(1, (number_of_files + 1)):
access_key = str("file_00" + str(i))
print(section[access_key])
This is my current solution, but I don't like it at all. First of all, it looks kind of messy in python, but I will also face problems when more than 9 files are listed in the config.
I could also do it like:
for i in range(1, (number_of_files + 1)):
if (i <= 9):
access_key = str("file_00" + str(i))
elif (i > 9 and i < 100):
access_key = str("file_0" + str(i))
print(section[access_key])
But I don't want to start with that because it becomes even worse. So my question is: What would be a proper and relatively clean way to go through all the file names in order? I definitely need the loop because I need to perform some actions with every file.
Use zero padding to generate the file number (for e.g. see this SO question answer: https://stackoverflow.com/a/339013/3775361). That way you don’t have to write the logic of moving through digit rollover yourself—you can use built-in Python functionality to do it for you. If you’re using Python 3 I’d also recommend you try out f-strings (one of the suggested solutions at the link above). They’re awesome!
If we can assume the file number has three digits, then you can do the followings to achieve zero padding. All of the below returns "015".
i = 15
str(i).zfill(3)
# or
"%03d" % i
# or
"{:0>3}".format(i)
# or
f"{i:0>3}"
Start by looking at the keys you actually have instead of guessing what they might be. You need to filter out the ones that match your pattern, and sort according to the numerical portion.
keys = [key for key in section.keys() if key.startswith('file') and key[4:].isdigit()]
You can add additional conditions, like len(key) > 4, or drop the conditions entirely. You might also consider learning regular expressions to make the checking more elegant.
To sort the names without having to account for padding, you can do something like
keys = sorted(keys, key=lambda s: int(s[4:]))
You can also try a library like natsort, which will handle the custom sort key much more generally.
Now you can iterate over the keys and do whatever you want:
for key in sorted((k for k in section if k.startswith('file') and k[4:].isdigit()), key=lambda s: int(s[4:])):
print(section[key])
Here is what a solution equipt with re and natsort might look like:
import re
from natsort import natsorted
pattern = re.compile(r'file\d+')
for key in natsorted(k for k in section if pattern.fullmatch(k)):
print(section[key])

Speed up file matching based on names of files

so I have 2 directories with 2 different file types (eg .csv, .png) but with the same basename (eg 1001_12_15.csv, 1001_12_15.png). I have many thousands of files in each directory.
What I want to do is to get the full paths of files, after having matched the basenames and then DO something with th efull path of both files.
I am asking some help of how to speed up the procedure.
My approach is:
csvList=[a list with the full path of each .csv file]
pngList=[a list with the full path of each .png file]
for i in range(0,len(csvlist)):
csv_base = os.path.basename(csvList[i])
#eg 1001
csv_id = os.path.splitext(fits_base)[0].split("_")[0]
for j in range(0, len(pngList)):
png_base = os.path.basename(pngList[j])
png_id = os.path.splitext(png_base)[0].split("_")[0]
if float(png_id) == float(csv_id):
DO SOMETHING
more over I tried fnmatch something like:
for csv_file in csvList:
try:
csv_base = os.path.basename(csv_file)
csv_id = os.path.splitext(csv_base)[0].split("_")[0]
rel_path = "/path/to/file"
pattern = "*" + csv_id + "*.png"
reg_match = fnmatch.filter(pngList, pattern)
reg_match=" ".join(str(x) for x in reg_match)
if reg_match:
DO something
It seems that using the nested for loops is faster. But I want it to be even faster. Are there any other approaches that I could speed up my code?
first of all, optimize syntax on your existing loop like this
for csv in csvlist:
csv_base = os.path.basename(csv)
csv_id = os.path.splitext(csv_base)[0].split("_")[0]
for png in pnglist:
png_base = os.path.basename(png)
png_id = os.path.splitext(png_base)[0].split("_")[0]
if float(png_id) == float(csv_id):
#do something here
nested loops are very slow because you need to run png loop n2 times
Then you can use list comprehension and array index to speed it up more
## create lists of processed values
## so you dont have to keep running the os library
sv_base_list=[os.path.basename(csv) for csv in csvlist]
csv_id_list=[os.path.splitext(csv_base)[0].split("_")[0] for csv_base in csv_base_list]
png_base_list=[os.path.basename(png) for png in pnglist]
png_id_list=[os.path.splitext(png_base)[0].split("_")[0] for png_base in png_base_list]
## run a single loop with list.index to find matching pair and record base values array
csv_png_base=[(csv_base_list[csv_id_list.index(png_id)], png_base)\
for png_id,png_base in zip(png_id_list,png_base_list)\
if png_id in csv_id_list]
## csv_png_base contains a tuple contianing (csv_base,png_base)
this logic using list index reduces the loop count significantly and there is no repetitive os lib calls
list comprehension is slightly faster than normal loop
You can loop through the list and do something with the values
eg
for csv_base,png_base in csv_png_base:
#do something
pandas will do the job much much faster though because it will run the loop using a C library
You can build up a search index in O(n), then seek items in it in O(1) each. If you have exact matches as your question implies, a flat lookup dict suffices:
from os.path import basename, splitext
png_lookup = {
splitext(basename(png_path))[0] : png_path
for png_path in pngList
}
This allows you to directly look up the png file corresponding to each csv file:
for csv_file in csvList:
csv_id = splitext(basename(csv_file)[0]
try:
png_file = png_lookup[csv_id]
except KeyError:
pass
else:
# do something
In the end, you have an O(n) lookup construction and a separate O(n) iteration with a nested O(1) lookup. The total complexity is O(n) compared to your initial O(n^2).

sort images based on a cluster correspondances list

I have the following working code to sort images according to a cluster list which is a list of tuples: (image_id, cluster_id).
One image can only be in one and only one cluster (there is never the same image in two clusters for example).
I wonder if there is a way to shorten the "for+for+if+if" loops at the end of the code as yet, for each file name, I must check in every pairs in the cluster list, which makes it a little redundant.
import os
import re
import shutil
srcdir = '/home/username/pictures/' #
if not os.path.isdir(srcdir):
print("Error, %s is not a valid directory!" % srcdir)
return None
pts_cls # is the list of pairs (image_id, cluster_id)
filelist = [(srcdir+fn) for fn in os.listdir(srcdir) if
re.search(r'\.jpg$', fn, re.IGNORECASE)]
filelist.sort(key=lambda var:[int(x) if x.isdigit() else
x for x in re.findall(r'[^0-9]|[0-9]+', var)])
for f in filelist:
fbname = os.path.splitext(os.path.basename(f))[0]
for e,cls in enumerate(pts_cls): # for each (img_id, clst_id) pair
if str(cls[0])==fbname: # check if image_id corresponds to file basename on disk)
if cls[1]==-1: # if cluster_id is -1 (->noise)
outdir = srcdir+'cluster_'+'Noise'+'/'
else:
outdir = srcdir+'cluster_'+str(cls[1])+'/'
if not os.path.isdir(outdir):
os.makedirs(outdir)
dstf = outdir+os.path.basename(f)
if os.path.isfile(dstf)==False:
shutil.copy2(f,dstf)
Of course, as I am pretty new to Python, any other well explained improvements are welcome!
I think you're complicating this far more than needed. Since your image names are unique (there can only be one image_id) you can safely convert pts_cls into a dict and have fast lookups on the spot instead of looping through the list of pairs each and every time. You are also utilizing regex where its not needed and you're packing your paths only to unpack them later.
Also, your code would break if it happens that an image from your source directory is not in the pts_cls as its outdir would never be set (or worse, its outdir would be the one from the previous loop).
I'd streamline it like:
import os
import shutil
src_dir = "/home/username/pictures/"
if not os.path.isdir(src_dir):
print("Error, %s is not a valid directory!" % src_dir)
exit(1) # return is expected only from functions
pts_cls = [] # is the list of pairs (image_id, cluster_id), load from whereever...
# convert your pts_cls into a dict - since there cannot be any images in multiple clusters
# base image name is perfectly ok to use as a key for blazingly fast lookups later
cluster_map = dict(pts_cls)
# get only `.jpg` files; store base name and file name, no need for a full path at this time
files = [(fn[:-4], fn) for fn in os.listdir(src_dir) if fn.lower()[-4:] == ".jpg"]
# no need for sorting based on your code
for name, file_name in files: # loop through all files
if name in cluster_map: # proceed with the file only if in pts_cls
cls = cluster_map[name] # get our cluster value
# get our `cluster_<cluster_id>` or `cluster_Noise` (if cluster == -1) target path
target_dir = os.path.join(src_dir, "cluster_" + str(cls if cls != -1 else "Noise"))
target_file = os.path.join(target_dir, file_name) # get the final target path
if not os.path.exists(target_file): # if the target file doesn't exists
if not os.path.isdir(target_dir): # make sure our target path exists
os.makedirs(target_dir, exist_ok=True) # create a full path if it doesn't
shutil.copy(os.path.join(src_dir, file_name), target_file) # copy
UPDATE - If you have multiple 'special' folders for certain cluster IDs (like Noise is for -1) you can create a map like cluster_targets = {-1: "Noise"} where the keys are your cluster IDs and their values are, obviously, the special names. Then you can replace the target_dir generation with: target_dir = os.path.join(src_dir, "cluster_" + str(cluster_targets.get(cls,cls)))
UPDATE #2 - Since your image_id values appear to be integers while filenames are strings, I'd suggest you to just build your cluster_map dict by converting your image_id parts to strings. That way you'd be comparing likes to likes without the danger of type mismatch:
cluster_map = {str(k): v for k, v in pts_cls}
If you're sure that none of the *.jpg files in your src_dir will have a non-integer in their name you can instead convert the filename into an integer to begin with in the files list generation - just replace fn[:-4] with int(fn[:-4]). But I wouldn't advise that as, again, you never know how your files might be named.

Data analysis for inconsistent string formatting

I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)
is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).

Categories