Cleaning list to remove semi-duplicate values

Cleaning list to remove semi-duplicate values - python

I have a list of of video links. Some of these links are almost duplicates, meaning they contain almost the same link except that it has x_480.mp4 instead of x.mp4. Not all links have those "_480" at the end.
How can I clean the list to get only the ones that end with _480.mp4, removing their alternate versions, and keep the ones without a _480.mp4 version?
Example:
videos=["VfeHB0sga.mp4","G9uKZiNm.mp4","VfeHB0sga_480.mp4","kvlX4Fa4.mp4"]
Expected result:
["G9uKZiNm.mp4","VfeHB0sga_480.mp4","kvlX4Fa4.mp4"]`
Note: all links ends with .mp4. Also, there are no _480.mp4 without original one.
By the way len(videos) is 243.

You can do it in two lines of code:
to_remove = {fn[:-8] + '.mp4' for fn in videos if fn.endswith('_480.mp4')}
cleaned = [fn for fn in videos if fn not in to_remove]
The first line uses a set comprehension to extract all of the _480.mp4
filenames, converting them to their unwanted short versions. They are
stored in a set for quick searching.
The second line uses a list comprehension to filter out the unwanted
filenames.

I'd probably go the dict route to not have to check for existence of items in a list (would become a (performance) problem for large lists). For instance:
list({v[:-8] if v.endswith("_480.mp4") else v[:-4]: v
for v in sorted(videos)}.values())
That is compact way to say.
Construct me a dictionary whose key is incoming v without last 8 characters for values ending with "_480.mp4" or otherwise just stripped of last four character and being assigned value of the full incoming string.
Give me just values of that dictionary and since input was a list, I've passed it to a list constructor to get the same type as output.
Or broken down for easier reading, it could look something like this:
videos=["x.mp4","y.mp4","z.mp4","x_480.mp4"]
video_d = {}
for video_name in sorted(videos):
if video_name.endswith("_480.mp4"):
video_d[video_name[:-8]] = video_name
else:
video_d[video_name[:-4]] = video_name
new_videos = list(video_d.values())
It uses a virtual base name (stripping _480.mp4 or .mp4) as dictionary key. Since you do not care about resulting list order, we've made sure _480 suffixed entries are sorted after the "plain" entries. That way if they appear, they overwrite keys created for values without _480 suffix.

This should work. It loops through the videos until it finds one which ends with "_480.mp4". It then splits the title and get the starting bit and add ".mp4" to to create the video title which you want to remove. It then loops through the videos again and removes the video with that title.
videos=["x.mp4","y.mp4","z.mp4","x_480.mp4"]
#Loops through all the videos
for video in videos:
if "_480.mp4" in video:
#Removes the "_480" part of the video title
start = video.replace("_480", "")
for video2 in videos:
if video2 == start:
videos.remove(start)
print(videos)

You can even do this with one liner list comprehension.
[x for x in videos if x.split('.')[0] + '_480.mp4' not in videos]

Related

Access last string after split function to create new list

I am a beginner in Python and I have been working on a code to access two types of files (dcd and inp files), combine them and create a new list with the matching strings.
I got stuck somewhere at the beginning. I want to get all dcd files here. So they have .dcd extension but the first part is not the same. So I was thinking if there is a way to access them after I have split the string.
#collect all dcd files into a list
list1 = []
for filename1 in glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
I want to get only names with dcd extension that are indexed [5] and create a new list or mutate this one, but I am not sure how to do that.
p.s I have just posted first part of the code
Thank you !
the oddly sorted part
this one looks better
and this is how I would like it to look like, but sorted and without eq* files.
want this sorted

just use sort with a sort key: os.path.basename (extracts only the basename of the file to perform sort):
import os, glob
list1 = sorted(glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'), key = os.path.basename)

So this worked. I just added del filename1[:5] to get rid of other unnecessary string parts
import os, glob
list1 = sorted(glob.glob('/FEP_SYAF014/FEP1/298//.dcd'), key = os.path.basename)
for filename1 in sorted(glob.glob('*/FEP_SYAF014 */FEP1/298/*/*.dcd'),key = os.path.basename):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
del filename1[:5]
print filename1

Your sort function is applied to file name parts. This is not what you want. If I understand well you want to sort the filename list, not the parts of the filename.
The code given by Jean François is great but I guess you'd like to get your own code working.
You need to extract the file name by using only the last part of the split
A split returns a list of strings. Each item is a part of the original.
filename = filename.split ('/')[len (filename.split ('/'))-1]
This line will get you the last part of the split
Then you can add that part to your list
And after all that you can sort your list
Hope this helps!

Substring with multiple instances of the same character

So I am using a Magtek USB reader that will read card information,
As of right now I can swipe a card and I get a long string of information that goes into a Tkinter Entry textbox that looks like this
%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?
All of the data has been randomized, but that's the format
I've got a tkinter button (it gets the text from the entry box in the format I included above and runs this)
def printCD(self):
print(self.carddata.get())
self.card_data_get = self.carddata.get()
self.creditnumber =
self.card_data_get[self.card_data_get.find("B")+1:
self.card_data_get.find("^")]
print(self.creditnumber)
print(self.card_data_get.count("^"))
This outputs:
%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?
8954756016548963
This yields no issues, but if I wanted to get the next two variables firstname, and lastname
I would need to reuse self.variable.find("^") because in the format it's used before LAST and after INITIAL
So far when I've tried to do this it hasn't been able to reuse "^"
Any takers on how I can split that string of text up into individual variables:
Card Number
First Name
Last Name
Expiration Date

Regex will work for this. I didn't capture everything because you didn't detail what's what but here's an example of capturing the name:
import re
data = "%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?"
matches = re.search(r"\^(?P<name>.+)\^", data)
print(matches.group('name'))
# LAST/FIRST INITIAL
If you aren't familiar with regex, here's a way of testing pattern matching: https://regex101.com/r/lAARCP/1 and an intro tutorial: https://regexone.com/
But basically, I'm searching for (one or more of anything with .+ between two carrots, ^).
Actually, since you mentioned having first and last separate, you'd use this regex:
\^(?P<last>.+)/(?P<first>.+)\^
This question may also interest you regarding finding something twice: Finding multiple occurrences of a string within a string in Python

If you find regex difficult you can divide the problem into smaller pieces and attack one at a time:
data = '%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?'
pieces = data.split('^') # Divide in pieces, one of which contains name
for piece in pieces:
if '/' in piece:
last, the_rest = piece.split('/')
first, initial = the_rest.split()
print('Name:', first, initial, last)
elif piece.startswith('%B'):
print('Card no:', piece[2:])

Parsing multiple occurrences of an item into a dictionary

Attempting to parse several separate image links from JSON data through python, but having some issues drilling down to the right level, due to what I believe is from having a list of strings.
For the majority of the items, I've had success with the below example, pulling back everything I need. Outside of this instance, everything is a 1:1 ratio of keys:values, but for this one, there are multiple values associated with one key.
resultsdict['item_name'] = item['attribute_key']
I've been adding it all to a resultsdict={}, but am only able to get to the below sample string when I print.
INPUT:
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
OUTPUT (only relevant section):
'images': [{u'VariationSpecificPictureSet': [{u'PictureURL': [u'http//imagelink1'], u'VariationSpecificValue': u'color1'}, {u'PictureURL': [u'http//imagelink2'], u'VariationSpecificValue': u'color2'}, {u'PictureURL': [u'http//imagelink3'], u'VariationSpecificValue': u'color3'}, {u'PictureURL': [u'http//imagelink4'], u'VariationSpecificValue': u'color4'}]
I feel like I could add ['VariationPictureSet']['PictureURL'] at the end of my initial input, but that throws an error due to the indices not being integers, but strings.
Ideally, I would like to see the output as a simple comma-separated list of just the URLs, as follows:
OUTPUT:
'images': http//imagelink1, http//imagelink2, http//imagelink3, http//imagelink4

An answer to your comment that required a bit of code to it.
When using
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
you get a list with one element, so I recommend using this
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures'][0]
now you can use
for image in resultsdict['images']['VariationsSpecificPictureSet']:
print(image['PictureUR‌L'])

Thanks for the help, #uzzee, it's appreciated. I kept tinkering with it and was able to pull the continuous string of all the image URLs with the following code.
resultsdict['images'] = sum([x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']],[])
Without the sum it looks like this and pulls in the whole list of lists...
resultsdict['images'] = [x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']]

Iterate over sections in a config file

I recently got introduced to the library configparser. I would like to be able to check if each section has at least one Boolean value set to 1. For example:
[Horizontal_Random_Readout_Size]
Small_Readout = 0
Medium_Readout = 0
Large_Readout = 0
The above would cause an error.
[Vertical_Random_Readout_Size]
Small_Readout = 0
Medium_Readout = 0
Large_Readout = 1
The above would pass. Below is some pseudo code of what I had in mind:
exit_test = False
for sections in config_file:
section_check = False
for name in parser.options(section):
if parser.getboolean(section, name):
section_check = True
if not section_check:
print "ERROR:Please specify a setting in {} section of the config file".format(section)
exit_test = True
if exit_test:
exit(1)
Questions:
1) How do I perform the first for loop and iterate over the sections of the config file?
2) Is this a good way of doing this or is there a better way? (If there isn't please answer question one.)

Using ConfigParser you have to parse your config.
After parsing you will get all sections using .sections() method.
You can iterate over each section and use .items() to get all key/value pairs of each section.
for each_section in conf.sections():
for (each_key, each_val) in conf.items(each_section):
print each_key
print each_val

Best bet is to load ALL the lines in the file into some kind of array (I'm going to ignore the issue of how much memory that might use and whether to page through it instead).
Then from there you know that lines denoting headings follow a certain format, so you can iterate over your array to create an array of objects containing the heading name; the line index (zero based reference to master array) and whether that heading has a value set.
From there you can iterate over these objects in cross-reference to the master array, and for each heading check the next "n" lines (in the master array) between the current heading and the next.
At this point you're down to the individual config values for that heading so you should easily be able to parse the line and detect a value, whereupon you can break from the loop if true, or for more robustness issue an exclusivity check on those heading's values in order to ensure ONLY one value is set.
Using this approach you have access to all the lines, with one object per heading, so your code remains flexible and functional. Optimise afterwards.
Hope that makes sense and is helpful.

To complete the answer by #Nilesh and comment from #PashMic, here is an example that really iterate over ALL sections, including DEFAULT:
all_section_names: list[str] = conf.sections()
all_section_names.append("DEFAULT")
for section_name in all_section_names:
for key, value in conf.items(section_name):
...
Note that even if there is no real "DEFAULT" section, this will still works. There will just be no item retreived by conf.items("DEFAULT").

Split words and creating new files with different names(python)

I need to write a program like this:
Write a program that reads a file .picasa.ini and copies pictures in new files, whose names are the same as identification numbers of person on these pictures (eg. 8ff985a43603dbf8.jpg). If there are more person on the picture it makes more copies. If a person is on more pictures, later override earlier copies of pictures; if a person 8ff985a43603dbf8 may appear in more pictures, only one file with this name will exist. You must presume that we have a simple file .picasa.ini.
I have an .ini, that consists:
[img_8538.jpg]
faces=rect64(4ac022d1820c8624),**d5a2d2f6f0d7ccbc**
backuphash=46512
[img_8551.jpg]
faces=rect64(acb64583d1eb84cb),**2623af3d8cb8e040**;rect64(58bf441388df9592),**d85d127e5c45cdc2**
backuphash=8108
...
Is this a good way to start this program?
for line in open('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini'):
if line.startswith('faces'):
line.split() # what must I do here to split the bolded words?
Is there a better way to do this? Remember the .jpg file must be created with a new name, so I think I should link the current .jpg file with the bolded one.

Consider using ConfigParser. Then you will have to split each value by hand, as you describe.

import ConfigParser
import string
config = ConfigParser.ConfigParser()
config.read('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini')
imgs = []
for item in config.sections():
imgs.append(config.get(item, 'faces'))
This is still work in progress. Just want to ask if it's correct.
edit:
Still don't know hot to split the bolded words out of there. This split function really is a pain for me.

Suggestions:
Your lines don't start with 'faces', so your second line won't work the way you want it to. Depending on how the rest of the file looks, you might only need to check whether the line is empty or not at that point.
To get the information you need, first split at ',' and work from there
Try at a solution: The elements you need seem to always have a ',' before them, so you can start by splitting at the ',' sign and taking everything from the 1-index elemnt onwards [1::] . Then if what I am thinking is correct, you split those elements twice again: at the ";" and take the 0-index element of that and at that " ", again taking the 0-index element.
for line in open('thingy.ini'):
if line != "\n":
personelements = line.split(",")[1::]
for person in personelements:
personstring = person.split(";")[0].split(" ")[0]
print personstring
works for me to get:
d5a2d2f6f0d7ccbc
2623af3d8cb8e040
d85d127e5c45cdc2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning list to remove semi-duplicate values - python

You can even do this with one liner list comprehension. [x for x in videos if x.split('.')[0] + '_480.mp4' not in videos]

Related

Access last string after split function to create new list

Substring with multiple instances of the same character

Parsing multiple occurrences of an item into a dictionary

Iterate over sections in a config file

Split words and creating new files with different names(python)

Categories

Resources