Pythonic way to find if a string contains multiple values?

Pythonic way to find if a string contains multiple values? - python

I am trying to find through a list of files all the excel, txt or csv files and append them to a list
goodAttachments = [i for i in attachments if str(i).split('.')[1].find(['xlsx','csv','txt'])
This is obviously not working because find() needs a string and not a list. Should I try a list comprehension inside of a list comprehension?

There's no need to split or use double list comprehension. You can use str.endswith which takes a tuple of strings to check as an argument:
goodAttachments = [i for i in attachments if str(i).endswith(('.xlsx', '.csv', '.txt'))]
If you really want to split:
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ('xlsx', 'csv', 'txt')]
The first way is better as it accounts for files with no extension.

You could try something like this:
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ['xlsx', 'csv', 'txt']]
This will check if the extension after the last '.' matches one of 'xlsx', 'csv', or 'txt' exactly.

[i for i in attachments if any([e in str(i).split('.')[1] for e in ['xlsx','csv','txt']]))
Like you said, nested list comprehension.
Edit: This will work without splitting, I was trying to replicate the logic in find.

You can check that everything after the last dot is present in a second list. using [-1] instead of [1] ensures that files named like.this.txt will return the last split txt and not this.
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ['xlsx','csv','txt']]

I would suggest maybe adding a few more lines then trying to create a one-liner with nested list comprehensions. Though that would work, I think it makes more readable code to split these comprehensions out onto separate lines.
import os
attachments = ['sadf.asdf', 'asd/asd/asd.xslx']
whitelist = {'.xslx', '.csv'}
extentions = (os.path.split(fp)[1] for fp in attachments)
good_attachments = [fp for fp, ext in zip(attachments, extentions) if ext in whitelist]
I've also used os.path.split over str.split as the file may have multiple dots present and this split is designed for this exact job.

Related

Strip str between last 2 instances of common character

I have many different strings (which are files) that look like this:
20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc
20201225_00_ec_op_2m_temp_chinawheat_timeseries.nc
20201225_00_ec_op_snowfall_romaniawheat_timeseries.nc
And many more. I want to be able to loop through all of these files and store their file path in a dictionary. To do that, I want the key to be the text that is between the last two instances of an underscore. For example, if this was the file 20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc, then the dict would be
{'argentinacorn': path/to/file/20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc
How can I loop through and do this pythonically?

You can use regexes to extract the key from the strings like this:
import re
input_string = "20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc"
dict_key = re.findall(".*_(.+)_[^_]+", input_string)[0]
gives
'argentinacorn'
Or with just a simple split:
dict_key = input_string.split("_")[-2]
Regarding file names, you can get the list from current working directory like this:
import os
file_names = os.listdir()
You can just loop through this list and apply the split/regex as shown above.

A simple split and pick:
parts = "20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc".split("_")
key = parts[-2:-1]

Removing string from a list by using a rule

I am trying to write the list below into a txt file in Python.
However, im also trying to remove the ones that has 'xxx' from the list. preferably by using some sort of a if function. So like if a url has 'xxx' remove from list.
Any idea on how to approach this issue?
TTF = ('abc.com/648','xxx.com/246','def.com/566','ghi.com/624','xxx.com/123')

TTF = ('abc.com/648','xxx.com/246','def.com/566','ghi.com/624','xxx.com/123')
filtered = tuple(filter(lambda e: "xxx" not in e, TTF))
print(filtered)
Similiar to Green Cloak Guy, but using filter instead.

Simple filtered list comprehension. Strings support using in for substring matching, so you can check if a string contains xxx by just doing xxx in string.
The result:
TTF_without_xxx = tuple(s for s in TTF if 'xxx' not in s)
# ('abc.com/648', 'def.com/566', 'ghi.com/624')

Cleaning list to remove semi-duplicate values

I have a list of of video links. Some of these links are almost duplicates, meaning they contain almost the same link except that it has x_480.mp4 instead of x.mp4. Not all links have those "_480" at the end.
How can I clean the list to get only the ones that end with _480.mp4, removing their alternate versions, and keep the ones without a _480.mp4 version?
Example:
videos=["VfeHB0sga.mp4","G9uKZiNm.mp4","VfeHB0sga_480.mp4","kvlX4Fa4.mp4"]
Expected result:
["G9uKZiNm.mp4","VfeHB0sga_480.mp4","kvlX4Fa4.mp4"]`
Note: all links ends with .mp4. Also, there are no _480.mp4 without original one.
By the way len(videos) is 243.

You can do it in two lines of code:
to_remove = {fn[:-8] + '.mp4' for fn in videos if fn.endswith('_480.mp4')}
cleaned = [fn for fn in videos if fn not in to_remove]
The first line uses a set comprehension to extract all of the _480.mp4
filenames, converting them to their unwanted short versions. They are
stored in a set for quick searching.
The second line uses a list comprehension to filter out the unwanted
filenames.

I'd probably go the dict route to not have to check for existence of items in a list (would become a (performance) problem for large lists). For instance:
list({v[:-8] if v.endswith("_480.mp4") else v[:-4]: v
for v in sorted(videos)}.values())
That is compact way to say.
Construct me a dictionary whose key is incoming v without last 8 characters for values ending with "_480.mp4" or otherwise just stripped of last four character and being assigned value of the full incoming string.
Give me just values of that dictionary and since input was a list, I've passed it to a list constructor to get the same type as output.
Or broken down for easier reading, it could look something like this:
videos=["x.mp4","y.mp4","z.mp4","x_480.mp4"]
video_d = {}
for video_name in sorted(videos):
if video_name.endswith("_480.mp4"):
video_d[video_name[:-8]] = video_name
else:
video_d[video_name[:-4]] = video_name
new_videos = list(video_d.values())
It uses a virtual base name (stripping _480.mp4 or .mp4) as dictionary key. Since you do not care about resulting list order, we've made sure _480 suffixed entries are sorted after the "plain" entries. That way if they appear, they overwrite keys created for values without _480 suffix.

This should work. It loops through the videos until it finds one which ends with "_480.mp4". It then splits the title and get the starting bit and add ".mp4" to to create the video title which you want to remove. It then loops through the videos again and removes the video with that title.
videos=["x.mp4","y.mp4","z.mp4","x_480.mp4"]
#Loops through all the videos
for video in videos:
if "_480.mp4" in video:
#Removes the "_480" part of the video title
start = video.replace("_480", "")
for video2 in videos:
if video2 == start:
videos.remove(start)
print(videos)

You can even do this with one liner list comprehension.
[x for x in videos if x.split('.')[0] + '_480.mp4' not in videos]

Access last string after split function to create new list

I am a beginner in Python and I have been working on a code to access two types of files (dcd and inp files), combine them and create a new list with the matching strings.
I got stuck somewhere at the beginning. I want to get all dcd files here. So they have .dcd extension but the first part is not the same. So I was thinking if there is a way to access them after I have split the string.
#collect all dcd files into a list
list1 = []
for filename1 in glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
I want to get only names with dcd extension that are indexed [5] and create a new list or mutate this one, but I am not sure how to do that.
p.s I have just posted first part of the code
Thank you !
the oddly sorted part
this one looks better
and this is how I would like it to look like, but sorted and without eq* files.
want this sorted

just use sort with a sort key: os.path.basename (extracts only the basename of the file to perform sort):
import os, glob
list1 = sorted(glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'), key = os.path.basename)

So this worked. I just added del filename1[:5] to get rid of other unnecessary string parts
import os, glob
list1 = sorted(glob.glob('/FEP_SYAF014/FEP1/298//.dcd'), key = os.path.basename)
for filename1 in sorted(glob.glob('*/FEP_SYAF014 */FEP1/298/*/*.dcd'),key = os.path.basename):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
del filename1[:5]
print filename1

Your sort function is applied to file name parts. This is not what you want. If I understand well you want to sort the filename list, not the parts of the filename.
The code given by Jean François is great but I guess you'd like to get your own code working.
You need to extract the file name by using only the last part of the split
A split returns a list of strings. Each item is a part of the original.
filename = filename.split ('/')[len (filename.split ('/'))-1]
This line will get you the last part of the split
Then you can add that part to your list
And after all that you can sort your list
Hope this helps!

What is a better way to readlines from Python file?

This is my python file:-
TestCases-2
Input-5
Output-1,1,2,3,5
Input-7
Ouput-1,1,2,3,5,8,13
What I want is this:-
A variable test_no = 2 (No. of testcases)
A list testCaseInput = [5,7]
A list testCaseOutput = [[1,1,2,3,5],[1,1,2,3,5,8,13]]
I've tried doing it in this way:
testInput = testCase.readline(-10)
for i in range(0,int(testInput)):
testCaseInput = testCase.readline(-6)
testCaseOutput = testCase.readline(-7)
The next step would be to strip the numbers on the basis of (','), and then put them in a list.
Weirdly, the readline(-6) is not giving desired results.
Is there a better way to do this, which obviously I'm missing out on.
I don't mind using serialization here but I want to make it very simple for someone to write a text file as the one I have shown and then take the data out of it. How to do that?

A negative argument to the readline method specifies the number of bytes to read. I don't think this is what you want to be doing.
Instead, it is simpler to pull everything into a list all at once with readlines():
with open('data.txt') as f:
full_lines = f.readlines()
# parse full lines to get the text to right of "-"
lines = [line.partition('-')[2].rstrip() for line in full_lines]
numcases = int(lines[0])
for i in range(1, len(lines), 2):
caseinput = lines[i]
caseoutput = lines[i+1]
...
The idea here is to separate concerns (the source of the data, the parsing of '-', and the business logic of what to do with the cases). That is better than having a readline() and redundant parsing logic at every step.

I'm not sure if I follow exactly what you're trying to do, but I guess I'd try something like this:
testCaseIn = [];
testCaseOut = [];
for line in testInput:
if (line.startsWith("Input")):
testCaseIn.append(giveMeAList(line.split("-")[1]));
elif (line.startsWith("Output")):
testCaseOut.append(giveMeAList(line.split("-")[1]));
where giveMeAList() is a function that takes a comma seperated list of numbers, and generates a list datathing from it.
I didn't test this code, but I've written stuff that uses this kind of structure when I've wanted to make configuration files in the past.

You can use regex for this and it makes it much easier. See question: python: multiline regular expression
For your case, try this:
import re
s = open("input.txt","r").read()
(inputs,outputs) = zip(*re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s))
and then split(",") each output element as required
If you do it this way you get the benefit that you don't need the first line in your input file so you don't need to specify how many entries you have in advance.
You can also take away the unzip (that's the zip(*...) ) from the code above, and then you can deal with each input and output a pair at a time. My guess is that is in fact exactly what you are trying to do.
EDIT Wanted to give you the full example of what I meant just then. I'm assuming this is for a testing script so I would say use the power of the pattern matching iterator to help keep your code shorter and simpler:
for (input,output) in re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s):
expectedResults = output.split(",")
testResults = runTest(input)
// compare testResults and expectedResults ...

This line has an error:
Ouput-1,1,2,3,5,8,13 // it should be 'Output' not 'Ouput
This should work:
testCase = open('in.txt', 'r')
testInput = int(testCase.readline().replace("TestCases-",""))
for i in range(0,int(testInput)):
testCaseInput = testCase.readline().replace("Input-","")
testCaseOutput = testCase.readline().replace("Output-","").split(",")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pythonic way to find if a string contains multiple values? - python

You could try something like this: goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ['xlsx', 'csv', 'txt']] This will check if the extension after the last '.' matches one of 'xlsx', 'csv', or 'txt' exactly.

[i for i in attachments if any([e in str(i).split('.')[1] for e in ['xlsx','csv','txt']])) Like you said, nested list comprehension. Edit: This will work without splitting, I was trying to replicate the logic in find.

You can check that everything after the last dot is present in a second list. using [-1] instead of [1] ensures that files named like.this.txt will return the last split txt and not this. goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ['xlsx','csv','txt']]

Related

Strip str between last 2 instances of common character

Removing string from a list by using a rule

Cleaning list to remove semi-duplicate values

Access last string after split function to create new list

What is a better way to readlines from Python file?

Categories

Resources