Python list folders list by numeric order - python

I'm getting list of all the folders in specific directory with this code:
TWITTER_FOLDER = os.path.join('TwitterPhotos')
dirs = [d for d in os.listdir(TWITTER_FOLDER) if os.path.isdir(os.path.join(TWITTER_FOLDER, d))]
This is the array: ['1','2','3','4','5','6','7','8','9','10','11'].
And I want to get the array in this order:['11','10','9','8','7','6','5','4','3','2','1']
So I use this code for that:
dirs.sort(key=lambda f: int(filter(str.isdigit, f)))
and when I use it I get this error:
int() argument must be a string, a bytes-like object or a number, not 'filter'
Any idea what is the problem? Or how I can sort it in another way?
It's important that the array will be sort by numeric order like:
12 11 10 9 8 7 6 5 4 3 2 1
And not:
9 8 7 6 5 4 3 2 12 11 1
Thanks!

Filter returns an iterator, you need to join them back into a string before you can convert it to an integer
dirs.sort(key=lambda f: int(''.join(filter(str.isdigit, f))))

Use sorted with key:
In [4]: sorted(f, key=lambda x: int(x), reverse=True)
Out[4]: ['11', '10', '9', '8', '7', '6', '5', '4', '3', '2', '1']
Or you can do f.sort(key=lambda x:int(x), reverse=True) for inplace sort.

Related

Get the number of specific dictionary values being stored in a 2-dimensional list in Python

I have a dictionary that contains fruits as keys and a 2-dimensional list including the line number and the timestamp, in which the fruit name occurs in the transcript file, as values. The 2-dimensional list is needed because the fruits appear several times in the file and I need to consider each single occurrence. The dictionary looks like this:
mydict = {
'apple': [['1', '00:00:03,950'], # 1
['1', '00:00:03,950'], # 2
['9', '00:00:24,030'], # 3
['11', '00:00:29,640']], # 4
'banana': [['20', '00:00:54,449']], # 5
'cherry': [['14', '00:00:38,629']], # 6
'orange': [['2', '00:00:06,840'], # 7
['2', '00:00:06,840'], # 8
['3', '00:00:09,180'], # 9
['4', '00:00:10,830']], # 10
}
Now, I would like to print the number of all fruits in total, so my desired solution is 10. Hence I want to count the number of the values, but not of each single list item, though... only of the whole list, so to say (see the comments which should clarify what I mean).
For this purpose, I tried:
print(len(mydict.values()))
But this code line just gives me the number 4 as result.
And the following code does not work for me either:
count = 0
for x in mydict:
if isinstance(mydict[x], list):
count += len(mydict[x])
print(count)
Has anyone an idea how to get the number 10?
You can obtain the lengths of the sub-lists by mapping the them to the len function and then add them up by passing the resulting sequence of lengths to the sum function:
sum(map(len, mydict.values()))
#blhsing solution is the best. If you want to keep it with loops, you can do:
mydict = {
'apple': [['1', '00:00:03,950'], # 1
['1', '00:00:03,950'], # 2
['9', '00:00:24,030'], # 3
['11', '00:00:29,640']], # 4
'banana': [['20', '00:00:54,449']], # 5
'cherry': [['14', '00:00:38,629']], # 6
'orange': [['2', '00:00:06,840'], # 7
['2', '00:00:06,840'], # 8
['3', '00:00:09,180'], # 9
['4', '00:00:10,830']], # 10
}
n_fruits = 0
for fruit, occurences_of_fruit in mydict.items():
# increment n_fruits by the number of occurence of the fruit
# BTW occurences_of_fruit and mydict[fruit] are the same thing
n_fruits += len(occurences_of_fruit)
print(n_fruits) # 10

How to use apply for two pandas column including lists to return index in a list in one column using the element in another column?

I have a pandas Dataframe with columns of "a" and "b". Column a has a list of values as a column value, and column "b" has a list with a single value that might appear in column "a". I want to create a new column c based on column a and b that has the value of position of element in b that appears in column a values using apply. (c: (index of b in a)+1 )
column b is always a list with one element or no element at all, column a can be in any length, but if it is empty, column b would be empty as well. column b element is expected to be in column a and I just want to find the position of first occurrence of it in column a.
a b c
['1', '2', '5'] ['2'] 2
['2','3','4'] ['4'] 3
['2','3','4'] [] 0
[] [] 0
...
I wrote a for loop which works fine but it is pretty slow:
for i in range(0,len(df)):
if len(df['a'][i])!=0:
df['c'][i]=df['a'][i].index(*df['b'][i])+1
else:
df['c'][i]=0
But I want to use apply to make it faster, the following does not work, any thoughts or suggestion would greatly be appreciated?
df['c']=df['a'].apply(df['a'].index(*df['b']))
First of all, here is a basic method using .apply().
import pandas as pd
import numpy as np
list_a = [['1', '2', '5'], ['2', '3', '4'], ['2', '3', '4'], []]
list_b = [['2'], ['4'], [], []]
df_1 = pd.DataFrame(data=zip(list_a, list_b), columns=['a', 'b'])
df_1['a'] = df_1['a'].map(lambda x: x if x else np.NaN)
df_1['b'] = df_1['b'].map(lambda x: x[0] if x else np.NaN)
#df_1['b'] = df_1['b'].map(lambda x: next(iter(x), np.NaN))
def calc_c(curr_row: pd.Series) -> int:
if curr_row['a'] is np.NaN or curr_row['b'] is np.NaN:
return 0
else:
return curr_row['a'].index(curr_row['b'])
df_1['c'] = df_1[['a', 'b']].apply(func=calc_c, axis=1)
df_1 result:
a b c
-- --------------- --- ---
0 ['1', '2', '5'] 2 1
1 ['2', '3', '4'] 4 2
2 ['2', '3', '4'] nan 0
3 nan nan 0
I replaced the empty lists with NaN, I find it far more idiomatic and practical.
This is obviously not an ideal solution, I will try to find something else. Obviously, the more information we have about your program and the DataFrame, the better.
By reading in the data so the data types are list, i am able to create an apply function that creates the values for c:
import io, ast
#a b
#['1','2','5'] ['2']
#['2','3','4'] ['4']
#['2','3','4'] []
#[] []
csvfile=io.StringIO("""a b
['1','2','5'] ['2']
['2','3','4'] ['4']
['2','3','4'] []
[] []""")
df = pd.read_csv(csvfile, sep=' ', converters={'a' : ast.literal_eval, 'b' : ast.literal_eval })
def a_b_index(hm):
if hm.b != []:
return hm.a.index(hm.b[0])
else:
return 0
df['c'] = df.apply(a_b_index, axis=1)
df.c
# a b c
#0 [1, 2, 5] [2] 1
#1 [2, 3, 4] [4] 2
#2 [2, 3, 4] [] 0
#3 [] [] 0

How to append a list after looping over a dataframe column?

Assuming I have a dataframe as follows:
df = pd.DataFrame({ 'ids' : ['1', '1', '1', '1', '2', '2', '2', '3', '3'],
'values' : ['5', '8', '7', '12', '2', '1', '3', '15', '4']
}, dtype='int32')
ids values
1 5
1 7
1 8
1 12
2 1
2 3
2 2
3 4
3 15
What I would like to do is to loop over the values column and check which values are greater than 6 and the corresponding id from the ids column must be appended into an empty list.
Even if an id (say 3) has multiple values and out of those multiple values (4 and 15), only one value is greater than 6, I would like the corresponding id to be appended into the list.
Example:
Assuming we run a loop over the above mentioned dataframe df, I would like the output as follows:
more = [1, 3]
less = [2]
with more =[] and less = [] being pre-initialized empty lists
What I have so far:
I tried implementing the same, but surely I am doing some mistake. The code I have:
less = []
more = []
for value in df['values']:
for id in df['ids']:
if (value > 6):
more.append(id)
else:
less.append(id)
Use groupby and boolean indexing to create your lists. This will be much faster than looping:
g = df.groupby('ids')['values'].max()
mask = g.gt(6)
more = g[mask].index.tolist()
less = g[~mask].index.tolist()
print(more)
print(less)
[1, 3]
[2]
You can use dataframe indexing to scrape out all those indices which are greater than 6 and create a set of unique indices using:
setA = set(df[df['values'] > 6]['ids'])
This will create a set of all indices in the dataframe:
setB = set(df['ids'])
Now,
more = list(setA)
and for less, take the set difference:
less = list(setB.difference(setA))
That's it!

make a list of matrices from a list of data

I have a hypothetical list of data from a .txt document
0 1 2
3 4 5
6 7 8
(value)
I want to make 2 lists of matricies like this
list 1 list 2
[0, 1, 2] [[0],[1],[2]]
[3, 4, 5] [[3],[4],[5]]
[6, 7, 8] [[6],[7],[8]]
Each item in list 1 must be able to give me a dot product when multiplying by a previously declared matrix.
new list 1
dot([0,1,2],mat)
dot([3,4,5],mat)
dot([6,7,8],mat)
I have tried appending the 3 values from the .txt document into a list as a matrix (this was done within a for loop)
list1.append ([value[0], value[1], value[2]]
but it didn't work. It gave me an error.
Any help is appreciated.
You may achieve it via list comprehension as:
>>> my_string = '0 1 2 3 4 5 6 7 8'
>>> row = 3
>>> my_list = my_string.split() # split the string based on space ' '
>>> list_1 = [my_list[i*3:(i*3)+row] for i in range(row)]
>>> list_1 # Output for example 1
[['0', '1', '2'],
['3', '4', '5'],
['6', '7', '8']]
>>> list_2 = [[[c] for c in my_list[i*3:(i*3)+row]] for i in range(row)]
>>> list_2 # Ouput for example 2
[[['0'], ['1'], ['2']],
[['3'], ['4'], ['5']],
[['6'], ['7'], ['8']]]
As far as i can see you forgot a bracket,
This:
list1.append ([value[0], value[1], value[2]]
Should be:
list1.append ([value[0], value[1], value[2]])
Edit: This answer is probably not the one you are looking for, as i am reading the post the problem does not seem to be the error, but something else.
It is not really clear what your question is. Maybe you could add some complete code instead of just samples to better explain the problem.

Separate a file in paragraphs

I have a file like this:
cluster number 1
1
2
3
cluster number 2
1
2
3
cluster number x
1
2
3
I want to split this file in paragraph of cluster numbers, like this
cluster number 1
1
2
3
I try to search for an answer but I can't handle it.
Thanks for your help!
user regular expression
import re
input_text = "..."
r = re.findall(r"(cluster number (\d+)\n\n(\d+)\n\n(\d+)\n\n(\d+))", input_text)
print r
this code return below list
[('cluster number 1\n\n1\n\n2\n\n3', '1', '1', '2', '3'),
('cluster number 2\n\n1\n\n2\n\n3', '2', '1', '2', '3')]
you can also see the detail explanation from here
As recommended, you should use regular expressions. Perhaps the re.split function would be suitable here:
>>> l = re.split('cluster number (?:\d+)', x)[1:]
>>> [a.split() for a in l]
[['1', '2', '3'], ['1', '2', '3'], ...]

Categories