Separate a file in paragraphs - python

I have a file like this:
cluster number 1
1
2
3
cluster number 2
1
2
3
cluster number x
1
2
3
I want to split this file in paragraph of cluster numbers, like this
cluster number 1
1
2
3
I try to search for an answer but I can't handle it.
Thanks for your help!

user regular expression
import re
input_text = "..."
r = re.findall(r"(cluster number (\d+)\n\n(\d+)\n\n(\d+)\n\n(\d+))", input_text)
print r
this code return below list
[('cluster number 1\n\n1\n\n2\n\n3', '1', '1', '2', '3'),
('cluster number 2\n\n1\n\n2\n\n3', '2', '1', '2', '3')]
you can also see the detail explanation from here

As recommended, you should use regular expressions. Perhaps the re.split function would be suitable here:
>>> l = re.split('cluster number (?:\d+)', x)[1:]
>>> [a.split() for a in l]
[['1', '2', '3'], ['1', '2', '3'], ...]

Related

Get the number of specific dictionary values being stored in a 2-dimensional list in Python

I have a dictionary that contains fruits as keys and a 2-dimensional list including the line number and the timestamp, in which the fruit name occurs in the transcript file, as values. The 2-dimensional list is needed because the fruits appear several times in the file and I need to consider each single occurrence. The dictionary looks like this:
mydict = {
'apple': [['1', '00:00:03,950'], # 1
['1', '00:00:03,950'], # 2
['9', '00:00:24,030'], # 3
['11', '00:00:29,640']], # 4
'banana': [['20', '00:00:54,449']], # 5
'cherry': [['14', '00:00:38,629']], # 6
'orange': [['2', '00:00:06,840'], # 7
['2', '00:00:06,840'], # 8
['3', '00:00:09,180'], # 9
['4', '00:00:10,830']], # 10
}
Now, I would like to print the number of all fruits in total, so my desired solution is 10. Hence I want to count the number of the values, but not of each single list item, though... only of the whole list, so to say (see the comments which should clarify what I mean).
For this purpose, I tried:
print(len(mydict.values()))
But this code line just gives me the number 4 as result.
And the following code does not work for me either:
count = 0
for x in mydict:
if isinstance(mydict[x], list):
count += len(mydict[x])
print(count)
Has anyone an idea how to get the number 10?
You can obtain the lengths of the sub-lists by mapping the them to the len function and then add them up by passing the resulting sequence of lengths to the sum function:
sum(map(len, mydict.values()))
#blhsing solution is the best. If you want to keep it with loops, you can do:
mydict = {
'apple': [['1', '00:00:03,950'], # 1
['1', '00:00:03,950'], # 2
['9', '00:00:24,030'], # 3
['11', '00:00:29,640']], # 4
'banana': [['20', '00:00:54,449']], # 5
'cherry': [['14', '00:00:38,629']], # 6
'orange': [['2', '00:00:06,840'], # 7
['2', '00:00:06,840'], # 8
['3', '00:00:09,180'], # 9
['4', '00:00:10,830']], # 10
}
n_fruits = 0
for fruit, occurences_of_fruit in mydict.items():
# increment n_fruits by the number of occurence of the fruit
# BTW occurences_of_fruit and mydict[fruit] are the same thing
n_fruits += len(occurences_of_fruit)
print(n_fruits) # 10

Create a dictionary from lists, overwrite duplicate keys

I have my code below. I am trying to create a dictionary from my lists extracted from a txt file but the loop overwrites the previous information:
f = open('data.txt','r')
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('data.txt')]
columns=lines.pop(0)
for i in range(len(lines)):
lines[i]=lines[i].split(',')
dictt={}
for line in lines:
dictt[line[0]]=line[1:]
print('\n')
print(lines)
print('\n')
print(dictt)
I know I have to play with:
for line in lines:
dictt[line[0]] = line[1:]
part but what can I do , do I have to use numpy? If so, how?
My lines list is :
[['USS-Enterprise', '6', '6', '6', '6', '6'],
['USS-Voyager', '2', '3', '0', '4', '1'],
['USS-Peres', '10', '4', '0', '0', '5'],
['USS-Pathfinder', '2', '0', '0', '1', '2'],
['USS-Enterprise', '2', '2', '2', '2', '2'],
['USS-Voyager', '2', '1', '0', '1', '1'],
['USS-Peres', '8', '5', '0', '0', '4'],
['USS-Pathfinder', '4', '0', '0', '2', '1']]
My dict becomes:
{'USS-Enterprise': ['2', '2', '2', '2', '2'],
'USS-Voyager': ['2', '1', '0', '1', '1'],
'USS-Peres': ['8', '5', '0', '0', '4'],
'USS-Pathfinder': ['4', '0', '0', '2', '1']}
taking only the last ones, I want to add the values together. I am really confused.
You are trying to append multiple values for the same key. You can use defaultdict for that, or modify your code and utilize the get method for dictionaries.
for line in lines:
dictt[line[0]] = dictt.get(line[0], []).extend(line[1:])
This will look for each key, assign the line[1:] if the key is unique, and if it is duplicate, simply append those values onto the previous values.
dict_output = {}
for line in list_input:
if line[0] not in dict_output:
dict_output[line[0]] = line[1:]
else:
dict_output[line[0]] += line[1:]
EDIT: You subsequently clarified in comments that your input has duplicate keys, and you want later rows to overwrite earlier ones.
ORIGINAL ANSWER: The input is not a dictionary, it's a CSV file. Just use pandas.read_csv() to read it:
import pandas as pd
df = pd.read_csv('my.csv', sep='\s+', header=None)
df
0 1 2 3 4 5
0 USS-Enterprise 6 6 6 6 6
1 USS-Voyager 2 3 0 4 1
2 USS-Peres 10 4 0 0 5
3 USS-Pathfinder 2 0 0 1 2
4 USS-Enterprise 2 2 2 2 2
5 USS-Voyager 2 1 0 1 1
6 USS-Peres 8 5 0 0 4
7 USS-Pathfinder 4 0 0 2 1
Seems your input didn't have a header row. If your input columns had names, you can add them with df.columns = ['Ship', 'A', 'B', 'C', 'D', 'E'] or whatever.
If you really want to write a dict output (beware of duplicate keys being suppressed), see df.to_dict()

Python if/else statement confusion

How can you create an if else statement in python when you have a file with both text and numbers. Let's say I want to replace the values from the third to last column in the file below. I want to create an if else statement to replace values <5 or if there's a dot "." with a zero, and if possible to use that value as integer for a sum.
A quick and dirty solution using awk would look like this, but I'm curious on how to handle this type of data with python:
awk -F"[ :]" '{if ( (!/^#/) && ($9<5 || $9==".") ) $9="0" ; print }'
So how do you solve this problem?
Thanks
Input file:
\##Comment1
\#Header
sample1 1 2 3 4 1:0:2:1:.:3
sample2 1 4 3 5 1:3:2:.:3:3
sample3 2 4 6 7 .:0:6:5:4:0
Desired output:
\##Comment1
\#Header
sample1 1 2 3 4 1:0:2:0:0:3
sample2 1 4 3 5 1:3:2:0:3:3
sample3 2 4 6 7 .:0:6:5:4:0
SUM = 5
Result so far
['sample1', '1', '2', '3', '4', '1', '0', '2', '0', '0', '3\n']
['sample2', '1', '4', '3', '5', '1', '3', '2', '0', '3', '3\n']
['sample3', '2', '4', '6', '7', '.', '0', '6', '5', '4', '0']
Here's what I have tried so far:
import re
data=open("inputfile.txt", 'r')
for line in data:
if not line.startswith("#"):
nodots = line.replace(":.",":0")
final_nodots=re.split('\t|:',nodots)
if (int(final_nodots[8]))<5:
final_nodots[8]="0"
print (final_nodots)
else:
print(final_nodots)
data=open("inputfile.txt", 'r')
import re
sums = 0
for line in data:
if not line.startswith("#"):
nodots = line.replace(".","0")
final_nodots=list(re.findall('\d:.+\d+',nodots)[0])
if (int(final_nodots[6]))<5:
final_nodots[6]="0"
print(final_nodots)
sums += int(final_nodots[6])
print(sums)
You were pretty close but you your final_nodots returns a split on : instead of a split on the first few numbers, so your 8 should have been a 3. After that just add a sums counter to keep track of that slot.
['sample1 1 2 3 4 1', '0', '2', '0', '0', '3\n']
There are better ways to achieve what you want but I just wanted to fix your code.

make a list of matrices from a list of data

I have a hypothetical list of data from a .txt document
0 1 2
3 4 5
6 7 8
(value)
I want to make 2 lists of matricies like this
list 1 list 2
[0, 1, 2] [[0],[1],[2]]
[3, 4, 5] [[3],[4],[5]]
[6, 7, 8] [[6],[7],[8]]
Each item in list 1 must be able to give me a dot product when multiplying by a previously declared matrix.
new list 1
dot([0,1,2],mat)
dot([3,4,5],mat)
dot([6,7,8],mat)
I have tried appending the 3 values from the .txt document into a list as a matrix (this was done within a for loop)
list1.append ([value[0], value[1], value[2]]
but it didn't work. It gave me an error.
Any help is appreciated.
You may achieve it via list comprehension as:
>>> my_string = '0 1 2 3 4 5 6 7 8'
>>> row = 3
>>> my_list = my_string.split() # split the string based on space ' '
>>> list_1 = [my_list[i*3:(i*3)+row] for i in range(row)]
>>> list_1 # Output for example 1
[['0', '1', '2'],
['3', '4', '5'],
['6', '7', '8']]
>>> list_2 = [[[c] for c in my_list[i*3:(i*3)+row]] for i in range(row)]
>>> list_2 # Ouput for example 2
[[['0'], ['1'], ['2']],
[['3'], ['4'], ['5']],
[['6'], ['7'], ['8']]]
As far as i can see you forgot a bracket,
This:
list1.append ([value[0], value[1], value[2]]
Should be:
list1.append ([value[0], value[1], value[2]])
Edit: This answer is probably not the one you are looking for, as i am reading the post the problem does not seem to be the error, but something else.
It is not really clear what your question is. Maybe you could add some complete code instead of just samples to better explain the problem.

Python: split by (different) n spaces

I have lines like this:
2 20 164 "guid" Some name^7 0 ip.a.dd.res:port -21630 25000
6 30 139 "guid" Other name^7 0 ip.a.dd.res:port 932 25000
I would like to split this, but the problem is that there is different number of spaces between this "words"...
How can I do this?
Python's split function doesn't care about the number of spaces:
>>> ' 2 20 164 "guid" Some name^7 0 ip.a.dd.res:port -21630 25000'.split()
['2', '20', '164', '"guid"', 'Some', 'name^7', '0', 'ip.a.dd.res:port', '-21630', '25000']
Have you tried split()? It will "compress" spaces, so after split you will get:
'2', '20', '164', '"guid'" etc.
>>> l = "1 2 4 'ds' 5 66"
>>> l
"1 2 4 'ds' 5 66"
>>> l.split(' ')
['1', '', '', '2', '', '', '4', "'ds'", '5', '', '66']
>>> [x for x in l.split()]
['1', '2', '4', "'ds'", '5', '66']
Just use split() function. The delimiter is \s+ that is any kind and any number of space

Categories