I have this text:
>A1
KKKKKKKK
DDDDDDDD
>A2
FFFFFFFF
FFFFOOOO
DAA
>A3
OOOZDDD
KKAZAAA
A
When I split it and remove the line jumps, I get this list:
It gives me a list that looks like this:
['>A1', 'KKKKKKKK', 'DDDDDDDD', '>A2', 'FFFFFFFF', 'FFFFOOOO', 'DAA', '>A3', 'OOOZDDD', 'KKAZAAA', 'A']
I'm trying to merge all the strings between each part that starts with '>', such that it looks like:
['KKKKKKKKDDDDDDDD', 'FFFFFFFFFFFFOOOODAA', 'OOOZDDDKKAZAAAA']
What I have so far, but it doesn't do anything and I'm lost:
my_list = ['>A1', 'KKKKKKKK', 'DDDDDDDD', '>A2', 'FFFFFFFF', 'FFFFOOOO', 'DAA', '>A3', 'OOOZDDD', 'KKAZAAA', 'A']
result = []
for item in range(len(my_list)):
if my_list[item][0] == '>':
temp = ''
while my_list[item] != '>':
temp += my_list[item]
result.append(temp)
print(result)
#Andrej has given a compact code for your problem, but I want to help you by pointing out some issues in your original code.
You have while in if, but when my_list[item] starts with '>', the inner while won't work. The correct thing is to add a else-statement to concatenate the following string.
You append a string temp to result at each iterative step, but temp is not a concatenated string. The correct time to append is when you meet '>' again.
After solving them, you may get something like this,
result = []
for item in range(len(my_list)):
if my_list[item][0] == '>':
if item != 0:
result.append(temp)
temp = ''
else:
temp += my_list[item]
if item != 0:
result.append(item)
print(result)
You can further simplify it.
Save list indexing by directly iterating over the list.
Save final repeated check by adding a sentinel.
result = []
concat_string = '' # just change a readable name
for string in my_list + ['>']: # iterate over list directly and add a sentinel
if string[0] == '>': # or string.startswith('>')
if concat_string:
result.append(concat_string)
concat_string = ''
else:
concat_string += string
print(result)
You can use itertools.groupby for the task:
from itertools import groupby
lst = [
">A1",
"KKKKKKKK",
"DDDDDDDD",
">A2",
"FFFFFFFF",
"FFFFOOOO",
"DAA",
">A3",
"OOOZDDD",
"KKAZAAA",
"A",
]
out = []
for k, g in groupby(lst, lambda s: s.startswith(">")):
if not k:
out.append("".join(g))
print(out)
Prints:
["KKKKKKKKDDDDDDDD", "FFFFFFFFFFFFOOOODAA", "OOOZDDDKKAZAAAA"]
Regex version:
data = """>A1
KKKKKKKK
DDDDDDDD
>A2
FFFFFFFF
FFFFOOOO
DAA
>A3
OOOZDDD
KKAZAAA
A"""
import re
patre = re.compile("^>.+\n",re.MULTILINE)
#split on `>xxx`
chunks = patre.split(data)
#remove whitespaces and newlines
blocks = [v.replace("\n","").strip() for v in chunks]
#get rid of leading trailing empty blocks
blocks = [v for v in blocks if v]
print(blocks)
output:
['KKKKKKKKDDDDDDDD', 'FFFFFFFFFFFFOOOODAA', 'OOOZDDDKKAZAAAA']
I am trying to check every word in my list if it contains right first letter of name and surname. I probably know how to check it if its in two separated lists. But i have this source of data:
data = ['Johnny Loom', 'frank Cedder', 'Marylin monroe', 'Joseph Monroe']
And the wanted result must be : Johnny Loom, Joseph Monroe
This is what i have now, but it doesnt work:
def capital(str_list):
right = []
wrong= []
for n in str_list:
if n[0].isupper():
right .append(n)
if not n[0].isupper():
wrong.append(n)
return right
Thanks for any advice how to solve it guys!
You can use built-in all function to check if all the words present in the name start with an uppercase letter.
right = [
name
for name in data
if all(i[0].isupper() for i in name.split())
]
There is an inbuilt function called istitle() for this in python -
data = ['Johnny Loom', 'frank Cedder', 'Marylin monroe', 'Joseph Monroe']
output = []
for x in data:
if x.istitle():
output.append(x)
print(output)
You can use split() to get the first and last name from the element and then use and to check for both:
def capital(str_list):
right = []
for n in str_list:
first, last = n.split()
if first[0].isupper() and last[0].isupper():
right.append(n)
return right
data = ['Johnny Loom', 'frank Cedder', 'Marylin monroe', 'Joseph Monroe']
print(capital(data))
This logic works for names with any number of words using split() and isupper():
def isCapital(name):
name_splited = name.split(' ')
i = 0
for names in name_splited:
if names[0].isupper():
i = i + 1
if i == len(name_splited):
return(True)
else:
return(False)
data_right = [i for i in data if isCapital(i)]
data_right
# ['Johnny Loom', 'Joseph Monroe']
I have a very long csv file with repeated blocks of information, however it's not perfectly regular:
T,2002,12,03,09,22,54
B,35,77,27,34,190,400,341,3447,940.3,303.5
G,3229987,41014,25,3447,1784033,21787,16,3447,940.3,303.5
R,3273751,46609,6452,3447,1810631,45933,6382,3447,940.3,303.5
D,NBXX,31,4.267,6.833,6.646,2.270,9.975,3.987
Y,194669,940.3,303.5,298.4,11.6,12.9,5.8,7,0000
T,2002,12,03,09,27,56
B,3520252,76702,297,3447,1906319,39865,305,3447,940.4,303.6
G,3231611,40449,13,3447,1785214,21650,25,3447,940.4,303.6
R,3273277,46425,6431,3447,1813279,45613,6425,3447,940.4,303.6
D,NBXX,28,-6.813,4.314,5.826,1.527,2.997,-9.648
Y,194767,940.4,303.6,298.4,11.4,12.9,5.8,9,0000
Z,2.782e-5,1.512e-5,1.195e-5,1.415e-5,8.290e-6,1.232e-5,2.319e-5
T,2002,12,03,09,32,59
.
.
.
the information isn't completely regular and some of the 'D' lines contain or less the normal number of elements e.g. most if not all 'D' lines contain 9 elements -
['D', 'ZBXX', '110', '2.590e-5', '1.393e-5', '1.032e-5e-6']
['D', 'ZBXX', '118', '2.641e-5', '1.402e-5', '1.027e-5', '1.237e-5',
'6.553e-6', '9.466', '290.9', '6.1', '12.0', '6.2', '7', '0000']
['D', 'ZBXX', '110', '2.590e-5', '1.393e-5', '1.032e-5e-6']
['D', 'ZBXX', '118', '2.641e-5', '1.402e-5', '1.027e-5', '1.237e-5',
'6.553e-6', '9.466', '290.9', '6.1', '12.0', '6.2', '7', '0000']
And I want it to look like:
Time [yy-mm-dd-hh-ss] D[3] D[4] D[5] D[6] D[7] D[8] Y[4] Y[[5]
2002-12-03-09-22-54 4.267 6.833 6.646 2.270 9.975 3.987 303.5
2002-12-03-09-27-56 -6.813 4.314 5.826
2002-12-03-09-32-59
This is the code I have thus far:
year_i=np.array(1999) # Start year
dataframe_rows = []
for x in range(1,6): # we have 5 files
# Create the name of file that will change within the loop
year_str='nef'+str(year_i)
start='C:\\Users\\'
end=".dat"
name_file=start+year_str+end # concat strings
file_ = open(name_file, 'r+').readlines()
rows = ""
for i in range(len(file_)):
if (file_[i].startswith('Z')): #ignore lines starting with 'Z'
continue
string = file_[i]
if (file_[i].startswith('B')): #ignore lines starting with 'B'
continue
string = file_[i]
if (file_[i].startswith('G')): #ignore lines starting with 'G'
continue
string = file_[i]
if (file_[i].startswith('R')): #ignore lines starting with 'R'
continue
string = file_[i]
if "T," in string:
if len(rows) > 0:
dataframe_rows.append(rows[:-1])
rows = ""
string = file_[i].replace("\n","").replace("\r","")
string = string[2:].replace(",","-")
rows += string + ","
#if "D," in string:
# I want to select certain the last 6 elements and convert them into columns
#if (file_[i].startswith('Y')):
# I want to select the 3rd, 5th, 6th and last elements and convert them into columns
else:
string = file_[i].replace("\n","").replace("\r","")
aux_row += string[2:] + ","
year_i+=1 # counter
fixed_rows = []
for row in (dataframe_rows):
if (len(row.split(","))) == 18:
fixed_rows.append(row)
df = pd.read_csv(io.StringIO('\n'.join(fixed_rows)))
Assuming:
You always want the 6 last values of D columns whatever the length + the 3rd, 5th, 6th and last values of Y columns whatever the length (note the first element is the letter itself so the ith value corresponds to the i+1th element)
T, D and Y always exist
I would do something like that (here the input file is just considered like a text file, not particularly csv but the memory is reasonably used):
from datetime import datetime
import pandas as pd
finName = 'testInput.csv'
foutName = 'testOutput.csv'
colNames = ['date', 'D[-6]', 'D[-5]', 'D[-4]', 'D[-3]', 'D[-2]', 'D[-1]',
'Y[4]', 'Y[6]', 'Y[7]', 'Y[-1]']
df = pd.DataFrame(columns=colNames)
dictionary = {}
with open(finName,'rt') as fin:
for i, line in enumerate(fin, 1):
if line.startswith('T'):
dictionary['date'] = datetime(*list(map(int,line.split(',')[1:7])))
elif line.startswith('D'):
shortLine=line.split(',')[-6:]
for i in range(-6,0):
colName = 'D['+str(i)+']'
dictionary[colName] = float(shortLine[i])
elif line.startswith('Y'):
fullLine=line.split(',')
for i in [4,6,7,-1]:
colName = 'Y['+str(i)+']'
dictionary[colName] = float(fullLine[i])
df = df.append(dictionary,ignore_index=True)
df.to_csv(foutName)
If D lines do not always have more than 6 values (I think this is your last question), here is an alternative where columns 'D[-i]' are filled with values when they exist or 'nan' when they don't. In the imports at the beginning of the script, you should add from numpy import nan, then replace the block under elif line.startswith('D'): by:
fullLine=line.split(',')
for i in range(-6,0):
colName = 'D['+str(i)+']'
try:
dictionary[colName] = float(fullLine[i])
except:
dictionary[colName] = nan
According to your expected output you require data from lines starting with "T", "D" and "Y"
Following lines can help (assuming that there are the same number of T,D and Y lines in input file)
import datetime
.....
file_ = open(name_file, 'r+').readlines()
values = [line.split(",") for line in file_]
T_data = [str(datetime.datetime(int(line[1]),int(line[2]),int(line[3]),int(line[4]),int(line[5]),int(line[6]))) for line in values if line[0]=="T"]
D_data = [[line[3], line[4], line[5], line[6],line[7],line[8]] for line in values if line[0]=="D"]
Y_data = [[line[3],line[4]] for line in values if line[0]=="Y"]
processed_data = [[T_data[i]]+D_data[i]+Y_data[i] for i in range(len(T_data))]
for line in processed_data:
print(line)
Update
import datetime
.....
data = []
item = {}
with open(name_file, 'r+') as file:
for textline in file: #
line = textline.split(",")
if line[0]=="T":
if "T" in item.keys():
if "D" not in item.keys():
item["D"] = ["Nan","Nan","Nan","Nan","Nan","Nan"] #
if "Y" not in item.keys():
item["Y"] = ["Nan","Nan","Nan"]
data.append(item) #It should append a dictionary object with "T", "D" and "Y" keys
item = {}
#data.append(item["T]+item["D"] + item["Y"])
item["T"] = str(datetime.datetime(int(line[1]),int(line[2]),int(line[3]),int(line[4]),int(line[5]),int(line[6])))
elif line[0]=="D":
#item["D"] = [line[3], line[4], line[5], line[6],line[7],line[8]]
#Use negative array index if you need last elements
item["D"] = [line[-6], line[-5], line[-4], line[-3],line[-2],line[-1]]
elif line[0]=="Y":
item["Y"] = [line[-6], line[-5], line[-3]]
Given some code:
keyword=re.findall(r'ke\w+ = \S+',s)
score=re.findall(r'sc\w+ = \S+',s)
print '%s,%s' %(keyword,score)
The output of above code is:
['keyword = NORTH', 'keyword = GUESS', 'keyword = DRESSES', 'keyword = RALPH', 'keyword = MATERIAL'],['score = 88466', 'score = 83965', 'score = 79379', 'score = 74897', 'score = 68168']
But I want the format should be different lines like:
NORTH,88466
GUESS,83935
DRESSES,83935
RALPH,73379
MATERIAL,68168
Instead of the last line, do this instead:
>>> for k, s in zip(keyword, score):
kw = k.partition('=')[2].strip()
sc = s.partition('=')[2].strip()
print '%s,%s' % (kw, sc)
NORTH,88466
GUESS,83965
DRESSES,79379
RALPH,74897
MATERIAL,68168
Here is how it works:
The zip brings the corresponding elements together pairwise.
The partition splits a string like 'keyword = NORTH' into three parts (the part before the equal sign, the equal sign itself, and the part after. The [2] keeps only the latter part.
The strip removes leading and trailing whitespace.
Alternatively, you can modify your regexes to do much of the work for you by using groups to capture the keywords and scores without the surrounding text:
keywords = re.findall(r'ke\w+ = (\S+)',s)
scores = re.findall(r'sc\w+ = (\S+)',s)
for keyword, score in zip(keywords, scores):
print '%s,%s' %(keyword,score)
Hope this will help:
keyword = ['NORTH','GUESS','DERESSES','RALPH']
score = [88466,83935,83935,73379]
for key,value in zip(keyword,score):
print "%s,%s" %(key,value)
One way would be like would be to zip() the two lists together (to iterate over them pairwise) and use str.partition() to grab the data after the =, like this::
def after_equals(s):
return s.partition(' = ')[-1]
for k,s in zip(keyword, score):
print after_equals(k) + ',' + after_equals(s)
If you don't want to call after_equals() twice, you could refactor to:
for pair in zip(keyword, score):
print ','.join(after_equals(data) for data in pair)
If you want to write to a text file (you really should have mentioned this in the question, not in your comments on my answer), then you can take this approach...
with open('output.txt', 'w+') as output:
for pair in zip(keyword, score):
output.write(','.join(after_equals(data) for data in pair) + '\n')
Output:
% cat output.txt
NORTH,88466
GUESS,83965
DRESSES,79379
RALPH,74897
MATERIAL,68168