Python code to read the split line, match and print - python

Please note the input file has the data as follows.
COM*IP**Home*57667*work*5646578*cell*45767~
I need python code to split this line using separator *, the if the 4th element is Home, work or cell then populate the next element "57667" accordingly. Similarly the same check should be done for 6th and 8th element and the next element should be printed. How to achieve this?
line_starts_with ('COM*',line):
your texts=tuple(line.split(*))
your textIf S[3] is 'HOME':
your textphone = s[4]
your textphone.strip()

You should not use explicit subscripts unless absolutely necessary. This is more flexible:
s = 'COM*IP**Home*57667*work*5646578*cell*45767~'
# take the entire string if it does not contain tilde
# if there's one or more tilde take everything that precedes the first occurrence
t, *_ = s.split('~')
tokens = t.split('*')
HWC = set(('work', 'home', 'cell'))
for i in range(3, len(tokens)-1, 2):
if tokens[i].lower() in HWC:
print(tokens[i+1])
Output:
57667
5646578
45767

It is unclear what your expected output is. If your expected output is this {'Home': '57667', 'Work': '5646578', 'Cell': '45767'}, then try this:
line = 'COMIP**Home57667work5646578cell45767~'
data = line.split('*')
phone_numbers = {}
if data[3] == 'Home':
phone_numbers['Home'] = data[4]
if data[5] == 'work':
phone_numbers['Work'] = data[6]
if data[7] == 'cell':
phone_numbers['Cell'] = data[8]
print(phone_numbers)

Related

How to group list items based on a specific condition?

I have this text:
>A1
KKKKKKKK
DDDDDDDD
>A2
FFFFFFFF
FFFFOOOO
DAA
>A3
OOOZDDD
KKAZAAA
A
When I split it and remove the line jumps, I get this list:
It gives me a list that looks like this:
['>A1', 'KKKKKKKK', 'DDDDDDDD', '>A2', 'FFFFFFFF', 'FFFFOOOO', 'DAA', '>A3', 'OOOZDDD', 'KKAZAAA', 'A']
I'm trying to merge all the strings between each part that starts with '>', such that it looks like:
['KKKKKKKKDDDDDDDD', 'FFFFFFFFFFFFOOOODAA', 'OOOZDDDKKAZAAAA']
What I have so far, but it doesn't do anything and I'm lost:
my_list = ['>A1', 'KKKKKKKK', 'DDDDDDDD', '>A2', 'FFFFFFFF', 'FFFFOOOO', 'DAA', '>A3', 'OOOZDDD', 'KKAZAAA', 'A']
result = []
for item in range(len(my_list)):
if my_list[item][0] == '>':
temp = ''
while my_list[item] != '>':
temp += my_list[item]
result.append(temp)
print(result)
#Andrej has given a compact code for your problem, but I want to help you by pointing out some issues in your original code.
You have while in if, but when my_list[item] starts with '>', the inner while won't work. The correct thing is to add a else-statement to concatenate the following string.
You append a string temp to result at each iterative step, but temp is not a concatenated string. The correct time to append is when you meet '>' again.
After solving them, you may get something like this,
result = []
for item in range(len(my_list)):
if my_list[item][0] == '>':
if item != 0:
result.append(temp)
temp = ''
else:
temp += my_list[item]
if item != 0:
result.append(item)
print(result)
You can further simplify it.
Save list indexing by directly iterating over the list.
Save final repeated check by adding a sentinel.
result = []
concat_string = '' # just change a readable name
for string in my_list + ['>']: # iterate over list directly and add a sentinel
if string[0] == '>': # or string.startswith('>')
if concat_string:
result.append(concat_string)
concat_string = ''
else:
concat_string += string
print(result)
You can use itertools.groupby for the task:
from itertools import groupby
lst = [
">A1",
"KKKKKKKK",
"DDDDDDDD",
">A2",
"FFFFFFFF",
"FFFFOOOO",
"DAA",
">A3",
"OOOZDDD",
"KKAZAAA",
"A",
]
out = []
for k, g in groupby(lst, lambda s: s.startswith(">")):
if not k:
out.append("".join(g))
print(out)
Prints:
["KKKKKKKKDDDDDDDD", "FFFFFFFFFFFFOOOODAA", "OOOZDDDKKAZAAAA"]
Regex version:
data = """>A1
KKKKKKKK
DDDDDDDD
>A2
FFFFFFFF
FFFFOOOO
DAA
>A3
OOOZDDD
KKAZAAA
A"""
import re
patre = re.compile("^>.+\n",re.MULTILINE)
#split on `>xxx`
chunks = patre.split(data)
#remove whitespaces and newlines
blocks = [v.replace("\n","").strip() for v in chunks]
#get rid of leading trailing empty blocks
blocks = [v for v in blocks if v]
print(blocks)
output:
['KKKKKKKKDDDDDDDD', 'FFFFFFFFFFFFOOOODAA', 'OOOZDDDKKAZAAAA']

How to check the correct typo of the first and last name in the list?

I am trying to check every word in my list if it contains right first letter of name and surname. I probably know how to check it if its in two separated lists. But i have this source of data:
data = ['Johnny Loom', 'frank Cedder', 'Marylin monroe', 'Joseph Monroe']
And the wanted result must be : Johnny Loom, Joseph Monroe
This is what i have now, but it doesnt work:
def capital(str_list):
right = []
wrong= []
for n in str_list:
if n[0].isupper():
right .append(n)
if not n[0].isupper():
wrong.append(n)
return right
Thanks for any advice how to solve it guys!
You can use built-in all function to check if all the words present in the name start with an uppercase letter.
right = [
name
for name in data
if all(i[0].isupper() for i in name.split())
]
There is an inbuilt function called istitle() for this in python -
data = ['Johnny Loom', 'frank Cedder', 'Marylin monroe', 'Joseph Monroe']
output = []
for x in data:
if x.istitle():
output.append(x)
print(output)
You can use split() to get the first and last name from the element and then use and to check for both:
def capital(str_list):
right = []
for n in str_list:
first, last = n.split()
if first[0].isupper() and last[0].isupper():
right.append(n)
return right
data = ['Johnny Loom', 'frank Cedder', 'Marylin monroe', 'Joseph Monroe']
print(capital(data))
This logic works for names with any number of words using split() and isupper():
def isCapital(name):
name_splited = name.split(' ')
i = 0
for names in name_splited:
if names[0].isupper():
i = i + 1
if i == len(name_splited):
return(True)
else:
return(False)
data_right = [i for i in data if isCapital(i)]
data_right
# ['Johnny Loom', 'Joseph Monroe']

How do I Import a csv file containing repeated lines of information (blocks) whereby for example a row beginning with 'T' refers to time etc

I have a very long csv file with repeated blocks of information, however it's not perfectly regular:
T,2002,12,03,09,22,54
B,35,77,27,34,190,400,341,3447,940.3,303.5
G,3229987,41014,25,3447,1784033,21787,16,3447,940.3,303.5
R,3273751,46609,6452,3447,1810631,45933,6382,3447,940.3,303.5
D,NBXX,31,4.267,6.833,6.646,2.270,9.975,3.987
Y,194669,940.3,303.5,298.4,11.6,12.9,5.8,7,0000
T,2002,12,03,09,27,56
B,3520252,76702,297,3447,1906319,39865,305,3447,940.4,303.6
G,3231611,40449,13,3447,1785214,21650,25,3447,940.4,303.6
R,3273277,46425,6431,3447,1813279,45613,6425,3447,940.4,303.6
D,NBXX,28,-6.813,4.314,5.826,1.527,2.997,-9.648
Y,194767,940.4,303.6,298.4,11.4,12.9,5.8,9,0000
Z,2.782e-5,1.512e-5,1.195e-5,1.415e-5,8.290e-6,1.232e-5,2.319e-5
T,2002,12,03,09,32,59
.
.
.
the information isn't completely regular and some of the 'D' lines contain or less the normal number of elements e.g. most if not all 'D' lines contain 9 elements -
['D', 'ZBXX', '110', '2.590e-5', '1.393e-5', '1.032e-5e-6']
['D', 'ZBXX', '118', '2.641e-5', '1.402e-5', '1.027e-5', '1.237e-5',
'6.553e-6', '9.466', '290.9', '6.1', '12.0', '6.2', '7', '0000']
['D', 'ZBXX', '110', '2.590e-5', '1.393e-5', '1.032e-5e-6']
['D', 'ZBXX', '118', '2.641e-5', '1.402e-5', '1.027e-5', '1.237e-5',
'6.553e-6', '9.466', '290.9', '6.1', '12.0', '6.2', '7', '0000']
And I want it to look like:
Time [yy-mm-dd-hh-ss] D[3] D[4] D[5] D[6] D[7] D[8] Y[4] Y[[5]
2002-12-03-09-22-54 4.267 6.833 6.646 2.270 9.975 3.987 303.5
2002-12-03-09-27-56 -6.813 4.314 5.826
2002-12-03-09-32-59
This is the code I have thus far:
year_i=np.array(1999) # Start year
dataframe_rows = []
for x in range(1,6): # we have 5 files
# Create the name of file that will change within the loop
year_str='nef'+str(year_i)
start='C:\\Users\\'
end=".dat"
name_file=start+year_str+end # concat strings
file_ = open(name_file, 'r+').readlines()
rows = ""
for i in range(len(file_)):
if (file_[i].startswith('Z')): #ignore lines starting with 'Z'
continue
string = file_[i]
if (file_[i].startswith('B')): #ignore lines starting with 'B'
continue
string = file_[i]
if (file_[i].startswith('G')): #ignore lines starting with 'G'
continue
string = file_[i]
if (file_[i].startswith('R')): #ignore lines starting with 'R'
continue
string = file_[i]
if "T," in string:
if len(rows) > 0:
dataframe_rows.append(rows[:-1])
rows = ""
string = file_[i].replace("\n","").replace("\r","")
string = string[2:].replace(",","-")
rows += string + ","
#if "D," in string:
# I want to select certain the last 6 elements and convert them into columns
#if (file_[i].startswith('Y')):
# I want to select the 3rd, 5th, 6th and last elements and convert them into columns
else:
string = file_[i].replace("\n","").replace("\r","")
aux_row += string[2:] + ","
year_i+=1 # counter
fixed_rows = []
for row in (dataframe_rows):
if (len(row.split(","))) == 18:
fixed_rows.append(row)
df = pd.read_csv(io.StringIO('\n'.join(fixed_rows)))
Assuming:
You always want the 6 last values of D columns whatever the length + the 3rd, 5th, 6th and last values of Y columns whatever the length (note the first element is the letter itself so the ith value corresponds to the i+1th element)
T, D and Y always exist
I would do something like that (here the input file is just considered like a text file, not particularly csv but the memory is reasonably used):
from datetime import datetime
import pandas as pd
finName = 'testInput.csv'
foutName = 'testOutput.csv'
colNames = ['date', 'D[-6]', 'D[-5]', 'D[-4]', 'D[-3]', 'D[-2]', 'D[-1]',
'Y[4]', 'Y[6]', 'Y[7]', 'Y[-1]']
df = pd.DataFrame(columns=colNames)
dictionary = {}
with open(finName,'rt') as fin:
for i, line in enumerate(fin, 1):
if line.startswith('T'):
dictionary['date'] = datetime(*list(map(int,line.split(',')[1:7])))
elif line.startswith('D'):
shortLine=line.split(',')[-6:]
for i in range(-6,0):
colName = 'D['+str(i)+']'
dictionary[colName] = float(shortLine[i])
elif line.startswith('Y'):
fullLine=line.split(',')
for i in [4,6,7,-1]:
colName = 'Y['+str(i)+']'
dictionary[colName] = float(fullLine[i])
df = df.append(dictionary,ignore_index=True)
df.to_csv(foutName)
If D lines do not always have more than 6 values (I think this is your last question), here is an alternative where columns 'D[-i]' are filled with values when they exist or 'nan' when they don't. In the imports at the beginning of the script, you should add from numpy import nan, then replace the block under elif line.startswith('D'): by:
fullLine=line.split(',')
for i in range(-6,0):
colName = 'D['+str(i)+']'
try:
dictionary[colName] = float(fullLine[i])
except:
dictionary[colName] = nan
According to your expected output you require data from lines starting with "T", "D" and "Y"
Following lines can help (assuming that there are the same number of T,D and Y lines in input file)
import datetime
.....
file_ = open(name_file, 'r+').readlines()
values = [line.split(",") for line in file_]
T_data = [str(datetime.datetime(int(line[1]),int(line[2]),int(line[3]),int(line[4]),int(line[5]),int(line[6]))) for line in values if line[0]=="T"]
D_data = [[line[3], line[4], line[5], line[6],line[7],line[8]] for line in values if line[0]=="D"]
Y_data = [[line[3],line[4]] for line in values if line[0]=="Y"]
processed_data = [[T_data[i]]+D_data[i]+Y_data[i] for i in range(len(T_data))]
for line in processed_data:
print(line)
Update
import datetime
.....
data = []
item = {}
with open(name_file, 'r+') as file:
for textline in file: #
line = textline.split(",")
if line[0]=="T":
if "T" in item.keys():
if "D" not in item.keys():
item["D"] = ["Nan","Nan","Nan","Nan","Nan","Nan"] #
if "Y" not in item.keys():
item["Y"] = ["Nan","Nan","Nan"]
data.append(item) #It should append a dictionary object with "T", "D" and "Y" keys
item = {}
#data.append(item["T]+item["D"] + item["Y"])
item["T"] = str(datetime.datetime(int(line[1]),int(line[2]),int(line[3]),int(line[4]),int(line[5]),int(line[6])))
elif line[0]=="D":
#item["D"] = [line[3], line[4], line[5], line[6],line[7],line[8]]
#Use negative array index if you need last elements
item["D"] = [line[-6], line[-5], line[-4], line[-3],line[-2],line[-1]]
elif line[0]=="Y":
item["Y"] = [line[-6], line[-5], line[-3]]

Replacing a character in a string with a set of two possible characters

a = ["0$%","0%%%","0$%$%","0$$"]
The above is a corrupted communication code where the first element of each sequence has been disguised as 0. I want to recover the original and correct code by computing a list of all possible sequences by replacing 0 with either $ or % and then checking which of the sequences is valid. Think of each sequence as corresponding to an alphabet if correct. For instance, "$$$" could correspond to the alphabet "B".
This is what I've done so far
raw_decoded = []
word = []
for i in a:
for j in i:
if j == "0":
x = list(itertools.product(["$", "%"], *i[1:]))
y = ("".join(i) for i in x)
for i in y:
raw_decoded.append(i)
for i in raw_decoded:
letter = code_dict[i] #access dictionary for converting to alphabet
word.append(letter)
return word
Try that:
output = []
for elem in a:
replaced_dollar = elem.replace('0', '$', 1)
replaced_percent = elem.replace('0', '%', 1)
# check replaced_dollar and replaced_percent
# and then write to output
output.append(replaced_...)
Not sure what you mean, perhaps you could add a desired output. What I got from your question could be solved in the following way:
b = []
for el in a:
if el[0] == '0':
b.append(el.replace('0', '%', 1))
b.append(el.replace('0', '$', 1))
else:
b.append(el)

Print in single line in Python

Given some code:
keyword=re.findall(r'ke\w+ = \S+',s)
score=re.findall(r'sc\w+ = \S+',s)
print '%s,%s' %(keyword,score)
The output of above code is:
['keyword = NORTH', 'keyword = GUESS', 'keyword = DRESSES', 'keyword = RALPH', 'keyword = MATERIAL'],['score = 88466', 'score = 83965', 'score = 79379', 'score = 74897', 'score = 68168']
But I want the format should be different lines like:
NORTH,88466
GUESS,83935
DRESSES,83935
RALPH,73379
MATERIAL,68168
Instead of the last line, do this instead:
>>> for k, s in zip(keyword, score):
kw = k.partition('=')[2].strip()
sc = s.partition('=')[2].strip()
print '%s,%s' % (kw, sc)
NORTH,88466
GUESS,83965
DRESSES,79379
RALPH,74897
MATERIAL,68168
Here is how it works:
The zip brings the corresponding elements together pairwise.
The partition splits a string like 'keyword = NORTH' into three parts (the part before the equal sign, the equal sign itself, and the part after. The [2] keeps only the latter part.
The strip removes leading and trailing whitespace.
Alternatively, you can modify your regexes to do much of the work for you by using groups to capture the keywords and scores without the surrounding text:
keywords = re.findall(r'ke\w+ = (\S+)',s)
scores = re.findall(r'sc\w+ = (\S+)',s)
for keyword, score in zip(keywords, scores):
print '%s,%s' %(keyword,score)
Hope this will help:
keyword = ['NORTH','GUESS','DERESSES','RALPH']
score = [88466,83935,83935,73379]
for key,value in zip(keyword,score):
print "%s,%s" %(key,value)
One way would be like would be to zip() the two lists together (to iterate over them pairwise) and use str.partition() to grab the data after the =, like this::
def after_equals(s):
return s.partition(' = ')[-1]
for k,s in zip(keyword, score):
print after_equals(k) + ',' + after_equals(s)
If you don't want to call after_equals() twice, you could refactor to:
for pair in zip(keyword, score):
print ','.join(after_equals(data) for data in pair)
If you want to write to a text file (you really should have mentioned this in the question, not in your comments on my answer), then you can take this approach...
with open('output.txt', 'w+') as output:
for pair in zip(keyword, score):
output.write(','.join(after_equals(data) for data in pair) + '\n')
Output:
% cat output.txt
NORTH,88466
GUESS,83965
DRESSES,79379
RALPH,74897
MATERIAL,68168

Categories