Parse list to get new list with same structure - python

I applied a previous code for a log, to get the following list
log = ['',
'',
'ABC KLSC: XYZ',
'',
'some text',
'some text',
'%%ABC KLSC: XYZ',
'some text',
'',
'ID = 5',
'TME = KRE',
'DDFFLE = SOFYU',
'QWWRTYA = GRRZNY',
'',
'some text',
'-----------------------------------------------',
'',
'QUWERW WALS RUSZ CRORS ELME',
'P <NULL> R 98028',
'P <NULL> R 30310',
'',
'',
'Some text',
'',
'Some text',
'',
'--- FINISH'
]
and I want to filter those lines in order to get a list with only the lines that contains "=" and the
lines that are ordered in columns format (those below headers QUWERW, WALS, RUSZ, CRORS), but additionally, for those lines with column format, store
each value with its corresponding header.
I was able to filter the desired lines with code below (not sure here if there is a better condition to filter the lines with columns)
d1 = [line for line in log if len(line) > 50 or " = " in line]
d1
>>
[
'ID = 5',
'TME = KRE',
'DDFFLE = SOFYU',
'QWWRTYA = GRRZNY',
'QUWERW WALS RUSZ CRORS ELME',
'P <NULL> R 98028',
'P <NULL> R 30310',
]
But I don´t know how to get the output I'm looking for as follows. Thanks for any help
[
'ID = 5',
'TME = KRE',
'DDFFLE = SOFYU',
'QWWRTYA = GRRZNY',
'QUWERW = P',
'WALS = <NULL>',
'RUSZ = R',
'CRORS = 98028',
'QUWERW = P',
'WALS = <NULL>',
'RUSZ = R',
'CRORS = 30310'
]

Finding the = is straight-forward. One way to find the column values might be, as follows, to identify header rows that contain the headings, and then zipping the following rows when splitting by white-space.
items_list = []
for item in log:
if '=' in item:
items_list.append(item)
elif len(item.split()) > 3:
splits = item.split()
if all(header in splits for header in ['QUWERW', 'WALS', 'RUSZ', 'CRORS']):
headers = splits
else:
for lhs,rhs in zip(headers,splits):
items_list.append(f'{lhs} = {rhs}')
print('\n'.join(items_list))

Related

Iteration skipping lines in a pandas dataframe

I'm trying to iterate through a whole dataframe which is already organized.
The idea is to find when a common user has an main_user aswell, when the keys that I use in the code below have a match these users have a main_user.
The problem I have is that some line are being skipped through the iteration and I can't find the error in the code.
Here's the code I'm using:
dataframe = gf.read_excel_base(path, sheet_name)
organized_dataframe = gf.organize_dataframe(dataframe)
main_user_data = {
'Nome titular': '',
'Nome beneficiário': '',
'Id Plano de Benefícios': '',
'Id Contratado': ''
}
user_data = {
'Nome titular': '',
'Nome beneficiário': '',
'Id Plano de Benefícios': '',
'Id Contratado': ''
}
main_user_list = []
user_list = []
for i, a in enumerate(organized_dataframe['Id Contratado']):
if gf.is_main_user(organized_dataframe, i):
main_user_data = gf.user_to_dict(organized_dataframe, i)
else:
user_data = gf.user_to_dict(organized_dataframe, i)
print(user_data['Nome beneficiário'])
if (main_user_data['Nome titular'] and main_user_data['Id Plano de Benefícios'] and main_user_data['Id Contratado']) == (user_data['Nome titular'] and user_data['Id Plano de Benefícios'] and user_data['Id Contratado']):
print('deu match')
main_user_list.append(main_user_data['Nome beneficiário'])
user_list.append(user_data['Nome beneficiário'])
print(user_list)
The resulting list always stops somewhere in the middle of the dataframe, there's a lot of lines that will match the statements I've made in the code, but somehow the code does not go into them.

How to split the given 'key-value' list into two lists separated as 'keys' and 'values' with python

This is my List
List = ['function = function1', 'string = string1', 'hello = hello1', 'new = new1', 'test = test1']
I need to separate the List into two differnt List's sepearted as 'keys' and 'values'
List = ['function = function1', 'string = string1', 'hello = hello1', 'new = new1', 'test = test1']
KeyList
KeyList = ['function', 'string', 'hello', 'new', 'test']
ValueList
ValueList = ['function1', 'string1', 'hello1', 'new1', 'test1']
There are different possible approach. One is the method proposed by Tim, but if you are not familiar with re you could also do:
List = ['function = function1', 'string = string1', 'hello = hello1', 'new = new1', 'test = test1']
KeyList = []
ValueList = []
for item in List:
val = item.split(' = ')
KeyList.append(val[0])
ValueList.append(val[1])
print(KeyList)
print(ValueList)
and the output is:
['function', 'string', 'hello', 'new', 'test']
['function1', 'string1', 'hello1', 'new1', 'test1']
You can simply use split(" = ") and unzip the list of key-value pairs to two tuples:
keys, values = zip(*map(lambda s: s.split(" = "), List))
# keys
# >>> ('function', 'string', 'hello', 'new', 'test')
# values
# >>>('function1', 'string1', 'hello1', 'new1', 'test1')
This is based on the fact that zip(*a_zipped_iterable) works as an unzipping function.
We can use re.findall here:
inp = ['function = function1', 'string = string1', 'hello = hello1', 'new = new1', 'test = test1']
keys = [re.findall(r'(\w+) =', x)[0] for x in inp]
vals = [re.findall(r'\w+ = (\w+)', x)[0] for x in inp]
keys = [pair[0] for pair in pairs]
values = [pair[1] for pair in pairs]

Split list into sublists at every occurrence of element starting with specific substring

I have a large list that contains a bunch of strings. I need to sort the elements of the original list into a nested list, determined by their placement in the list. In other words, I need to break the original list into sublists, where each sublist contains all elements that fall between an element starting with 'ABC', and then join them together as a nested list.
So the original list is:
all_results = ['ABCAccount', 'def = 0', 'gg = 0', 'kec = 0', 'tend = 1234567890', 'ert = abc', 'sed = target', 'id = sadfefsd3g3g24b24b', 'ABCAccount', 'def = 0', 'gg = 0', 'kec = 0', 'tend = NA', 'ert = abc', 'sed = source', 'id = sadfefsd3g3g24b24b', 'ABCAdditional', 'addkey = weds', 'addvalue = false', 'ert = abc', 'sed = target', 'id = sadfefsd3g3g24b24b', 'time_zone = EDT’]
And I need to return:
split_results = [['ABCAccount','def = 0', 'gg = 0', 'kec = 0', 'tend = 1234567890', 'ert = abc', 'sed = target', 'id = sadfefsd3g3g24b24b'],['ABCAccount', 'def = 0', 'gg = 0', 'kec = 0', 'tend = NA', 'ert = abc', 'sed = source', 'id = sadfefsd3g3g24b24b'],['ABCAdditional', 'addkey = weds', 'addvalue = false', 'ert = abc', 'sed = target', 'id = sadfefsd3g3g24b24b', 'time_zone = EDT’]]
I have tried the following:
split_results = [l.split(',') for l in ','.join(all_results).split('ABC')]
You can work from your original list directly:
def make_split( lst ):
if len(lst) == 0:
return []
r0 = []
r1 = []
for s in lst:
if s.startswith("ABC"):
if r1:
r0.append(r1)
r1 = []
r1.append(s)
return r0 + [r1]

Reading a dynamic table with pandas

I'm using conda 4.5.11 and python 3.6.3 to read a dynamic list, such as this:
[['Results:',
'2',
'Time:',
'16',
'Register #1',
'Field1:',
'999999999999999',
'Field2:',
'name',
'Field3:',
'some text',
'Field4:',
'number',
'Fieldn:',
'other number',
'Register #2',
'Field1:',
'999999999999999',
'Field2:',
'name',
'Field3:',
'type',
'Field4:',
'some text'
'FieldN:',
'some text',
'Register #N',
...
]]
Here is the code for my best try:
data = []
header = []
data_text = []
for data in res:
part = data.split(":")
header_text = part[1]
data_t = part[2]
header.append(header_text)
data_text.append(data_t)
df_data = pd.DataFrame(data_text)
df_header = pd.DataFrame(header)
Output
Field1 Field2 Field3 Field4 Fieldn1 Fieldn2 Fieldn
999999999999999 name sometext number number text number
999999999999999 name sometext number number number NAN
999999999999999 name number NAN number text number
Is it possible to read from a list and concat in one DataFrame?

Delete index in list if multiple strings are matched

I've scraped a website containing a table and I want to format the headers for my desired final out.
headers = []
for row in table.findAll('tr'):
for item in row.findAll('th'):
for link in item.findAll('a', text=True):
headers.append(link.contents[0])
print headers
Which returns:
[u'Rank ', u'University Name ', u'Entry Standards', u'Click here to read more', u'Student Satisfaction', u'Click here to read more', u'Research Quality', u'Click here to read more', u'Graduate Prospects', u'Click here to read more', u'Overall Score', u'Click here to read more', u'\r\n 2016\r\n ']
I don't want the "Click here to read more' or '2016' headers so I've done the following:
for idx, i in enumerate(headers):
if 'Click' in i:
del headers[idx]
for idx, i in enumerate(headers):
if '2016' in i:
del headers[idx]
Which returns:
[u'Rank ', u'University Name ', u'Entry Standards', u'Student Satisfaction', u'Research Quality', u'Graduate Prospects', u'Overall Score']
Perfect. But is there a better/neater way of removing the unwanted items? Thanks!
headers = filter(lambda h: not 'Click' in h and not '2016' in h, headers)
If you want to be more generic:
banned = ['Click', '2016']
headers = filter(lambda h: not any(b in h for b in banned), headers)
You can consider using list comprehension to get a new, filtered list, something like:
new_headers = [header for header in headers if '2016' not in header]
If you can be sure that '2016' will always be last:
>>> [x for x in headers[:-1] if 'Click here' not in x]
['Rank ', 'University Name ', 'Entry Standards', 'Student Satisfaction', 'Research Quality', 'Graduate Prospects', 'Overall Score']
pattern = '^Click|^2016'
new = [x for x in header if not re.match(pattern,str(x).strip())]

Categories