Extracting multiple data from a single list - python

I working on a text file that contains multiple information. I converted it into a list in python and right now I'm trying to separate the different data into different lists. The data is presented as following:
CODE/ DESCRIPTION/ Unity/ Value1/ Value2/ Value3/ Value4 and then repeat, an example would be:
P03133 Auxiliar helper un 203.02 417.54 437.22 675.80
My approach to it until now has been:
Creating lists to storage each information:
codes = []
description = []
unity = []
cost = []
Through loops finding a code, based on the code's structure, and using the code's index as base to find the remaining values.
Finding a code's easy, it's a distinct type of information amongst the other data.
For the remaining values I made a loop to find the next value that is numeric after a code. That way I can delimitate the rest of the indexes:
The unity would be the code's index + index until isnumeric - 1, hence it's the first information prior to the first numeric value in each line.
The cost would be the code's index + index until isnumeric + 2, the third value is the only one I need to store.
The description is a little harder, the number of elements that compose it varies across the list. So I used slicing starting at code's index + 1 and ending at index until isnumeric - 2.
for i, carc in enumerate(txtl):
if carc[0] == "P" and carc[1].isnumeric():
codes.append(carc)
j = 0
while not txtl[i+j].isnumeric():
j = j + 1
description.append(" ".join(txtl[i+1:i+j-2]))
unity.append(txtl[i+j-1])
cost.append(txtl[i+j])
I'm facing some problems with this approach, although there will always be more elements to the list after a code I'm getting the error:
while not txtl[i+j].isnumeric():
txtl[i+j] list index out of range.
Accepting any solution to debug my code or even new solutions to problem.
OBS: I'm also going to have to do this to a really similar data font, but the code would be just a sequence of 7 numbers, thus harder to find amongst the other data. Any solution that includes this facet is also appreciated!

A slight addition to your code should resolve this:
while i+j < len(txtl) and not txtl[i+j].isnumeric():
j += 1
The first condition fails when out of bounds, so the second one doesn't get checked.
Also, please use a list of dict items instead of 4 different lists, fe:
thelist = []
thelist.append({'codes': 69, 'description': 'random text', 'unity': 'whatever', 'cost': 'your life'})
In this way you always have the correct values together in the list, and you don't need to keep track of where you are with indexes or other black magic...
EDIT after comment interactions:
Ok, so in this case you split the line you are processing on the space character, and then process the words in the line.
from pprint import pprint # just for pretty printing
textl = 'P03133 Auxiliar helper un 203.02 417.54 437.22 675.80'
the_list = []
def handle_line(textl: str):
description = ''
unity = None
values = []
for word in textl.split()[1:]:
# it splits on space characters by default
# you can ignore the first item in the list, as this will always be the code
# str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296
if not word.replace(',', '').replace('.', '').isnumeric():
if len(description) == 0:
description = word
else:
description = f'{description} {word}' # I like f-strings
elif not unity:
# if unity is still None, that means it has not been set yet
unity = word
else:
values.append(word)
return {'code': textl.split()[0], 'description': description, 'unity': unity, 'values': values}
the_list.append(handle_line(textl))
pprint(the_list)
str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296

Related

How to assign a new value to the list in running loop

My data cutting loop seems to run ok in the loop, but when it prints the result outside the loop, the contents are unchanged. Presuming it's buggy because I'm trying to assign to what the for loop is running through, but I don't know.
For reference, it's a small web review scraper project I'm working on. To get it formatted to CSV with pandas I think all the data needs to end at the same point (length), so I'm cutting any lists that are longer than the shortest. The values "cust_stars_result, rev_result, cust_res" are all lists with basics strings stored inside, in this case equal to lengths 16, 12, and 15. I try to slice everything down to 12 in the end but the results are overwritten. What is the right/best way to go about this?
star_len = len(cust_stars_result)
rev_len = len(rev_result)
custname_len = len(cust_res)
print('customer name length: ' + str(custname_len) + ' -- review length: ' + str(rev_len) + ' -- star length: ' + str(star_len))
datalen = [star_len, rev_len, custname_len]
print(min(datalen))
datapack = [cust_stars_result, rev_result, cust_res]
# LOOPER FOR CULLING
for data in datapack:
if len(data) != min(datalen):
print("operating culler to make data even length")
print(len(data))
data = data[: min(datalen)]
print(len(data)) #this comes out OK
else:
print("equal length, skipping culler")
pass
print(datapack) # prints the original values
Inside your loop you update the data variable but that's just reassigning the value of that variable. You want to do something like
for i, data in enumerate(datapack):
...
datapack[i] = data[: min(datalen)]
This will update the datapack element
While "trying to assign to what the for loop is running through" is a real issue, in this case the problem is rather that your code is not assigning anything to datapack when you change data. Instead, what it does is assign each item in datapack to data, so when you change data, datapack remain unchanged.
Instead, try either adding each item to new list, and then assigning datapack to equal the new list:
temp = []
for data in datapack:
...
temp.append(data[:min(datalen)])
datapack = temp
Or try using a range or enumerate loop:
for i, data in enumerate(datapack):
...
datapack[i] = data[:min(datalen)]
There are more fancy ways (but less readable and debuggable) to accomplish what you're doing here (slicing off the end of the list), such as the below which uses list comprehension and map:
mindatalen = min(map(len, datapack))
datapack = [data[:mindatalen]for data in datapack]

Extract words from random strings

Below I have some strings in a list:
some_list = ['a','l','p','p','l','l','i','i','r',i','r','a','a']
Now I want to take the word april from this list. There are only two april in this list. So I want to take that two april from this list and append them to another extract list.
So the extract list should look something like this:
extract = ['aprilapril']
or
extract = ['a','p','r','i','l','a','p','r','i','l']
I tried many times trying to get the everything in extract in order, but I still can't seems to get it.
But I know I can just do this
a_count = some_list.count('a')
p_count = some_list.count('p')
r_count = some_list.count('r')
i_count = some_list.count('i')
l_count = some_list.count('l')
total_count = [a_count,p_count,r_count,i_count,l_count]
smallest_count = min(total_count)
extract = ['april' * smallest_count]
Which I wouldn't be here If I just use the code above.
Because I made some rules for solving this problem
Each of the characters (a,p,r,i and l) are some magical code elements, these code elements can't be created out of thin air; they are some unique code elements, that has some uniquw identifier, like a secrete number that is associated with them. So you don't know how to create this magical code elements, the only way to get the code elements is to extract them to a list.
Each of the characters (a,p,r,i and l) must be in order. Imagine they are some kind of chains, they will only work if they are together. Meaning that we got to put p next to and in front of a, and l must come last.
These important code elements are some kind of top secrete stuff, so if you want to get it, the only way is to extract them to a list.
Below are some examples of a incorrect way to do this: (breaking the rules)
import re
word = 'april'
some_list = ['aaaaaaappppppprrrrrriiiiiilll']
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count
This might sounds really stupid, but this is actually a hard problem (well for me). I originally designed this problem for myself to practice python, but it turns out to be way harder than I thought. I just want to see how other people solve this problem.
If anyone out there know how to solve this ridiculous problem, please help me out, I am just a fourteen-year-old trying to do python. Thank you very much.
I'm not sure what do you mean by "cannot copy nor delete the magical codes" - if you want to put them in your output list you will need to "copy" them somehow.
And btw your example code (a_count = some_list.count('a') etc) won't work since count will always return zero.
That said, a possible solution is
worklist = [c for c in some_list[0]]
extract = []
fail = False
while not fail:
lastpos = -1
tempextract = []
for magic in magics:
if magic in worklist:
pos = worklist.index(magic, lastpos+1)
tempextract.append(worklist.pop(pos))
lastpos = pos-1
else:
fail = True
break
else:
extract.append(tempextract)
Alternatively, if you don't want to pop the elements when you find them, you may compute the positions of all the occurences of the first element (the "a"), and set lastpos to each of those positions at the beginning of each iteration
May not be the most efficient way, although code works and is more explicit to understand the program logic:
some_list = ['aaaaaaappppppprrrrrriiiiiilll']
word = 'april'
extract = []
remove = []
string = some_list[0]
for x in range(len(some_list[0])//len(word)): #maximum number of times `word` can appear in `some_list[0]`
pointer = i = 0
while i<len(word):
j=0
while j<(len(string)-pointer):
if string[pointer:][j] == word[i]:
extract.append(word[i])
remove.append(pointer+j)
i+=1
pointer = j+1
break
j+=1
if i==len(word):
for r_i,r in enumerate(remove):
string = string[:r-r_i] + string[r-r_i+1:]
remove = []
elif j==(len(string)-pointer):
break
print(extract,string)

Python docx - find and replace words with italicized version

I have thought of a few ways to accomplish this, but each is uglier than the next. I'm trying to think of a way to search for all instances of a word in a word document and italicize them.
I can't upload a word document, but here's what I had in mind:
A working example would find all instances of billybob, including the one in the table, and italicize them. The problem is the way the runs are frequently aligned means that one run might have billy and the next one might have bob so there's no straightforward way to find all of them.
I'm going to leave this open because the approach I came up with isn't perfect, but it works in the vast majority of the cases. Here is the code:
document = Document(<YOUR_DOC>)
# Data will be a list of rows represented as dictionaries
# containing each row's data.
characters = {}
for paragraph in <YOUR_PARAGRAPHS>:
run_string = ""
run_index = {}
i = 0
for x, run in enumerate(paragraph.runs):
# Create a string consisting of all the runs' text. Theoretically this
# should always be the same as parapgrah.text, but I didn't check
run_string = run_string + run.text
# The index i represents the starting position of the run in question
# within the string. We are creating a dictionary of form
# {<run_start_location>: <pointer_to_run>}
run_index[i] = x
# This will be the start of the next run
i = i + len(run.text)
word_you_wanted_to_find = re.findall("some_regex", paragraph.text)
for word in word_you_wanted_to_find:
# [m.start() for m in re.finditer(word, run_string)] returns the starting
# positions of each word that was found
for word_start in [m.start() for m in re.finditer(word, run_string)]:
word_end = word_start + len(word)
# This will be a list of the indices of the runs which have part
# of the word we want to include
included_runs = []
for key in run_index.keys():
# Remember, the key is the location in the string of the start of
# the run. In this case, the start of the word start should be less than
# the key+len(run) and the end of the word should be greater
# than the key (the start of the run)
if word_start <= (key + len(paragraph.runs[run_index[key]].text)) and key < word_end:
included_runs.append(key)
# If the key is larger than or equal to the end of the word,
# this means we have found all relevant keys. We don't need
# to loop over the rest (we could, it just wouldn't be efficient)
if key >= word_end:
break
# At this point, included_runs is a full list of indices to the relevant
# runs so we can modify each one in turn.
for run_key in included_runs:
paragraph.runs[run_index[run_key]].italic = True
document.save(<MODIFIED_DOC>)
Problem 1
The problem with this approach is that, while uncommon (at least in my doc), it is possible for a single run to contain more than just your target word. So you might end up italicizing an entire run that includes your run and then some. For my use case it didn't make sense to fix that problem here.
Solution
If you were to perfect what I did above you would have to change this code block:
if word_start <= (key + len(paragraph.runs[run_index[key]].text)) and key < word_end:
included_runs.append(key)
Here you have identified the run that has your word. You would need to extend the code to separate the word into its own run and remove it from the current run. Then you could separately italicize that run.
Problem 2
The code shown above doesn't handle both the table and normal text. I didn't need to for my use case, but in the general case you would have to check both.

Variable table width with .format

I'm trying to display data from a csv in a text table. I've got to the point where it displays everything that I need, however the table width still has to be set, meaning if the data is longer than the number set then issues begin.
I currently print the table using .format to sort out formatting, is there a way to set the width of the data to a variable that is dependant on the length of the longest piece of data?
for i in range(len(list_l)):
if i == 0:
print(h_dashes)
print('{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}'.format('|', (list_l[i][0].upper()),'|', (list_l[i][1].upper()),'|',(list_l[i][2].upper()),'|', (list_l[i][3].upper()),'|'))
print(h_dashes)
else:
print('{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}'.format('|', list_l[i][0], '|', list_l[i][1], '|', list_l[i][2],'|', list_l[i][3],'|'))
I realise that the code is far from perfect, however I'm still a newbie so it's piecemeal from various tutorials
You can actually use a two-pass approach to first get the correct lengths. As per your example with four fields per line, the following shows the basic idea you can use.
What follows is an example of the two-pass approach, first to get the maximum lengths for each field, the other to do what you're currently doing (with the calculated rather than fixed lengths):
# Can set MINIMUM lengths here if desired, eg: lengths = [10, 0, 41, 7]
lengths = [0] * 4
fmtstr = None
for pass in range(2):
for i in range(len(list_l)):
if pass == 0:
# First pass sets lengths as per data.
for field in range(4):
lengths[field] = max(lengths[field], len(list_l[i][field])
else:
# Second pass prints the data.
# First, set format string if not yet set.
if fmtstr is None:
fmtstr = '|'
for item in lengths:
fmtstr += '{:^%ds}|' % (item)
# Now print item (and header stuff if first item).
if i == 0: print(h_dashes)
print(fmtstr.format(list_l[i][0].upper(), list_l[i][1].upper(), list_l[i][2].upper(), list_l[i][3].upper()))
if i == 0: print(h_dashes)
The construction of the format string is done the first time you process an item in pass two.
It does so by taking a collection like [31,41,59] and giving you the string:
|{:^31s}|{:^41s}|{:^59s}|
There's little point using all those {:^1s} format specifiers when the | is not actually a varying item - you may as well code it directly into the format string.

Single remove clause in while loop is removing two elements

I am writing a simple secret santa script that selects a "GiftReceiver" and a "GiftGiver" from a list. Two lists and an empty dataframe to be populated are produced:
import pandas as pd
import random
santaslist_receivers = ['Rudolf',
'Blitzen',
'Prancer',
'Dasher',
'Vixen',
'Comet'
]
santaslist_givers = santaslist_receivers
finalDataFrame = pd.DataFrame(columns = ['GiftGiver','GiftReceiver'])
I then have a while loop that selects random elements from each list to pick a gift giver and receiver, then remove from the respective list:
while len(santaslist_receivers) > 0:
print (len(santaslist_receivers)) #Used for testing.
gift_receiver = random.choice(santaslist_receivers)
santaslist_receivers.remove(gift_receiver)
print (len(santaslist_receivers)) #Used for testing.
gift_giver = random.choice(santaslist_givers)
while gift_giver == gift_receiver: #While loop ensures that gift_giver != gift_receiver
gift_giver = random.choice(santaslist_givers)
santaslist_givers.remove(gift_giver)
dummyDF = pd.DataFrame({'GiftGiver':gift_giver,'GiftReceiver':gift_receiver}, index = [0])
finalDataFrame = finalDataFrame.append(dummyDF)
The final dataframe only contains three elements instead of six:
print(finalDataframe)
returns
GiftGiver GiftReceiver
0 Dasher Prancer
0 Comet Vixen
0 Rudolf Blitzen
I have inserted two print lines within the while loop to investigate. These print the length of the list santaslist_receivers before and after the removal of an element. The expected return is to see original list length on the first print, then minus 1 on the second print, then the same length again on the first print of the next iteration of the while loop, then so on. Specifically I expect:
6,5,5,4,4,3,3... and so on.
What is returned is
6,5,4,3,2,1
Which is consistent with the DataFrame having only 3 rows, but I do not see the cause of this.
What is the error in my code or my approach?
You can solve it by simply changing this line
santaslist_givers = santaslist_receivers
to
santaslist_givers = list(santaslist_receivers)
Python variables are pointers essentially so they refer to the same list , ie santaslist_givers and santaslist_receivers were accessing the same location in memory in your implementation . To make them different use a list function
And for some extra information , you can refer copy.deepcopy
You should make an explicit copy of your list here
santaslist_givers = santaslist_receivers
there are multiple options for doing this as explained in this question.
In this case I would recommend (if you have Python >= 3.3):
santaslist_givers = santaslist_receivers.copy()
If you are on an older version of Python, the typical way to do it is:
santaslist_givers = santaslist_receivers[:]

Categories