Dictreader and regex, indexing issue - python

Hey im having an issue creating a list of all strings from my list that match the regex, and the field names associated with the DictReader.
I am looping through an array of strings, and trying to see if each string matches a pattern:
reader = csv.DictReader(file)
for mystr in reader:
for i in range(len(mystr)):
if re.search(pattern, list(mystr.values())[i]):
data.append([list(reader.fieldnames)[i],list(mystr.values())[i]])
When a string matches the pattern, it appends the matched string and the csv field name to a list.
This works, however there seems to be an issue with it appending a seemingly random field names to the correct and expected matched regex value.
I.E, If my data was ordered
Names, Location, Price
Sometimes the if condition from the regex will append the field name location to the numerical value associated with price. And it seems to have no predictable pattern as to which value is associates...
The results:
[['firstitem'], ['seconditem'], ['thirditem'], ['fourthitem', '27'], ['fifthitem', '201']]
[['firstitem','1'], ['seconditem'], ['thirditem','12'], ['fourthitem'], ['fifthitem']]
etc..
The numbers all appear in the correct order, they just are not aligning in what i can read as a pattern/order so im not sure why they appear somewhat random. Any help would be appreciated.

I think you can simplify your code like this:
reader = csv.DictReader(file)
for mystr in reader:
for fieldname, value in mystr.items():
if re.search(pattern, value):
data.append([fieldname, value])
That way, it is easier to understand…

Given a completely contrived csv like the following (saved as 'test.csv'):
firstitem, seconditem, thirditem, fourthitem, fifthitem
first, price, 1, nothing, important
second, price, 2, over, here
Then the following should extract all columns with integers:
>>> def get_items(pattern, csv_file):
with open(csv_file) as file:
for entry in csv.DictReader(file):
for field_name, value in entry.items():
if re.search(pattern, value):
yield [field_name, value]
>>> data = list(get_items(r'\d+', 'test.csv'))
[[' thirditem', ' 1'], [' thirditem', ' 2']]
Alternatively, you could use if value.strip().isdigit() as the conditional statement rather than having to use regex.

Related

How to split a comma-separated line if the chunk contains a comma in Python?

I'm trying to split current line into 3 chunks.
Title column contains comma which is delimiter
1,"Rink, The (1916)",Comedy
Current code is not working
id, title, genres = line.split(',')
Expected result
id = 1
title = 'Rink, The (1916)'
genres = 'Comedy'
Any thoughts how to split it properly?
Ideally, you should use a proper CSV parser and specify that double quote is an escape character. If you must proceed with the current string as the starting point, here is a regex trick which should work:
inp = '1,"Rink, The (1916)",Comedy'
parts = re.findall(r'".*?"|[^,]+', inp)
print(parts) # ['1', '"Rink, The (1916)"', 'Comedy']
The regex pattern works by first trying to find a term "..." in double quotes. That failing, it falls back to finding a CSV term which is defined as a sequence of non comma characters (leading up to the next comma or end of the line).
lets talk about why your code does not work
id, title, genres = line.split(',')
here line.split(',') return 4 values(since you have 3 commas) on the other hand you are expecting 3 values hence you get.
ValueError: too many values to unpack (expected 3)
My advice to you will be to not use commas but use other characters
"1#\"Rink, The (1916)\"#Comedy"
and then
id, title, genres = line.split('#')
Use the csv package from the standard library:
>>> import csv, io
>>> s = """1,"Rink, The (1916)",Comedy"""
>>> # Load the string into a buffer so that csv reader will accept it.
>>> reader = csv.reader(io.StringIO(s))
>>> next(reader)
['1', 'Rink, The (1916)', 'Comedy']
Well you can do it by making it a tuple
line = (1,"Rink, The (1916)",Comedy)
id, title, genres = line

Using Zip Function to Create Columns in CSV with non-identical lengths of data

I have large number of files that are named according to a gradually more specific criteria.
Each part of the filename separate by the '_' relate to a drilled down categorization of that file.
The naming convetion looks like this:
TEAM_STRATEGY_ATTRIBUTION_TIMEFRAME_DATE_FILEVIEW
What I am trying to do is iterate through all these files and then pull out a list of how many different occurrences of each naming convention exists.
So essentially this is what I've done so far, I iterated through all the files and made a list of each name. I then separated each name by the '_' and then appended each of those to their respective category lists.
Now I'm trying to export them to a CSV file separated by columns and this is where I'm running into problems
L = [teams, strategies, attributions, time_frames, dates, file_types]
columns = zip(*L)
list(columns)
with open (_outputfolder_, 'w') as f:
writer = csv.writer(f)
for column in columns:
print(column)
This is a rough estimation of the list I'm getting out:
[{'TEAM1'},
{'STRATEGY1', 'STRATEGY2', 'STRATEGY3', 'STRATEGY4', 'STRATEGY5', 'STRATEGY6', 'STRATEGY7', 'STRATEGY8', 'STRATEGY9', 'STRATEGY10','STRATEGY11', 'STRATEGY12', 'STRATEGY13', 'STRATEGY14', 'STRATEGY15'},
{'ATTRIBUTION1','ATTRIBUTION1','Attribution3','Attribution4','Attribution5', 'Attribution6', 'Attribution7', 'Attribution8', 'Attribution9', 'Attribution10'},
{'TIME_FRAME1', 'TIME_FRAME2', 'TIME_FRAME3', 'TIME_FRAME4', 'TIME_FRAME5', 'TIME_FRAME6', 'TIME_FRAME7'},
{'DATE1'},
{'FILE_TYPE1', 'FILE_TYPE2'}]
What I want the final result to look like is something like:
Team1 STRATEGY1 ATTRIBUTION1 TIME_FRAME1 DATE1 FILE_TYPE1
STRATEGY2 ATTRIBUTION2 TIME_FRAME2 FILE_TYPE2
... ... ...
etc. etc. etc.
But only the first line actually gets stored in the CSV file.
can anyone help me understand how to iterate just past the first line? I'm sure this is happening because the Team type has only one option, but I don't want this to hinder it.
I referred to the answer, you have to transpose the result and use it.
refer the post below ,
Python - Transposing a list (rows with different length) using numpy fails.
I have used natural sorting to sort the integers and appended the lists with blanks to have the expected outcome.
The natural sorting is slower for larger lists
you can also use third party libraries,
Does Python have a built in function for string natural sort?
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
res = [[] for _ in range(max(len(sl) for sl in columns))]
count = 0
for sl in columns:
sorted_sl = natural_sort(sl)
for x, res_sl in zip(sorted_sl, res):
res_sl.append(x)
for result in res:
if (count > 0 ):
result.insert(0,'')
count = count +1
with open ("test.csv", 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(res)
f.close()
the columns should be converted in to list before printing to csv file
writerows method can be leveraged to print multiplerows
https://docs.python.org/2/library/csv.html -- you can find more information here
TEAM1,STRATEGY1,ATTRIBUTION1,TIME_FRAME1,DATE1,FILE_TYPE1
,STRATEGY2,Attribution3,TIME_FRAME2,FILE_TYPE2
,STRATEGY3,Attribution4,TIME_FRAME3
,STRATEGY4,Attribution5,TIME_FRAME4
,STRATEGY5,Attribution6,TIME_FRAME5
,STRATEGY6,Attribution7,TIME_FRAME6
,STRATEGY7,Attribution8,TIME_FRAME7
,STRATEGY8,Attribution9
,STRATEGY9,Attribution10
,STRATEGY10
,STRATEGY11
,STRATEGY12
,STRATEGY13
,STRATEGY14
,STRATEGY15

python: Search for a word and read the value from a file

I would like to populate a dictionary by fetching the values associated with a tag.
For example: if I have these variables stored in the file as
NUM 0 OPERATION add DATA [0x1, 0x2]
How can I extract the values of NUM, OPERATION and DATA if the order of the tag's is not fixed?
Thanks in advance.
If you can assure that the operation never contains a space you can do something like this with the built-in method str.split:
>>> line.split(' ', 5)[1::2]
['0', 'add', '[0x1, 0x2]']
It splits the line in a list of parts that are separated by and then returns every second entry of this list starting with the second.

From .txt file to dictionary

i'm using python 2.7, and I need to do some algorithm and I need some help:
The function need to read some data: the data model is like this:
# some_album *song_name::writer::duration::song_lyrics
All over the txt file, I need to get into every position like : the album name and the song name using the function split().
I have some questions:
how can I use split() between two characters- example: to an Album name, split between # to * ????
I want to divide all the txt file to a dictionary, the albums is the key's and the value is another dictionary that his key's is the song name and the value is a list of all the lyrics in the song. mt question is how can i do it with a loop or any other idea, because i want it to divide the hull txt file, and not just part of him.
this is what i do until now:
data_file = open("<someplace>","r")
data = data_file.readlines()
data = str(data)
i=0
for i in data:
albums= {data.split('#','*')[0] : data.split("::")[0]}
to print just the album and the name of the first song. I dont understand how to do it with some loop??
Referring to your first question I would recommend to use the "Regular expressions operations module" re for this.
>>> import re
>>> str = 'py=th;on'
>>> lst = re.split("=|;",str)
>>> lst[1]
'th'

Writing a csv file python

So i have a list:
>>> print references
>>> ['Reference xxx-xxx-xxx-007 ', 'Reference xxx-xxx-xxx-001 ', 'Reference xxx-xxx-xxxx-00-398 ', 'Reference xxx-xxx-xxxx-00-399']
(The list is much longer than that)
I need to write a CSV file wich would look this:
Column 1:
Reference xxx-xxx-xxx-007
Reference xxx-xxx-xxx-001
[...]
I tried this :
c = csv.writer(open("file.csv", 'w'))
for item in references:
c.writerows(item)
Or:
for i in range(0,len(references)):
c.writerow(references[i])
But when I open the csv file created, I get a window asking me to choose the delimiter
No matter what, I have something like
R,e,f,e,r,e,n,c,es
writerows takes a sequence of rows, each of which is a sequence of columns, and writes them out.
But you only have a single list of values. So, you want:
for item in references:
c.writerow([item])
Or, if you want a one-liner:
c.writerows([item] for item in references)
The point is, each row has to be a sequence; as it is, each row is just a single string.
So, why are you getting R,e,f,e,r,e,n,c,e,… instead of an error? Well, a string is a sequence of characters (each of which is itself a string). So, if you try to treat "Reference" as a sequence, it's the same as ['R', 'e', 'f', 'e', 'r', 'e', 'n', 'c', 'e'].
In a comment, you asked:
Now what if I want to write something in the second column ?
Well, then each row has to be a list of two items. For example, let's say you had this:
references = ['Reference xxx-xxx-xxx-007 ', 'Reference xxx-xxx-xxx-001 ']
descriptions = ['shiny thingy', 'dull thingy']
You could do this:
csv.writerows(zip(references, descriptions))
Or, if you had this:
references = ['Reference xxx-xxx-xxx-007 ', 'Reference xxx-xxx-xxx-001 ', 'Reference xxx-xxx-xxx-001 ']
descriptions = {'Reference xxx-xxx-xxx-007 ': 'shiny thingy',
'Reference xxx-xxx-xxx-001 ': 'dull thingy']}
You could do this:
csv.writerows((reference, descriptions[reference]) for reference in references)
The key is, find a way to create that list of lists—if you can't figure it out all in your head, you can print all the intermediate steps to see what they look like—and then you can call writerows. If you can only figure out how to create each single row one at a time, use a loop and call writerow on each row.
But what if you get the first column values, and then later get the second column values?
Well, you can't add a column to a CSV; you can only write by row, not column by column. But there are a few ways around that.
First, you can just write the table in transposed order:
c.writerow(references)
c.writerow(descriptions)
Then, after you import it into Excel, just transpose it back.
Second, instead of writing the values as you get them, gather them up into a list, and write everything at the end. Something like this:
rows=[[item] for item in references]
# now rows is a 1-column table
# ... later
for i, description in enumerate(descriptions):
values[i].append(description)
# and now rows is a 2-column table
c.writerows(rows)
If worst comes to worst, you can always write the CSV, then read it back and write a new one to add the column:
with open('temp.csv', 'w') as temp:
writer=csv.writer(temp)
# write out the references
# later
with open('temp.csv') as temp, open('real.csv', 'w') as f:
reader=csv.reader(temp)
writer=csv.writer(f)
writer.writerows(row + [description] for (row, description) in zip(reader, descriptions))
writerow writes the elements of an iterable in different columns. This means that if your provide a tuple, each element will go in one column. If you provide a String, each letter will go in one column. If you want all the content in the same column do the following:
c = csv.writer(open("file.csv", 'wb'))
c.writerows(references)
or
for item in references:
c.writerow(references)
c = csv.writer(open("file.csv", 'w'))
c.writerows(["Reference"])
# cat file.csv
R,e,f,e,r,e,n,c,e
but
c = csv.writer(open("file.csv", 'w'))
c.writerow(["Reference"])
# cat file.csv
Reference
Would work as others have said.
My original answer was flawed due to confusing writerow and writerows.

Categories