I have a simple .csv file encoded in windows-1250. Two columns with key-value pairs, separated by semicolon. I would like to create a dictionary from this data. I used this solution: How to read a .csv file into a dictionary in Python. Code below:
import os
import csv
strpath = r"C:\Project Folder"
filename = "to_dictionary.csv"
os.chdir(strpath)
test_csv = open(filename, mode="r", encoding="windows-1250")
dctreader = csv.DictReader(test_csv)
ordereddct = list(dctreader)[0]
finaldct = dict(ordereddct)
print(finaldct)
First of all, this file has 370 rows but I receive only two. Second, Python reads whole first row as a key and next row as a value (and then stops as I mentioned).
# source data
# a;A
# b;B
# c;C
# ... up to 370 rows
# what I need (example; there should be 368 pairs more of course)
finaldct = {"a": "A", "b": "B"}
# what I receive
finaldct = {"a;A": "b;B"}
I have no idea why this happens and couldn't find any working solution.
Note: I would like to avoid using pandas because it seems to work slower in this case.
file has 370 rows but I receive only two
This might be caused by problems with newlines (they do differ between systems, see Newline wikipedia entry if you want to know more). csv module docs suggest using newline='' i.e. in your case
test_csv = open(filename, newline='', mode="r", encoding="windows-1250")
If you have a file with just two columns (assuming they're unquoted, etc.), you don't need use the csv module at all.
dct = {}
with open("file.txt", encoding="windows-1250") as f:
for line in f:
key, _, value = line.rstrip("\r\n").partition(";")
dct[key] = value
Thank you all but I finally managed to do so (and without looping)! The solution from Kite misleaded me a little. Here is my code:
import os
import csv
strpath = r"C:\Project Folder"
filename = "to_dictionary.csv"
os.chdir(strpath)
test_csv = open(filename, mode="r", encoding="windows-1250")
csvreader = csv.reader(test_csv, delimiter=";")
finaldct = dict(csvreader)
print(finaldct)
So, I needed to specify delimiter first but in a reader. Second, there's no need to use DictReader. Changing reader to dictionary suffices.
You can try the below
data = dict()
with open('test.csv') as f:
for line in f:
temp = line.strip()
if temp:
k,v = temp.split(';')
data[k] = v
print(data)
test.csv
1;2
3;5
78;8
6;0
output
{'1': '2', '3': '5', '78': '8', '6': '0'}
Related
I want to create a word dictionary. The dictionary looks like
words_meanings= {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
keys_letter=[]
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
Output: rekindle , pesky, verge, maneuver, accountability
Here rekindle , pesky, verge, maneuver, accountability they are the keys and relight, annoying, border, activity, responsibility they are the values.
Now I want to create a csv file and my code will take input from the file.
The file looks like
rekindle | pesky | verge | maneuver | accountability
relight | annoying| border| activity | responsibility
So far I use this code to load the file and read data from it.
from google.colab import files
uploaded = files.upload()
import pandas as pd
data = pd.read_csv("words.csv")
data.head()
import csv
reader = csv.DictReader(open("words.csv", 'r'))
words_meanings = []
for line in reader:
words_meanings.append(line)
print(words_meanings)
This is the output of print(words_meanings)
[OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
It looks very odd to me.
keys_letter=[]
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
Now I create an empty list and want to append only key values. But the output is [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
I am confused. As per the first code block it only included keys but now it includes both keys and their values. How can I overcome this situation?
I would suggest that you format your csv with your key and value on the same row. Like this
rekindle,relight
pesky,annoying
verge,border
This way the following code will work.
words_meanings = {}
with open(file_name, 'r') as file:
for line in file.readlines():
key, value = line.split(",")
word_meanings[key] = value.rstrip("\n")
if you want a list of the keys:
list_of_keys = list(word_meanings.keys())
To add keys and values to the file:
def add_values(key:str, value:str, file_name:str):
with open(file_name, 'a') as file:
file.writelines(f"\n{key},{value}")
key = input("Input the key you want to save: ")
value = input(f"Input the value you want to save to {key}:")
add_values(key, value, file_name)```
You run the same block of code but you use it with different objects and this gives different results.
First you use normal dictionary (check type(words_meanings))
words_meanings = {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
and for-loop gives you keys from this dictionary
You could get the same with
keys_letter = list(words_meanings.keys())
or even
keys_letter = list(words_meanings)
Later you use list with single dictionary inside this list (check type(words_meanings))
words_meanings = [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
and for-loop gives you elements from this list, not keys from dictionary which is inside this list. So you move full dictionary from one list to another.
You could get the same with
keys_letter = words_meanings.copy()
or even the same
keys_letter = list(words_meanings)
from collections import OrderedDict
words_meanings = {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
print(type(words_meanings))
keys_letter = []
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
#keys_letter = list(words_meanings.keys())
keys_letter = list(words_meanings)
print(keys_letter)
words_meanings = [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
print(type(words_meanings))
keys_letter = []
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
#keys_letter = words_meanings.copy()
keys_letter = list(words_meanings)
print(keys_letter)
The default field separator for the csv module is a comma. Your CSV file uses the pipe or bar symbol |, and the fields also seem to be fixed width. So, you need to specify | as the delimiter to use when creating the CSV reader.
Also, your CSV file is encoded as Big-endian UTF-16 Unicode text (UTF-16-BE). The file contains a byte-order-mark (BOM) but Python is not stripping it off, so you will notice the string '\ufeffrekindle' contains the FEFF UTF-16-BE BOM. That can be dealt with by specifying encoding='utf16' when you open the file.
import csv
with open('words.csv', newline='', encoding='utf-16') as f:
reader = csv.DictReader(f, delimiter='|', skipinitialspace=True)
for row in reader:
print(row)
Running this on your CSV file produces this:
{'rekindle ': 'relight ', 'pesky ': 'annoying', 'verge ': 'border', 'maneuver ': 'activity ', 'accountability': 'responsibility'}
Notice that there is trailing whitespace in the key and values. skipinitialspace=True removed the leading whitespace, but there is no option to remove the trailing whitespace. That can be fixed by exporting the CSV file from Excel without specifying a field width. If that can't be done, then it can be fixed by preprocessing the file using a generator:
import csv
def preprocess_csv(f, delimiter=','):
# assumes that fields can not contain embedded new lines
for line in f:
yield delimiter.join(field.strip() for field in line.split(delimiter))
with open('words.csv', newline='', encoding='utf-16') as f:
reader = csv.DictReader(preprocess_csv(f, '|'), delimiter='|', skipinitialspace=True)
for row in reader:
print(row)
which now outputs the stripped keys and values:
{'rekindle': 'relight', 'pesky': 'annoying', 'verge': 'border', 'maneuver': 'activity', 'accountability': 'responsibility'}
As I found that no one able to help me with the answer. Finally, I post the answer here. Hope this will help other.
import csv
file_name="words.csv"
words_meanings = {}
with open(file_name, newline='', encoding='utf-8-sig') as file:
for line in file.readlines():
key, value = line.split(",")
words_meanings[key] = value.rstrip("\n")
print(words_meanings)
This is the code to transfer a csv to a dictionary. Enjoy!!!
I have this csv log file which is huge in size (GBs)and has no header row:
1,<timestamp>,BEGIN
1,<timestamp>,fetched from db
1,<timestamp>,some processing
2,<timestamp>,BEGIN
2,<timestamp>,fetched from db
1,<timestamp>,returned success
3,<timestamp>,BEGIN
4,<timestamp>,BEGIN
1,<timestamp>,END
3,<timestamp>,some work
2,<timestamp>,some processing
4,<timestamp>,waiting for
2,<timestamp>,ERROR
3,<timestamp>,attempting other work
4,<timestamp>,ERROR
3,<timestamp>,attempting other work
Each line is a trace-log, and the first field is the RequestID.
Need to scan the file and store the logs only for requests which resulted in 'ERROR' to another file.
import csv
def readFile(filename):
with open(filename, 'r') as fn:
reader = csv.reader(fn)
for line in reversed(list(reader)):
yield (line)
def wrt2File():
rows = readFile('log.csv')
with open('error.csv', 'w') as fn:
writer = csv.writer(fn)
errReqIds = []
for row in rows:
if 'ERROR' in row:
errReqIds.append(row[0])
if row[0] in errReqIds:
writer.writerow(row)
wrt2File()
How to improve my code not to use memory for readFile operation and re-usability of this code? I don't want to use pandas, if any better alternative is available.
This does not look like CSV at all. Might I suggest something along the following lines:
def extract(filename):
previous = dict()
current = set()
with open(filename) as inputfile:
for line in inputfile:
id, rest = line.split(' ')
if 'ERROR' in line:
if id in previous:
for kept in previous[id]:
yield(kept)
del previous[id]
yield(line)
current.add(id)
elif id in current:
yield(line)
# Maybe do something here to remove really old entries from previous
def main():
import sys
for filename in sys.argv[1:]:
for line in extract(filename):
print(line)
if __name__ == '__main__':
main()
This simply prints to standard output. You could refactor it to accept an output file name as an option and use write on that filehandle if you like.
Since your file is huge, you could need a solution which avoid to load the entire file in memory. The following can do this job:
def find_errors(filename):
with open(filename) as f:
return {l[0:3] for l in f if 'ERROR' in l}
def wrt2File():
error_ids = find_errors('log.csv')
with open('error.csv', 'w') as fw, open('log.csv') as fr:
[fw.write(l) for l in fr if l[0:3] in error_ids]
Note that I assumed the id was the first 3 characters of the line, change it if needed.
Here's something that should be fairly fast, probably because it reads the entire file into memory to process it. You haven't defined what you mean by "efficent", so I assumed it was speed and that your computer has enough memory to do it—since that's what the code in your question does.
import csv
from itertools import groupby
from operator import itemgetter
REQUEST_ID = 0 # Column
RESULT = 2 # Column
ERROR_RESULT = 'ERROR'
keyfunc = itemgetter(REQUEST_ID)
def wrt2File(inp_filename, out_filename):
# Read log data into memory and sort by request id column.
with open(inp_filename, 'r', newline='') as inp:
rows = list(csv.reader(inp))
rows.sort(key=keyfunc)
with open(out_filename, 'w', newline='') as outp:
csv_writer = csv.writer(outp)
for k, g in groupby(rows, key=keyfunc):
g = list(g)
# If any of the lines in group have error indicator, write
# them to error csv.
has_error = False
for row in g:
if row[RESULT] == ERROR_RESULT:
has_error = True
break
if has_error:
csv_writer.writerows(g)
wrt2File('log.csv', 'error.csv')
Update:
Since I now know you don't want read it all into memory, here's one alternative. It reads the entire file twice. The first time is just determine which request ids had errors in the lines logging them. This information is used the second time to determine which line to write to the errors csv. Your OS should do a certain amount of file buffering / and data cache, so hopefully it will be an acceptable trade-off.
It's important to note that row for request ids with errors in the output file won't be grouped together since this approach doesn't sort them.
import csv
REQUEST_ID = 0 # Column
RESULT = 2 # Column
ERROR_RESULT = 'ERROR'
def wrt2File(inp_filename, out_filename):
# First pass:
# Read entire log file and determine which request id had errors.
error_requests = set() # Used to filter rows in second pass.
with open(inp_filename, 'r', newline='') as inp:
for row in csv.reader(inp):
if row[RESULT] == ERROR_RESULT:
error_requests.add(row[REQUEST_ID])
# Second pass:
# Read log file again and write rows associated with request ids
# which had errors to the output csv
with open(inp_filename, 'r', newline='') as inp:
with open(out_filename, 'w', newline='') as outp:
csv_writer = csv.writer(outp)
for row in csv.reader(inp)
if row[RESULT] in error_requests:
csv_writer.writerow(row)
wrt2File('log.csv', 'error.csv')
print('done')
I'm writing a program that reads names and statistics related to those names from a file. Each line of the file is another person and their stats. For each person, I'd like to make their last name a key and everything else linked to that key in the dictionary. The program first stores data from the file in an array and then I'm trying to get those array elements into the dictionary, but I'm not sure how to do that. Plus I'm not sure if each time the for loop iterates, it will overwrite the previous contents of the dictionary. Here's the code I'm using to attempt this:
f = open("people.in", "r")
tmp = None
people
l = f.readline()
while l:
tmp = l.split(',')
print tmp
people = {tmp[2] : tmp[0])
l = f.readline()
people['Smith']
The error I'm currently getting is that the syntax is incorrect, however I have no idea how to transfer the array elements into the dictionary other than like this.
Use key assignment:
people = {}
for line in f:
tmp = l.rstrip('\n').split(',')
people[tmp[2]] = tmp[0]
This loops over the file object directly, no need for .readline() calls here, and removes the newline.
You appear to have CSV data; you could also use the csv module here:
import csv
people = {}
with open("people.in", "rb") as f:
reader = csv.reader(f)
for row in reader:
people[row[2]] = row[0]
or even a dict comprehension:
import csv
with open("people.in", "rb") as f:
reader = csv.reader(f)
people = {r[2]: r[0] for r in reader}
Here the csv module takes care of the splitting and removing newlines.
The syntax error stems from trying close the opening { with a ) instead of }:
people = {tmp[2] : tmp[0]) # should be }
If you need to collect multiple entries per row[2] value, collect these in a list; a collections.defaultdict instance makes that easier:
import csv
from collections import defaultdict
people = defaultdict(list)
with open("people.in", "rb") as f:
reader = csv.reader(f)
for row in reader:
people[row[2]].append(row[0])
In repsonse to Generalkidd's comment above, multiple people with the same last time, an addition to Martijn Pieter's solution, posted as an answer for better formatting:
import csv
people = {}
with open("people.in", "rb") as f:
reader = csv.reader(f)
for row in reader:
if not row[2] in people:
people[row[2]] = list()
people[row[2]].append(row[0])
I am trying to replace blank values in a certain column (column 6 'Author' for example) with "DMD" in CSV using Python. I am fairly new to the program, so a lot of the lingo throws me. I have read through the CSV Python documentation but there doesn't seem to be anything that is specific to my question. Here is what I have so far. It doesn't run. I get the error 'dict' object has no attribute replace. It seems like there should be something similar to replace in the dict. Also, I am not entirely sure my method to search the field is accurate. Any guidance would be appreciated.
import csv
inputFileName = "C:\Author.csv"
outputFileName = os.path.splitext(inputFileName)[0] + "_edited.csv"
field = ['Author']
with open(inputFileName) as infile, open(outputFileName, "w") as outfile:
r = csv.DictReader(infile)
w = csv.DictWriter(outfile, field)
w.writeheader()
for row in r:
row.replace(" ","DMD")
w.writerow(row)
I think you're pretty close. You need to pass the fieldnames to the writer and then you can edit the row directly, because it's simply a dictionary. For example:
with open(inputFileName, "rb") as infile, open(outputFileName, "wb") as outfile:
r = csv.DictReader(infile)
w = csv.DictWriter(outfile, r.fieldnames)
w.writeheader()
for row in r:
if not row["Author"].strip():
row["Author"] = "DMD"
w.writerow(row)
turns
a,b,c,d,e,Author,g,h
1,2,3,4,5,Smith,6,7
8,9,10,11,12,Jones,13,14
13,14,15,16,17,,18,19
into
a,b,c,d,e,Author,g,h
1,2,3,4,5,Smith,6,7
8,9,10,11,12,Jones,13,14
13,14,15,16,17,DMD,18,19
I like using if not somestring.strip(): because that way it won't matter if there are no spaces, or one, or seventeen and a tab. I also prefer DictReader to the standard reader because this way you don't have to remember which column Author is living in.
[PS: The above assumes Python 2, not 3.]
Dictionaries don't need the replace method because simple assignment does this for you:
for row in r:
if row[header-6] == "":
row[header-6] = "DMD"
w.writerow(row)
Where header-6 is the name of your sixth column
Also note that your call to DictReader appears to have the wrong fields attribute. That argument should be a list (or other sequence) containing all the headers of your new CSV, in order.
For your purposes, it appears to be simpler to use the vanilla reader:
import csv
inputFileName = "C:\Author.csv"
outputFileName = os.path.splitext(inputFileName)[0] + "_edited.csv"
with open(inputFileName) as infile, open(outputFileName, "w") as outfile:
r = csv.reader(infile)
w = csv.writer(outfile)
w.writerow(next(r)) # Writes the header unchanged
for row in r:
if row[5] == "":
row[5] = "DMD"
w.writerow(row)
(1) to use os.path.splitest, you need to add an import os
(2) Dicts don't have a replace method; dicts aren't strings. If you're trying to alter a string that's the value of a dict entry, you need to reference that dict entry by key, e.g. row['Author']. If row['Author'] is a string (should be in your case), you can do a replace on that. Sounds like you need an intro to Python dictionaries, see for example http://www.sthurlow.com/python/lesson06/ .
(3) A way to do this, that also deals with multiple spaces, no spaces etc. in the field, would look like this:
field = 'Author'
marker = 'DMD'
....
## longhand version
candidate = str(row[field]).strip()
if candidate:
row[field] = candidate
else:
row[field] = marker
or
## shorthand version
row[field] = str(row[field]).strip() and str(row[field]) or marker
Cheers
with open('your file', 'r+') as f2:
txt=f2.read().replace('#','').replace("'",'').replace('"','').replace('&','')
f2.seek(0)
f2.write(txt)
f2.truncate()
Keep it simple and replace your choice of characters.
I am making a little program that will read and display text from a document. I have got a test file which looks like this:
12,12,12
12,31,12
1,5,3
...
and so on. Now I would like Python to read each line and store it to memory, so when you select to display the data, it will display it in the shell as such:
1. 12,12,12
2. 12,31,12
...
and so on. How can I do this?
I know it is already answered :) To summarize the above:
# It is a good idea to store the filename into a variable.
# The variable can later become a function argument when the
# code is converted to a function body.
filename = 'data.txt'
# Using the newer with construct to close the file automatically.
with open(filename) as f:
data = f.readlines()
# Or using the older approach and closing the filea explicitly.
# Here the data is re-read again, do not use both ;)
f = open(filename)
data = f.readlines()
f.close()
# The data is of the list type. The Python list type is actually
# a dynamic array. The lines contain also the \n; hence the .rstrip()
for n, line in enumerate(data, 1):
print '{:2}.'.format(n), line.rstrip()
print '-----------------'
# You can later iterate through the list for other purpose, for
# example to read them via the csv.reader.
import csv
reader = csv.reader(data)
for row in reader:
print row
It prints on my console:
1. 12,12,12
2. 12,31,12
3. 1,5,3
-----------------
['12', '12', '12']
['12', '31', '12']
['1', '5', '3']
Try storing it in an array
f = open( "file.txt", "r" )
a = []
for line in f:
a.append(line)
Thanks for #PePr excellent solution. In addition, you can try to print the .txt file with the built-in method String.join(data). For example:
with open(filename) as f:
data = f.readlines()
print(''.join(data))
You may also be interested in the csv module. It lets you parse, read and write to files in the comma-separated values( csv) format...which your example appears to be in.
Example:
import csv
reader = csv.reader( open( 'file.txt', 'rb'), delimiter=',' )
#Iterate over each row
for idx,row in enumerate(reader):
print "%s: %s"%(idx+1,row)
with open('test.txt') as o:
for i,t in enumerate(o.readlines(), 1):
print ("%s. %s"% (i, t))
#!/usr/local/bin/python
t=1
with open('sample.txt') as inf:
for line in inf:
num = line.strip() # contains current line
if num:
fn = '%d.txt' %t # gives the name to files t= 1.txt,2.txt,3.txt .....
print('%d.txt Files splitted' %t)
#fn = '%s.txt' %num
with open(fn, 'w') as outf:
outf.write('%s\n' %num) # writes current line in opened fn file
t=t+1