So I have a csv file that has tons of games and info about them and I'm trying to save the game's publisher and the ESRB rating. But for some reason when I print them out it'll randomly skip games and chose wrong cells.
My code:
def simpleLoop(file_name):
output = []
input_file = open(file_name, "r")
for line in input_file:
cells = line.split(",")
output.append((cells[7], cells[13]))
i = 0
while (i <= 10):
print(output[i]) # testing what values i get
i += 1
Screenshot of csv
Output
Expected Output
Any help is appreciated thanks!
Edit: Solved with the help of SimoN
For anyone else facing a similar issue make sure you specify exactly where you want to split. In my case I split at commas but there were commas inside some of the cells. So to fix this I changed:
cells = line.split(",")
To
cells = line.split('","')
Which makes python split after each cell because cells end with a double quote then a comma and the next cell starts with a double quote
There are commas inside some of the cells and you are splitting on these. When you opened the CSV in Excel (or whatever you used) it knew not to split on these as they are surrounded by quotes. I'd suggest using the Python csv module so you can do the same.
Related
I have just started a data quality class in which I got zero instruction on Python but am expected to create a script. There are three instructions for my Python script:
Create a script that loads an entire CSV file and replace all the blank values to NAN
Use genfromtxt function
Write the results set into a different file
I have been working on this for a few hours, but with no previous experience with Python, I am completely stuck! This is what I have so far:
import csv
file = open(quality.csv, 'r')
csvreader = csv.reader(file)
header = next(csvreader)
print(header)
rows = []
for row in csvreader:
rows.append(row)
print(rows)
My first problem is that when I tried using genfromtxt, it would not print out the headers or the entire csv file, it would only print out a few lines. If it matters, all of the values of the csv file are ints/floats, but the headers are strings.
See here
The next problem is I have tried several different ways to replace blank values, but I was not successful. All of the blank fields in this file are in the last column. When I print out the csv in full, this is what the line looks like (I've highlighted the empty value):
See here
Finally, I have no idea what instruction #3 means. I am completely new at this with zero Python knowledge! I think I am unsure of the Python syntax and rules - which I will look into more and learn, however I only had two days to complete this assignment and I do not know anything yet! Thank you in advance.
What you did with genfromtxt seems correct already. With big data like this, terminal only shows some data from the beginning and at the end, and those 3 dots in the middle also indicates the other records you're not seeing there!
I'm trying to export data that I queried from a database to a txt file. I am able to do so with the .to_csv method however it exports with spaces. I've tried to set the (sep) in the query to no space but it is forcing me to use at least one space or item as a seperator. Is there any way to export data to a txt file and not have any spaces in between export?
dataframe
Code I've been using to export to .txt
dataframe.to_csv('Sales_Drivers_ITCSignup.txt',index=False,header=True)
Want it to export like this:
Try
np.savetext(filename, df.values, fmt)
Feel free to ask question in case of any problem.
Took a bit of tinkering but this was the code I was able to come up with. The thought process was to create import the text file, edit it as a list, then re-export it overwriting the previous list. Thanks for all the suggestions!
RemoveCommas = []
RemoveSpaces = []
AddSymbol = []
Removeextra = []
#Import List and execute replacements
SalesDriversTransactions = []
with open('Sales_Drivers_Transactions.txt', 'r')as reader:
for line in reader.readlines():
SalesDriversTransactions.append(line)
for comma in SalesDriversTransactions:
WhatWeNeed = comma.replace(",","")
RemoveCommas.append(WhatWeNeed)
for comma in RemoveCommas:
WhatWeNeed = comma.replace(" ","")
RemoveSpaces.append(WhatWeNeed)
for comma in RemoveSpaces:
WhatWeNeed = comma.replace("þ","þ")
AddSymbol.append(WhatWeNeed)
for comma in AddSymbol:
WhatWeNeed = comma.replace("\n","")
Removeextra.append(WhatWeNeed)
with open('Sales_Drivers_Transactions.txt', 'w')as f:
for i in Removeextra:
f.write(i)
f.write('\n')
I have a text file which contains text in the first 20 or so lines, followed by CSV data. Some of the text in the text section contains commas and so trying csv.reader or csv.dictreader doesn't work well.
I want to skip past the text section and only then start to parse the CSV data.
Searches don't yield much other than instructions to either use csv.reader/csv.dictreader and iterate through the rows that are returned (which doesn't work because of the commas in the text), or to read the file line-by-line and split the lines using ',' as the delimiter.
The latter works up to a point, but it produces strings, not numbers. I could convert the strings to numbers but I'm hoping that there's a simple way to do this either with the csv or numpy libraries.
As requested - Sample data:
This is the first line. This is all just text to be skipped.
The first line doesn't always have a comma - maybe it's in the third line
Still no commas, or was there?
Yes, there was. And there it is again.
and so on
There are more lines but they finally stop when you get to
EndOfHeader
1,2,3,4,5
8,9,10,11,12
3, 6, 9, 12, 15
Thanks for the help.
Edit#2
A suggested answer gave the following link entitled Read file from line 2...
That's kind of what I'm looking for, but I want to be able to read through the lines until I find the "EndOfHeader" and then call on the CSV library to handle the remainder of the file.
The reply by saimadhu.polamuri is part of what I've tried, specifically
with open(filename , 'r') as f:
first_line = f.readline()
for line in f:
#test if line equals EndOfHeader. If true then parse as CSV
But that's where it comes apart - I can't see how to have CSV work with the data from this point forward.
With thanks to #Mike for the suggestion, the code is actually reasonably straightforward.
with open('data.csv') as f: # open the file
for i in range(7): # Loop over first 7 lines
str=f.readline() # just read them. Could also do f.next()
r = csv.reader(f, delimiter=',') # Now pass the file handle to a csv reader
for row in r: # and loop over the resulting rows
print(row) # Print the row. Or do something else.
In my actual code, it will search for the EndOfHeader line and use that to decide where to start parsing the CSV
I'm posting this as an answer, as the question that this one supposedly duplicates doesn't explicitly consider this issue of the file handle and how it can be passed to a CSV reader, and so it may help someone else.
Thanks to all who took time to help.
So I've got about 5008 rows in a CSV file, a total of 5009 with the headers. I'm creating and writing this file all within the same script. But when i read it at the end, with either pandas pd.read_csv, or python3's csv module, and print the len, it outputs 4967. I checked the file for any weird characters that may be confusing python but don't see any. All the data is delimited by commas.
I also opened it in sublime and it shows 5009 rows not 4967.
I could try other methods from pandas like merge or concat, but if python wont read the csv correct, that's no use.
This is one method i tried.
df1=pd.read_csv('out.csv',quoting=csv.QUOTE_NONE, error_bad_lines=False)
df2=pd.read_excel(xlsfile)
print (len(df1))#4967
print (len(df2))#5008
df2['Location']=df1['Location']
df2['Sublocation']=df1['Sublocation']
df2['Zone']=df1['Zone']
df2['Subnet Type']=df1['Subnet Type']
df2['Description']=df1['Description']
newfile = input("Enter a name for the combined csv file: ")
print('Saving to new csv file...')
df2.to_csv(newfile, index=False)
print('Done.')
target.close()
Another way I tried is
dfcsv = pd.read_csv('out.csv')
wb = xlrd.open_workbook(xlsfile)
ws = wb.sheet_by_index(0)
xlsdata = []
for rx in range(ws.nrows):
xlsdata.append(ws.row_values(rx))
print (len(dfcsv))#4967
print (len(xlsdata))#5009
df1 = pd.DataFrame(data=dfcsv)
df2 = pd.DataFrame(data=xlsdata)
df3 = pd.concat([df2,df1], axis=1)
newfile = input("Enter a name for the combined csv file: ")
print('Saving to new csv file...')
df3.to_csv(newfile, index=False)
print('Done.')
target.close()
But not matter what way I try the CSV file is the actual issue, python is writing it correctly but not reading it correctly.
Edit: Weirdest part is that i'm getting absolutely no encoding errors or any errors when running the code...
Edit2: Tried testing it with nrows param in first code example, works up to 4000 rows. Soon as i specify 5000 rows, it reads only 4967.
Edit3: manually saved csv file with my data instead of using the one written by the program, and it read 5008 rows. Why is python not writing the csv file correctly?
I ran into this issue also. I realized that some of my lines had open-ended quotes, which was for some reason interfering with the reader.
So for example, some rows were written as:
GO:0000026 molecular_function "alpha-1
GO:0000027 biological_process ribosomal large subunit assembly
GO:0000033 molecular_function "alpha-1
and this led to rows being read incorrectly. (Unfortunately I don't know enough about how csvreader works to tell you why. Hopefully someone can clarify the quote behavior!)
I just removed the quotes and it worked out.
Edited: This option works too, if you want to maintain the quotes:
quotechar=None
My best guess without seeing the file is that you have some lines with too many or not enough commas, maybe due to values like foo,bar.
Please try setting error_bad_lines=True. From Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html to see if it catches lines with errors in them, and my guess is that there will be 41 such lines.
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)
The csv.QUOTE_NONE option seems to not quote fields and replace the current delimiter with escape_char + delimiter when writing, but you didn't paste your writing code, but on read it's unclear what this option does. https://docs.python.org/3/library/csv.html#csv.Dialect
wrote a python script in windows 8.1 using Sublime Text editor and I just tried to run it from terminal in OSX Yosemite but I get an error.
My error occurs when parsing the first line of a .CSV file. This is the slice of the code
lines is an array where each element is the line in the file it is read from as a string
we split the string by the desired delimiter
we skip the first line because that is the header information (else condition)
For the last index in the for loop i = numlines -1 = the number of lines in the file - 2
We only add one to the value of i because the last line is blank in the file
for i in range(numlines):
if i == numlines-1:
dataF = lines[i+1].split(',')
else:
dataF = lines[i+1].split(',')
dataF1 = list(dataF[3])
del(dataF1[len(dataF1)-1])
del(dataF1[len(dataF1)-1])
del(dataF1[0])
f[i] = ''.join(dataF1)
return f
All the lines in the csv file looks like this (with the exception of the header line):
"08/06/2015","19:00:00","1","410"
So it saves the single line into an array where each element corresponds to one of the 4 values separated by commas in a line of the CSV file. Then we take the 3 element in the array, "410" ,and create a list that should look like
['"','4','1','0','"','\n']
(and it does when run from windows)
but it instead looks like
['"','4','1','0','"','\r','\n']
and so when I concatenate this string based off the above code I get 410 instead of 410.
My question is: Where did the '\r' term come from? It is non-existent in the original files when ran by a windows machine. At first I thought it was the text format so I saved the CSV file to a UTF-8, that didn’t work. I tried changing the tab size from 4 to 8 spaces, that didn’t work. Running out of ideas now. Any help would be greatly appreciated.
Thanks
The "\r" is the line separator. The "\r\n" is also a line separator. Different platforms have different line separators.
A simple fix: if you read a line from a file yourself, then line.rstrip() will remove the whitespace from the line end.
A proper fix: use Python's standard CSV reader. It will skip the blank lines and comments, will properly handle quoted strings, etc.
Also, when working with long lists, it helps to stop thinking about them as index-addressed 'arrays' and use the 'stream' or 'sequential reading' metaphor.
So the typical way of handling a CSV file is something like:
import csv
with open('myfile.csv') as f:
reader = csv.reader(f)
# We assume that the file has 3 columns; adjust to taste
for (first_field, second_field, third_field) in reader:
# do something with field values of the current lines here