Reading in Excel file with corrupt data using PYTHON - python

I am trying to read in a table from a .CSV file which should have 5 columns.
But, some rows have corrupt data..making it more than 5 columns.
How do I reject those rows and continue reading further ?
*Using
temp = read_table(folder + r'\temp.txt, sep=r'\t')
Just gives an error and stops the program*
I am new to Python...please help
Thanks

Look into using Python's csv module.
Without testing the damaged file it is difficult to say if this will do the trick however the csvreader reads a csv file's rows as a list of strings so you could potentially check if the list has 5 elements and proceed that way.
A code example:
out = []
with open('file.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimeter=' ')
for row in reader:
if len(row) == 5:
out.append(row)

Related

Python CSV: Loop stops after reaching empty row

I want to read a CSV file generated by my other script and I need to check 2 columns at same time. The problem is that my loop its stopping because there are empty values for some lines and It cant reach the following value. For example:
HASH 1111
HASH 2222
HASH 3333
HASH 4444
HASH 5555
HASH
HASH
HASH 6666
I cant read further point 5, because 6 and 7 has empty values and I need to read also the 8. Here is my code.
import csv
with open('vts.csv') as csvDataFile:
csvReader = csv.reader(csvDataFile, delimiter=';')
next(csvReader)
VTs = []
for row in csvReader:
VT = row
VTs.append(VT)
for row in VTs:
print(row[0],row[4])
Is there any way to continue the listing without manually sorting the Excel?
First, a csv file is not an Excel file. The former is a text delimited file, the latter is a binary one.
Next, your problem is not at reading time: the csv module can easily accept files with variable number of fields across its rows, including empty lines that will just give empty lists for row.
So the fix is just:
...
for row in VTs:
if len(row) > 4:
print(row[0],row[4])
There is no problem with your code except for the print(row[0],row[4]) for the given data while there no so many columns. I tested your code as follows:
.py
import csv
with open('vts.csv') as csvDataFile:
csvReader = csv.reader(csvDataFile, delimiter=';')
next(csvReader)
VTs = []
for row in csvReader:
VT = row
VTs.append(VT)
for row in VTs:
print(row[0], row[1])
vts.csv
HASH;1111
HASH;2222
HASH;3333
HASH;4444
HASH;5555
HASH;
HASH;
HASH;6666
If your data is as the sample, you don't really need delimiter=';' since it's a comma-separated value (hence csv), not semicolon ;.
Anyway, you can just ignore if the intended column not exists. Assuming your input is in proper csv format as below.
col1,col2
hash1,1111
hash2,2222
...
You can use csv.reader as what you did.
import csv
with open('vts.csv') as csvDataFile:
csvReader = csv.reader(csvDataFile, delimiter=';')
next(csvReader)
# csv.reader returns generator object, which you can convert it to list as below
VTs = list(csvReader)
for row in VTs:
if len(row) == 2:
print(row[0],row[1])
If your goal is only for inspecting the data, you can conveniently use pandas.DataFrame:
import pandas as pd
df = pd.read_csv("vts.csv")
print(df.dropna()) # This will print all rows without any missing data

How to copy entire row of excel (.csv) which contain specific words into another csv file using python?

I have to copy all the rows which contain specific word into an anther csv file.
My file is in .csv and I want to copy all rows which contain the word "Canada" in one of the cells. I have tried the various method given on the internet. But I am unable to copy my rows. My data contains more than 15,000 lines.
Example of my dataset includes:
tweets date area
dbcjhbc 12:4:19 us
cbhjc 3:3:18 germany
cwecewc 5:6:19 canada
cwec 23:4:19 us
wncwjwk 9:8:18 canada
code is:
import csv
with open('twitter-1.csv', "r" ,encoding="utf8") as f:
reader = csv.DictReader(f, delimiter=',')
with open('output.csv', "w") as f_out:
writer = csv.DictWriter(f_out, fieldnames=reader.fieldnames, delimiter=",")
writer.writeheader()
for row in reader:
if row == 'Canada':
writer.writerow(row)
But this code is not working and I am getting the error
Error: field larger than field limit (131072)
I know the question is asking for a solution in Python, but I believe this task can be solved easier with command-line tools.
One-Liner using Bash:
grep 'canada' myFile.csv > outputfile.csv
You can do this even without the csv module.
# read file and split by newlines (get list of rows)
with open('input.csv', 'r') as f:
rows = f.read().split('\n')
# loop over rows and append to list if they contain 'canada'
rows_containing_keyword = [row for row in rows if 'canada' in row]
# create and write lines to output file
with open('output.csv', 'w+') as f:
f.write('\n'.join(rows_containing_keyword))
All solutions except the grep one (which is probably the fastest if grep is available) load the entire .csv file into memory. Don't do that! You can stream the file and keep only one line in memory at a time.
with open('input.csv', 'r') as if, open('output.csv', 'w') as of:
for line in if:
if 'canada' in line:
of.write(line)
NOTE: I don't actually have python3 on this computer, so there might be a typo on this code. But I'm confident it's more efficient on sufficiently large files than loading the entire file into memory before manipulating it. It would be interesting to see benchmarks.
Assuming your .csv data (twitter-1.csv) looks like this:
tweets,date,area
dbcjhbc,12:4:19,us
cbhjc,3:3:18,germany
cwecewc,5:6:19,canada
cwec,23:4:19,us
wncwjwk,9:8:18,canada
Using numpy:
import numpy as np
# import .csv data (skipping header)
data = np.genfromtxt('twitter-1.csv', delimiter=',', dtype='string', skip_header=1)
# select only rows where the 'area' column is 'canada'
data_canada = data[np.where(data[:,2]=='canada')]
# export the resulting data
np.savetxt("foo.csv", data_canada, delimiter=',', fmt='%s')
foo.csv will contain:
cwecewc,5:6:19,canada
wncwjwk,9:8:18,canada
If you want to search every entry (every column) for canada, then you could use list comprehension. Assume twitter-1.csv contained an occurrence of canada in the tweets column:
tweets,date,area
dbcjhbc,12:4:19,us
cbhjc,3:3:18,germany
cwecewc,5:6:19,canada
canada,23:4:19,us
wncwjwk,9:8:18,canada
This will return all rows with any occurrence of canada:
out = [i for i, v in enumerate(data) if 'canada' in v]
data_canada = data[out]
np.savetxt("foo.csv", data_canada, delimiter=',', fmt='%s')
Now, foo.csv will contain:
cwecewc,5:6:19,canada
canada,23:4:19,us
wncwjwk,9:8:18,canada

Only outputting a few lines into a text file, instead of all of them

I've made a Python script that grabs information from a .csv archive, and outputs it into a text file as a list. The original csv file has over 200,000 fields to input and output from, yet when I run my program it only outputs 36 into the .txt file.
Here's the code:
import csv
with open('OriginalFile.csv', 'r') as csvfile:
emailreader = csv.reader(csvfile)
f = open('text.txt', 'a')
for row in emailreader:
f.write(row[1] + "\n")
And the text file only lists up to 36 strings. How can I fix this? Is maybe the original csv file too big?
After many comments, the original problem was encoding of characters in the csv file. If you specify the encoding in pandas it will read it just fine.
Any time you are dealing with a csv file (or excel, sql or R) I would use Pandas DataFrames for this. The syntax is shorter and easier to know what is going on.
import pandas as pd
csvframe = pd.read_csv('OriginalFile.csv', encoding='utf-8')
with open('text.txt', 'a') as output:
# I think what you wanted was the 2nd column from each row
output.write('\n'.join(csvframe.ix[:,1].values))
# the ix is for index and : is for all the rows and the 1 is only the first column
You might have luck with something like the following:
with open('OriginalFile.csv', 'r') as csvfile:
emailreader = csv.reader(csvfile)
with open('text.txt','w') as output:
for line in emailreader:
output.write(line[1]+'\n')

breaking csv file in two files with python

This is my first project using python and I'm not that great at programming. I have a csv file with two tables in it.
table 1 title
row1
row2
...
blank row
blank row
table 2 title
row1
row2
...
Here is my code
import csv
csv_file = open('usagebased.csv')
csv_reader = csv.reader(csv_file, delimiter=',')
next(csv_reader)
So I want to split the file in two csv files. What is the best way to do it? Can i split the file based on title 2 or the blank rows?
Thanks!
The function csv.reader can accept any object that conforms the iterator protocol and outputs a string from next(). Knowing that, you can actually split you csv file in two lists by the blank rows. After that you can feed two csv.reader with both lists.
import csv
two_tables = open('usagebased.csv').read().split("\n\n\n")
# Feed first csv.reader
first_csv = csv.reader(two_tables[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(two_tables[1], delimiter=',')
Thanks to both of you I succeeded. Thank you very much.
f = open('usagebased.csv').read().split("\n\n\n")
f1 = f[0]
f2 = f[1]
file1 = open('test1.csv','w')
file2 = open('test2.csv','w')
file1.write(f1)
file2.write(f2)
in case that we don't know number of new lines , this can help
,i tried to write code as simple as possible :
since you don't do anything to csv things inside file you dont need to use csv library
FzListe
7MA1, 7OS1
7MA1, 7ZJB
7MA2, 7MA3, 7OS1
76G1, 7MA1, 7OS1
7MA1, 7OS1
71E5, 71E6, 7MA1, FSS1
here the code :
f= open('test.txt','rt')
while True:
name = 0
for s in f:
if not s=='\n':
with open(str(name),'at') as ff:
ff.write(s)
else:
while s =='\n':
s = next(f)
name +=1
with open(str(name),'at') as ff:
ff.write(s)

How to copy multiple rows and one column from one CSV file to another CSV Excel?

I am extremely new to python(coding, for that matter).
Could I please get some help as to how can I achieve this. I have gone through numerous threads but nothing helped.
My input file looks like this:
I want my output file to look like this:
Just replication of the first column, twice in the second excel sheet. With a line after every 5 rows.
A .csv file can be opened with a normal text editor, do this and you'll see that the entries for each column are comma-separated (csv = comma separated values). Most likely it's semicolons ;, though.
Since you're new to coding, I recommend trying it manually with a text editor first until you have the desired output, and then try to replicate it with python.
Also, you should post code examples here and ask specific questions about why it doesn't work like you expected it to work.
Below is the solution. Don't forget to configure input/output files and the delimiter:
input_file = 'c:\Temp\input.csv'
output_file = 'c:\Temp\output.csv'
delimiter = ';'
i = 0
output_data = ''
with open(input_file) as f:
for line in f:
i += 1
output_data += line.strip() + delimiter + line
if i == 5:
output_data += '\n'
i = 0
with open(output_file, 'w') as file_:
file_.write(output_data)
Python has a csv module for doing this. It is able to automatically read each row into a list of columns. It is then possible to simply take the first element and replicate it into the second column in an output file.
import csv
with open('input.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
input_rows = list(csv_input)
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
for line, row in enumerate(input_rows, start=1):
csv_output.writerow([row[0], row[0]])
if line % 5 == 0:
csv_output.writerow([])
Note, it is not advisable to write the updated data directly over the input file as if there was a problem you would lose your original file.
If your input file has multiple columns, this script will remove them and simple duplicate the first column.
By default, the csv format separates each column using a comma, this can be modified by specifying a desired delimiter as follows:
csv_output = csv.writer(f_output, delimiter=';')

Categories