Get rid of extra commas from excel file with python - python

I've got a csv file and whenever i access the elements it gets me
aapl,2001-12-4,,,,,
The commas at the end is causing my functions to not work properly for my other application. How can I remove this in order to get rid of any additional commas after elements?
for example the above after correction would be
aaple,2001-12-4
anything will help, thanks so much.
m

Why would you remove the trailing commas? Typically the commas with no value between them would mean that the particular field is empty.
It would be better I think to not modify the line/file, but instead utilize in your application a method to split the line on commas. Then, do what you need to do with the list of data
import csv
csv_file = file('test.csv', 'rU')
csv_list = csv.reader(csv_file)
for k in csv_list:
print filter(None,k)
>>>
['aapl','2001-02-4']

Here's how to remove the excess commas from the right hand side of a string:
In [2]: mystring = '1,2,3,4,"Hello!",,,,,,,,,'
In [3]: mystring.rstrip(',')
Out[3]: '1,2,3,4,"Hello!"'
In [4]:
Expand on this to perform the comma-stripping operation for each line of a file:
Open the original .csv file.
Process one line, removing excess commas.
Write the processed line to a new file.
Repeat until your original .csv file is completely processed.

Use str.rstrip:
>>> 'aapl,2001-12-4,,,,,'.rstrip(',')
'aapl,2001-12-4'

If in case you can use sed then you can do this way from command line
sed -re 's/,*$//g' temp.csv

One of the simplest tricks is to use the parameter usecols in the read_csv function to limit what columns you read in:
For example
import pandas as pd
from google.colab import files
import io
uploaded = files.upload()
x_train = pd.read_csv(io.StringIO(uploaded['x_train.csv'].decode('utf-8')), skiprows=1, usecols=range(10) ,header=None)
To limit the reader to read only 10 columns, since the comma will be on the column 11.

Related

How to prevent double quotes at start and end of DAT file, while writing a pandas DF using to_csv()?

My panda DF contains huge data and giving filename with '.DAT' extension (client requirement) and using to_csv() to write data.
When I open the file in notepad or any other text viewer, I see double quotes at start and end of the file:
" col1|Col2|Col3
D1|D2|D3
... So On
D1n|D2n|D3n "
How to remove these double quotes while writing the dataframe as CSV file?
I tried quote, quoting parameters in to_csv, replace function. Please suggest any parameter combination to eliminate this
To write to a CSV you would do it like this normally.
Generally, this should not give you quotes in the file by default.
import pandas as pd
df = pd.read_csv('path\to\source_folder\input.dat')
df.to_csv('path\to\folder\s.dat')
Can we see a sample of the Code?
Try this
df.to_csv("data.dat",header=None, index=None, sep='|', escapechar='')

How to split a txt into an array in Python?

I have this txt and I need to parse it in a Python script.
00:08:e3:ff:fd:90 - 172.21.152.1
70:70:8b:85:67:80 - 172.21.155.4
I want to separate and store in an array only the MAC address. How can I do this?
You can use the built-in function open to read the file, giving the path to the file and passing the "r" argument to tell that you want to read the file. Then use the readlines function from the file handler which returns a list of lines. For each line you can split the text on the dash character. The mac address will be the first element in the list given by the split function.
with open("file.txt", "r") as f :
macs = [line.split(" - ")[0] for line in f.readlines()]
You could achieve this with pandas as well
import pandas as pd
macs = pd.read_table('file.txt', header=None, usecols=[0], delim_whitespace=True)
I think it would be unnecessary to use pandas for this purpose only. However, if you are using pandas already anyway, I would prefer this approach

Extract text from multiple PDFs and write to a single CSV

I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.
Here is my code:
import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path
import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])
for i in range(len(pdf_files)):
scraped_text = convert_pdf_to_txt(pdf_files_path[i])
df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)
df.to_csv('output.csv')
The variables have the following values:
pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']
pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]
Empty DataFrame
Columns: [FileName, Text]
Index: []
Update: based on a suggestion from #AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.
12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n
I guess you don't need pandas for that. You can make it simpler by using the standard library csv.
Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.
Here is an almost complete example:
import csv
from pathlib import Path
folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')
with csv_file.open('w', encoding='utf-8') as f:
writer = csv.writer(f, csv.QUOTE_ALL)
writer.writerow(['FileName', 'Text'])
for pdf_file in folder.glob('*.pdf'):
pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
writer.writerow([pdf_file.name, pdf_text])
Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:
pdf_text.replace('\n', '|')
It is not meant to be a complete example but a starting point. I hope it helps.

Reading .csv files with different footer row length in Python

I am a complete noob at Python so I apologize if the solution is obvious.
I am trying to read some .csv field data on python for processing. Currently I have:
data = pd.read_csv('somedata.csv', sep=' |,', engine='python', usecols=(range(0,10)), skiprows=155, skipfooter=3)
However depending on if the data collection was interrupted, the last few lines of the file may be something like:
#data_end
Run Complete
Or
Run Interrupted
ERROR
A bunch of error codes
Hence I can't just use skipfooter=3. Is there a way for Python to detect the length of the footer and skip it? Thank you.
You can first read the content of your file as a plain text file into a Python list, remove those lines that don't contain the expected number of separators, and then transform the list into an IO stream. This IO stream is then passed on to pd.read_csv as if it was a file object.
The code might look like this:
from io import StringIO
import pandas as pd
# adjust these variables to meet your requirements:
number_of_columns = 11
separator = " |, "
# read the content of the file as plain text:
with open("somedata.csv", "r") as infile:
raw = infile.readlines()
# drop the rows that don't contain the expected number of separators:
raw = [x for x in raw if x.count(separator) == number_of_columns]
# turn the list into an IO stream (after joining the rows into a big string):
stream = StringIO("".join(raw))
# pass the string as an argument to pd.read_csv():
df = pd.read_csv(stream, sep=separator, engine='python',
usecols=(range(0,10)), skiprows=155)
If you use Python 2.7, you have to make replace the first line from io import StringIO by the following two lines:
from __future__ import unicode_literals
from cStringIO import StringIO
This is so because StringIO requires a unicode string (which is not the default in Python 2.7), and because the StringIO class lives in a different module in Python 2.7.
I think you have to simply resort to counting the commas for each line and manually find the last correct one. I'm not aware of a parameter to read_csv to automate that.

How remove large space between the sentences in text file?

I am working with Unicode file after processing it. I am getting very large spacing between sentences for example
തൃശൂരില്‍ ഹര്‍ത്താല്‍ പൂര്‍ണം
തൃശൂവില്‍ ഇടതുമുന്നണി ഹര്‍ത്താലില്‍ ജനജീവിതം പൂര്‍ണമായും സ്‌...
ഡി.വൈ.എഫ്‌.ഐ. ഉപരോധം; കലക്‌ടറേറ്റ്‌ സ്‌തംഭിച്ചു
തൃശൂര്‍: നിയമനനിരോധനം, അഴിമതി, വിലക്കയറ്റം എന്നീ വിഷയങ്ങള്‍ മുന്‍...
ബൈക്ക്‌ പോസ്‌റ്റിലിടിച്ച്‌ പതിന്നേഴുകാരന്‍ മരിച്ചു
How to remove these large spaces ?
I have tried this
" ".join(raw.split())
It is not working at all. Any suggestions ?
The easiest way is to write the results another file, or rewrite it to your file. Most operating systems doesn't allow us to edit directly into a file, especially appending. For simple cases like this, rewriting is much simpler:
with open('f.txt') as raw:
data = ''.join(raw.read().split()) #If you want to remove newlines only, use split('\n')
with open('f.txt', 'w') as raw:
raw.write(data)
Hope this helps!
Assuming raw is your raw data, you need to split the raw data using str.splitlines, filter all the empty lines, and rejoin them using newline
print '\n'.join(line for line in raw.splitlines() if line.strip())
If you are open to use regex, you may also try
import re
print re.sub("\n+","\n", raw)
If instead raw is a file object, group all consecutive spaces as one
from itertools import groupby
with open("<some-file>") as raw:
data = ''.join(k for k, _ in groupby(raw))
assuming the lines are empty (only a newline) using python:
import re
import sys
f = sys.argv[1]
for line in open(f, 'r'):
if not re.search('^$', line):
print line
or if you prefer:
egrep -v "^$" <filename>

Categories