how to extract specific data from a csv file with given parameters? - python

I want to extract Neutral words from the given csv file (to a separate .txt file), but I'm fairly new to python and don't know much about file handling. I could not find a neutral words dataset, but after searching here and there, this is what I was able to find.
Here is the Gtihub project from where I want to extract data (just in case anyone needs to know) : hoffman-prezioso-projects/Amazon_Review_Sentiment_Analysis
Neutral Words
Word Sentiment Score
a 0.0125160264947
the 0.00423728459134
it -0.0294755274737
and 0.0810574365028
an 0.0318918766949
or -0.274298468178
normal -0.0270787859177
So basically I want to extract only those words (text) from csv where the numeric value is 0.something.

Even without using any libraries, this is fairly easy with the csv you're using.
First open the file (I'm going to assume you have the path saved in the variable filename), then read the file with the readlines() function, and then filter out according to the condition you give.
with open(filename, 'r') as csv: # Open the file for reading
rows = [line.split(',') for line in csv.readlines()] # Read each the file in lines, and split on commas
filter = [line[0] for line in rows if abs(float(line[1])) < 1]
# Filter out all lines where the second value is not equal to 1
This is now the accepted answer, so I'm adding a disclaimer. There are numerous reasons why this code should not be applied to other CSVs without thought.
It reads the entire CSV in memory
It does not account for e.g. quoting
It is acceptable for very simple CSVs but the other answers here are better if you cannot be certain that the CSV won't break this code.

Here is one way to do it with only vanilla libs and not holding the whole file in memory
import csv
def get_vals(filename):
with open(filename, 'rb') as fin:
reader = csv.reader(fin)
for line in reader:
if line[-1] <= 0:
yield line[0]
words = get_vals(filename)
for word in words:
do stuff...

Use pandas like so:
import pandas
df = pandas.read_csv("yourfile.csv")
df.columns = ['word', 'sentiment']
to choose words by sentiment:
positive = df[df['sentiment'] > 0]['word']
negative = df[df['sentiment'] < 0]['word']
neutral = df[df['sentiment'] == 0]['word']

If you don't want to use any additional libraries, you can try with csv module. Note that delimiter='\t' can be different in your case.
import csv
f = open('name.txt', 'r')
reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
for row in reader:
if(float(row[1]) > 0.0):
print(row[0] + ' ' row[1])

Related

Parse pipe delimited CSV Python [duplicate]

I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file.
Thanks in advance
I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR:
the Microsoft version of CSV is a textbook example of how not to design a textual file format.
The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task.
So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant.
Edit:
Python provides csv.Sniffer that can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row):
a|b|c
"a|b"|c|d
foo|"bar|baz"|qux
You can do this:
import csv
csvfile = open("csvfile.csv")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.DictReader(csvfile, dialect=dialect)
for row in reader:
print row,
# => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'}
# write records using other dialect
Your strategy could be the following:
parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader
calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation.
In the following example you find the simpler statistic (total number of fields)
import csv
piperows= []
tabrows = []
#parsing | delimiter
f = open("file", "rb")
readerpipe = csv.reader(f, delimiter = "|")
for row in readerpipe:
piperows.append(row)
f.close()
#parsing TAB delimiter
f = open("file", "rb")
readertab = csv.reader(f, delimiter = "\t")
for row in readerpipe:
tabrows.append(row)
f.close()
#in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data)
#count total fields
totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows])
totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows])
if totfieldspipe > totfieldstab:
yourrows = piperows
else:
yourrows = tabrows
#the var yourrows contains the rows, now just write them in any format you like
Like this
from __future__ import with_statement
import csv
import re
with open( input, "r" ) as source:
with open( output, "wb" ) as destination:
writer= csv.writer( destination )
for line in input:
writer.writerow( re.split( '[\t|]', line ) )
I would suggest taking some of the example code from the existing answers, or perhaps better use the csv module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that.
If you actually have lots of files, then you need to try to find a way to detect which file is which.
One of the examples has this:
if "|" in line:
This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file.
Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.
for line in open("file"):
line=line.strip()
if "|" in line:
print ','.join(line.split("|"))
else:
print ','.join(line.split("\t"))

Trying to copy column1 from a csv file to another empty file using python

I'm looking for a way using python to copy the first column from a csv into an empty file. I'm trying to learn python so any help would be great!
So if this is test.csv
A 32
D 21
C 2
B 20
I want this output
A
D
C
B
I've tried the following commands in python but the output file is empty
f= open("test.csv",'r')
import csv
reader = csv.reader(f,delimiter="\t")
names=""
for each_line in reader:
names=each_line[0]
First, you want to open your files. A good practice is to use the with statement (that, technically speaking, introduces a context manager) so that when your code exits from the with block all the files are automatically closed
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
next you want a loop on the lines of the input file (note the indentation, we are inside the with block), line splitting is automatic when you read a text file with lines separated by newlines…
for line in inpfile:
each line is a string, but you think of it as two fields separated by white space — this situation is so common that strings have a method to deal with this situation (note again the increasing indent, we are in the for loop block)
fields = line.split()
by default .split() splits on white space, but you can use, e.g., split(',') to split on commas, etc — that said, fields is a list of strings, for your first record it is equal to ['A', '32'] and you want to output just the first field in this list… for this purpose a file object has the .write() method, that writes a string, just a string, to the file, and fields[0] IS a string, but we have to add a newline character to it because, in this respect, .write() is different from print().
outfile.write(fields[0]+'\n')
That's all, but if you omit my comments it's 4 lines of code
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
for line in inpfile:
fields = line.split()
outfile.write(fields[0]+'\n')
When you are done with learning (some) Python, ask for an explanation of this...
with open('test.csv') as ifl, open('out.csv', 'w') as ofl:
ofl.write('\n'.join(line.split()[0] for line in ifl))
Addendum
The csv module in such a simple case adds the additional conveniences of
auto-splitting each line into a list of strings
taking care of the details of output (newlines, etc)
and when learning Python it's more fruitful to see how these steps can be done using the bare language, or at least that it is my opinion…
The situation is different when your data file is complex, has headers, has quoted strings possibly containing quoted delimiters etc etc, in those cases the use of csv is recommended, as it takes into account all the gory details. For complex data analisys requirements you will need other packages, not included in the standard library, e.g., numpy and pandas, but that is another story.
This answer reads the CSV file, understanding a column to be demarked by a space character. You have to add the header=None otherwise the first row will be taken to be the header / names of columns.
ss is a slice - the 0th column, taking all rows as denoted by :
The last line writes the slice to a new filename.
import pandas as pd
df = pd.read_csv('test.csv', sep=' ', header=None)
ss = df.ix[:, 0]
ss.to_csv('new_path.csv', sep=' ', index=False)
import csv
reader = csv.reader(open("test.csv","rb"), delimiter='\t')
writer = csv.writer(open("output.csv","wb"))
for e in reader:
writer.writerow(e[0])
The best you can do is create a empty list and append the column and then write that new list into another csv for example:
import csv
def writetocsv(l):
#convert the set to the list
b = list(l)
print (b)
with open("newfile.csv",'w',newline='',) as f:
w = csv.writer(f, delimiter=',')
for value in b:
w.writerow([value])
adcb_list = []
f= open("test.csv",'r')
reader = csv.reader(f,delimiter="\t")
for each_line in reader:
adcb_list.append(each_line)
writetocsv(adcb_list)
hope this works for you :-)

writing the data in text file while converting it to csv

I am very new with python. I have a .txt file and want to convert it to a .csv file with the format I was told but could not manage to accomplish. a hand can be useful for it. I am going to explain it with screenshots.
I have a txt file with the name of bip.txt. and the data inside of it is like this
I want to convert it to csv like this csv file
So far, what I could do is only writing all the data from text file with this code:
read_files = glob.glob("C:/Users/Emrehana1/Desktop/bip.txt")
with open("C:/Users/Emrehana1/Desktop/Test_Result_Report.csv", "w") as outfile:
for f in read_files:
with open(f, "r") as infile:
outfile.write(infile.read())
So is there a solution to convert it to a csv file in the format I desire? I hope I have explained it clearly.
There's no need to use the glob module if you only have one file and you already know its name. You can just open it. It would have been helpful to quote your data as text, since as an image someone wanting to help you can't just copy and paste your input data.
For each entry in the input file you will have to read multiple lines to collect together the information you need to create an entry in the output file.
One way is to loop over the lines of input until you find one that begins with "test:", then get the next line in the file using next() to create the entry:
The following code will produce the split you need - creating the csv file can be done with the standard library module, and is left as an exercise. I used a different file name, as you can see.
with open("/tmp/blip.txt") as f:
for line in f:
if line.startswith("test:"):
test_name = line.strip().split(None, 1)[1]
result = next(f)
if not result.startswith("outcome:"):
raise ValueError("Test name not followed by outcome for test "+test_name)
outcome = result.strip().split(None, 1)[1]
print test_name, outcome
You do not use the glob function to open a file, it searches for file names matching a pattern. you could open up the file bip.txt then read each line and put the value into an array then when all of the values have been found join them with a new line and a comma and write to a csv file, like this:
# set the csv column headers
values = [["test", "outcome"]]
current_row = []
with open("bip.txt", "r") as f:
for line in f:
# when a blank line is found, append the row
if line == "\n" and current_row != []:
values.append(current_row)
current_row = []
if ":" in line:
# get the value after the semicolon
value = line[line.index(":")+1:].strip()
current_row.append(value)
# append the final row to the list
values.append(current_row)
# join the columns with a comma and the rows with a new line
csv_result = ""
for row in values:
csv_result += ",".join(row) + "\n"
# output the csv data to a file
with open("Test_Result_Report.csv", "w") as f:
f.write(csv_result)

More efficient way to go through .csv file?

I'm trying to parse through a few dictionary a in .CSV file, using two lists in separate .txt files so that the script knows what it is looking for. The idea is to find a line in the .CSV file which matches both a Word and IDNumber, and then pull out a third variable if there is a match. However, the code is running really slow. Any ideas how I could make it more efficient?
import csv
IDNumberList_filename = 'IDs.txt'
WordsOfInterest_filename = 'dictionary_WordsOfInterest.txt'
Dictionary_filename = 'dictionary_individualwords.csv'
WordsOfInterest_ReadIn = open(WordsOfInterest_filename).read().split('\n')
#IDNumberListtoRead = open(IDNumberList_filename).read().split('\n')
for CurrentIDNumber in open(IDNumberList_filename).readlines():
for CurrentWord in open(WordsOfInterest_filename).readlines():
FoundCurrent = 0
with open(Dictionary_filename, newline='', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if ((row['IDNumber'] == CurrentIDNumber) and (row['Word'] == CurrentWord)):
FoundCurrent = 1
CurrentProportion= row['CurrentProportion']
if FoundCurrent == 0:
CurrentProportion=0
else:
CurrentProportion=1
print('found')
First of all, consider to load file dictionary_individualwords.csv into the memory. I guess that python dictionary is proper data structure for this case.
Your are opening the CSV file N times where N = (# lines in IDS.txt) * (# lines in dictionary_WordsOfInterest.txt). If the file is not too large, you can avoid that by saving its content to a dictionary or a list of lists.
The same way you open dictionary_WordsOfInterest.txt every time you read a new line from IDS.txt
Also It seems that you are looking for any combination of pair (CurrentIDNumber, CurrentWord) possible from the txt files. So for example you can store the ids in a set, and the words in an other, and for each row in the csv file, you can check if both the id and the word are in their respective set.
As you use readlines for the .txt files, you already build an in memory list with them. You should build those lists first and them only parse once the csv file. Something like:
import csv
IDNumberList_filename = 'IDs.txt'
WordsOfInterest_filename = 'dictionary_WordsOfInterest.txt'
Dictionary_filename = 'dictionary_individualwords.csv'
WordsOfInterest_ReadIn = open(WordsOfInterest_filename).read().split('\n')
#IDNumberListtoRead = open(IDNumberList_filename).read().split('\n')
numberlist = open(IDNumberList_filename).readlines():
wordlist = open(WordsOfInterest_filename).readlines():
FoundCurrent = 0
with open(Dictionary_filename, newline='', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for CurrentIDNumber in numberlist:
for CurrentWord in wordlist :
if ((row['IDNumber'] == CurrentIDNumber) and (row['Word'] == CurrentWord)):
FoundCurrent = 1
CurrentProportion= row['CurrentProportion']
if FoundCurrent == 0:
CurrentProportion=0
else:
CurrentProportion=1
print('found')
Beware: untested

How to remove rows from a csv file when compared to a list in a txt file using Python?

I have a list of 12.000 dictionary entries (the words only, without their definitions) stored in a .txt file.
I have a complete dictionary with 62.000 entries (the words with their definitions) stored in .csv file.
I need to compare the small list in the .txt file with the larger list in the .csv file and delete the rows containing the entries that doesn't appear on the smaller list. In other words, I want to purge this dictionary to only 12.000 entries.
The .txt file is ordered in separate lines like this, line by line:
word1
word2
word3
The .csv file is ordered like this:
ID (column 1) WORD (column 2) MEANING (column 3)
How do I accomplish this using Python?
Good answers so far. If you want to get minimalistic...
import csv
lookup = set(l.strip().lower() for l in open(path_to_file3))
map(csv.writer(open(path_to_file2, 'w')).writerow,
(row for row in csv.reader(open(path_to_file))
if row[1].lower() in lookup))
The following will not scale well, but should work for the number of records indicated.
import csv
csv_in = csv.reader(open(path_to_file, 'r'))
csv_out = csv.writer(open(path_to_file2, 'w'))
use_words = open(path_to_file3, 'r').readlines()
lookup = dict([(word, None) for word in use_words])
for line in csv_in:
if lookup.has_key(line[0]):
csv_out.writerow(line)
csv_out.close()
One of the least known facts of current computers is that when you delete a line from a text file and save the file, most of the time the editor does this:
load the file into memory
write a temporary file with the rows you want
close the files and move the temp over the original
So you have to load your wordlist:
with open('wordlist.txt') as i:
wordlist = set(word.strip() for word in i) # you said the file was small
Then you open the input file:
with open('input.csv') as i:
with open('output.csv', 'w') as o:
output = csv.writer(o)
for line in csv.reader(i): # iterate over the CSV line by line
if line[1] not in wordlist: # test the value at column 2, the word
output.writerow(line)
os.rename('input.csv', 'output.csv')
This is untested, now go do your homework and comment here if you find any bug... :-)
i would use pandas for this. the data set's not large, so you can do it in memory with no problem.
import pandas as pd
words = pd.read_csv('words.txt')
defs = pd.read_csv('defs.csv')
words.set_index(0, inplace=True)
defs.set_index('WORD', inplace=True)
new_defs = words.join(defs)
new_defs.to_csv('new_defs.csv')
you might need to manipulate new_defs to make it look like you want it to, but that's the gist of it.

Categories