Hello i am not very familiar with programming and found Stackoverflow while researching my task. I want to do natural language processing on a .csv file that looks like this and has about 15.000 rows
ID | Title | Body
----------------------------------------
1 | Who is Jack? | Jack is a teacher...
2 | Who is Sam? | Sam is a dog....
3 | Who is Sarah?| Sarah is a doctor...
4 | Who is Amy? | Amy is a wrestler...
I want to read the .csv file and do some basic NLP operations and write the results back in a new or in the same file. After some research python and nltk seams to be the technologies i need. (i hope thats right). After tokenizing i want my .csv file to look like this
ID | Title | Body
-----------------------------------------------------------
1 | "Who" "is" "Jack" "?" | "Jack" "is" "a" "teacher"...
2 | "Who" "is" "Sam" "?" | "Sam" "is" "a" "dog"....
3 | "Who" "is" "Sarah" "?"| "Sarah" "is" "a" "doctor"...
4 | "Who" "is" "Amy" "?" | "Amy" "is" "a" "wrestler"...
What i have achieved after a day of research and putting pieces together looks like this
ID | Title | Body
----------------------------------------------------------
1 | "Who" "is" "Jack" "?" | "Jack" "is" "a" "teacher"...
2 | "Who" "is" "Sam" "?" | "Jack" "is" "a" "teacher"...
3 | "Who" "is" "Sarah" "?"| "Jack" "is" "a" "teacher"...
4 | "Who" "is" "Amy" "?" | "Jack" "is" "a" "teacher"...
My first idea was to read a specific cell in the .csv ,do an operation and write it back to the same cell. And than somehow do that automatically on all rows. Obviously i managed to read a cell and tokenize it. But i could not manage to write it back in that specific cell. And i am far away from "do that automatically to all rows". I would appreciate some help if possible.
My code:
import csv
from nltk.tokenize import word_tokenize
############Read CSV File######################
########## ID , Title, Body####################
line_number = 1 #line to read (need some kind of loop here)
column_number = 2 # column to read (need some kind of loop here)
with open('test10in.csv', 'rb') as f:
reader = csv.reader(f)
reader = list(reader)
text = reader[line_number][column_number]
stringtext = ''.join(text) #tokenizing just work on strings
tokenizedtext = (word_tokenize(stringtext))
print(tokenizedtext)
#############Write back in same cell in new CSV File######
with open('test11out.csv', 'wb') as g:
writer = csv.writer(g)
for row in reader:
row[2] = tokenizedtext
writer.writerow(row)
I hope i asked the question correctly and someone can help me out.
The pandas library will make all of this much easier.
pd.read_csv() will handle the input much more easily, and you can apply the same function to a column using pd.DataFrame.apply()
Here's a quick example of how the key parts you'll want work. In the .applymap() method, you can replace my lambda function with word_tokenize() to apply that across all elements instead.
In [58]: import pandas as pd
In [59]: pd.read_csv("test.csv")
Out[59]:
0 1
0 wrestler Amy dog is teacher dog dog is
1 is wrestler ? ? Sarah doctor teacher Jack
2 a ? Sam Sarah is dog Sam Sarah
3 Amy a a doctor Amy a Amy Jack
In [60]: df = pd.read_csv("test.csv")
In [61]: df.applymap(lambda x: x.split())
Out[61]:
0 1
0 [wrestler, Amy, dog, is] [teacher, dog, dog, is]
1 [is, wrestler, ?, ?] [Sarah, doctor, teacher, Jack]
2 [a, ?, Sam, Sarah] [is, dog, Sam, Sarah]
3 [Amy, a, a, doctor] [Amy, a, Amy, Jack]
Also see: http://pandas.pydata.org/pandas-docs/stable/basics.html#row-or-column-wise-function-application
You first need to parse your file and then process (tokenize, etc.) each field separately.
If our file really looks like your sample, I wouldn't call it a CSV. You could parse it with the csv module, which is specifically for reading all sorts of CSV files: Add delimiter="|" to the arguments of csv.reader(), to separate your rows into cells. (And don't open the file in binary mode.) But your file is easy enough to parse directly:
with open('test10in.csv', encoding="utf-8") as fp: # Or whatever encoding is right
content = fp.read()
lines = content.splitlines()
allrows = [ [ fld.strip() for fld in line.split("|") ] for line in lines ]
# Headers and data:
headers = allrows[0]
rows = allrows[2:]
You can then use nltk.word_tokenize() to tokenize each field of rows, and go on from there.
Related
This question already has answers here:
How do I count the occurrences of a list item?
(29 answers)
Closed 2 years ago.
I'm trying to have my program read a single line formed by words separated by commas. For example if we have:
hello,cat,man,hey,dog,boy,Hello,man,cat,woman,dog,Cat,hey,boy
in the input file, the program would need to separate each word on a single line and ditch the commas. After that the program would count frequencies of the words in the input file.
f = open('input1.csv') # create file object
userInput = f.read()
seperated = userInput.split(',')
for word in seperated:
freq = seperated.count(word)
print(word, freq)
The problem with this code is it prints the initial count for the same word that's counted twice. The output for this program would be:
hello 1
cat 2
man 2
hey 2
dog 2
boy 1
Hello 1
man 2
cat 2
woman 1
dog 2
Cat 1
hey 2
boy
1
The correct output would be:
hello 1
cat 2
man 2
hey 2
dog 2
boy 2
Hello 1
woman 1
Cat 1
Question is how do I make my output look more polished by having the final count instead of the initial one?
This is a common pattern and core programming skill. You should try collecting and counting words each time you encounter them, in a dictionary. I'll give you the idea, but it's best you practise the exact implementation yourself. Happy hacking!
(Also recommend the "pretty print" python built-in method)
import pprint
for word in file:
word_dict[word] += 1
pprint.pprint(word_dict)
A couple of extra tips - you may want to f.close() your file when you're finished, (E: I misread so disregard the rest...) and it looks like you want to look at converting your words to lower case so that different capitalisations aren't counted seperately. There are python built in methods to do this you can find by searching
try using a dictionary:
f = open('input1.csv') # create file object
userInput = f.read()
seperated = userInput.split(',')
wordsDict = {}
for word in seperated:
if word not in wordsDict:
wordsDict[word] = 1
else:
wordsDict[word] = int(wordsDict.get(word)) + 1
for i in wordsDict:
print i, wordsDict[i]
)
Create a new dictionary. Add the word as key and the count of that as value to it
count_dict={}
for w in seperated:
count_dict[w]=seperated.count(w)
for key,value in count_dict.items():
print(key,value)
How comes the symbol \r makes pandas bug when reading a csv file?
Example:
test = pd.DataFrame(columns = ['id','text'])
test.id = [1,2,3]
test.text = ['Foo\rBar','Bar\rFoo','Foo\r\r\nBar']
test.to_csv('temp.csv',index = False)
test2 = pd.read_csv('temp.csv')
Then the dataframes are as follow:
test:
id text
0 1 Foo\rBar
1 2 Bar\rFoo
2 3 Foo\r\r\nBar
test2:
id text
0 1 Foo
1 Bar NaN
2 2 Bar
3 Foo NaN
4 3 Foo\r\r\nBar
Note that adding a \n to the text prevent from going to another line. Any idea what's going on? And how to prevent this behavior?
Note thatiIt also prevents from using pandas.to_pickle as it corrupts the file. Yielding a file containing the following error:
Error! ..\my_pickle.pkl is not UTF-8 encoded
Saving disabled.
See Console for more details.
Try to add lineterminator and encoding parameters:
test = pd.DataFrame(columns = ['id', 'text'])
test.id = [1, 2, 3]
test.text = ['Foo\rBar', 'Bar\rFoo', 'Foo\r\r\nBar']
test.to_csv('temp.csv', index=False, line_terminator='\n', encoding='utf-8')
test2 = pd.read_csv('temp.csv', lineterminator='\n', encoding='utf-8')
test and test2:
id text
0 1 Foo\rBar
1 2 Bar\rFoo
2 3 Foo\r\r\nBar
It works fine for me, but maybe it's only Windows problem (I have MacBook). Also check this issue.
In order to have valid csv data all fields containing a newline should be enclosed in double quotes.
The generated csv should look like this:
id text
1 "Foo\rBar"
2 "Bar\rFoo"
3 "Foo\r\r\nBar"
or:
id text
1 "Foo
Bar"
2 "Bar
Foo"
3 "Foo
Bar"
If the reader only treats \n as a newline this will do:
id text
1 Foo\rBar
2 Bar\rFoo
3 "Foo\r\r\nBar"
To read the csv data make sure to tell the reader to parse the fields as quoted (which could be the default).
The parser might try to autodetect the type of newline in your file (could be \n, \r\n or even \r) and maybe that's why you could have unexpected results if there are combinations of \r and \n in unquoted fields.
I am trying to parse a text document line by line and in doing so I stumbled onto some weird behavior which I believe is caused by the presence of some kind of ankh symbol (☥). I am not able to copy the real symbol here.
In my code I try to determine whether a '+' symbol is present in the first characters of each line. To see if this worked I added a print statement containing a boolean and this string.
The relevant part of my code:
with open(file_path) as input_file:
content = input_file.readlines()
for line in content:
plus = '+' in line[0:2]
print('Plus: {0}, line: {1}'.format(plus,line))
A file I could try to parse:
+------------------------------
row 1 with some content
+------+------+-------+-------
☥+------+------+-------+------
| col 1 | col 2 | col 3 ...
+------+------+-------+-------
|_ valu | val | | dsf |..
|_ valu | valu | ...
What I get as output:
Plus: True, line: +------------------------------
Plus: False, line: row 1 with some content
Plus: True, line: +------+------+-------+-------
♀+------+------+-------+------
Plus: False, line: | col 1 | col 2 | col 3 ...
Plus: True, line: +------+------+-------+-------
Plus: False, line: |_ valu | val | | dsf |..
Plus: False, line: |_ valu | valu | ...
So my question is why does it just print the line containing the symbol without the 'Plus: True/False'. How should I solve this?
Thanks.
What you are seeing is the gender symbol. It is from the original IBM PC character set and is encoded as 0x0c, aka FormFeed, aka Ctrl-L.
If you are parsing text data with these present, they likely were inserted to indicate to a printer to start a new page.
From wikipedia:
Form feed is a page-breaking ASCII control character. It forces the printer to eject the current page and to continue printing at the top of another. Often, it will also cause a carriage return. The form feed character code is defined as 12 (0xC in hexadecimal), and may be represented as control+L or ^L.
I'm new with python so i'm reaaally struggling in making a script.
So, what I need is to make a comparison between two files. One file contains all proteins of some data base, the other contain only some of the proteins presents in the other file, because it belongs to a organism. So I need to know which proteins of this data base is present in my organism. For that I want to build a output like a matrix, with 0 and 1 referring to every protein present in the data base that may or may not be in my organism.
Does anybody have any idea of how could I do that?
I'm trying to use something like this
$ cat sorted.a
A
B
C
D
$ cat sorted.b
A
D
$ join sorted.a sorted.b | sed 's/^/1 /' && join -v 1 sorted.a sorted.b | sed 's/^/0 /'
1 A
1 D
0 B
0 C
But I'm not being able to use it because sometimes a protein is present but its not in the same line.
Here is a Example:
1-cysPrx_C
120_Rick_ant
14-03-2003
2-Hacid_dh
2-Hacid_dh_C
2-oxoacid_dh
2-ph_phosp
2CSK_N
2C_adapt
2Fe-2S_Ferredox
2H-phosphodiest
2HCT
2OG-FeII_Oxy
Comparing with
1-cysPrx_C
14-3-3
2-Hacid_dh
2-Hacid_dh_C
2-oxoacid_dh
2H-phosphodiest
2OG-FeII_Oxy
2OG-FeII_Oxy_3
2OG-FeII_Oxy_4
2OG-FeII_Oxy_5
2OG-Fe_Oxy_2
2TM
2_5_RNA_ligase2
Does anyone have an idea of how could I do that?
Thanks so far.
The fastest way in Python would be to read your organism file, and save each protein name to a set. Then open and iterate through your all_proteins file, for each name print it, check if that name is present in your organism set, and print a 0 or 1 appropriately.
Example code if your organism list is called 'prot_list':
with open(all_proteins_file) as f:
for line in f:
prot = line.strip()
if prot in prot_list: num = 1
else: num = 0
print '%i %s' % (num, prot)
I am new to data base handling using python programming.
By using python programming ,i want to read raw text file which consist of STUDEN T_NAME,STUDENT_MARKS. Which are separated by pipe symbol(given as below Example),I want to push this data into student table consists of 2 columns (STUDENT_NAME,STUDENT_MARKS) with respective data values.
input data file will be like this(it consists of some thousands of records like this),my input file is .Dat file ,its start only with records,each line contain 0 or more number of records(there is no fixed count of records on each line),there is no other keyword appear anywhere else ::
records STUDENT_NAME| jack | STUDENT_MARKS|200| STUDENT_NAME| clark
|STUDENT_MARKS|200| STUDENT_NAME| Ajkir | STUDENT_MARKS|30|
STUDENT_NAME| Aqqm | STUDENT_MARKS|200| STUDENT_NAME| jone |
STUDENT_MARKS|200| STUDENT_NAME| jake | STUDENT_MARKS|100|
Output mysql table table::
STUDENT_NAME| STUDENT_MARKS
jack | 200
clark | 200
.......
please advice me to read file&push data in efficient way.
I would be so grateful if someone could give me script to achieve this.
# import mysql module
import MySQLDB
# import regular expression module
import re
# set file name & location (note we need to create a temporary file because
# the original one is messed up)
original_fyle = open('/some/directory/some/file.csv', 'r')
ready_fyle = open('/some/directory/some/ready_file.csv', 'w')
# initialize & establish connection
con = MySQLdb.connect(host="localhost",user="username", passwd="password",db="database_name")
cur = con.cursor()
# prepare your ready file
for line in original_fyle:
# substitute useless information this also creates some formatting for the
# actuall loading into mysql
line = re.sub('STUDENT_NAME|', '\n', line)
line = re.sub('STUDENT_MARKS|', '', line)
ready_fyle.write(line)
# load your ready file into db
# close file
ready_file.close()
# create a query
query = 'load data local infile "/some/directory/some/ready_file.csv" into table table_name field terminated by "|" lines terminated by "\n" '
# run it
cur.execute(query)
# commit just in case
cur.commit()
In the spirit of being kind to newcomers, some code to get you started:
# assuming your data is exactly as in the original question
data = '''records STUDENT_NAME| jack | STUDENT_MARKS|200| STUDENT_NAME| clark |STUDENT_MARKS|200| STUDENT_NAME| Ajkir | STUDENT_MARKS|30| STUDENT_NAME| Aqqm | STUDENT_MARKS|200| STUDENT_NAME| jone | STUDENT_MARKS|200| STUDENT_NAME| jake | STUDENT_MARKS|100|'''
data = data.split('|')
for idx in range(1, len(data), 4):
# every second item in the list is a name and every fourth is a mark
name = data[idx].strip() # need to add code to check for duplicate names
mark = int(data[idx+2].strip()) # this will crash if not a number
print(name, mark) # use these values to add to the database
You may want to play with SQLite using this tutorial to learn how to use such databases with Python.
And this tutorial about file input may be useful.
You may want to start with this and then come back with some code.