I have data twitter in a CSV file (that I'm mining with a Python API). I get around 1000 lines of data. Now I want to shorten the tweet data using the specific Indonesian words “macet” or “kecelakaan” (in English “traffic” or “accident”) and put the matching rows into a new separate CSV file, just like in Excel using find all.
The sample data twitter is example1.csv and the new file which will be created after the search of the word "macet" or "kecelakaan" is example2.csv. But there is no result.
import re
import csv
with open('example1.csv', 'r') as csvFile:
reader = csv.reader(csvFile)
if re.search(r'macet', reader):
for row in reader:
myData = list(row)
print(row)
newFile = open('example2.csv', 'w')
with newFile:
writer = csv.writer(newFile)
writer.writerows(myData)
print("Writing complete")
I use spyder for environment Python 3.6.
The CSV file is already in the same folder with Spyder. Here is the screen capture image of my CSV twitter data
myCSVtwitterData
updated : Sample of csv file. OS using : Windows
There are a couple of problems with your code.
In your reading loop you are passing a csv.reader object to re.search, but it doesn't know how to search that object. You need to pass it text or byte strings.
The line
myData = list(row)
converts row into a new list and saves it to myData, but it's already a list, so no conversion is necessary. And that line replaces the previous contents of myData, but you actually want to save all the matching rows. However, there's no need to save the rows, you can just write them to the new file as you go.
Anyway, here's a repaired version of your code. From the screen shot it looks like you only want to search the text in column 2 of the input data (which corresponds to column C in your spreadsheet). I've created a regex that searches for the whole words "macet" and "kecelakaan", the "\b" matches at word boundaries so we don't get a match if "macet" or "kecelakaan" is part of a larger word.
import re
import csv
# Make a case-insensitive regex to match the words "macet" or "kecelakaan"
pattern = re.compile(r'\bmacet\b|\bkecelakaan\b', re.I)
with open('example1.csv', 'r', newline='') as csvFile, open('example2.csv', 'w', newline='') as newFile:
reader = csv.reader(csvFile)
writer = csv.writer(newFile)
for row in reader:
# Skip empty rows
if not row:
continue
if pattern.search(row[2]):
print(row)
writer.writerow(row)
print("Writing complete")
I've just made a couple of improvements to that code. It now uses the newline='' arg to open the CSV files, and it skips any empty lines in the input CSV. And the regex now ignores the case when looking for matching words.
Not answering about Python. But if you have a Linux OS, you can do it in one command line :
grep -i "macet" exemple1.csv > exemple2.csv
-i is for ignore case, so it will also match "Macet"
how is it~?
this code visit rows one by one
and find cells that contain a word in word_list
and write the value list on the row
import re
import csv
word_list = ['macet', 'kecelakaan']
with open('example1.csv', 'r') as csvFile, open('example2.csv', 'w') as newFile:
reader = csv.reader(csvFile)
writer = csv.writer(newFile, lineterminator='\n')
for row in reader:
new_row = [content for content in row if any(map(lambda word: word in content, word_list))]
if(new_row != []):
print(new_row)
writer.writerow(new_row)
print("Writing complete")
I am writing a script to write a list with tab separated as below to a csv file. But i am not getting proper output on this.
out_l = ['host\tuptime\tnfsserver\tnfs status\n', 'node1\t2\tnfs_host\tok\n', 'node2\t100\tnfs_host\tna\n', 'node3\t59\tnfs_host\tok\n']
code:
out_f = open('test.csv', 'w')
w = csv.writer(out_f)
for l in out_l:
w.writerow(l)
out_f.close()
The output csv file reads as below.
h,o,s,t, ,s,s,h, , , , , ,s,u,d,o,_,h,o,s,t, , , , , , , ,n,f,s,"
"1,9,2,.,1,6,8,.,1,2,2,.,2,0,1, ,o,k, ,n,f,s,h,o,s,t, ,o,k,"
"1,9,2,.,1,6,8,.,1,2,2,.,2,0,2, ,f,a,i,l,e,d, ,n,a, ,n,a,"
"1,9,2,.,1,6,8,.,1,2,2,.,2,0,3, ,o,k, ,n,f,s,h,o,s,t, ,s,h,o,w,m,o,u,n,t, ,f,a,i,l,e,d,"
"
Also I have checked the csv.writer option like delimiter, dialect=excel, but no luck.
Can some one help to format the output?
With the formatting you have in out_l, you can just write it to a file:
out_l = ['host\tuptime\tnfsserver\tnfs status\n', 'node1\t2\tnfs_host\tok\n', 'node2\t100\tnfs_host\tna\n', 'node3\t59\tnfs_host\tok\n']
with open('test.csv', 'w') as out_f:
for l in out_l:
out_f.write(l)
To properly use csv, out_l should just be lists of the columns and let the csv module do the formatting with tabs and newlines:
import csv
out_l = [['host','uptime','nfsserver','nfs status'],
['node1','2','nfs_host','ok'],
['node2','100','nfs_host','na'],
['node3','59','nfs_host','ok']]
#with open('test.csv', 'wb') as out_f: # Python 2
with open('test.csv', 'w', newline='') as out_f: # Python 3
w = csv.writer(out_f, delimiter='\t') # override for tab delimiter
w.writerows(out_l) # writerows (plural) doesn't need for loop
Note that with will automatically close the file.
See the csv documentation for the correct way to open a file for use with csv.reader or csv.writer.
The csv.Writer.writerow method takes an iterable and writes the values said iterable produces into the csv fields separated by the specified delimeter:
out_f = open('test.csv', 'w')
w = csv.writer(out_f, delimiter='\t') # set tab as delimiter
for l in out_l: # l is string (iterable of chars!)
w.writerow(l.split('\t')) # split to get the correct tokens
out_f.close()
As the strings in your list already contain the necessary tabs, you could just write them directly to the file, no csv tools needed. If you have built/joined the strings in out_l manually, you can omit that step and just pass the original data structure to writerow.
The delimiter parameter
The delimiter parameter controls the delimiter in the output. It has nothing to do with the input out_l.
Why your output is garbled
csv.writer.writerow iterates the input. In your case you are giving it a string (host\tuptime\tnfsserver\tnfs status\n', etc.), therefore the function iterates the string, giving you a sequence of chars.
How to produce the correct output
Give it a list of fields instead of the full string by using str.split(). In your case the string ends with \n, so use str.strip() as well:
import csv
out_l = ['host\tuptime\tnfsserver\tnfs status\n',
'node1\t2\tnfs_host\tok\n',
'node2\t100\tnfs_host\tna\n',
'node3\t59\tnfs_host\tok\n']
out_f = open('test.csv', 'w')
w = csv.writer(out_f)
for l in out_l:
w.writerow(l.strip().split('\t'))
out_f.close()
This should be what you want:
host,uptime,nfsserver,nfs status
node1,2,nfs_host,ok
node2,100,nfs_host,na
node3,59,nfs_host,ok
Reference: https://docs.python.org/3/library/csv.html
Very simple:
with open("test.csv" , 'w') as csv_file:
writer = csv.writer(csv_file, delemeter='\t')
for item in out_l:
writer.writerow([item,])
I had exported a csv from Nokia Suite.
"sms","SENT","","+12345678901","","2015.01.07 23:06","","Text"
Reading from the PythonDoc, I tried
import csv
with open(sourcefile,'r', encoding = 'utf8') as f:
reader = csv.reader(f, delimiter = ',')
for line in reader:
# write entire csv row
with open(filename,'a', encoding = 'utf8', newline='') as t:
a = csv.writer(t, delimiter = ',')
a.writerows(line)
It didn't work, until I put brackets around 'line' as so i.e. [line].
So at the last part I had
a.writerows([line])
Why is that so?
The writerows method accepts a container object. The line object isn't a container. [line] turns it into a list with one item in it.
What you probably want to use instead is writerow.
i just wondering how i can read special field from a CVS File with next structure:
40.0070222,116.2968604,2008-10-28,[["route"], ["sublocality","political"]]
39.9759505,116.3272935,2008-10-29,[["route"], ["establishment"], ["sublocality", "political"]]
the way that on reading cvs files i used to work with:
with open('routes/stayedStoppoints', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
The problem with that is the first 3 fields no problem i can use:
for row in spamreader:
row[0],row[1],row[2] i can access without problem. but in the last field and i guess that with csv.reader(csvfile, delimiter=',', quotechar='"') split also for each sub-list:
so when i tried to access just show me:
[["route"]
Anyone has a solution to handle the last field has a full list ( list of list indeed)
[["route"], ["sublocality","political"]]
in order to can access to each category.
Thanks
Your format is close to json. You only need to wrap each line in brackets, and to quote the dates.
For each line l just do:
lst=json.loads(re.sub('([0-9]+-[0-9]+-[0-9]+)',r'"\1"','[%s]'%(l)))
results in lst being
[40.0070222, 116.2968604, u'2008-10-28', [[u'route'], [u'sublocality', u'political']]]
You need to import the json parser and regular expressions
import json
import re
edit: you asked how to access the element containing 'route'. the answer is
lst[3][0][0]
'political' is at
lst[3][1][1]
If the strings ('political' and others) may contain strings looking like dates, you should go with the solution by #unutbu
Use line.split(',', 3) to split on just the first 3 commas:
import json
with open(filename, 'rb') as csvfile:
for line in csvfile:
row = line.split(',', 3)
row[3] = json.loads(row[3])
print(row)
yields
['40.0070222', '116.2968604', '2008-10-28', [[u'route'], [u'sublocality', u'political']]]
['39.9759505', '116.3272935', '2008-10-29', [[u'route'], [u'establishment'], [u'sublocality', u'political']]]
That is not a valid CSV file. The csv module won't be able to read this.
If the line structure is always like this (two numbers, a date, and a nested list), you can do this:
import ast
result = []
with open('routes/stayedStoppoints') as infile:
for line in infile:
coord_x, coord_y, datestr, objstr = line.split(",", 3)
result.append([float(coord_x), float(coord_y),
datestr, ast.literal_eval(objstr)])
Result:
>>> result
[[40.0070222, 116.2968604, '2008-10-28', [['route'], ['sublocality', 'political']]],
[39.9759505, 116.3272935, '2008-10-29', [['route'], ['establishment'], ['sublocality', 'political']]]]