Separating List Using a delimiter in Python - python

UPDATED(I'm sorry, it's my first question)
I'm an intern and really new into coding.
In my job, I have to read a file from Azure storage and then insert this data into a database.
To do this, I'm using get_file_to_text().content and storing its value in a variable file as follows:
file = file_service.get_file_to_text('teste','','Retorno.csv').content
and then, I'm using .splitlines() like this:
formFile.append(file.splitlines())
I expected a result like this(each line of my file being a sublist):
[['2017-08-01', 'Zabbix server Sura', 'system.cpu.load[percpu,avg5]', '0.2900', '0.05217361111111111111', '0.1'], ['2017-08-01', 'Zabbix server Sura', 'system.cpu.util[,iowait]' ... ]
But I've got this(One big sublist with all the lines inside):
[['2017-08-01;Zabbix server Sura;system.cpu.load[percpu,avg5];0.2900;0.05217361111111111111;0.1', '2017-08-01;Zabbix server Sura;system.cpu.util[,iowait]; ... ']]
I also tried a .split(';'):
file2 = file.split(';')
But it returns me a list with the values only:
['2017-08-01', 'Zabbix server Sura', 'system.cpu.load[percpu,avg5]', '0.2900', '0.05217361111111111111', '0.1\n2017-08-01', 'Zabbix server Sura', 'system.cpu.util[,iowait]', ...]
What can I do toget the result I expect?
Thanks!
UPDATE (RESOLVED):
I did this an it worked fine.
data = []
azurestorage_text = file_service.get_file_to_text('teste', '',
'Retorno.csv').content
with StringIO(azurestorage_text) as file_obj:
reader = csv.reader(file_obj, delimiter=';')
header = next(reader)
for line in reader:
data.append(line)

.splitlines() will split the lines in the text input, returning a list of whole lines. In order to parse that into fields (bits between semicolons) you would need to then .split(';') each line, e.g.
lines = text.splitlines()
rows = []
for line in lines:
row.append(line.split(';'))
However if you want to split semicolon-separated text like this you should be using csv.reader to parse the data. It is more robust at handling CSV formats, including for example "quoted text". Splitting on semicolons will break if any of the fields in the data have semicolons within them, e.g. "semicolons in quoted; text".
csv.reader requires a file-like object to operate on, rather than a string. To pass in a string, you can use StringIO to create a file-like interface to it:
For Python2:
from StringIO import cStringIO as StringIO
import csv
text = file_service.get_file_to_text('teste','','Retorno.csv').content
file_obj = StringIO(text)
reader = csv.reader(file_obj, delimiter=';')
for row in reader:
print(row)
For Python3:
from io import StringIO
import csv
file_obj = StringIO(text)
text = file_service.get_file_to_text('teste','','Retorno.csv').content
file_obj = StringIO(text)
reader = csv.reader(file_obj, delimiter=';')
for row in reader:
print(row)
Each row will contain a single line from your file, split into fields on the semicolons (specified by delimiter).

Related

Same python code block gives different outputs at different time

I want to create a word dictionary. The dictionary looks like
words_meanings= {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
keys_letter=[]
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
Output: rekindle , pesky, verge, maneuver, accountability
Here rekindle , pesky, verge, maneuver, accountability they are the keys and relight, annoying, border, activity, responsibility they are the values.
Now I want to create a csv file and my code will take input from the file.
The file looks like
rekindle | pesky | verge | maneuver | accountability
relight | annoying| border| activity | responsibility
So far I use this code to load the file and read data from it.
from google.colab import files
uploaded = files.upload()
import pandas as pd
data = pd.read_csv("words.csv")
data.head()
import csv
reader = csv.DictReader(open("words.csv", 'r'))
words_meanings = []
for line in reader:
words_meanings.append(line)
print(words_meanings)
This is the output of print(words_meanings)
[OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
It looks very odd to me.
keys_letter=[]
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
Now I create an empty list and want to append only key values. But the output is [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
I am confused. As per the first code block it only included keys but now it includes both keys and their values. How can I overcome this situation?
I would suggest that you format your csv with your key and value on the same row. Like this
rekindle,relight
pesky,annoying
verge,border
This way the following code will work.
words_meanings = {}
with open(file_name, 'r') as file:
for line in file.readlines():
key, value = line.split(",")
word_meanings[key] = value.rstrip("\n")
if you want a list of the keys:
list_of_keys = list(word_meanings.keys())
To add keys and values to the file:
def add_values(key:str, value:str, file_name:str):
with open(file_name, 'a') as file:
file.writelines(f"\n{key},{value}")
key = input("Input the key you want to save: ")
value = input(f"Input the value you want to save to {key}:")
add_values(key, value, file_name)```
You run the same block of code but you use it with different objects and this gives different results.
First you use normal dictionary (check type(words_meanings))
words_meanings = {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
and for-loop gives you keys from this dictionary
You could get the same with
keys_letter = list(words_meanings.keys())
or even
keys_letter = list(words_meanings)
Later you use list with single dictionary inside this list (check type(words_meanings))
words_meanings = [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
and for-loop gives you elements from this list, not keys from dictionary which is inside this list. So you move full dictionary from one list to another.
You could get the same with
keys_letter = words_meanings.copy()
or even the same
keys_letter = list(words_meanings)
from collections import OrderedDict
words_meanings = {
"rekindle": "relight",
"pesky":"annoying",
"verge": "border",
"maneuver": "activity",
"accountability":"responsibility",
}
print(type(words_meanings))
keys_letter = []
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
#keys_letter = list(words_meanings.keys())
keys_letter = list(words_meanings)
print(keys_letter)
words_meanings = [OrderedDict([('\ufeffrekindle', 'relight'), ('pesky', 'annoying')])]
print(type(words_meanings))
keys_letter = []
for x in words_meanings:
keys_letter.append(x)
print(keys_letter)
#keys_letter = words_meanings.copy()
keys_letter = list(words_meanings)
print(keys_letter)
The default field separator for the csv module is a comma. Your CSV file uses the pipe or bar symbol |, and the fields also seem to be fixed width. So, you need to specify | as the delimiter to use when creating the CSV reader.
Also, your CSV file is encoded as Big-endian UTF-16 Unicode text (UTF-16-BE). The file contains a byte-order-mark (BOM) but Python is not stripping it off, so you will notice the string '\ufeffrekindle' contains the FEFF UTF-16-BE BOM. That can be dealt with by specifying encoding='utf16' when you open the file.
import csv
with open('words.csv', newline='', encoding='utf-16') as f:
reader = csv.DictReader(f, delimiter='|', skipinitialspace=True)
for row in reader:
print(row)
Running this on your CSV file produces this:
{'rekindle ': 'relight ', 'pesky ': 'annoying', 'verge ': 'border', 'maneuver ': 'activity ', 'accountability': 'responsibility'}
Notice that there is trailing whitespace in the key and values. skipinitialspace=True removed the leading whitespace, but there is no option to remove the trailing whitespace. That can be fixed by exporting the CSV file from Excel without specifying a field width. If that can't be done, then it can be fixed by preprocessing the file using a generator:
import csv
def preprocess_csv(f, delimiter=','):
# assumes that fields can not contain embedded new lines
for line in f:
yield delimiter.join(field.strip() for field in line.split(delimiter))
with open('words.csv', newline='', encoding='utf-16') as f:
reader = csv.DictReader(preprocess_csv(f, '|'), delimiter='|', skipinitialspace=True)
for row in reader:
print(row)
which now outputs the stripped keys and values:
{'rekindle': 'relight', 'pesky': 'annoying', 'verge': 'border', 'maneuver': 'activity', 'accountability': 'responsibility'}
As I found that no one able to help me with the answer. Finally, I post the answer here. Hope this will help other.
import csv
file_name="words.csv"
words_meanings = {}
with open(file_name, newline='', encoding='utf-8-sig') as file:
for line in file.readlines():
key, value = line.split(",")
words_meanings[key] = value.rstrip("\n")
print(words_meanings)
This is the code to transfer a csv to a dictionary. Enjoy!!!

Python how to get the tweet data using specific word in csv file and put it in new csv file

I have data twitter in a CSV file (that I'm mining with a Python API). I get around 1000 lines of data. Now I want to shorten the tweet data using the specific Indonesian words “macet” or “kecelakaan” (in English “traffic” or “accident”) and put the matching rows into a new separate CSV file, just like in Excel using find all.
The sample data twitter is example1.csv and the new file which will be created after the search of the word "macet" or "kecelakaan" is example2.csv. But there is no result.
import re
import csv
with open('example1.csv', 'r') as csvFile:
reader = csv.reader(csvFile)
if re.search(r'macet', reader):
for row in reader:
myData = list(row)
print(row)
newFile = open('example2.csv', 'w')
with newFile:
writer = csv.writer(newFile)
writer.writerows(myData)
print("Writing complete")
I use spyder for environment Python 3.6.
The CSV file is already in the same folder with Spyder. Here is the screen capture image of my CSV twitter data
myCSVtwitterData
updated : Sample of csv file. OS using : Windows
There are a couple of problems with your code.
In your reading loop you are passing a csv.reader object to re.search, but it doesn't know how to search that object. You need to pass it text or byte strings.
The line
myData = list(row)
converts row into a new list and saves it to myData, but it's already a list, so no conversion is necessary. And that line replaces the previous contents of myData, but you actually want to save all the matching rows. However, there's no need to save the rows, you can just write them to the new file as you go.
Anyway, here's a repaired version of your code. From the screen shot it looks like you only want to search the text in column 2 of the input data (which corresponds to column C in your spreadsheet). I've created a regex that searches for the whole words "macet" and "kecelakaan", the "\b" matches at word boundaries so we don't get a match if "macet" or "kecelakaan" is part of a larger word.
import re
import csv
# Make a case-insensitive regex to match the words "macet" or "kecelakaan"
pattern = re.compile(r'\bmacet\b|\bkecelakaan\b', re.I)
with open('example1.csv', 'r', newline='') as csvFile, open('example2.csv', 'w', newline='') as newFile:
reader = csv.reader(csvFile)
writer = csv.writer(newFile)
for row in reader:
# Skip empty rows
if not row:
continue
if pattern.search(row[2]):
print(row)
writer.writerow(row)
print("Writing complete")
I've just made a couple of improvements to that code. It now uses the newline='' arg to open the CSV files, and it skips any empty lines in the input CSV. And the regex now ignores the case when looking for matching words.
Not answering about Python. But if you have a Linux OS, you can do it in one command line :
grep -i "macet" exemple1.csv > exemple2.csv
-i is for ignore case, so it will also match "Macet"
how is it~?
this code visit rows one by one
and find cells that contain a word in word_list
and write the value list on the row
import re
import csv
word_list = ['macet', 'kecelakaan']
with open('example1.csv', 'r') as csvFile, open('example2.csv', 'w') as newFile:
reader = csv.reader(csvFile)
writer = csv.writer(newFile, lineterminator='\n')
for row in reader:
new_row = [content for content in row if any(map(lambda word: word in content, word_list))]
if(new_row != []):
print(new_row)
writer.writerow(new_row)
print("Writing complete")

Python: writing an entire row to a CSV file. Why does it work this way?

I had exported a csv from Nokia Suite.
"sms","SENT","","+12345678901","","2015.01.07 23:06","","Text"
Reading from the PythonDoc, I tried
import csv
with open(sourcefile,'r', encoding = 'utf8') as f:
reader = csv.reader(f, delimiter = ',')
for line in reader:
# write entire csv row
with open(filename,'a', encoding = 'utf8', newline='') as t:
a = csv.writer(t, delimiter = ',')
a.writerows(line)
It didn't work, until I put brackets around 'line' as so i.e. [line].
So at the last part I had
a.writerows([line])
Why is that so?
The writerows method accepts a container object. The line object isn't a container. [line] turns it into a list with one item in it.
What you probably want to use instead is writerow.

Editing a code to create a filter based on a condition and then stripping the condition

SO,
I'm looking for some help making a bit of code so that it also includes an if statement so that the filter is only added if the line contains (BIPL) but then stripping it out of the filters list once it's added...
1test,tester,testing (BIPL),no,yes
2test,tester,testing,no,yes
3data,datas,datatest (BIPL),yes,no
Current code...
with open('test.csv', 'rb') as old_csv:
filters = {(row[0].lower(), row[1][:3].upper(), row[2].upper()) for row in csv.reader(old_csv, delimiter=',')}
Effectively the outcome would be as follows, just in a different format.
1test,TES,TESTING
3data,DAT,DATATEST
It should be a simple change but I can't figure it out
csv.reader can accept an iterator as its first argument (not just file handles). So you can define a generator which yields only those lines which contain '(BIPL)' and send that to csv.reader:
import csv
import re
def only_bipl(f):
for line in f:
if '(BIPL)' in line:
yield re.sub(r'\s*\(BIPL\)', '', line)
with open('test.csv', 'rb') as old_csv:
reader = csv.reader(only_bipl(old_csv), delimiter=',')
filters = {(row[0].lower(), row[1][:3].upper(), row[2].upper()) for row in reader}
Note the above will yield any line that contains '(BIPL)' anywhere. A better, more targeted alternative would be to match only those lines which contain '(BIPL)' at the end of the third item. You can do that using an if-clause inside the set comprehension:
with open('test.csv', 'rb') as old_csv:
reader = csv.reader(old_csv, delimiter=',')
filters = {(row[0].lower(), row[1][:3].upper(), row[2][:-6].strip().upper())
for row in reader
if row[2].endswith('(BIPL)')}

Python: Read fields of CSV File with a list of list

i just wondering how i can read special field from a CVS File with next structure:
40.0070222,116.2968604,2008-10-28,[["route"], ["sublocality","political"]]
39.9759505,116.3272935,2008-10-29,[["route"], ["establishment"], ["sublocality", "political"]]
the way that on reading cvs files i used to work with:
with open('routes/stayedStoppoints', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
The problem with that is the first 3 fields no problem i can use:
for row in spamreader:
row[0],row[1],row[2] i can access without problem. but in the last field and i guess that with csv.reader(csvfile, delimiter=',', quotechar='"') split also for each sub-list:
so when i tried to access just show me:
[["route"]
Anyone has a solution to handle the last field has a full list ( list of list indeed)
[["route"], ["sublocality","political"]]
in order to can access to each category.
Thanks
Your format is close to json. You only need to wrap each line in brackets, and to quote the dates.
For each line l just do:
lst=json.loads(re.sub('([0-9]+-[0-9]+-[0-9]+)',r'"\1"','[%s]'%(l)))
results in lst being
[40.0070222, 116.2968604, u'2008-10-28', [[u'route'], [u'sublocality', u'political']]]
You need to import the json parser and regular expressions
import json
import re
edit: you asked how to access the element containing 'route'. the answer is
lst[3][0][0]
'political' is at
lst[3][1][1]
If the strings ('political' and others) may contain strings looking like dates, you should go with the solution by #unutbu
Use line.split(',', 3) to split on just the first 3 commas:
import json
with open(filename, 'rb') as csvfile:
for line in csvfile:
row = line.split(',', 3)
row[3] = json.loads(row[3])
print(row)
yields
['40.0070222', '116.2968604', '2008-10-28', [[u'route'], [u'sublocality', u'political']]]
['39.9759505', '116.3272935', '2008-10-29', [[u'route'], [u'establishment'], [u'sublocality', u'political']]]
That is not a valid CSV file. The csv module won't be able to read this.
If the line structure is always like this (two numbers, a date, and a nested list), you can do this:
import ast
result = []
with open('routes/stayedStoppoints') as infile:
for line in infile:
coord_x, coord_y, datestr, objstr = line.split(",", 3)
result.append([float(coord_x), float(coord_y),
datestr, ast.literal_eval(objstr)])
Result:
>>> result
[[40.0070222, 116.2968604, '2008-10-28', [['route'], ['sublocality', 'political']]],
[39.9759505, 116.3272935, '2008-10-29', [['route'], ['establishment'], ['sublocality', 'political']]]]

Categories