How to separate comma separated data from csv file? - python

I have opened a csv file and I want to sort each string which is comma separated and are in same line:
ex:: file :
name,sal,dept
tom,10000,it
o/p :: each string in string variable
I have a file which is already open, so I can not use "open" API, I have to use "csv.reader" which have to read one line at a time.

If the file open for reading is bound to a variable name, say fin; and assuming you're using Python 2.6, and you know the file's not empty (has at least the row with headers):
import csv
rd = csv.reader(fin)
headers = next(rd)
for data in rd:
...process data and headers...
In Python 2.5, use headers = rd.next() instead of headers = next(rd).
These versions use a list of fields data, which is a completely general solution (i.e., you don't need to know in advance how many columns the file has: you'll access them as data[0], data[1], etc, and the current row has len(data) fields at each leg of the loop).
If you know the file has exactly three columns and prefer to use separate names for a variable per column, change the loop header to:
for name, sales, department in rd:
The field data as returned by the reader (just like the headers) are all strings. If you know for example that the second column is an int and want to treat it as such, start the loop with
for data in rd:
data[1] = int(data[1])
or, if you're using the named-variables variant:
for name, sales, department in rd:
sales = int(sales)

I don't know if I have properly understood your question. You may want to have a look at the split function described in the Python documentation anyway.

Related

Storing DataFrame output utilizing Pandas to either csv or MySql DB

question regarding pandas:
Say I created a dataframe and generated output under separate variables, rather than printing them, how would I go about combining them back into another dataframe correctly to either send as a CSV and then upload to a DB or directly upload to a DB?
Everything works fine code wise, I just haven't really seen or know of the best practice to do this. I know we can store things in list, dict, etc
What I did was:
#imported all modules
object = df.iloc[0,0]
#For loop magic goes here
#nested for loop
#if conditions are met, do this
result = df.iloc[i, k+1]
print(object, result)
I've also stored them into a separate DataFrame trying:
df2 = pd.DataFrame({'object': object, 'result' : result}, index=[0])
df2.to_csv('output.csv', index=False, mode='a')
The only problem with that is that it appends everything to each row, most likely do to the append and perhaps not including it in the for loop. Which is odd because the raw output is EXACTLY how I'm trying to get it into a csv or into a DB
As saying though, looking to combine both values back into a dataframe for speed. I tried concat etc, but no luck, so I was wondering what the correct format would be? Thanks
So it turns out that after more research and revising, I solved my issue
Referenced this and personal revisions, this is a basis of what I did:
Empty space in between rows after using writer in python
import csv
/* Had to wrap in a for loop that is not listed and append to file while clearing it first to remove spaces in each row*/
with open('csvexample.csv', wb+, newline='') as file:
reader = csv.reader(file)
for row in reader:
print(row)
Additional supporting material:
Confused by python file mode "w+"

Labelling and Grouping Postcodes using Python

I'm fairly new to Python and I am attempting to group various postcodes together under predefined labels. For example "SA31" would be labelled a "HywelDDAPostcode"
I have some code where I read lots of postcodes from a singled columned file into a list and compare them with postcodes that are in predefined lists. However, when I output my postcode labels only the Label "UKPostcodes" is outputted for every postcode in my original file. It would appear that the first two conditions in my code always evaluate to false no matter what. Am I doing the right thing using "in"? Or perhaps it's a file reading issue? I'm not sure
The input file is simply a file which contains a list of postcodes ( in reality it has thousands of rows)
The CSV file
Here is my code:
import csv
with open('postcodes.csv', newline='') as f:
reader = csv.reader(f)
your_list = list(reader)
my_list =[]
HywelDDAPostcodes=["SA46","SY23","SY24","SA18","SA16","SA43","SA31","SA65","SA61","SA62","SA17","SA48","SA40","SA19","SA20","SA44","SA15","SA14","SA73","SA32","SA67","SA45",
"SA38","SA42","SA41","SA72","SA71","SA69","SA68","SA33","SA70","SY25","SA34","LL40","LL42","LL36","SY18","SY17","SY20","SY16","LD6"]
NationalPostcodes=["LL58","LL59","LL60","LL61","LL62","LL63","LL64","LL65","LL66","LL67","LL68","LL69","LL70","LL71","LL72","LL73","LL74","LL75","LL76","LL77","LL78",
"NP1","NP2","NP23","NP3","CF31","CF32","CF33","CF34","CF35","CF36","CF3","CF46","CF81","CF82","CF83","SA35","SA39","SA4","SA47","LL16","LL18","LL21","LL22","LL24","LL25","LL26","LL27","LL28","LL29","LL30","LL31","LL32","LL33","LL34","LL57","CH7","LL11","LL15","LL16","LL17","LL18","LL19","LL20","LL21","LL22","CH1","CH4","CH5","CH6","CH7","LL12","CF1","CF32","CF35","CF5","CF61","CF62","CF63","CF64","CF71","LL23","LL37","LL38","LL39","LL41","LL43","LL44","LL45","LL46","LL47","LL48","LL49","LL51","LL52","LL53","LL54","LL55","LL56","LL57","CF46","CF47","CF48","NP4","NP5","NP6","NP7","SA10","SA11","SA12","SA13","SA8","CF3","NP10","NP19","NP20","NP9","SA36","SA37","SA63","SA64","SA66","CF44","CF48","HR3","HR5","LD1","LD2","LD3","LD4","LD5","LD7","LD8","NP8","SY10","SY15","SY19","SY21","SY22","SY5","CF37","CF38","CF39","CF4","CF40","CF41","CF42","CF43","CF45","CF72","SA1","SA2","SA3","SA4","SA5","SA6","SA7","SA1","NP4","NP44","NP6","LL13","LL14","SY13","SY14"]
NationalPostcodes2= list(dict.fromkeys(NationalPostcodes))
labels=["HywelDDA","NationalPostcodes","UKPostcodes"]
for postcode in your_list:
#print(postcode)
if postcode in HywelDDAPostcodes:
my_list.append(labels[0])
if postcode in NationalPostcodes2:
my_list.append(labels[1])
else:
my_list.append(labels[2])
with open('DiscretisedPostcodes.csv','w') as result_file:
wr = csv.writer(result_file, dialect='excel')
for item in my_list:
wr.writerow([item,])
If anyone has any advice as to what could be causing the issue or just any advice surrounding Python, in general, I would very much appreciate it. Thank you!
The reason why your comparison block isn't working is that when you use csv reader to read your file, each line is being added to your_list as a list. So you are making a list of lists and when you compare those things it doesn't match.
['LL58'] == 'LL58' # fails
So, inspect your_list and see what I mean. You should make a shell your_list before you read the file and append each new reading to it. Then inspect that to make sure it looks good. It would also behoove you to use the strip() command to strip off whitespace from each item. I can't recall if csv reader does that automatically.
Also... a better structure for testing for membership is to use sets instead of lists. in will work for lists, but it is MUCH faster for sets, so I would put your comparison items into sets.
Lastly, it isn't clear what you are trying to do with NationalPostcodes2. Just use your NationalPostcodes, but put them in a set with {}.
#Jeff H's answer is correct, but for what it's worth here's how I might write this code (untested):
# Note: Since, as you wrote, these are only single-column files I did not use the csv
# module, as it will just add additional unnecessary overhead.
# Read the known data from files--this will always be more flexible and maintainable than
# hard-coding them in your code. This is just one possible scheme for doing this; e.g.
# you could also put all of them into a single JSON file
standard_postcode_files = {
'HywelDDA': 'hyweldda.csv',
'NationalPostcodes': 'nationalpostcodes.csv',
'UKPostcodes': 'ukpostcodes.csv'
}
def read_postcode_file(filename):
with open(filename) as f:
# exclude blank lines and strip additional whitespace
return [line.strip() for line in f if line.strip()]
standard_postcodes = {}
for key, filename in standard_postcode_files.items():
standard_postcodes[key] = set(read_postcode_file(filename))
# Assuming all post codes are unique to a set, map postcodes to the set they belong to
postcodes_reversed = {v: k for k, s in standard_postcodes.items() for v in s}
your_postcodes = read_postcode_file('postcodes.csv')
labels = [postcodes_reversed[code] for code in your_postcodes]
with open('DiscretisedPostCodes.csv', 'w') as f:
for label in labels:
f.write(label + '\n')
I would probably do other things like not make the input filename hard-coded. If you need to work with multiple columns using the csv module would also be fine with minimal additional changes, but since you're just writing one item per line I figured it was unnecessary.

Using Python v3.5 to load a tab-delimited file, omit some rows, and output max and min floating numbers in a specific column to a new file

I've tried for several hours to research this, but every possible solution hasn't suited my particular needs.
I have written the following in Python (v3.5) to download a tab-delimited .txt file.
#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)
This downloads the file from here and renames it based on the current time. It works perfectly.
I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).
This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):
import csv
with open(filename, 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
print(row)
As you can see from the file, the first 18 lines are informational.
Line 19 provides the actual column names. Then there is a line of dashes.
The actual data I'm interested in starts on line 21.
I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.
Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...
import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])
I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.
I would appreciate any pointers anyone may have, please.
This is meant to be a comment on your latest question to Apoc, but I'm new, so I'm not allowed to comment. One thing that might create problems is that bz_values (and bt_values, for that matter) might be a list of strings (at least it was when I tried to run Apoc's script on the example file you linked to). You could solve this by substituting this:
min_bz = min([float(x) for x in bz_values])
max_bz = max([float(x) for x in bz_values])
for this:
min_bz = min(bz_values)
max_bz = max(bz_values)
The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:
from itertools import islice
# Line that data starts from, zero-indexed.
START_LINE = 20
# The column containing the data in question, zero-indexed.
DATA_COL = 10
# The value present when a measurement failed.
FAILED_MEASUREMENT = '-999.9'
with open('data.txt', 'r') as f:
bt_values = []
for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
if val != FAILED_MEASUREMENT:
bt_values.append(float(val))
min_bt = min(bt_values)
max_bt = max(bt_values)
with open('minBt.txt', 'a') as minFile:
print(min_bt, file=minFile)
with open('maxBt.txt', 'a') as maxFile:
print(max_bt, file=maxFile)
I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.
Edit: Updated to include workaround for failed measurements, as discussed in comments.
Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.

Parse Specific Text File to CSV Format with Headers

I have a log file that is updated every few milliseconds however the information is currently saved with four(4) different delimitors. The log files contain millions of lines so the chances of performing the action in excel null.
What I have left to work on resembles:
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
I would like these set to:
Sequence,Status;Report;Header;Profile
3433,true,Report=223313,,xxxx
0323,true,Report=43838,The,xxxx
5323,true,Report=6541998,,xxxx
Meaning that I would the need the creation of a header using all portions with the equal "=" symbol following it. All of the other operations within the file are taken care of and this will be used to perform a comparative check between files and replace or append fields. As I am new to python, I only need the assistance with this portion of the program I am writing.
Thank you all in advance!
You can try this.
First of all, I called the csv library to reduce the job of putting commas and quotes.
import csv
Then I made a function that takes a single line from your log file and outputs a dictionary with the fields passed in the header. If the current line hasn't a particular field from header, it will stay filled with an empty string.
def convert_to_dict(line, header):
d = {}
for cell in header:
d[cell] = ''
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
d[key] = value
return d
Since the header and the number of fields can vary between your files, I made a function extracting them. For this, I employed a set, a collection of unique elements, but also unordered. So I converted to a list and used the sorted function. Don't forget that seek(0) call, to rewind the file!
def extract_fields(logfile):
fields = set()
for line in logfile:
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
fields.add(key)
logfile.seek(0)
return sorted(list(fields))
Lastly, I made the main piece of code, in which open both the log file to read and the csv file to write. Then, it extracts and writes the header, and writes each converted line.
if __name__ == '__main__':
with open('report.log', 'r') as logfile:
with open('report.csv', 'wb') as csvfile:
csvwriter = csv.writer(csvfile)
header = extract_fields(logfile)
csvwriter.writerow(header)
for line in logfile:
d = convert_to_dict(line, header)
csvwriter.writerow([d[cell] for cell in header])
These are the files I used as an example:
report.log
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
report.csv
Header,Profile,Report,Sequence,Status
,xxxx,223313,3433,true
The,xxxx,43838,0323,true
,xxxx,6541998,5323,true
I hope it helps! :D
EDIT: I added support for different headers.

Parse csv to dict

I am trying to parse csv financial data from the web into a dict that I can navigate through by key.
I am failing using csv.DictReader.
I have:
import csv
import urllib2
req = urllib2.Request('http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:BRCM&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=desc&columnYear=5&rounding=3&view=raw&r=886300&denominatorView=raw&number=3')
response = urllib2.urlopen(req)
response.next()
csvio = (csv.DictReader(response))
print csvio.fieldnames
print len(csvio.fieldnames)
Edited to reflect changes from answer below.
This almost gets me there, but I need to strip the leading 'Fiscal year...share data' before feeding it to DictReader. How best to do this? I have looked at converting to string and stripping lead chars with str.lstrip() as the docs say here with no luck.
To use a DictReader you need to either specify the field names, or the field names need to be the first row of the csv data (ie. a header row).
In the csv file that your code retrieves, the field names are in the second row of data, not the first. What I've done is to throw out the first line of data before passing the csv file to the DictReader constructor.
In response to your updated question:
Stripping the leading text from the header row probably isn't desirable as this is acting as the field name for the first column of data. Probably better to discard the first 2 rows of data and then supply the desired field names directly to the DictReader. I have updated the example below to reflect this.
import csv
import urllib2
req = urllib2.Request('http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:BRCM&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=desc&columnYear=5&rounding=3&view=raw&r=886300&denominatorView=raw&number=3')
response = urllib2.urlopen(req)
response.readline() # This reads (and discards) the first row of data which is not needed.
response.readline() # skip the
myFieldnames = ["firstColName", "TTM", "2012", "2011", "2010", "2009", "2008"]
csvio = csv.DictReader(response, fieldnames=myFieldnames)
print csvio.fieldnames
for row in csvio:
print row

Categories