I am new to python and trying to parse data from a file that contains millions of lines. Tried to go old school to parse it using excel but it fails. How can I parse the information efficiently and export them into an excel file so that it is easier for other people to read?
I tried using this code provided by someone else but no luck so far
import re
import pandas as pd
def clean_data(filename):
with open(filename, "r") as inputfile:
for row in inputfile:
if re.match("\[", row) is None:
yield row
with open(clean_file, 'w') as outputfile:
for row in clean_data(filename):
outputfile.write(row)
NameError: name 'clean_file' is not defined
It looks like clean_file is not defined, which is probably a problem from copy/pasteing code.
Did you mean to write to a file called "clean_file"? In which case you need to wrap it in quotes: with open("clean_file", 'w')
If you want to work with json I sugget looking into the json package which has lots of tools for loading and parsing json. Otherwise, if the json is flat, you can just use the inbuilt pandas function read_json
Related
I don't need the entire code but I want a push to help me on the way, I've been searching on the internet for clues on how to start to write a function like this but I haven't gotten any further then just the name of the function.
So I haven't got the slightest clue on how to start with this, I don't know how to work with text files. Any tips?
These text files are CSV (Comma Separated Values). It is a simple file format used to store tabular data.
You may explore Python's inbuilt module called csv.
Following code snippet an example to load .csv file in Python:
import csv
filename = 'us_population.csv'
with open(filename, 'r') as csvfile:
csvreader = csv.reader(csvfile)
I am doing data migration.Old application data is exported as one CSV file. We cannot import this CSV file directly to new application. Need to create new CSV template that match with new application and import some data into this new CSV template. I would like to request code that facilitate this requirement.
I'm not exactly sure what template you want to go to. I'm going to assume that you either want to change the number/order of columns or the delimiter.
The simplest thing is to read it line by line and write it:
import csv
with open("Old.csv", 'r') as readfp, open("new.csv", 'w') as writefp:
csvReader = csv.reader(readfp)
csvWriter = csv.writer(writefp, delimiter=',')
for line in csvReader:
#line is a list of strings so you can reorder it as you wish. I'll skip the third column as an example.
csvWriter.writerow(line[:2]+line[3:])
If you have pandas installed this is even simpler
import pandas as pd
df = pd.read_csv("Old.csv")
df.drop(labels=["name_of_bad_col1", "name_of_bad_col2"], sep=',')
df.to_csv("new.csv)
If you are going the pandas route, make sure to checkout the documentations (read_csv, to_csv)
Pandas has the DataFrame.to_json and pd.read_json functions that work for single Data Frames. However, I have been trying to figure a way to export and import a list with many Data Frames into and from a single json file. So far, I have come to successfully export the list with this code:
with open('my_file.json', 'w') as outfile:
outfile.writelines([json.dumps(df.to_dict()) for df in list_of_df])
This creates a json file with all the Data Frames converted to dicts. However, when I try to do the reverse to read the file and extract my Data Frames, I get an error. This is the code:
with open('my_file.json', 'r') as outfile:
list_of_df = [pd.DataFrame.from_dict(json.loads(item)) for item in
outfile]
The error I get is:
JSONDecodeError: Extra data
I think the problem is that I have to include somehow the opposite of 'writelines', which is 'readlines' in the code that reads the json file, but I do not know how to do it. Any help will be appreciated!
By using writelines your data isn't really a list in the python sense, which makes reading it a bit tricky. I'd recommend instead writing to your file like this:
with open('my_file.json', 'w') as outfile:
outfile.write(json.dumps([df.to_dict() for df in list_of_df]))
Which means we can read it back just as simply using:
with open('my_file.json', 'r') as outfile:
list_of_df = [pd.DataFrame.from_dict(item) for item in json.loads(outfile.read())]
So I had written a little snip for Scrapy to search the country on a site by zip code, but it seems like a waste to go through all the nonexistent zip codes, so, first, this is what I had...
def start_requests(self):
for i in xrange(100000):
yield self.make_requests_from_url("http://www.example.com/zipcode/%05d/search.php" % i)
The idea is obvious, but I downloaded a CSV with all of the US zip codes in a column - how would I easily use this as a list (or more efficient method than a list) to use in the above example? I have pandas if that would make things easier.
If I'm understanding you correctly, you have a file that is comma-delimited and formatted such that in a particular column (Perhaps titled 'ZipCodes') a zipcode is present on each row.
If there's a header line and different columns and you know the name of the column that contains the zipcodes you could do this:
def start_requests(self, filename, columnname):
with open(filename) as file:
headers = file.readline().strip().split(',')
for line in file.readlines():
zipcode = line.strip().split(',')[headers.index(columnname)]
yield self.make_requests_from_url("http://www.example.com/zipcode/%05d/search.php" % zipcode)
Open file, read lines, grab zip codes, yield ...
for line in open('zipcodes.csv', 'r').readlines():
zipcode = line.split(',')[columnNumberOfTheZipCodesStartingFrom0]
yield self.make_requests_from_url("http://foo.com/blah/%s/search.php" % (zipcode,))
Just to round out the array of perfectly good suggestions, here's another. The main idea to this approach is that it doesn't require special libraries like pandas, but isn't just reading plain file contents either, in which case you have to re-invent the wheel as far as CSV markup goes (not the hardest thing, but why bother?). If your csv file is simple enough, it might be easier just to read out the file contents, as suggested by dg99
Use python's built-in csv library!
ziplist = []
import csv
with open('zipcodes.csv', 'rb') as csvfile:
zipreader = csv.reader(csvfile)
for row in zipreader:
ziplist.append(row[i])
Notes:
I have row[i] where i is the column index for the zipcodes in your csv file. If the file lists zip+4 codes, you might use row[i][:5]. Interestingly, if you don't know what column number the zipcodes will be in, but you do know the column header (field name), you can use
zipreader = csv.DictReader(csvfile)
for zipDict in zipreader:
ziplist.append(row['Zip Code Column Name Here'])
According to this post, getting info back out of a list is just as efficient as a tuple, so this seems like the way to go.
so you want to read in a csv to a list....well:
i think this should be easy:
import pandas
colname = ['zip code','city']
zipdata = pandas.read_csv('uszipcodes.csv', names=colname)
i hope i understood you right!
Maybe like this?
#!/usr/local/cpython-3.3/bin/python
import csv
import pprint
def gen_zipcodes(file_):
reader = csv.reader(file_, delimiter='|', quotechar='"')
for row in reader:
yield row[0]
def main():
with open('zipcodes_2006.txt', 'r') as file_:
zipcodes = list(gen_zipcodes(file_))
pprint.pprint(zipcodes[:10])
main()
Yesterday I posted the below link:
Python CSV Module read and write simultaneously
Several people suggested that I "If file b is not extremely large I would suggest using readlines() to get a list of all lines and then iterate over the list and change lines as needed."
I want to still be able to use the functionality of the CSV Module but do what they have suggested. I am new to python and still don't quite undertand how I should do this.
Could someone please provide me with an example of how I should do this.
Here is a sample that reads a CSV file using a DictReader and uses a DictWriter to write to stdout. The file has a column named PERCENT_CORRECT_FLAG, and this modifies the CSV file to set this field to 0.
#!/usr/bin/env python
from __future__ import with_statement
from __future__ import print_function
from csv import DictReader, DictWriter
import sys
def modify_csv(filename):
with open(filename) as f:
reader = DictReader(f)
writer = DictWriter(sys.stdout, fieldnames=reader.fieldnames)
for i, s in enumerate(writer.fieldnames):
print(i, s, file=sys.stdout)
for row in reader:
row['PERCENT_CORRECT_FLAG'] = '0'
writer.writerow(row)
if __name__ == '__main__':
for filename in sys.argv[1:]:
modify_csv(filename)
If you do not want to write to stdout, you can open another file for write and use that. Note that if you want to write back to the original file, you have to either:
Read the file into memory and close the file before opening for write or
Open a file with a different name for write and rename it after closing it.