I'm new to MapReduce and MRjob, I am trying to read a csv file that I want to process using MRjob in python. But it has about 5 columns with JSON strings(eg. {}) or an array of JSON strings (eg. [{},{}]), some of them are nested.
My mapper so far looks as follows:
from mrjob.job import MRJob
import csv
from io import StringIO
class MRWordCount(MRJob):
def mapper(self, _, line):
l = StringIO(line)
reader = csv.reader(l) # returns a generator.
for cols in reader:
columns = cols
yield None, columns
I get the error -
_csv.Error: field larger than field limit (131072)
But that seems to happen because my code separates the JSON strings into separate columns as well (because of the commas inside).
How do I make this, so that the JSON strings are not split? Maybe I'm overlooking something?
Alternatively, is there any other ways I could read this file with MRjob that would make this process simpler or cleaner?
Your JSON string is not surrounded by quote characters so every comma in that field makes the csv engine think its a new column.
take a look here what you are looking for is quotechar change your data so that you json is surrounded with a special character (The default is ") and adjust your csv reader accordingly
Related
I am a complete noob at Python so I apologize if the solution is obvious.
I am trying to read some .csv field data on python for processing. Currently I have:
data = pd.read_csv('somedata.csv', sep=' |,', engine='python', usecols=(range(0,10)), skiprows=155, skipfooter=3)
However depending on if the data collection was interrupted, the last few lines of the file may be something like:
#data_end
Run Complete
Or
Run Interrupted
ERROR
A bunch of error codes
Hence I can't just use skipfooter=3. Is there a way for Python to detect the length of the footer and skip it? Thank you.
You can first read the content of your file as a plain text file into a Python list, remove those lines that don't contain the expected number of separators, and then transform the list into an IO stream. This IO stream is then passed on to pd.read_csv as if it was a file object.
The code might look like this:
from io import StringIO
import pandas as pd
# adjust these variables to meet your requirements:
number_of_columns = 11
separator = " |, "
# read the content of the file as plain text:
with open("somedata.csv", "r") as infile:
raw = infile.readlines()
# drop the rows that don't contain the expected number of separators:
raw = [x for x in raw if x.count(separator) == number_of_columns]
# turn the list into an IO stream (after joining the rows into a big string):
stream = StringIO("".join(raw))
# pass the string as an argument to pd.read_csv():
df = pd.read_csv(stream, sep=separator, engine='python',
usecols=(range(0,10)), skiprows=155)
If you use Python 2.7, you have to make replace the first line from io import StringIO by the following two lines:
from __future__ import unicode_literals
from cStringIO import StringIO
This is so because StringIO requires a unicode string (which is not the default in Python 2.7), and because the StringIO class lives in a different module in Python 2.7.
I think you have to simply resort to counting the commas for each line and manually find the last correct one. I'm not aware of a parameter to read_csv to automate that.
First question here so forgive any lapses in the etiquette.
I'm new to python. I have a small project I'm trying to accomplish both for practical reasons and as a learning experience and maybe some people here can help me out. There's a proprietary system I regularly retrieve data from. Unfortunately they don't use standard CSV format. They use a strange character to separate data, its a ‡. I need it in CSV format in order to import it into another system. So what I need to do is take the data and replace the special character (with a comma) and format the data by removing whitespaces among other minor things like unrecognized characters etc...so it's the way I need it in CSV to import it.
I want to learn some python so I figured I'd write it in python. I'll be reading it from a webservice URL, but for now I just have some test data in the same format I'd receive.
In reality it will be tons of data per request but I can scale it when I understand how to retrieve and manipulate the data properly.
My code so far just trying to read and write two columns from the data:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0')
data = r.text
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for elem in data:
f.writerow([elem["PlayerID"], elem["Partner"]])
I'm getting this error.
File "csvTest.py", line 14, in
f.writerow([elem["PlayerID"], elem["Partner"]])
TypeError: string indices must be integers
It's probably evident by that, that I don't know how to manipulate the data much nor read it properly. I was able to pull back some JSON data and output it so i know the structure works at core with standardized data.
Thanks in advance for any tips.
I'll continue to poke at it.
Sample data is at the dropbox link mentioned in the script.
https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0
There are multiple problems. First, the link is incorrect, since it returns the html. To get the raw file, use:
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
Then, data is a string, so elem in data will iterate over all the characters of the string, which is not what you want.
Then, your data are unicode, not string. So you need to decode them first.
Here is your program, with some changes:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
data = str(r.text.encode('utf-8').replace("\xc2\x87", ",")).splitlines()
headers = data.pop(0).split(",")
pidx = headers.index('PlayerID')
partidx = headers.index('Partner')
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for data in data[1:]:
words = data.split(',')
f.writerow([words[pidx], words[partidx]])
Output:
PlayerID,Partner
1038005,EXT
254034,EXT
Use split:
lines = data.split('\n') # split your data to lines
headers = lines[0].split('‡')
player_index = headers.index('PlayerID')
partner_index = headers.index('Partner')
for line in lines[1:]: # skip the headers line
words = line.split('‡') # split each line by the delimiter '‡'
print words[player_index], words[partner_index]
For this to work, define the encoding of your python source code as UTF-8 by adding this line to the top of your file:
# -*- coding: utf-8 -*-
Read more about it in PEP 0263.
I am getting below keyError while running my python script which import data from one csv,modify it and write to another csv.
Code snippet:
import csv
Ty = 'testy'
Tx = 'testx'
ifile = csv.DictReader(open('test.csv'))
cdata = [x for x in ifile]
for row in cdata:
row['Test'] = row.pop(Ty)
Error seen while executing :
row['Test'] = row.pop(Ty)
KeyError: 'testy'
Any idea?
Thanks
Probably your csv don't have a header, where the specification of the key is done, since you didn't define the key names. The DictReader requires the parameter fieldnames so it can map accordingly it as keys (header) to values.
So you should do something like to read your csv file:
ifile = csv.DictReader(open('test.csv'), fieldnames=['testx', 'testy'])
If you don't want to pass the fieldnames parameter, try to understand from where the csv define its header, see the wikipedia article:
The first record may be a "header", which contains column names in
each of the fields (there is no reliable way to tell whether a file
does this or not; however, it is uncommon to use characters other than
letters, digits, and underscores in such column names).
Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar
You can put your 'testy' and 'testx' in your csv and don't pass the fieldnames to DictReader
According to the error message, there is missing testy on the first line of test.csv
Try such content in test.csv
col_name1,col_name2,testy
a,b,c
c,d,e
Note that there should not be any spaces/tabs around the testy.
First off, I am very new to Python. When I started to do this it seemed very simple. However I am at a complete loss.
I want to take a text file with as many as 90k entries and put the data groups on a single line separated by a ';' My examples are below. Keep in mind that the groups of data vary in size. They could be two entries, or 100 entries.
Raw Data
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
Formatted Data
group1;data;
group2;data;data;data;
group3;data;data;data;data;data;data;data;data;data;data;data;data;
group4;data;data;
try something like the following. (untested...you can learn a bit of python by debugging!)
create python file "parser.py"
import sys
f = open('filename.txt', 'r')
for line in f:
txt = line.strip()
if txt == '':
sys.stdout.write('\n\n')
sys.stdout.flush()
sys.stdout.write( txt + ';')
sys.stdout.flush()
f.close()
and in a shell, type:
python parser.py > output.txt
and see if output.txt is what you want.
Assuming the groups are separated with an empty line, you can use the following one-liner:
>>> print "\n".join([item.replace('\n', ';') for item in open('file.txt').read().split('\n\n')])
group1;data
group2;data;data;data
group3;data;data;data;data;data;data;data;data;data;data;data;data
group4;data;data;
where file.txt contains
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
First the file content (open().read()) is split on empty lines split('\n\n') to produce a list of blocks, then, in each block [item ... for item in list], newlines are replaced with semi-colons, and finally all blocks are printed separated with a newline "\n".join(list)
Note that the above is not safe for production, that is code that you would write for interactive data transformation, not in production-level scripts.
What have you tried? Text file is for/from what? File manipulation is one of the last "basic" things I plan on learning. I'm saving it for when I understand the nuances of for loops, while loops, dictionaries, lists, appending, and a million other handy functions out there. That's after 2-3 months of research, coding and creating GUI's by the way.
Anyways here's some basic suggestions.
';'.join(group) will put a ";" in between each group, effectively creating one long (semi-colon delimited) string
group.replace("SPACE CHARACTER", ";") : This will replace any spaces or specified character (like a newline) within a group with a semi-colon.
There's a lot of other methods that include loading the txt file into a python script, .append() functions, putting the groups into lists, dictionaries, or matrix's, etc..
These are my bits to throw on the problem:
from collections import defaultdict
import codecs
import csv
res = defaultdict(list)
cgroup = ''
with codecs.open('tmp.txt',encoding='UTF-8') as f:
for line in f:
if line.startswith('group'):
cgroup = line.strip()
continue
res[cgroup].append(line.strip())
with codecs.open('out.txt','w',encoding='UTF-8') as f:
w = csv.writer(f, delimiter=';',quoting=csv.QUOTE_MINIMAL)
for k in res:
w.writerow([k,]+ res[k])
Let me explain a bit on the why I did things, as I did. First, I used the codecs module to open the data file explicitly with the codec, since data should always be treated right and not by just guessing what it might be. Then I used a defaultdict, which has a nice documentation online, cause its more pythonic, at least regarding to mr. hettinger. It is one of the patterns, that can be unlearned if you use python.
At least, I used a csv-writer to generate the output, cause writing CSV files is not as easy as one might think. And to be able to just meet the right criteria, or just to get the data into a correct csv format, it is better to use, what many eyes have seen, instead of reinventing the wheel.
I'm trying to read a csv file in python, so that I can then find the average of the values in one of the columns using numpy.average.
My script looks like this:
import os
import numpy
import csv
listing = os.listdir('/path/to/directory/of/files/i/need')
os.chdir('/path/to/directory/of/files/i/need')
for file in listing[1:]:
r = csv.reader(open(file, 'rU'))
for row in r:
if len(row)<2:continue
if float(row[2]) <=0.05:
avg = numpy.average(float(row[2]))
print avg
but I keep on getting the error ValueError: invalid literal for float(). The csv reader seems to be reading the numbers as string, and won't allow me to convert it to a float. Any suggestions?
Judging by the comments, your program is running into problems with the headers.
Two solutions of this are to use r.next(), which skips a line, before your for loop, or to use the DictReader class. The advantage of the DictReader class is that you can treat each row as a dictionary instead of a tuple, which may make for more readability in some cases, but you do have to pass the list of headers to it in the constructor.
change:
float(row[2])
to:
float(row[2].strip("'\""))