Python Pandas and Generator to Process Lines in CSV - python

Hoping that this is an allowable SO question but I am hoping to get some advice on how to convert the below code which processes lines in a file to produce a dataframe into one that uses generators and yields because this implementation using list and append is far too slow.
Here is the solution I came up with but I was really hoping to avoid using very slow lists and append operation. I was hoping for cool generator and yield solution instead but not comfortable enough yet working with generators.
Sample lines in file:
"USNC3255","27","US","NC","LANDS END","72305006","KNJM","KNCA","KNKT","T72305006","","","NCC031","NCZ095","","545","28594","America/New_York","34.65266","-77.07661","7","RDU","893727","
"USNC3256","27","US","NC","LANDSDOWN","72314058","KEHO","KAKH","KIPJ","T72314058","","","NCC045","NCZ068","sc007","517","28150","America/New_York","35.29374","-81.46537","797","CLT","317845","
Current Solution:
def parse_file(filename):
newline = []
with open(filename, 'rb') as f:
reader = csv.reader(f, quoting=csv.QUOTE_NONE)
for row in reader:
newline.append([s.strip('"') for s in row[:-1]])
df = pd.DataFrame(newline)
df = df.applymap(lambda x: nan if len(x) == 0 else x).astype(object)
return df
df = parse_file(filename)
Output is just a dataframe with 23 columns and two rows if used against the sample lines above.

The only problem with your file is that each line ends with ,". This confuses the parser. If you can remove the trailing comma and quotation mark, you can use the regular parser.
import pandas as pd
from StringIO import StringIO
with open('example.txt') as myfile:
data = myfile.read().replace(',"\n', '\n')
pd.read_csv(StringIO(data), header=None)
This is what I get:
0 1 2 3 4 5 6 7 8 9 \
0 USNC3255 27 US NC LANDS END 72305006 KNJM KNCA KNKT T72305006
1 USNC3256 27 US NC LANDSDOWN 72314058 KEHO KAKH KIPJ T72314058
... 13 14 15 16 17 18 19 \
0 ... NCZ095 NaN 545 28594 America/New_York 34.65266 -77.07661
1 ... NCZ068 sc007 517 28150 America/New_York 35.29374 -81.46537
20 21 22
0 7 RDU 893727
1 797 CLT 317845
[2 rows x 23 columns]

Related

split Python DataFrame into k parts with index and iterate over them in a loop

I suppose that someone might have asked this already, but for the life of me I cannot find what I need after some looking, possibly my level of Py is too low.
I saw several questions with answers using globals() and exec() with comments that it's a bad idea, other answers suggest using dictionaries or lists. At this point I got a bit loopy about what to use here and any help would be very welcome.
What I need is roughly this:
I have a Python DataFrame, say called dftest
I'd like to split dftest into say 6 parts of similar size
then I'd like to iterate over them (or possibly parallelise?) and run some steps calling some spatial functions that use parameters (param0,param1, ... param5) over each of the rows of each df to add more columns, preferably export each result to a csv (as it takes long time to complete one part, I wouldn't want to loose the result of each iteration)
And then I'd like to put them back together into one DataFrame, say dfresult (possibly with concat) and continue doing the next thing with it
To keep it simple, this is how a toy dftest looks like (the original df has more rows and columns):
print(dftest)
# rowid type lon year
# 1 1 Tomt NaN 2021
# 2 2 Lägenhet 12.72 2022
# 3 3 Lägenhet NaN 2017
# 4 4 Villa 17.95 2016
# 5 5 Radhus 17.95 2021
# 6 6 Villa 17.95 2016
# 7 7 Fritidshus 18.64 2020
# 8 8 Villa 18.64 2019
# 9 9 Villa 18.63 2021
# 10 10 Villa 18.63 2019
# 11 11 Villa 17.66 2017
# 12 12 Radhus 17.66 2022
So here is what I tried:
dfs = np.array_split(dftest, 6)
for j in range(0,6):
print ((f'dfs[{j}] has'),len(dfs[j].index),'obs ',min(dfs[j].index),'to ',max (dfs[j].index))
where I get output:
# dfs[0] has 2 obs 1 to 2
# dfs[1] has 2 obs 3 to 4
# dfs[2] has 2 obs 5 to 6
# dfs[3] has 2 obs 7 to 8
# dfs[4] has 2 obs 9 to 10
# dfs[5] has 2 obs 11 to 12
So now I'd like to iterate over each df and create more columns. I tried a hardcoded test, one by one something like this:
for row in tqdm(dfs[0].itertuples()):
x = row.type
y = foo.bar(x, param="param0")
i = row[0]
dfs[0].x[i, 'anotherColumn'] = baz(y)
#... some more functions ...
dfs[0].to_csv("/projectPath/dfs0.csv")
I suppose this should be possible to automate or even run in parallel (how?)
And in the end I'll try putting them together (no clue if this would work), possibly something like this:
pd.concat([dfs[0],dfs[1],dfs[2],dfs[3],dfs[4],dfs[5]])
If I had a 100 parts - perhaps dfs[0]:dfs[5] would work...I'm still in the previous step
PS. I'm using a Jupyter notebook on localhost with Python3.
As far as I understand, you can use the chunk_apply function of the parallel-pandas library. This function splits the dataframe into chunks and applies a custom function to each chunk then concatenates the result. Everything works in parallel.Toy example:
#pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
# n_cpu - count of cores and split chunks
ParallelPandas.initialize(n_cpu=8)
def foo(df):
# do something with df
df['new_col'] = df.sum(axis=1)
return df
if __name__ == '__main__':
ROW = 10000
COL = 10
df = pd.DataFrame(np.random.random((ROW, COL)))
res = df.chunk_apply(foo, axis=0)
print(res.head())
Out:
0 1 2 ... 8 9 new_col
0 0.735248 0.393912 0.966608 ... 0.261675 0.207216 6.276589
1 0.256962 0.461601 0.341175 ... 0.688134 0.607418 5.297881
2 0.335974 0.093897 0.622115 ... 0.442783 0.115127 3.102827
3 0.488585 0.709927 0.209429 ... 0.942065 0.126600 4.367873
4 0.619996 0.704085 0.685806 ... 0.626539 0.145320 4.901926

How to get the data from UCI machine learning repository

I have a small problem with reading the data from this source correctly. I tried to write:
path = 'http://archive.ics.uci.edu/ml/machine-learning-databases/image/segmentation.data'
df = pd.read_table(path)
And then I got something strange.
Then I wrote:
df = pd.read_table(path, sep=',', header=None)
and got an error: ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 19
Could you, please, help me to find the solution?
The file is basically a csv file so you can use read_csv. Use it in combination with skiprows=2 to skip the first non-relevant rows of the file.
import pandas as pd
path = 'http://archive.ics.uci.edu/ml/machine-learning-databases/image/segmentation.data'
df = pd.read_csv(path, skiprows=2, index_col=False)
Output df.head():
REGION-CENTROID-COL
REGION-CENTROID-ROW
REGION-PIXEL-COUNT
SHORT-LINE-DENSITY-5
SHORT-LINE-DENSITY-2
VEDGE-MEAN
VEDGE-SD
HEDGE-MEAN
HEDGE-SD
INTENSITY-MEAN
RAWRED-MEAN
RAWBLUE-MEAN
RAWGREEN-MEAN
EXRED-MEAN
EXBLUE-MEAN
EXGREEN-MEAN
VALUE-MEAN
SATURATION-MEAN
HUE-MEAN
0
BRICKFACE
140
125
9
0
0
0.277778
0.062963
0.666667
0.311111
6.18518
7.33333
7.66667
3.55556
3.44444
4.44444
-7.88889
7.77778
0.545635
1
BRICKFACE
188
133
9
0
0
0.333333
0.266667
0.5
0.0777777
6.66667
8.33333
7.77778
3.88889
5
3.33333
-8.33333
8.44444
0.53858
2
BRICKFACE
105
139
9
0
0
0.277778
0.107407
0.833333
0.522222
6.11111
7.55556
7.22222
3.55556
4.33333
3.33333
-7.66667
7.55556
0.532628
3
BRICKFACE
34
137
9
0
0
0.5
0.166667
1.11111
0.474074
5.85185
7.77778
6.44444
3.33333
5.77778
1.77778
-7.55556
7.77778
0.573633
4
BRICKFACE
39
111
9
0
0
0.722222
0.374074
0.888889
0.429629
6.03704
7
7.66667
3.44444
2.88889
4.88889
-7.77778
7.88889
0.562919
Can you give encoding like this:
path = 'http://archive.ics.uci.edu/ml/machine-learning-databases/image/segmentation.data'
df = pd.read_csv(path,encoding = 'utf8')
If it does not work, can you try other encodings?
The problem seems to be that the data file contains some meta information that Pandas cannot parse. You need to convert your file to a CSV before it can be read by pandas.
To do this, first download the file to your local machine at some location filepath and remove the lines starting with the ;;; and the empty lines. Then running a pd.read_table(filepath, sep='\t') or a pd.read_csv(filepath) should work as expected.
Note that the header argument does not refer to any generic header information that the file may contain. header lets pandas know whether the first line in your CSV contains the names of the columns (if header is True) or whether the actual data in the file starts from the first line (if header is False).

Extra commas at beginning and end of csv rows, how to remove?

So I have a .csv file where each row looks like this:
,11:00:14,4,5.,93.7,0.01,0.0,7,20,0.001,10,49.3,0.01,
,11:00:15,4,5.,94.7,0.04,0.5,7,20,0.005,10,49.5,0.04,
when it should look like this:
11:00:14,4,5.,93.7,0.01,0.0,7,20,0.001,10,49.3,0.01
11:00:15,4,5.,94.7,0.04,0.5,7,20,0.005,10,49.5,0.04
I think that this is the reason why pandas is not creating data frames properly. What can I do to remove these commas?
The code generating the original csv file is
def tsv2csv():
# read tab-delimited file
with open(file_location + tsv_file,'r') as fin:
cr = csv.reader(fin, delimiter='\t')
filecontents = [line for line in cr]
# write comma-delimited file (comma is the default delimiter)
# give the exact location of the file
#"newline=''" at the end of the line stops there being spaces between each row
with open(new_csv_file,'w', newline='') as fou:
cw = csv.writer(fou, quotechar='', quoting=csv.QUOTE_NONE)
cw.writerows(filecontents)
You can use usecols to specify the columns you want to import as follows:
import pandas as pd
csv_df = pd.read_csv('temp.csv', header=None, usecols=range(1,13))
This will skip the first and last empty columns.
The trailing commas correspond to missing data. When loading in your dataframe, they're loaded in as NaNs, so all you'd need to do is get rid of it, either using dropna or by slicing them out -
df = pd.read_csv('file.csv', header=None).dropna(how='all', axis=1)
Or,
df = pd.read_csv('file.csv', header=None).iloc[:, 1:-1]
df
1 2 3 4 5 6 7 8 9 10 11 12
0 11:00:14 4 5.0 93.7 0.01 0.0 7 20 0.001 10 49.3 0.01
1 11:00:15 4 5.0 94.7 0.04 0.5 7 20 0.005 10 49.5 0.04
You can strip any character at the beginning and end of a text by using strip and give a string with the characters you wan't to escape as an argument.
x = ',11:00:14,4,5.,93.7,0.01,0.0,7,20,0.001,10,49.3,0.01,'
print x.strip(',')
>11:00:14,4,5.,93.7,0.01,0.0,7,20,0.001,10,49.3,0.01
Not sure If It Works in you case, bit have you tried import:
df = pd.read_csv('filename', sep=';')

How to use square brackets as a quote character in Pandas.read_csv

Let's say I have a text file that looks like this:
Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]
What I'd like to be able to do is read that in with pandas.read_csv, but the second row will throw an error. Here is the code I'm currently using:
import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)
I've tried to set quotechar to "[", but that obviously just eats up the lines until the next open bracket and adding a closing bracket results in a "string of length 2 found" error. Any insight would be greatly appreciated. Thanks!
Update
There were three primary solutions that were offered: 1) Give a long range of names to the data frame to allow all data to be read in and then post-process the data, 2) Find values in square brackets and put quotes around it, or 3) replace the first n number of commas with semicolons.
Overall, I don't think option 3 is a viable solution in general (albeit just fine for my data) because a) what if I have quoted values in one column that contain commas, and b) what if my column with square brackets is not the last column? That leaves solutions 1 and 2. I think solution 2 is more readable, but solution 1 was more efficient, running in just 1.38 seconds, compared to solution 2, which ran in 3.02 seconds. The tests were run on a text file containing 18 columns and more than 208,000 rows.
We can use simple trick - quote balanced square brackets with double quotes:
import re
import six
import pandas as pd
data = """\
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]"""
print('{0:-^70}'.format('original data'))
print(data)
data = re.sub(r'(\[[^\]]*\])', r'"\1"', data, flags=re.M)
print('{0:-^70}'.format('quoted data'))
print(data)
df = pd.read_csv(six.StringIO(data))
print('{0:-^70}'.format('data frame'))
pd.set_option('display.expand_frame_repr', False)
print(df)
Output:
----------------------------original data-----------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]
-----------------------------quoted data------------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,"[45.2344:-78.25453]","[aaaa,bbb]"
2,01/03/2016,19:11,"[43.3423:-79.23423,41.2342:-81242]","[0,1,2,3]"
3,01/10/2016,01:27,"[51.2344:-86.24432]","[12,13]"
4,01/30/2016,05:55,"[51.2344:-86.24432,41.2342:-81242,55.5555:-81242]","[45,55,65]"
------------------------------data frame------------------------------
Item Date Time Location junk
0 1 01/01/2016 13:41 [45.2344:-78.25453] [aaaa,bbb]
1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242] [0,1,2,3]
2 3 01/10/2016 01:27 [51.2344:-86.24432] [12,13]
3 4 01/30/2016 05:55 [51.2344:-86.24432,41.2342:-81242,55.5555:-81242] [45,55,65]
UPDATE: if you are sure that all square brackets are balances, we don't have to use RegEx's:
import io
import pandas as pd
with open('35948417.csv', 'r') as f:
fo = io.StringIO()
data = f.readlines()
fo.writelines(line.replace('[', '"[').replace(']', ']"') for line in data)
fo.seek(0)
df = pd.read_csv(fo)
print(df)
I can't think of a way to trick the CSV parser into accepting distinct open/close quote characters, but you can get away with a pretty simple preprocessing step:
import pandas as pd
import io
import re
# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]')
with open('path/to/file.txt', 'r') as fi:
# replaced brackets with quotes, pipe into file-like object
fo = io.StringIO()
fo.writelines(unicode(re.sub(location_regex, r'"\1"', line)) for line in fi)
# rewind file to the beginning
fo.seek(0)
# read transformed CSV into data frame
df = pd.read_csv(fo)
print df
This gives you a result like
Date_Time Item Location
0 2016-01-01 13:41:00 1 [45.2344:-78.25453]
1 2016-01-03 19:11:00 2 [43.3423:-79.23423, 41.2342:-81242]
2 2016-01-10 01:27:00 3 [51.2344:-86.24432]
Edit If memory is not an issue, then you are better off preprocessing the data in bulk rather than line by line, as is done in Max's answer.
# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]', flags=re.M)
with open('path/to/file.csv', 'r') as fi:
data = unicode(re.sub(location_regex, r'"\1"', fi.read()))
df = pd.read_csv(io.StringIO(data))
If you know ahead of time that the only brackets in the document are those surrounding the location coordinates, and that they are guaranteed to be balanced, then you can simplify it even further (Max suggests a line-by-line version of this, but I think the iteration is unnecessary):
with open('/path/to/file.csv', 'r') as fi:
data = unicode(fi.read().replace('[', '"').replace(']', '"')
df = pd.read_csv(io.StringIO(data))
Below are the timing results I got with a 200k-row by 3-column dataset. Each time is averaged over 10 trials.
data frame post-processing (jezrael's solution): 2.19s
line by line regex: 1.36s
bulk regex: 0.39s
bulk string replace: 0.14s
I think you can replace first 3 occurence of , in each line of file to ; and then use parameter sep=";" in read_csv:
import pandas as pd
import io
with open('file2.csv', 'r') as f:
lines = f.readlines()
fo = io.StringIO()
fo.writelines(u"" + line.replace(',',';', 3) for line in lines)
fo.seek(0)
df = pd.read_csv(fo, sep=';')
print df
Item Date Time Location
0 1 01/01/2016 13:41 [45.2344:-78.25453]
1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242]
2 3 01/10/2016 01:27 [51.2344:-86.24432]
Or can try this complicated approach, because main problem is, separator , between values in lists is same as separator of other column values.
So you need post - processing:
import pandas as pd
import io
temp=u"""Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]"""
#after testing replace io.StringIO(temp) to filename
#estimated max number of columns
df = pd.read_csv(io.StringIO(temp), names=range(10))
print df
0 1 2 3 4 \
0 Item Date Time Location NaN
1 1 01/01/2016 13:41 [45.2344:-78.25453] NaN
2 2 01/03/2016 19:11 [43.3423:-79.23423 41.2342:-81242
3 3 01/10/2016 01:27 [51.2344:-86.24432] NaN
5 6 7 8 9
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 41.2342:-81242] NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None
#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
Location NaN NaN
1 [45.2344:-78.25453] NaN NaN
2 [43.3423:-79.23423 41.2342:-81242 41.2342:-81242]
3 [51.2344:-86.24432] NaN NaN
#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)
#subset of desired columns
print df[['Item','Date','Time','Location']]
Item Date Time Location
1 1 01/01/2016 13:41 [45.2344:-78.25453]
2 2 01/03/2016 19:11 [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3 3 01/10/2016 01:27 [51.2344:-86.24432]

Pandas error "Can only use .str accessor with string values"

I have the following input file:
"Name",97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,
And I am reading it in with:
#!/usr/bin/env python
import pandas as pd
import sys
import numpy as np
filename = sys.argv[1]
df = pd.read_csv(filename,header=None)
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
print df
However, I get the error
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2241, in __getattr__
return object.__getattribute__(self, name)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 188, in __get__
return self.construct_accessor(instance)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 528, in _make_str_accessor
raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
This worked OK in pandas 0.14 but does not work in pandas 0.17.0.
It's happening because your last column is empty so this becomes converted to NaN:
In [417]:
t="""'Name',97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[417]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 'Name' 97.7 0A 0A 65M 0A 100M 5M 75M 100M 90M 90M 99M 90M 0#
15 16
0 0N# NaN
If you slice your range up to the last row then it works:
In [421]:
for col in df.columns[2:-1]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[421]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN
Alternatively you can just select the cols that are object dtype and run the code (skipping the first col as this is the 'Name' entry):
In [428]:
for col in df.select_dtypes([np.object]).columns[1:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[428]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN
I got this error while working in Eclipse. It turned out that the project interpreter was somehow (after an update I believe) reset to Python 2.7. Setting it back to Python 3.6 resolved this issue. It all resulted in several crashes, restarts and warnings. After several minutes of troubles it seems fixed now.
While I know this is not a solution to the problem posed here, I thought it might be useful for others, as I came to this page after searching for this error.
In this case we have to use the str.replace() method on that series, but first we have to convert it to str type:
df1.Patient = 's125','s45',s588','s244','s125','s123'
df1 = pd.read_csv("C:\\Users\\Gangwar\\Desktop\\competitions\\cancer prediction\\kaggle_to_students.csv")
df1.Patient = df1.Patient.astype(str)
df1['Patient'] = df1['Patient'].str.replace('s','').astype(int)

Categories