Error on loading CSV using fread in pydatatable - python

I have a csv contains about 600K observations, and I'm importing it using fread
DT = dt.fread('C:\\Users\\myamulla\\Desktop\\proyectos_de_py\\7726_analysis\\datasets\\7726export_Jan_23.csv')
It is throwing out an error as -
--------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-3-01684fbecd91> in <module>
----> 1 dt.fread('C:\\Users\\myamulla\\Desktop\\proyectos_de_py\\7726_analysis\\datasets\\7726export_Jan_23.csv')
IOError: Too few fields on line 432815: expected 14 but found only 4 (with sep=','). Set fill=True to ignore this error. <<19731764,2021-01-23 23:30:15,2021-01-23 23:42:20,"Vote for David Borrero, your Republican in HD 105. Potestad betrayed Prez Trump. Borrero is for our values & POTUS Trump.>>
As suggested here, i passed the argument fill=True in fread statement.
DT = dt.fread('C:\\Users\\myamulla\\Desktop\\proyectos_de_py\\7726_analysis\\datasets\\7726export_Jan_23.csv',fill=True)
It executes, but DT will be created EMPTY.
How to get it resolved ?

Related

I am using the Foursquare API and keep getting an error where it does not have a zip/postal code and no results are returned

My code is this...
results = requests.get(url).json()['response']['groups'][0]['items']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-225-110fa1855079> in <module>
1 london_venues = getNearbyVenues(names=df2['Postcode'],
2 latitudes=df2['Latitude'],
----> 3 longitudes=df2['Longitude']
4 )
<ipython-input-223-c23495b2f972> in getNearbyVenues(names, latitudes, longitudes, radius)
16
17 # make the GET request
---> 18 results = requests.get(url).json()['response']['groups'][0]['items']
19
20 # return only relevant information for each nearby venue
KeyError: 'groups'
I think it is because there is no data returned in some cases - is there a way I can just return no data?
If in some cases you do not have group, you can simply change your line to:
results = requests.get(url).json()['response'].get('groups',[{}])[0].get('items', [])
it will returns None if you miss groups or items in your response.

Ignoring large returned values in IPython [duplicate]

I'm currently working with pandas and ipython. Since pandas dataframes are copied when you perform operations with it, my memory usage increases by 500 mb with every cell. I believe it's because the data gets stored in the Out variable, since this doesn't happen with the default python interpreter.
How do I disable the Out variable?
The first option you have is to avoid producing output. If you don't really need to see the intermediate results just avoid them and put all the computations in a single cell.
If you need to actually display that data you can use InteractiveShell.cache_size option to set a maximum size for the cache. Setting this value to 0 disables caching.
To do so you have to create a file called ipython_config.py (or ipython_notebook_config.py) under your ~/.ipython/profile_default directory with the contents:
c = get_config()
c.InteractiveShell.cache_size = 0
After that you'll see:
In [1]: 1
Out[1]: 1
In [2]: Out[1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-d74cffe9cfe3> in <module>()
----> 1 Out[1]
KeyError: 1
You can also create different profiles for ipython using the command ipython profile create <name>. This will create a new profile under ~/.ipython/profile_<name> with a default configuration file. You can then launch ipython using the --profile <name> option to load that profile.
Alternatively you can use the %reset out magic to reset the output cache or use the %xdel magic to delete a specific object:
In [1]: 1
Out[1]: 1
In [2]: 2
Out[2]: 2
In [3]: %reset out
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
Flushing output cache (2 entries)
In [4]: Out[1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-4-d74cffe9cfe3> in <module>()
----> 1 Out[1]
KeyError: 1
In [5]: 1
Out[5]: 1
In [6]: 2
Out[6]: 2
In [7]: v = Out[5]
In [8]: %xdel v # requires a variable name, so you cannot write %xdel Out[5]
In [9]: Out[5] # xdel removes the value of v from Out and other caches
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-573c4eba9654> in <module>()
----> 1 Out[5]
KeyError: 5

CSV Load Error with Pandas

Can someone help me figure out what this error is telling me? I don't understand why this csv won't load.
Code:
import pandas as pd
import numpy as np
energy = pd.read_csv('Energy Indicators.csv')
GDP = pd.read_csv('world_bank_new.csv')
ScimEn = pd.read_csv('scimagojr-3.csv')
Error:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-2-65661166aab4> in <module>()
10
11
---> 12 answer_one()
<ipython-input-2-65661166aab4> in answer_one()
4 energy = pd.read_csv('Energy Indicators.csv')
5 GDP = pd.read_csv('world_bank_new.csv')
----> 6 ScimEn = pd.read_csv('scimagojr-3.csv')
7
8
The read_csv function takes an encoding option. You're going to need to tell Pandas what the file encoding is. Try encoding = "ISO-8859-1".

Pandas ExcelFile.parse() reading file in as dict instead of dataframe

I am new to python and even newer to pandas, but relatively well versed in R. I am using Anaconda, with Python 3.5 and pandas 0.18.1. I am trying to read in an excel file as a dataframe. The file admittedly is pretty... ugly. There is a lot of empty space, missing headers, etc. (I am not sure if this is the source of any issues)
I create the file object, then find the appropriate sheet, then try to read that sheet as a dataframe:
xl = pd.ExcelFile(allFiles[i])
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
df = xl.parse(sName)
df
Results:
{'Security exposure - 21 day lag': Percent of Total Holdings \
0 KMNFC vs. 3 Month LIBOR AUD
1 04-OCT-16
2 Australian Dollar
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 Long/Short Net Exposure
9 Total
10 NaN
11 Long
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
(This goes on for 20-30 more rows and 5-6 more columns)
I am using Anaconda, and Spyder, which has a 'Variable Explorer'. It shows the variable df to be a dict of the DataFrame type:
However, I cannot use iloc:
df.iloc[:,1]
Traceback (most recent call last):
File "<ipython-input-77-d7b3e16ccc56>", line 1, in <module>
df.iloc[:,1]
AttributeError: 'dict' object has no attribute 'iloc'
Any thoughts? What am I missing?
EDIT:
To be clear, what I am really trying to do is reference the first column of the df. In R this would be df[,1]. Looking around it seems to be not a very popular way to do things, or not the 'correct' way. I understand why indexing by column names, or keys, is better, but in this situation, I really just need to index the dataframes by column numbers. Any working method of doing that would be greatly appreciated.
EDIT (2):
Per a suggestion, I tried 'read_excel', with the same results:
df = pd.ExcelFile(allFiles[i]).parse(sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-90-fc40aa59bd20>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
df = pd.read_excel(allFiles[i], sheetname = sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-91-72b8405c6c42>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
The problem was here:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
which returned a single element list. I changed it to the following:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()][0]
which returns a string, and the code then performs as expected.
All thanks to ayhan for pointing this out.

Prevent ipython from storing outputs in Out variable

I'm currently working with pandas and ipython. Since pandas dataframes are copied when you perform operations with it, my memory usage increases by 500 mb with every cell. I believe it's because the data gets stored in the Out variable, since this doesn't happen with the default python interpreter.
How do I disable the Out variable?
The first option you have is to avoid producing output. If you don't really need to see the intermediate results just avoid them and put all the computations in a single cell.
If you need to actually display that data you can use InteractiveShell.cache_size option to set a maximum size for the cache. Setting this value to 0 disables caching.
To do so you have to create a file called ipython_config.py (or ipython_notebook_config.py) under your ~/.ipython/profile_default directory with the contents:
c = get_config()
c.InteractiveShell.cache_size = 0
After that you'll see:
In [1]: 1
Out[1]: 1
In [2]: Out[1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-d74cffe9cfe3> in <module>()
----> 1 Out[1]
KeyError: 1
You can also create different profiles for ipython using the command ipython profile create <name>. This will create a new profile under ~/.ipython/profile_<name> with a default configuration file. You can then launch ipython using the --profile <name> option to load that profile.
Alternatively you can use the %reset out magic to reset the output cache or use the %xdel magic to delete a specific object:
In [1]: 1
Out[1]: 1
In [2]: 2
Out[2]: 2
In [3]: %reset out
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
Flushing output cache (2 entries)
In [4]: Out[1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-4-d74cffe9cfe3> in <module>()
----> 1 Out[1]
KeyError: 1
In [5]: 1
Out[5]: 1
In [6]: 2
Out[6]: 2
In [7]: v = Out[5]
In [8]: %xdel v # requires a variable name, so you cannot write %xdel Out[5]
In [9]: Out[5] # xdel removes the value of v from Out and other caches
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-573c4eba9654> in <module>()
----> 1 Out[5]
KeyError: 5

Categories