Pandas: Error tokenizing data--when using glob.glob - python

I am using the following code to concatenate several files (candidate master files) I have downloaded from here; but they can also be found here:
https://github.com/108michael/ms_thesis/blob/master/cn06.txt
https://github.com/108michael/ms_thesis/blob/master/cn08.txt
https://github.com/108michael/ms_thesis/blob/master/cn10.txt
https://github.com/108michael/ms_thesis/blob/master/cn12.txt
https://github.com/108michael/ms_thesis/blob/master/cn14.txt
import numpy as np
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, header=None, names=['feccandid','candname',\
'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
'cand_status', '1', '2','3','4', '5', '6' ], usecols=['feccandid', \
'party', 'date', 'state', 'chamber'])for f in glob.glob\
('/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))
I am getting the following error:
CParserError: Error tokenizing data. C error: Expected 2 fields in line 58, saw 4
Does anyone have a clue on this?

The default delimiter for pd.read_csv is the comma ,. Since all of your candidates have names listed in the format Last, First, pandas reads two columns: everything before the comma and everything after. In one of the files, there are additional commas, leading pandas to assume that there are more columns. That's the parser error.
To use | as the delimiter instead of ,, just change your code to use the keyword delimiter="|" or sep="|". From the docs, we see that delimiter and sep are aliases of the same keyword.
New code:
df = pd.concat((pd.read_csv(f, header=None, delimiter="|", names=['feccandid','candname',\
'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
'cand_status', '1', '2','3','4', '5', '6' ], usecols=['feccandid', \
'party', 'date', 'state', 'chamber'])for f in glob.glob\
('/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))

import numpy as np
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, header=None, names=['feccandid','candname', \
'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
'cand_status', '1', '2','3','4', '5', '6' ],sep='|', \
usecols=['feccandid', 'party', 'date', 'state', 'chamber'] \
)for f in glob.glob\
(/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))
print len(df)

Related

Faster way to iterate over columns in pandas

I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

Pandas loc with prefix

I have two data frames and I want for each line in one data frame to locate the matching line in the other data frame by a certain column (containing some id). I thought to go over the lines in the df1 and use the loc function to find the matching line in df2.
The problem is that some of the id's in df2 has some extra information except the id itself.
For example:
df1 has the id: 1234,
df2 has the id: 1234-KF
How can I locate this id for example with loc? Can loc somehow match only by prefixes?
Extra information can be removed using e.g. regular expression (or substring):
import pandas as pd
import re
df1 = pd.DataFrame({
'id': ['123', '124', '125'],
'data': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'id': ['123-AA', '124-AA', '125-AA'],
'data': ['1', '2', '3']
})
df2.loc[df2.id.apply(lambda s : re.sub("[^0-9]", "", s)) == df1.id]

List as names for names of Pandas Dataframes

I want to make the names of some stock symbols the actual name of a pandas dataframe.
import pandas as pd
import pandas_datareader.data as pdr
choices = ['ROK', 'HWM', 'PYPL', 'V', 'KIM', 'FISV', 'REG', 'EMN', 'GS', 'TYL']
for c in choices:
pdr.DataReader(c, data_source='yahoo', start=datetime(2000,1,1),
end=datetime(2020,1,1)).to_csv(f'Data/{c}.csv')
f'{c}'['Price'] = pd.read_csv(f'Data/{c}.csv', index_col='Date')['Adj Close']
I'm getting this error:
TypeError: 'str' object does not support item assignment
Is there a way to go about doing this? Maybe perhaps using the name of the stock symbol as the name of the dataframe is not the best convention.
Thank you
You can put it in a data structure as a dictionary.
import pandas as pd
import pandas_datareader.data as pdr
choices = ['ROK', 'HWM', 'PYPL', 'V', 'KIM', 'FISV', 'REG', 'EMN', 'GS', 'TYL']
dataframes = {}
for c in choices:
pdr.DataReader(c, data_source='yahoo', start=datetime(2000,1,1),
end=datetime(2020,1,1)).to_csv(f'Data/{c}.csv')
dataframes[c] = pd.read_csv(f'Data/{c}.csv', index_col='Date')['Adj Close']
So, you will get a structure like the one bellow:
>>> print(dataframes)
{'ROK': <your_ROK_dataframe_here>,
'HWM': <your_HWM_dataframe_here>,
...
}
Then, you can access a specific dataframe by using dataframes['XXXX'] where XXXX is one of the choices.
You shouldn't be storing variables with string as it can get quite messy down the line. If you wanted to keep with your convention I'd advise storing your dataframes as a dictionary with the stock symbols as a key
choices = ['ROK', 'HWM', 'PYPL', 'V', 'KIM', 'FISV', 'REG', 'EMN', 'GS', 'TYL']
choices_dict = {}
for c in choices:
pdr.DataReader(c, data_source='yahoo', start=datetime(2000,1,1),
end=datetime(2020,1,1)).to_csv(f'Data/{c}.csv')
csv_pd = pd.read_csv(f'Data/{c}.csv', index_col='Date')['Adj Close']
choices_dict[c] = pd.DataFrame(csv_pd, columns=['Price'])

How to prevent pandas from removing 'NA' character string when reading a csv?

I am using pandas to read in a CSV file that has product data, one of the dataframes contain a product line code 'NA', when I output the data to new file, the Line Code 'NA' is no longer part of the dataframe, it's been removed and is now blank field, how can I stop this from happening?
data = pd.read_csv('C:\\Users\\User\\Desktop\\' + filename, sep=',', quotechar='"', encoding='mbcs',
low_memory=False)
My desired dataframe would look like this:
"Line Code" "Product SKU"
AB Product1
AB Product2
AB Product3
NA Product4
NA Product5
NA Product6
MV Product7
MV Product7
MV Product7
Read the dataframe with keep_default_na=False, possibly specifying with na_values the set of values that you want to consider as "genuine" NaNs:
# custom admissible NaNs values, 'NA' is not in this list
na_values = ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND',
'-1.#QNAN', '-NaN', '-nan', '1.#IND',
'1.#QNAN', 'N/A', 'NULL', 'NaN',
'n/a', 'nan', 'null'
]
data = pd.read_csv('C:\\Users\\User\\Desktop\\' + filename,
sep=',',
quotechar='"',
encoding='mbcs',
low_memory=False,
na_values = na_values # specify custom NaN values
keep_default_na=False) # and use them
Here's a reproducible example of what could be happening here:
# create dataframe with NA and write it to file
import pandas as pd
df = pd.DataFrame({'Line Code':['MV', 'RM', 'NA', 'AB'],
'Product SKU':['Product1', 'Product2', 'Product3', 'Product4']})
df.to_csv("mydf.csv", index = False)
# read it in, in two different fashions
df_problematic = pd.read_csv("mydf.csv")
df_ok = pd.read_csv("mydf.csv", keep_default_na = False)
in df_problematic, the 'NA' value is interpreted as NaN, which is not what you want (refer to the read_csv docs for options when reading csv files in pandas and for info about the default list of symbols interpreted as NaNs).

Python custom method set to new variable changes old variable

I have created a class with two methods, NRG_load and NRG_flat. The first loads a CSV, converts it into a DataFrame and applies some filtering; the second takes this DataFrame and, after creating two columns, it melts the DataFrame to pivot it.
I am trying out these methods with the following code:
nrg105 = eNRG.NRG_load('nrg_105a.tsv')
nrg105_flat = eNRG.NRG_flat(nrg105, '105')
where eNRG is the class, and '105' as second argument is needed to run an if-loop within the method to create the aforementioned columns.
The behaviour I cannot explain is that the second line - the one with the NRG_flat method - changes the nrg105 values.
Note that if I only run the NRG_load method, I get the expected DataFrame.
What is the behaviour I am missing? Because it's not the first time I apply a syntax like that, but I never had problems, so I don't know where I should look at.
Thank you in advance for all of your suggestions.
EDIT: as requested, here is the class' code:
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 16 15:22:21 2019
#author: CAPIZZI Filippo Antonio
"""
import pandas as pd
from FixFilename import FixFilename as ff
from SplitColumn import SplitColumn as sc
from datetime import datetime as ddt
class EurostatNRG:
# This class includes the modules needed to load and filter
# the Eurostat NRG files
# Default countries' lists to be used by the functions
COUNTRIES = [
'EU28', 'AL', 'AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'EL',
'ES', 'FI', 'FR', 'GE', 'HR', 'HU', 'IE', 'IS', 'IT', 'LT', 'LU', 'LV',
'MD', 'ME', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK',
'TR', 'UA', 'UK', 'XK'
]
# Default years of analysis
YEARS = list(range(2005, int(ddt.now().year) - 1))
# NOTE: the 'datetime' library will call the current year, but since
# the code is using the 'range' function, the end years will be always
# current-1 (e.g. if we are in 2019, 'current year' will be 2018).
# Thus, I have added "-1" because the end year is t-2.
INDIC_PROD = pd.read_excel(
'./Datasets/VITO/map_nrg.xlsx',
sheet_name=[
'nrg105a_indic', 'nrg105a_prod', 'nrg110a_indic', 'nrg110a_prod',
'nrg110'
],
convert_float=True)
def NRG_load(dataset, countries=COUNTRIES, years=YEARS, unit='ktoe'):
# This module will load and refine the NRG dataset,
# preparing it to be filtered
# Fix eventual flags
dataset = ff.fix_flags(dataset)
# Load the dataset into a DataFrame
df = pd.read_csv(
dataset,
delimiter='\t',
encoding='utf-8',
na_values=[':', ': ', ' :'],
decimal='.')
# Clean up spaces from the column names
df.columns = df.columns.str.strip()
# Removes the mentioned column because it's not needed
if 'Flag and Footnotes' in df.columns:
df.drop(columns=['Flag and Footnotes'], inplace=True)
# Split the first column into separate columns
df = sc.nrg_split_column(df)
# Rename the columns
df.rename(
columns={
'country': 'COUNTRY',
'fuel_code': 'KEY_PRODUCT',
'nrg_code': 'KEY_INDICATOR',
'unit': 'UNIT'
},
inplace=True)
# Filter the dataset
df = EurostatNRG.NRG_filter(
df, countries=countries, years=years, unit=unit)
return df
def NRG_filter(df, countries, years, unit):
# This module will filter the input DataFrame 'df'
# showing only the 'countries', 'years' and 'unit' selected
# First, all of the units not of interest are removed
df.drop(df[df.UNIT != unit.upper()].index, inplace=True)
# Then, all of the countries not of interest are filtered out
df.drop(df[~df['COUNTRY'].isin(countries)].index, inplace=True)
# Finally, all of the years not of interest are removed,
# and the columns are rearranged according to the desired output
main_cols = ['KEY_INDICATOR', 'KEY_PRODUCT', 'UNIT', 'COUNTRY']
cols = main_cols + [str(y) for y in years if y not in main_cols]
df = df.reindex(columns=cols)
return df
def NRG_flat(df, name):
# This module prepares the DataFrame to be flattened,
# then it gives it as output
# Assign the indicators and products' names
if '105' in name: # 'name' is the name of the dataset
# Creating the 'INDICATOR' column
indic_dic = dict(
zip(EurostatNRG.INDIC_PROD['nrg105a_indic'].KEY_INDICATOR,
EurostatNRG.INDIC_PROD['nrg105a_indic'].INDICATOR))
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
# Creating the 'PRODUCT' column
prod_dic = dict(
zip(
EurostatNRG.INDIC_PROD['nrg105a_prod'].KEY_PRODUCT.astype(
str), EurostatNRG.INDIC_PROD['nrg105a_prod'].PRODUCT))
df['PRODUCT'] = df['KEY_PRODUCT'].map(prod_dic)
elif '110' in name:
# Creating the 'INDICATOR' column
indic_dic = dict(
zip(EurostatNRG.INDIC_PROD['nrg110a_indic'].KEY_INDICATOR,
EurostatNRG.INDIC_PROD['nrg110a_indic'].INDICATOR))
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
# Creating the 'PRODUCT' column
prod_dic = dict(
zip(
EurostatNRG.INDIC_PROD['nrg110a_prod'].KEY_PRODUCT.astype(
str), EurostatNRG.INDIC_PROD['nrg110a_prod'].PRODUCT))
df['PRODUCT'] = df['KEY_PRODUCT'].map(prod_dic)
# Delete che columns 'KEY_INDICATOR' and 'KEY_PRODUCT', and
# rearrange the columns in the desired order
df.drop(columns=['KEY_INDICATOR', 'KEY_PRODUCT'], inplace=True)
main_cols = ['INDICATOR', 'PRODUCT', 'UNIT', 'COUNTRY']
year_cols = [y for y in df.columns if y not in main_cols]
cols = main_cols + year_cols
df = df.reindex(columns=cols)
# Pivot the DataFrame to have it in flat format
df = df.melt(
id_vars=df.columns[:4], var_name='YEAR', value_name='VALUE')
# Convert the 'VALUE' column into float numbers
df['VALUE'] = pd.to_numeric(df['VALUE'], downcast='float')
# Drop rows that have no indicators (it means they are not in
# the Excel file with the products of interest)
df.dropna(subset=['INDICATOR', 'PRODUCT'], inplace=True)
return df
EDIT 2: if this could help, this is the error I receive when using the EurostatNRG class in IPython:
[autoreload of EurostatNRG failed: Traceback (most recent call last):
File
"C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 244, in check
superreload(m, reload, self.old_objects) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 394, in superreload
update_generic(old_obj, new_obj) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 331, in update_generic
update(a, b) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 279, in update_class
if (old_obj == new_obj) is True: File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py",
line 1478, in nonzero
.format(self.class.name)) ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or
a.all(). ]
I managed to find the culprit.
In the NRG_flat method, the lines:
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
...
df['PRODUCT'] = df['KEY_PRODUCT'].map(indic_dic)
mess up the copies of the df DataFrame, thus I had to change them with the Pandas assign method:
df = df.assign(INDICATOR=df.KEY_INDICATOR.map(prod_dic))
...
df = df.assign(PRODUCT=df.KEY_PRODUCT.map(prod_dic))
I do not get any more error.
Thank you for replying!

Categories