Same DataFrame.reindex code - different output - python

Good afternoon everyone,
I want to filter out from a DataFrame the columns that I am not interested in.
To do that - and since the columns could change based on user input (that I will not show here) - I am using the following code within my offshore_filter function:
# Note: 'df' is my DataFrame, with different country codes as rows and years as columns' headers
import datetime as d
import pandas as pd
COUNTRIES = [
'EU28', 'AL', 'AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'EL',
'ES', 'FI', 'FR', 'GE', 'HR', 'HU', 'IE', 'IS', 'IT', 'LT', 'LU', 'LV',
'MD', 'ME', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK',
'TR', 'UA', 'UK', 'XK'
YEARS = list(range(2005, int(d.datetime.now().year)))
def offshore_filter(df, countries=COUNTRIES, years=YEARS):
# This function is specific for filtering out the countries
# and the years not needed in the analysis
# Filter out all of the countries not of interest
df.drop(df[~df['country'].isin(countries)].index, inplace=True)
# Filter out all of the years not of interest
columns_to_keep = ['country', 'country_name'] + [i for i in years]
temp = df.reindex(columns=columns_to_keep)
df = temp # This step to avoid the copy vs view complication
return df
When I pass a years list of integers, the code works well and filters the DataFrame by taking only the columns in the years list.
However, if the DataFrame's column headers are strings (e.g. '2018' instead of 2018), changing [i for i in years] into [str(i) for i in years] doesn't work, and I have columns of Nan's (as the reindex documentation states).
Can you help me spot me why?

Related

KeyError: Float64Index when running drop_duplicates

I have a dataframe with the duplicates in the t column, however when I run the drop_duplicates function the following error is returned. Could someone please explain how to fix this?
print(df.columns)
Index(['t', 'ax', 'ay', 'az', 'gx', 'gy', 'gz', 'mx', 'my', 'mz', 'movement',
'hand', 'trial_1', 'trial_2', 'trial'],
dtype='object')
df.drop_duplicates(subset= left_data.loc[:, 't'], keep='first', inplace=True, ignore_index=True) ```
KeyError: Int64IndeX

Common techniques / tips for messy pdf parsing?

I'm working on converting 100s of pdf files (500+ pages each) into csv data files. The pdfs have valuable University class grade data that cannot be found elsewhere (These are public record).
I am attempting to use Python and PyPDF to parse through it and extract data. I will attach my current progress below:
PDF example Image
The format is for the most part predictable. I cannot blindly parse, I will have to define some parsing rules to extract my data.
For example: I know when I see the string "Listing" it is followed by the class code.
Are there any advice or techniques to doing this? Is this approaching the bounds of natural language processing?
['Miami', 'PlanHonorsCross', 'ListingACE', '113', 'B', 'Yang', 'Eun', 'Chong', 'Reading', 'Writing', 'Acad', 'ContextsNN', 'A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA', '1523010101000000000003.38%7.135.714.321.40.07.10.07.10.07.10.00.00.00.00.00.00.00.00.00.00.0ACE', '113', 'C', 'Duffield', 'Ebru', 'D.', 'Reading', 'Writing', 'Acad', 'ContextsNN', 'A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA', '6030120000000100000003.63%46.20.023.10.07.715.40.00.00.00.00.00.00.07.70.00.00.00.00.00.00.0ACE', '113', 'D', 'Duffield', 'Ebru', 'D.', 'Reading', 'Writing', 'Acad', 'ContextsNN', 'A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA', '5211310000000000000003.59%38.515.47.77.723.17.70.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0ACE', '113', 'F', 'Duffield', 'Ebru', 'D.', 'Reading', 'Writing', 'Acad', 'ContextsNN', 'A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA', '6132000000000100000003.81%46.27.723.115.40.00.00.00.00.00.00.00.00.07.70.00.00.00.00.00.00.0Course', 'Total', 'ACE', '113A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA188910960202000200000003.44%27.312.113.615.213.69.10.03.00.03.00.00.00.03.00.00.00.00.00.00.00.0ACE', '310E', 'A', 'Marcus', 'Felice', 'J.', 'American', 'FilmNN', 'A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA', '1312441012100100000002.84%4.814.34.89.519.019.04.80.04.89.54.80.00.04.80.00.00.00.00.00.00.0Course', 'Total', 'ACE', '310EA+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA1312441012100100000002.84%4.814.34.89.519.019.04.80.04.89.54.80.00.04.80.00.00.00.00.00.00.0Run:02/12/16', '#', '15:00MIAMI', 'UNIVERSITY', '-', 'OXFORD,', 'OHIO', 'Page:', '16Program:', 'SZRGRDT.SQRGrade', 'Distribution', 'by', 'Campus', 'and', 'by', 'DepartmentSections', 'of', '10Office', 'of', 'the', 'RegistrarSpring', 'Semester,', '2014-15Oxford', 'CampusGrade', 'key:', 'I', '=', 'Incomplete,', 'X', '=', 'Credit/No', 'Credit,', 'Y', '=', 'Research/Credit/No,', 'P', '=', 'Pass/Fail,', 'S', '=', 'Satisfactory', 'Progress']
Above is the text output, which corresponds to the pdf screenshot.
If you want to select specific elements in your data as a list, then try:
data = ['your list here']
data[3]
Will return value 113 as the output, for example. It will probably be cleaner to use a pandas dataframe, it kind of depends if the data is in the same position each time. Selecting specific grade types can be done, except when they go into double digits, then it will be tricky.

Read dynamically CSV files

How I read CSV's files dynamically in Python, when change the suffix name files?
Example:
import pandas as pd
uf = ['AC', 'AL', 'AP', 'AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MA', 'MT', 'MS', 'MG', 'PA', 'PB', 'PR', 'PE', 'PI', 'RJ', 'RN', 'RS', 'RO', 'RR', 'SC', 'SP1', 'SP2', 'SE', 'TO']
for n in uf:
{n} = pd.read_csv('Basico_{n}.csv', encoding='latin1', sep=';', header=0)
The {} is not recognize into "for-loop".
I want to read the different file suffix names within in list items and create different DataFrames by same rules.
You have two main issues:
{n} = is invalid syntax. You can't assign to a variable name without messing with locals or globals. Doing so is almost always a bad idea anyway because it's much more difficult to programmatically access names that are, in a way, hard-coded. If the list of names is dynamic, then you need to start accessing globals() to get at them and this leads to bugs.
'Basico_{n}.csv' misses the f out of fstrings. n will not be added to the string if you don't specify that it's an f-string by prepending f.
Instead:
import pandas as pd
uf = ['AC', 'AL', 'AP', 'AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MA', 'MT', 'MS', 'MG', 'PA', 'PB', 'PR', 'PE', 'PI', 'RJ', 'RN', 'RS', 'RO', 'RR', 'SC', 'SP1', 'SP2', 'SE', 'TO']
dfs = {} # Create a dict to store the names
for n in uf:
dfs[n] = pd.read_csv(f'Basico_{n}.csv', encoding='latin1', sep=';', header=0)
'Basico_{n}.csv'
Will only work for python >= 3.6
Try
{n} = pd.read_csv('Basico_{}.csv'.format(n), encoding='latin1', sep=';', header=0)

Python pygal code - World map

Could you please help me with this code:
import pygal
from pygal.maps.world import World
worldmap_chart = pygal.maps.world.World()
worldmap_chart.title = 'Some countries'
worldmap_chart.add('F countries', ['fr', 'fi'])
worldmap_chart.add('M countries', ['ma', 'mc', 'md', 'me', 'mg',
'mk', 'ml', 'mm', 'mn', 'mo',
'mr', 'mt', 'mu', 'mv', 'mw',
'mx', 'my', 'mz'])
worldmap_chart.add('U countries', ['ua', 'ug', 'us', 'uy', 'uz'])
worldmap_chart.render()
I use Spyder. Python 3.6
.The problem is that the map does not show up on the IPython console, and also on the second line of the code, I get yellow triangle/note that says: 'pygal.maps.world.World' imported but unused. Maybe this is the reason why the map does not show up.
Otherwise, if it helps, in the IPython console I get only this: runfile('C:/Users/Nikki/.spyder-py3/untitled0.py', wdir='C:/Users/Nikki/.spyder-py3')
Could you please help me to fix this.
Thanks,
Nikki
To put this here to help others trying to use pygal to create maps.
Yes, following from what #Carolos said, you can also easily export them as html. like this:
import pygal
from pygal.maps.world import World
worldmap_chart = World()
worldmap_chart.title = 'Some countries'
worldmap_chart.add('F countries', ['fr', 'fi'])
worldmap_chart.add('M countries', ['ma', 'mc', 'md', 'me', 'mg',
'mk', 'ml', 'mm', 'mn', 'mo',
'mr', 'mt', 'mu', 'mv', 'mw',
'mx', 'my', 'mz'])
worldmap_chart.add('U countries', ['ua', 'ug', 'us', 'uy', 'uz'])
worldmap_chart.render_to_file('mymap.html')

Plot a histogram of text values

Let's say I have a list of text values (i.e., names), and I want to plot an histogram of those values, with the xticks labeled with those names.
import matplotlib.pyplot as plt
listofnames = ['Al', 'Ca', 'Re', 'Ma', 'Al', 'Ma', 'Ma', 'Re', 'Ca']
a,b,c = plt.hist(listofnames)
First of all, this code gives an error
TypeError: cannot perform reduce with flexible type
which I don't have on my complete program (with a list of >2k names, with no more than 12 different names). I haven't been able to see why this simple example list gives an error while the complete one doesn't.
But the actual point is: I can do the histogram, but the bins are not labeled with the names. How could I do that?
Thanks
Use the xticks function:
plt.xticks( arange(5), ('Tom', 'Dick', 'Harry', 'Sally', 'Sue') )
Complete example (by the way, your code doesn't work for me, but instead of your error, I get TypeError: len() of unsized object, so I'm histogramming manually here):
import matplotlib.pyplot as plt
listofnames = ['Al', 'Ca', 'Re', 'Ma', 'Al', 'Ma', 'Ma', 'Re', 'Ca']
import collections
x = collections.Counter(listofnames)
l = range(len(x.keys()))
plt.bar(l, x.values(), align='center')
plt.xticks(l, x.keys())

Categories