Python pygal code - World map - python

Could you please help me with this code:
import pygal
from pygal.maps.world import World
worldmap_chart = pygal.maps.world.World()
worldmap_chart.title = 'Some countries'
worldmap_chart.add('F countries', ['fr', 'fi'])
worldmap_chart.add('M countries', ['ma', 'mc', 'md', 'me', 'mg',
'mk', 'ml', 'mm', 'mn', 'mo',
'mr', 'mt', 'mu', 'mv', 'mw',
'mx', 'my', 'mz'])
worldmap_chart.add('U countries', ['ua', 'ug', 'us', 'uy', 'uz'])
worldmap_chart.render()
I use Spyder. Python 3.6
.The problem is that the map does not show up on the IPython console, and also on the second line of the code, I get yellow triangle/note that says: 'pygal.maps.world.World' imported but unused. Maybe this is the reason why the map does not show up.
Otherwise, if it helps, in the IPython console I get only this: runfile('C:/Users/Nikki/.spyder-py3/untitled0.py', wdir='C:/Users/Nikki/.spyder-py3')
Could you please help me to fix this.
Thanks,
Nikki

To put this here to help others trying to use pygal to create maps.
Yes, following from what #Carolos said, you can also easily export them as html. like this:
import pygal
from pygal.maps.world import World
worldmap_chart = World()
worldmap_chart.title = 'Some countries'
worldmap_chart.add('F countries', ['fr', 'fi'])
worldmap_chart.add('M countries', ['ma', 'mc', 'md', 'me', 'mg',
'mk', 'ml', 'mm', 'mn', 'mo',
'mr', 'mt', 'mu', 'mv', 'mw',
'mx', 'my', 'mz'])
worldmap_chart.add('U countries', ['ua', 'ug', 'us', 'uy', 'uz'])
worldmap_chart.render_to_file('mymap.html')

Related

KeyError: Float64Index when running drop_duplicates

I have a dataframe with the duplicates in the t column, however when I run the drop_duplicates function the following error is returned. Could someone please explain how to fix this?
print(df.columns)
Index(['t', 'ax', 'ay', 'az', 'gx', 'gy', 'gz', 'mx', 'my', 'mz', 'movement',
'hand', 'trial_1', 'trial_2', 'trial'],
dtype='object')
df.drop_duplicates(subset= left_data.loc[:, 't'], keep='first', inplace=True, ignore_index=True) ```
KeyError: Int64IndeX

Common techniques / tips for messy pdf parsing?

I'm working on converting 100s of pdf files (500+ pages each) into csv data files. The pdfs have valuable University class grade data that cannot be found elsewhere (These are public record).
I am attempting to use Python and PyPDF to parse through it and extract data. I will attach my current progress below:
PDF example Image
The format is for the most part predictable. I cannot blindly parse, I will have to define some parsing rules to extract my data.
For example: I know when I see the string "Listing" it is followed by the class code.
Are there any advice or techniques to doing this? Is this approaching the bounds of natural language processing?
['Miami', 'PlanHonorsCross', 'ListingACE', '113', 'B', 'Yang', 'Eun', 'Chong', 'Reading', 'Writing', 'Acad', 'ContextsNN', 'A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA', '1523010101000000000003.38%7.135.714.321.40.07.10.07.10.07.10.00.00.00.00.00.00.00.00.00.00.0ACE', '113', 'C', 'Duffield', 'Ebru', 'D.', 'Reading', 'Writing', 'Acad', 'ContextsNN', 'A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA', '6030120000000100000003.63%46.20.023.10.07.715.40.00.00.00.00.00.00.07.70.00.00.00.00.00.00.0ACE', '113', 'D', 'Duffield', 'Ebru', 'D.', 'Reading', 'Writing', 'Acad', 'ContextsNN', 'A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA', '5211310000000000000003.59%38.515.47.77.723.17.70.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0ACE', '113', 'F', 'Duffield', 'Ebru', 'D.', 'Reading', 'Writing', 'Acad', 'ContextsNN', 'A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA', '6132000000000100000003.81%46.27.723.115.40.00.00.00.00.00.00.00.00.07.70.00.00.00.00.00.00.0Course', 'Total', 'ACE', '113A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA188910960202000200000003.44%27.312.113.615.213.69.10.03.00.03.00.00.00.03.00.00.00.00.00.00.00.0ACE', '310E', 'A', 'Marcus', 'Felice', 'J.', 'American', 'FilmNN', 'A+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA', '1312441012100100000002.84%4.814.34.89.519.019.04.80.04.89.54.80.00.04.80.00.00.00.00.00.00.0Course', 'Total', 'ACE', '310EA+AA-B+BB-C+CC-D+DD-FWWPWFIXYPSAvg', 'GPA1312441012100100000002.84%4.814.34.89.519.019.04.80.04.89.54.80.00.04.80.00.00.00.00.00.00.0Run:02/12/16', '#', '15:00MIAMI', 'UNIVERSITY', '-', 'OXFORD,', 'OHIO', 'Page:', '16Program:', 'SZRGRDT.SQRGrade', 'Distribution', 'by', 'Campus', 'and', 'by', 'DepartmentSections', 'of', '10Office', 'of', 'the', 'RegistrarSpring', 'Semester,', '2014-15Oxford', 'CampusGrade', 'key:', 'I', '=', 'Incomplete,', 'X', '=', 'Credit/No', 'Credit,', 'Y', '=', 'Research/Credit/No,', 'P', '=', 'Pass/Fail,', 'S', '=', 'Satisfactory', 'Progress']
Above is the text output, which corresponds to the pdf screenshot.
If you want to select specific elements in your data as a list, then try:
data = ['your list here']
data[3]
Will return value 113 as the output, for example. It will probably be cleaner to use a pandas dataframe, it kind of depends if the data is in the same position each time. Selecting specific grade types can be done, except when they go into double digits, then it will be tricky.

Read dynamically CSV files

How I read CSV's files dynamically in Python, when change the suffix name files?
Example:
import pandas as pd
uf = ['AC', 'AL', 'AP', 'AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MA', 'MT', 'MS', 'MG', 'PA', 'PB', 'PR', 'PE', 'PI', 'RJ', 'RN', 'RS', 'RO', 'RR', 'SC', 'SP1', 'SP2', 'SE', 'TO']
for n in uf:
{n} = pd.read_csv('Basico_{n}.csv', encoding='latin1', sep=';', header=0)
The {} is not recognize into "for-loop".
I want to read the different file suffix names within in list items and create different DataFrames by same rules.
You have two main issues:
{n} = is invalid syntax. You can't assign to a variable name without messing with locals or globals. Doing so is almost always a bad idea anyway because it's much more difficult to programmatically access names that are, in a way, hard-coded. If the list of names is dynamic, then you need to start accessing globals() to get at them and this leads to bugs.
'Basico_{n}.csv' misses the f out of fstrings. n will not be added to the string if you don't specify that it's an f-string by prepending f.
Instead:
import pandas as pd
uf = ['AC', 'AL', 'AP', 'AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MA', 'MT', 'MS', 'MG', 'PA', 'PB', 'PR', 'PE', 'PI', 'RJ', 'RN', 'RS', 'RO', 'RR', 'SC', 'SP1', 'SP2', 'SE', 'TO']
dfs = {} # Create a dict to store the names
for n in uf:
dfs[n] = pd.read_csv(f'Basico_{n}.csv', encoding='latin1', sep=';', header=0)
'Basico_{n}.csv'
Will only work for python >= 3.6
Try
{n} = pd.read_csv('Basico_{}.csv'.format(n), encoding='latin1', sep=';', header=0)

Same DataFrame.reindex code - different output

Good afternoon everyone,
I want to filter out from a DataFrame the columns that I am not interested in.
To do that - and since the columns could change based on user input (that I will not show here) - I am using the following code within my offshore_filter function:
# Note: 'df' is my DataFrame, with different country codes as rows and years as columns' headers
import datetime as d
import pandas as pd
COUNTRIES = [
'EU28', 'AL', 'AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'EL',
'ES', 'FI', 'FR', 'GE', 'HR', 'HU', 'IE', 'IS', 'IT', 'LT', 'LU', 'LV',
'MD', 'ME', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK',
'TR', 'UA', 'UK', 'XK'
YEARS = list(range(2005, int(d.datetime.now().year)))
def offshore_filter(df, countries=COUNTRIES, years=YEARS):
# This function is specific for filtering out the countries
# and the years not needed in the analysis
# Filter out all of the countries not of interest
df.drop(df[~df['country'].isin(countries)].index, inplace=True)
# Filter out all of the years not of interest
columns_to_keep = ['country', 'country_name'] + [i for i in years]
temp = df.reindex(columns=columns_to_keep)
df = temp # This step to avoid the copy vs view complication
return df
When I pass a years list of integers, the code works well and filters the DataFrame by taking only the columns in the years list.
However, if the DataFrame's column headers are strings (e.g. '2018' instead of 2018), changing [i for i in years] into [str(i) for i in years] doesn't work, and I have columns of Nan's (as the reindex documentation states).
Can you help me spot me why?

Convert tuple containing an OrderedDict with tagged parts to table with columns named from tagged parts

The title is more completely: Convert tuple containing an OrderedDict with tagged parts to table with columns named from tagged parts (variable number of tagged parts and variable number of occurrences of tags).
I know more about address parsing than python which is probably the underlying source of the problem. How to do this might be obvious. The usaddress library is intentionally returning results in this manner which is presumably useful.
I'm using usaddress which "is a python library for parsing unstructured address strings into address components, using advanced NLP methods," and seems to work very well. Here is the usaddress source and website.
So I run it on a file like:
2244 NE 29TH DR
1742 NW 57TH ST
1241 NE EAST DEVILS LAKE RD
4239 SW HWY 101, UNIT 19
1315 NE HARBOR RIDGE
4850 SE 51ST ST
1501 SE EAST DEVILS LAKE RD
1525 NE REGATTA WAY
6458 NE MAST AVE
4009 SW HWY 101
814 SW 9TH ST
1665 SALMON RIVER HWY
3500 NE WEST DEVILS LAKE RD, UNIT 18
1912 NE 56TH DR
3334 NE SURF AVE
2734 SW DUNE CT
2558 NE 33RD ST
2600 NE 33RD ST
5617 NW JETTY AVE
I want to convert those results into something more like a table (CSV or database eventually).
I was not sure what datatypes are returned. Reading the docs, tells me that the tag method returns a tuple containing an OrderedDict with tagged parts. The parse method seems to return a slightly different type. This question, helped me determine that it is a list and a tuple (apparently with tags). Searching for how to convert a python list with tagged parts to a table was unsuccessful.
Searching for how to convert a tuple containing an OrderedDict doesn't turn up much. This is the closest that I found. I also found that pandas is good at various formatting tasks, although it was not clear to me how to apply pandas to this. Many of the closest question I've found like the opposite question or one with named tuples have very low scores.
I also tried some exploratory attempts to see if it would just work (below). I was able to see a few ways to access the data and using zip from this Matrix Transpose question got a little closer to a table since the data and named tags are now separate, although not uniform. Is there a way to take these results in tagged lists or tuples containing an OrderedDict with tagged parts to a table? Is there a fairly direct way from the returned results?
Here is the parse method:
## Get a library
import usaddress
## Open the file with read only permmission
f = open('address_sample.txt')
## Read the first line
line = f.readline()
## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
## Try the parse method
parsed = usaddress.parse(line)
## See what the parse results look like
zippy = [list(i) for i in zip(*parsed)]
print(zippy)
## read the next line
line = f.readline()
## close the file
f.close()
And the results produced (notice that when there are multiple parts to a tag it is repeated).
[['2244', 'NE', '29TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1742', 'NW', '57TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1241', 'NE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['4239', 'SW', 'HWY', '101,', 'UNIT', '19'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier']]
[['1315', 'NE', 'HARBOR', 'RIDGE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4850', 'SE', '51ST', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1501', 'SE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['1525', 'NE', 'REGATTA', 'WAY'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['6458', 'NE', 'MAST', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4009', 'SW', 'HWY', '101'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName']]
[['814', 'SW', '9TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1665', 'SALMON', 'RIVER', 'HWY'], ['AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['3500', 'NE', 'WEST', 'DEVILS', 'LAKE', 'RD,', 'UNIT', '18'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier']]
[['1912', 'NE', '56TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['3334', 'NE', 'SURF', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2734', 'SW', 'DUNE', 'CT'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2558', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2600', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['5617', 'NW', 'JETTY', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
Here is the tag method:
## Get a library
import usaddress
## Open the file with read only permmission
f = open('address_sample.txt')
## Read the first line
line = f.readline()
## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
## Try tag method
tagged = usaddress.tag(line)
## See what the tag results look like
items_ = list(tagged[0].items())
zippy2 = [list(i) for i in zip(*items_)]
print(zippy2)
## read the next line
line = f.readline()
## close the file
f.close()
produces the following output which better handles the combining of multiple parts with the same tag:
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2244', 'NE', '29TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1742', 'NW', '57TH', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1241', 'NE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier'], ['4239', 'SW', 'HWY', '101', 'UNIT', '19']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1315', 'NE', 'HARBOR', 'RIDGE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['4850', 'SE', '51ST', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1501', 'SE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1525', 'NE', 'REGATTA', 'WAY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['6458', 'NE', 'MAST', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName'], ['4009', 'SW', 'HWY', '101']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['814', 'SW', '9TH', 'ST']]
[['AddressNumber', 'StreetName', 'StreetNamePostType'], ['1665', 'SALMON RIVER', 'HWY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier'], ['3500', 'NE', 'WEST DEVILS LAKE', 'RD', 'UNIT', '18']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1912', 'NE', '56TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['3334', 'NE', 'SURF', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2734', 'SW', 'DUNE', 'CT']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2558', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2600', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['5617', 'NW', 'JETTY', 'AVE']]
Just use the csv.DictWriter class with your tag method:
from csv import DictWriter
import usaddress
tagged_lines = []
fields = set()
# Note 1: Use the 'with' statement instead of worrying about opening
# and closing your file manually
with open('address_sample.txt') as in_file:
# Note 2: You don't need to mess with readline() and while loops;
# just iterate over the file handle directly, it produces lines.
for line in in_file:
tagged = usaddress.tag(line)[0]
tagged_lines.append(tagged)
fields.update(tagged.keys()) # keep track of all field names we see
with open('address_sample.csv', 'w') as out_file:
writer = DictWriter(out_file, fieldnames=fields)
writer.writeheader()
writer.writerows(tagged_lines)
Note that this is inefficient for large files as it holds the entire contents of your input in memory at once; the only reason for that is that the set of fieldnames (i.e. csv column headers) is unknown beforehand.
If you know the full set you could just do it in one streaming pass, writing tagged output as you read each line. Alternatively, you could do one pass over the file to generate the set of headers, and then a second pass to do the conversion.

Categories