Hi Everyone so I am having problems parsing out information from a query I made to Mapquest API. I am trying to parse out data from my geocode_data column and place into separate columns. I am trying to extract the address specifically the following components in the geocode data below. bolded words are the things I am trying to extract.
'providedLocation': {'latLng': {'lat': 52.38330319, 'lng': 4.7959011}}, 'locations': [{'adminArea6Type': 'Neighborhood', 'street': (4) '25 Philip Vingboonsstraat', 'adminArea4Type': 'County', 'adminArea3Type': 'State', 'displayLatLng': (9){'lat': 52.383324, (10){ 'lng': 4.795784}, (7) 'adminArea3': 'Noord-Holland', 'adminArea1Type': 'Country', 'linkId': '0', 'adminArea4': 'MRA', 'dragPoint': False, 'mapUrl': 'http://www.mapquestapi.com/staticmap/v4/getmap?key=Cxk9Ng7G6M8VlrJytSZaAACnZE6pG3xp&type=map&size=225,160&pois=purple-1,52.3833236,4.7957837,0,0,|¢er=52.3833236,4.7957837&zoom=15&rand=-152222465', 'type': 's', '(5)postalCode': '1067BG', 'latLng': {'lat': 52.383324, 'lng': 4.795784},(6) 'adminArea5': 'Amsterdam', 'adminArea6': 'Amsterdam', 'geocodeQuality': 'ADDRESS', 'unknownInput': '', 'adminArea5Type': 'City', 'geocodeQualityCode': 'L1AAA', (8) 'adminArea1': 'NL', 'sideOfStreet': 'N'}]}
I have tried building my code but I keep getting KeyErrors. Can anyone fix my code so that I am able to extract the different address components for my study. Thanks! My code is correct until locations part towards the end. then I get an key error.
import pandas as pd
import json
import requests
df = pd.read_csv('/Users/albertgonzalobautista/Desktop/Testing11.csv')
df['geocode_data'] = ''
df['address']=''
df['st_pr_mn']= ' '
def reverseGeocode(latlng):
result = {}
url = 'http://www.mapquestapi.com/geocoding/v1/reverse?key={1}&location={0}'
apikey = 'Cxk9Ng7G6M8VlrJytSZaAACnZE6pG3xp'
request = url.format(latlng, apikey)
data = json.loads(requests.get(request).text)
if len(data['results']) > 0:
result = data['results'][0]
return result
for i, row in df.iterrows():
df['geocode_data'][i] = reverseGeocode(df['lat'][i].astype(str) + ',' + df['lon'][i].astype(str))
for i, row in df.iterrows():
if 'locations' in row['geocode_data']:
for component in row['locations']:
print (row['locations'])
df['st_pr_mn'][i] = row['adminArea3']
First of all , according to your if condition , locations is a key in row['geocode_data'] , so you should try row['geocode_data']['locations'] , not row['locations'] , this is most probably the reason you are getting the KeyError.
Then according to the json you have given in the OP, seems like locations key stores a list, so iterate over each element (as you are doing now) and get the required element from component not row. Example -
for i, row in df.iterrows():
if 'locations' in row['geocode_data']:
for component in row['geocode_data']['locations']:
print (row['geocode_data']['locations'])
df['st_pr_mn'][i] = component['adminArea3']
Though this would overwrite df['st_pr_mn'][i] with a new value for component['adminArea3'] for every dictionary in the list of row['geocode_data']['locations'] . If there is only one element in the list then its fine, otherwise you would have to decide how to store the multiple values , maybe use a list for that.
Related
I have a dataframe with a column called "Spl" with the values below: I am trying to extract the values next to 'name': strings (some rows have multiple values) but I see the new column generated with the specific location of the memory. I used the below code to extract. Any help how to extract the values after "name:" string is much appreciated.
Column values:
'name': 'Chirotherapie', 'name': 'Innen Medizin'
'name': 'Manuelle Medizin'
'name': 'Akupunktur', 'name': 'Chirotherapie', 'name': 'Innen Medizin'
Code:
df['Spl'] = lambda x: len(x['Spl'].str.split("'name':"))
Output:
<function <lambda> at 0x0000027BF8F68940>
Just simply do:-
df['Spl']=df['Spl'].str.split("'name':").str.len()
Just do count
df['Spl'] = df['Spl'].str.count("'name':")+1
I have a column in a pandas data frame that contains string like the following format as for example
fullyRandom=true+mapSizeDividedBy64=51048
mapSizeDividedBy16000=9756+fullyRandom=false
qType=MpmcArrayQueue+qCapacity=822398+burstSize=664
count=11087+mySeed=2+maxLength=9490
capacity=27281
capacity=79882
we can read for example the first row as 2 parameters separated by '+' each parameter has a value, that clear by '=' that separate between the parameter and its value.
in Output, I'm asking if there is a python script that either extract the parameters we retrieve a list of unique parameters like the following
[fullyRandom,mapSizeDividedBy64,mapSizeDividedBy64,qType,qCapacity,qCapacity, count,mySeed,maxLength,Capacity]
Notice from the previous list that it contains only the unique parameters without its values
Or extended pandas data frame if it's not too difficult if we can parse the following column and convert into many columns, each column is for one parameter that store it's value in it
Try this, it will store the values in a list.
data = []
with open('<your text file>', 'r') as file:
content = file.readlines()
for row in content:
if '+' in row:
sub_row = row.strip('\n').split('+')
for r in sub_row:
data.append(r)
else:
data.append(row.strip('\n'))
print(data)
Output:
['fullyRandom=true', 'mapSizeDividedBy64=51048', 'mapSizeDividedBy16000=9756', 'fullyRandom=false', 'qType=MpmcArrayQueue', 'qCapacity=822398', 'burstSize=664', 'count=11087', 'mySeed=2', 'maxLength=9490', 'capacity=27281', 'capacity=79882']
to convert to a list of dict that could be used in pandas:
dict_list = []
for item in data:
df = {
item.split('=')[0]: item.split('=')[1]
}
dict_list.append(df)
print(dict_list)
Output:
[{'fullyRandom': 'true'}, {'mapSizeDividedBy64': '51048'}, {'mapSizeDividedBy16000': '9756'}, {'fullyRandom': 'false'}, {'qType': 'MpmcArrayQueue'}, {'qCapacity': '822398'}, {'burstSize': '664'}, {'count': '11087'}, {'mySeed': '2'}, {'maxLength': '9490'}, {'capacity': '27281'}, {'capacity': '79882'}]
To just get the headers:
dict_list.append(item.split('=')[0])
Output:
['fullyRandom', 'mapSizeDividedBy64', 'mapSizeDividedBy16000', 'fullyRandom', 'qType', 'qCapacity', 'burstSize', 'count', 'mySeed', 'maxLength', 'capacity', 'capacity']
My Goal here is to clean up address data from individual CSV files using dictionaries for each individual column. Sort of like automating the find and replace feature from excel. The addresses are divided into columns. Housenumbers, streetnames, directions and streettype all in their own column. I used the following code to do the whole document.
missad = {
'Typo goes here': 'Corrected typo goes here'}
def replace_all(text, dic):
for i, j in missad.items():
text = text.replace(i, j)
return text
with open('original.csv','r') as csvfile:
text=csvfile.read()
text=replace_all(text,missad)
with open('cleanfile.csv','w') as cleancsv:
cleancsv.write(text)
While the code works, I need to have separate dictionaries as some columns need specific typo fixes.For example for the Housenumbers column housenum , stdir for the street direction and so on each with their column specific typos:
housenum = {
'One': '1',
'Two': '2
}
stdir = {
'NULL': ''}
I have no idea how to proceed, I feel it's something simple or that I would need pandas but am unsure how to continue. Would appreciate any help! Also is there anyway to group the typos together with one corrected typo? I tried the following but got an unhashable type error.
missad = {
['Typo goes here',Typo 2 goes here',Typo 3 goes here']: 'Corrected typo goes here'}
is something like this what you are looking for?
import pandas as pd
df = pd.read_csv(filename, index_col=False) #using pandas to read in the CSV file
#let's say in this dataframe you want to do corrections on the 'column for correction' column
correctiondict= {
'one': 1,
'two': 2
}
df['columnforcorrection']=df['columnforcorrection'].replace(correctiondict)
and use this idea for other columns of interest.
I am using python-docx to extract particular table data in a word file.
I have a word file with multiple tables. This is the particular table in multiple tables
and the retrieved data need to be arranged like this.
Challenges:
Can I find a particular table in word file using python-docx
Can I achieve my requirement using python-docx
This is not a complete answer, but it should point you in the right direction, and is based on some similar task I have been working on.
I run the following code in Python 3.6 in a Jupyter notebook, but it should work just in Python.
First we start but importing the docx Document module and point to the document we want to work with.
from docx.api import Document
document = Document(<your path to doc>)
We create a list of tables, and print how many tables there are in that. We create a list to hold all the tabular data.
tables = document.tables
print (len(tables))
big_data = []
Next we loop through the tables:
for table in document.tables:
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
#print (data)
big_data.append(data)
print(big_data)
By looping through all the tables, we read the data, creating a list of lists. Each individual list represents a table, and within that we have dictionaries per row. Each dictionary contains a key / value pair. The key is the column heading from the table and value is the cell contents for that row's data for that column.
So, that is half of your problem. The next part would be to use python-docx to create a new table in your output document - and to fill it with the appropriate content from the list / list / dictionary data.
In the example I have been working on this is the final table in the document.
When I run the routine above, this is my output:
[{'Version': '1', 'Changes': 'Local Outcome Improvement Plan ', 'Page Number': '1-34 and 42-61', 'Approved By': 'CPA Board\n', 'Date ': '22 August 2016'},
{'Version': '2', 'Changes': 'People are resilient, included and supported when in need section added ', 'Page Number': '35-41', 'Approved By': 'CPA Board', 'Date ': '12 December 2016'},
{'Version': '2', 'Changes': 'Updated governance and accountability structure following approval of the Final Report for the Review of CPA Infrastructure', 'Page Number': '59', 'Approved By': 'CPA Board', 'Date ': '12 December 2016'}]]
I would like to expand on a previously asked question:
Nested For Loop with Unequal Entities
In that question, I requested a method to extract the location's type (Hospital, Urgent Care, etc) in addition to the location's name (WELLSTAR ATLANTA MEDICAL CENTER, WELLSTAR ATLANTA MEDICAL CENTER SOUTH, etc).
The answer suggested utilizing a for loop and dictionary to collect the values and keys. The code snippet appears below:
from pprint import pprint
import requests
from bs4 import BeautifulSoup
url = "https://www.wellstar.org/locations/pages/default.aspx"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
d = {}
for row in soup.select(".WS_Content > .WS_LeftContent > table > tr"):
title = row.h3.get_text(strip=True)
d[title] = [item.get_text(strip=True) for item in row.select(".PurpleBackgroundHeading a)]
pprint(d)
I would like to extend the existing solution to include the entity's address matched with the appropriate key-value combination. If the best solution is to utilize something other than a dictionary, I'm open to that suggestion as well.
Let's say you have a dict my_dict and you want to add 2 with my_key as key. Simply do:
my_dict['my_key'] = 2
Let say you have a dict d = {'Name': 'Zara', 'Age': 7} now you want to add another value
'Sex'= 'female'
You can use built in update method.
d.update({'Sex': 'female' })
print "Value : %s" % d
Value : {'Age': 7, 'Name': 'Zara', 'Sex': 'female'}
ref is https://www.tutorialspoint.com/python/dictionary_update.htm