I am using python-docx to extract particular table data in a word file.
I have a word file with multiple tables. This is the particular table in multiple tables
and the retrieved data need to be arranged like this.
Challenges:
Can I find a particular table in word file using python-docx
Can I achieve my requirement using python-docx
This is not a complete answer, but it should point you in the right direction, and is based on some similar task I have been working on.
I run the following code in Python 3.6 in a Jupyter notebook, but it should work just in Python.
First we start but importing the docx Document module and point to the document we want to work with.
from docx.api import Document
document = Document(<your path to doc>)
We create a list of tables, and print how many tables there are in that. We create a list to hold all the tabular data.
tables = document.tables
print (len(tables))
big_data = []
Next we loop through the tables:
for table in document.tables:
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
#print (data)
big_data.append(data)
print(big_data)
By looping through all the tables, we read the data, creating a list of lists. Each individual list represents a table, and within that we have dictionaries per row. Each dictionary contains a key / value pair. The key is the column heading from the table and value is the cell contents for that row's data for that column.
So, that is half of your problem. The next part would be to use python-docx to create a new table in your output document - and to fill it with the appropriate content from the list / list / dictionary data.
In the example I have been working on this is the final table in the document.
When I run the routine above, this is my output:
[{'Version': '1', 'Changes': 'Local Outcome Improvement Plan ', 'Page Number': '1-34 and 42-61', 'Approved By': 'CPA Board\n', 'Date ': '22 August 2016'},
{'Version': '2', 'Changes': 'People are resilient, included and supported when in need section added ', 'Page Number': '35-41', 'Approved By': 'CPA Board', 'Date ': '12 December 2016'},
{'Version': '2', 'Changes': 'Updated governance and accountability structure following approval of the Final Report for the Review of CPA Infrastructure', 'Page Number': '59', 'Approved By': 'CPA Board', 'Date ': '12 December 2016'}]]
Related
I've created a generator object and want to write it out into a CSV file so I can upload it to an external tool. At the minute the generator returns records as separate dictionaries but don't appear to have any commas separating the records/dictionaries and when I write out the file to a txt file and reload it back into the script it returns a <class 'str'>.
Class Generator declared as:
matches =
{'type_of_reference': 'JOUR', 'title': 'Ranking evidence in substance use and addiction', 'secondary_title': 'International Journal of Drug Policy', 'alternate_title1': 'Int. J. Drug Policy', 'volume': '83', 'year': '2020', 'doi': '10.1016/j.drugpo.2020.102840'}
{'type_of_reference': 'JOUR', 'title': 'Methods used in the selection of instruments for outcomes included in core outcome sets have improved since the publication of the COSMIN/COMET guideline', 'secondary_title': 'Journal of Clinical Epidemiology', 'alternate_title1': 'J. Clin. Epidemiol.', 'volume': '125', 'start_page': '64', 'end_page': '75', 'year': '2020', 'doi': '10.1016/j.jclinepi.2020.05.021',}
Which is a result of the following generator function that compares records "doi" key within this generator object and a set of doi's from an other file.
def match_record():
with open(filename_ris) as f:
ris_records = readris(f)
for entry in ris_records:
if entry['doi'] in doi_match:
yield entry
I've outputted this generator class matches by using the following code to review that the correct records have been kept as a txt file.
with open('output.txt', 'w') as f:
for x in matchs:
f.write(str(x))
It's not a list of dictionaries nor dictionaries separated by commas that I have so I'm a bit confused about how to read/load it into pandas effectively. I want to load it into pandas to drop certain series[keys] and then write it out as a csv once completed.
I'm reading it in using pd.read_csv and just returns the key: value pairs for all the separate records as column headers which is no surprise but I don't know what to do before this step.
my source dictionary is something like this (just mentioning few columns for example)
{'aws_resource_name': 'abcd', 'resource_type': 'instance', 'policies': ['LAB_TEMP']}
What I am trying to get list like values as string in json format
info=[]
Account_Name=acc_name
for resource in result[acc_name]["resources"]:
if (Hostname==resource["aws_resource_name"]):
print(resource)
#Policy =(resource["policies"])
Policy = resource['policies']
info.append({"Account Name": Account_Name ,"policy Name": Policy })
print(info)
Current output:
[{'Account Name': 'xxxxxx', 'policy Name': ['LAB_TEMP']}]
expected output:
[{'Account Name': 'xxxxxx', 'policy Name': 'LAB_TEMP'}]
Problem is some of the values are in source dict is list type, i need to convert into string while i print to json finally
If your use case involves that the "POLICY" list will either have just one value or the first value is what would be needed then you just need to add the 0th index to your code i.e. Policy = resource['policies'][0]
In case it can have no values as well at times, then you would need to add a check for that as well
My Goal here is to clean up address data from individual CSV files using dictionaries for each individual column. Sort of like automating the find and replace feature from excel. The addresses are divided into columns. Housenumbers, streetnames, directions and streettype all in their own column. I used the following code to do the whole document.
missad = {
'Typo goes here': 'Corrected typo goes here'}
def replace_all(text, dic):
for i, j in missad.items():
text = text.replace(i, j)
return text
with open('original.csv','r') as csvfile:
text=csvfile.read()
text=replace_all(text,missad)
with open('cleanfile.csv','w') as cleancsv:
cleancsv.write(text)
While the code works, I need to have separate dictionaries as some columns need specific typo fixes.For example for the Housenumbers column housenum , stdir for the street direction and so on each with their column specific typos:
housenum = {
'One': '1',
'Two': '2
}
stdir = {
'NULL': ''}
I have no idea how to proceed, I feel it's something simple or that I would need pandas but am unsure how to continue. Would appreciate any help! Also is there anyway to group the typos together with one corrected typo? I tried the following but got an unhashable type error.
missad = {
['Typo goes here',Typo 2 goes here',Typo 3 goes here']: 'Corrected typo goes here'}
is something like this what you are looking for?
import pandas as pd
df = pd.read_csv(filename, index_col=False) #using pandas to read in the CSV file
#let's say in this dataframe you want to do corrections on the 'column for correction' column
correctiondict= {
'one': 1,
'two': 2
}
df['columnforcorrection']=df['columnforcorrection'].replace(correctiondict)
and use this idea for other columns of interest.
I'm working on a project, comaring different sorting algorithms. I already have a data generating script, which can time everything. I need this data to fit in a table (I'm using OriginPro 8) like that one:
But what should I write in Python script, so when I import .csv file it would look like this exact table?
Right now I have this structure:
{'bubble_sort': {'BEST': {'COMP': 999000, 'PERM': 0, 'TIME': 1072.061538696289},
'RND': {'COMP': 999000,
'PERM': 249853,
'TIME': 1731.0991287231445},
'WORST': {'COMP': 999000,
'PERM': 499500,
'TIME': 2358.1347465515137}},
'hoare_sort': {'BEST': {'COMP': 10975, 'PERM': 0, 'TIME': 14.000654220581055}, #and so on
And this code to save it:
def write_csv_in_file(fn, data):
with open(fn + ".cvs", 'w') as file:
writer = csv.writer(file)
for key, value in data.items():
writer.writerow([key, value])
And after importing get this table:
And it is far away from the variant I need.
What I want is that:
let's say that this data was collected on best case array of length 100. Then for 1st row of first table there should be values from ['bubble_sort']['BEST']['TIME'], ['hoare_sort']['BEST']['TIME'] and so on. Then I'd make the same tables for worst case scenario (["WORST"]), random (["RND"]), and then repeat everything for number of comparissons (["COMP"]) and permutations done (["PERM"])
Hi Everyone so I am having problems parsing out information from a query I made to Mapquest API. I am trying to parse out data from my geocode_data column and place into separate columns. I am trying to extract the address specifically the following components in the geocode data below. bolded words are the things I am trying to extract.
'providedLocation': {'latLng': {'lat': 52.38330319, 'lng': 4.7959011}}, 'locations': [{'adminArea6Type': 'Neighborhood', 'street': (4) '25 Philip Vingboonsstraat', 'adminArea4Type': 'County', 'adminArea3Type': 'State', 'displayLatLng': (9){'lat': 52.383324, (10){ 'lng': 4.795784}, (7) 'adminArea3': 'Noord-Holland', 'adminArea1Type': 'Country', 'linkId': '0', 'adminArea4': 'MRA', 'dragPoint': False, 'mapUrl': 'http://www.mapquestapi.com/staticmap/v4/getmap?key=Cxk9Ng7G6M8VlrJytSZaAACnZE6pG3xp&type=map&size=225,160&pois=purple-1,52.3833236,4.7957837,0,0,|¢er=52.3833236,4.7957837&zoom=15&rand=-152222465', 'type': 's', '(5)postalCode': '1067BG', 'latLng': {'lat': 52.383324, 'lng': 4.795784},(6) 'adminArea5': 'Amsterdam', 'adminArea6': 'Amsterdam', 'geocodeQuality': 'ADDRESS', 'unknownInput': '', 'adminArea5Type': 'City', 'geocodeQualityCode': 'L1AAA', (8) 'adminArea1': 'NL', 'sideOfStreet': 'N'}]}
I have tried building my code but I keep getting KeyErrors. Can anyone fix my code so that I am able to extract the different address components for my study. Thanks! My code is correct until locations part towards the end. then I get an key error.
import pandas as pd
import json
import requests
df = pd.read_csv('/Users/albertgonzalobautista/Desktop/Testing11.csv')
df['geocode_data'] = ''
df['address']=''
df['st_pr_mn']= ' '
def reverseGeocode(latlng):
result = {}
url = 'http://www.mapquestapi.com/geocoding/v1/reverse?key={1}&location={0}'
apikey = 'Cxk9Ng7G6M8VlrJytSZaAACnZE6pG3xp'
request = url.format(latlng, apikey)
data = json.loads(requests.get(request).text)
if len(data['results']) > 0:
result = data['results'][0]
return result
for i, row in df.iterrows():
df['geocode_data'][i] = reverseGeocode(df['lat'][i].astype(str) + ',' + df['lon'][i].astype(str))
for i, row in df.iterrows():
if 'locations' in row['geocode_data']:
for component in row['locations']:
print (row['locations'])
df['st_pr_mn'][i] = row['adminArea3']
First of all , according to your if condition , locations is a key in row['geocode_data'] , so you should try row['geocode_data']['locations'] , not row['locations'] , this is most probably the reason you are getting the KeyError.
Then according to the json you have given in the OP, seems like locations key stores a list, so iterate over each element (as you are doing now) and get the required element from component not row. Example -
for i, row in df.iterrows():
if 'locations' in row['geocode_data']:
for component in row['geocode_data']['locations']:
print (row['geocode_data']['locations'])
df['st_pr_mn'][i] = component['adminArea3']
Though this would overwrite df['st_pr_mn'][i] with a new value for component['adminArea3'] for every dictionary in the list of row['geocode_data']['locations'] . If there is only one element in the list then its fine, otherwise you would have to decide how to store the multiple values , maybe use a list for that.