Importing xml file into Access database with defined id - python

I am strugling in order to import an enormous amount of data from xml file into Access.
The problem I am facing is that files I want to import does contain the first row with id
<vin id="11111111111111111">
<description>Mazda3 L 2.0l MZR 150 PS 4T 5AG AL-EDITION TRA-P</description>
<type>BL</type>
<typeapproval>e11*2001/116*0262*07</typeapproval>
<variant>B2F</variant>
<version>7EU</version>
<series>Mazda3</series>
<body>L</body>
<engine>2.0l MZR 150 PS</engine>
<grade>AL-EDITION</grade>
<transmission>5AG</transmission>
<colourtype>Mica</colourtype>
<extcolourcode>34K</extcolourcode>
<extcolourcodedescription>Crystal White Pearl</extcolourcodedescription>
<intcolourcode>BU4</intcolourcode>
<intcolourcodedescription>Black</intcolourcodedescription>
<registrationdate>2012-07-20</registrationdate>
<productiondate>2011-11-30</productiondate>
</vin>
so the result of my import is all the lines except from the VIN number of vehicle that is actually defined as id.
I was trying to manually replace characters like:
"> etc. with
etc.
to get rid of that id but I have actually dozens of files and hundreds of thousands records in each file so it is quite a pain...
so I thought about concatinating all files together with a script in python:
import os
import csv
import pandas as pd
import numpy as np
ver='2011'
dirName =r'C:\Users\dawid\Desktop\DE_DATA\Mazda_DE\VINs_DE\Mazda\xml'.format(ver);
out_file=r'C:\Users\dawid\Desktop\DE_DATA\Mazda_DE\VINs_DE\Mazda\Output.xml'.format(ver);
def getListOfFiles(dirName):
# create a list of file and sub directories
# names in the given directory
listOfFile = os.listdir(dirName)
allFiles = list()
# Iterate over all the entries
for entry in listOfFile:
# Create full path
fullPath = os.path.join(dirName, entry)
# If entry is a directory then get the list of files in this directory
if os.path.isdir(fullPath):
allFiles = allFiles + getListOfFiles(fullPath)
else:
allFiles.append(fullPath)
if os.path.isdir(fullPath):
allFiles = allFiles + getListOfFiles(fullPath)
return allFiles
listOfFileOut=getListOfFiles(dirName)
#filenames = allFiles
with open(out_file, 'w',encoding='ANSI') as outfile:
for fname in listOfFileOut:
with open(fname,encoding='ANSI') as infile:
for line in infile:
outfile.write(line)
print("Done")
But this completely destroyed structure of the xml file and I cannot import it anymore.
Could anyone suggest if it's possilble to use python to get rid of all those ids to be able to import the whole Database in access?
Thank you in advance.enter image description here

Try this.
from simplified_scrapy import utils, SimplifiedDoc, req
dirName = r'C:\Users\dawid\Desktop\DE_DATA\Mazda_DE\VINs_DE\Mazda\xml'
listFile = utils.getSubFile(dirName, end='.xml')
for f in listFile:
doc = SimplifiedDoc(utils.getFileContent(f, encoding='ANSI'))
doc.replaceReg('<vin[^>]*>', '<vin>')
print(doc.html)
# utils.saveFile(f, doc.html, encoding='ANSI') # write to original file
Result:
<vin>
<description>Mazda3 L 2.0l MZR 150 PS 4T 5AG AL-EDITION TRA-P</description>
<type>BL</type>
<typeapproval>e11*2001/116*0262*07</typeapproval>
<variant>B2F</variant>
<version>7EU</version>
...

Related

How to get the filename in directory with the max number in the filename python?

I have some xml files in a folder as example 'assests/2020/2010.xml', 'assests/2020/20005.xml', 'assests/2020/20999.xml' etc. I want to get the filename with max value in the '2020' folder. For above three files output should be 20999.xml
I am trying as following:
import glob
import os
list_of_files = glob.glob('assets/2020/*')
# latest_file = max(list_of_files, key=os.path.getctime)
# print (latest_file)
I couldn't be able to find logic to get the required file.
Here is the resource that have best answer to my query but I couldn't build my logic.
You can use pathlib to glob for the xml files and access the Path object attributes like .name and .stem:
from pathlib import Path
list_of_files = Path('assets/2020/').glob('*.xml')
print(max((Path(fn).name for fn in list_of_files), key=lambda fn: int(Path(fn).stem)))
Output:
20999.xml
I can't test it out right now, but you may try this:
files = []
for filename in list_of_files:
filename = str(filename)
filename = filename.replace('.xml','') #Assuming it's not printing your complete directory path
filename = int(filename)
files += [filename]
print(files)
This should get you your filenames in integer format and now you should be able to sort them in descending order and get the first item of the sorted list.
Use re to search for the appropriate endings in your file paths. If found use re again to extract the nr.
import re
list_of_files = [
'assests/2020/2010.xml',
'assests/2020/20005.xml',
'assests/2020/20999.xml'
]
highest_nr = -1
highest_nr_file = ''
for f in list_of_files:
re_result = re.findall(r'\d+\.xml$', f)
if re_result:
nr = int(re.findall(r'\d+', re_result[0])[0])
if nr > highest_nr:
highest_nr = nr
highest_nr_file = f
print(highest_nr_file)
Result
assests/2020/20999.xml
You can also try this way.
import os, re
path = "assests/2020/"
files =[
"assests/2020/2010.xml",
"assests/2020/20005.xml",
"assests/2020/20999.xml"
]
n = [int(re.findall(r'\d+\.xml$',file)[0].split('.')[0]) for file in files]
output = str(max(n))+".xml"
print("Biggest max file name of .xml file is ",os.path.join(path,output))
Output:
Biggest max file name of .xml file is assests/2020/20999.xml
import glob
xmlFiles = []
# this will store all the xml files in your directory
for file in glob.glob("*.xml"):
xmlFiles.append(file[:4])
# this will print the maximum one
print(max(xmlFiles))

I have a folder with many .tar.gz files. In python I want to go into each file unzip or compress and find text file that has string I want to extract?

I have main folder with many gz.tar compress files. So I need to unzip twice to get to a data file with text then I am extracting a certain string in the text. I am having trouble unzipping to get to the file with text then move to next file and do the same. Saving the results in a dataframe.
import os
import tarfile
for i in os.listdir(r'\user\project gz'):
tar = (i, "r:gz")
for m in tar.getmembers():
f= tar.extractfile(member):
if f is not None:
content = f.read()
text = re.findall(r"\name\s", content)
df = pd.Dataframe(text)
print(df)
I guess you want to find out file which contains the string \name\s in \user\project gz\*.tar.gz?
A solution is
import os
import re
import tarfile
import pandas as pd
row = []
value = []
for filename in os.listdir(r'\\user\\project gz'):
if filename.endswith('.tar.gz'):
tar = tarfile.open(r'\\user\\project gz' + filename)
for text_file in tar.getmembers():
f = tar.extractfile(text_file)
if f is not None:
content = f.read().decode()
if re.findall(r"\\name\\s", content):
row.append(text_file.name)
value.append(content)
tar.close()
df = pd.DataFrame(value, columns=['nametag'], index=row)
print(df)

Python - Iterating over all text files recursively

I am creating a text parser with python 3.6. I have a file layout like below:
(The real file structure I will be using is much more extensive than this.)
-Directory(main folder)
-amerigroup.txt
-bcbs.txt
childfolder
-medicare.txt
I need to extract text into 2 different lists (going through and appending to my ever-growing lists). Whenever I run my current code, I can't seem to get my program to open up my medicare.txt file to read and extract the information. I get an error stating that there is no such file or directory: 'medicare.txt'.
My goal is to get the data from the 3 files and extract it in one go. How do I get the amerigroup and bcbs data then go into the childfolder and get medicare.txt, then repeat that for all branches of my file path?
I am simply trying to open and close my text files in this code snippet. Here's what I have so far:
import re
import os
import pandas as pd
#change active directory
os.chdir(r'\\company\Files\HomeDrive\user\My Documents\claimstest')
#rootdir = r'\\company\Files\HomeDrive\user\My Documents\claimstest'
#set up Regular Expression objects to parse X12
claimidRegex = re.compile(r'(CLM\*)(\d+)')
dxRegex = re.compile(r'(ABK:)(\w\d+)(\*|~)(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?')
claimids = []
dxinfo = []
for dirpath, dirnames, files in os.walk(topdir):
for name in files:
cid = []
dx = []
if name.lower().endswith(exten):
data = open(name, 'r')
data.close()
Thank you so much for taking your time to assist me on this!
edit: I have tried using walk to no avail so far. My most recent attempt (I tried using txtfile_full_path as well--did not work):
for dirpath, dirnames, filename in os.walk(base_dir):
for filename in filename:
#defining file type
txtfile=open(filename,"r")
txtfile_full_path = os.path.join(dirpath, filename)
print(filename)
edit2 for anyone interested. This was my final solution to the problem:
import re
import os
import pandas as pd
#change active directory
os.chdir(r'\\company\Files\HomeDrive\user\My Documents\claimstest')
base_dir = (r'\\company\Files\HomeDrive\user\My Documents\claimstest')
#set up Regular Expression objects to parse X12
claimidRegex = re.compile(r'(CLM\*)(\d+)')
dxRegex = re.compile(r'(ABK:)(\w\d+)(\*|~)(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?(ABF:)?(\w\d+)?(\*|~)?')
claimids = []
dxinfo = []
for dirpath, dirnames, filename in os.walk(base_dir):
for filename in filename:
txtfile_full_path = os.path.join(dirpath, filename)
x12 = open(txtfile_full_path, 'r')
for i in x12:
match = claimidRegex.findall(i)
for word in match:
claimids.append(word[1])
x12.seek(0)
for i in x12:
match = dxRegex.findall(i)
for word in match:
dxinfo.append(word)
x12.close()
datadic = dict(zip(claimids, dxinfo))
You need to pass the full path to open. Just creating a string variable somewhere won't do anything for you! So the following should avoid your error:
txt_list = []
for dirpath, dirnames, filename in os.walk(base_dir):
for filename in filename:
# create full path
txtfile_full_path = os.path.join(dirpath, filename)
with open(txtfile_full_path) as f:
txt_list.append(f.read())
It should be easy enough to integrate the segregation based on your regexes now...

Copy certain files from one folder to another using python

I am trying to copy only certain files from one folder to another. The filenames are in a attribute table of a shapefile.
I am successful upto writing the filenames into a .csv file and list the column containing the list of the filenames to be transferred. I am stuck after that on how to read those filenames to copy them to another folder. I have read about using Shutil.copy/move but not sure how to use it. Any help is appreciated. Below is my script:
import arcpy
import csv
import os
import sys
import os.path
import shutil
from collections import defaultdict
fc = 'C:\\work_Data\\Export_Output.shp'
CSVFile = 'C:\\wokk_Data\\Export_Output.csv'
src = 'C:\\UC_Training_Areas'
dst = 'C:\\MOSAIC_Files'
fields = [f.name for f in arcpy.ListFields(fc)]
if f.type <> 'Geometry':
for i,f in enumerate(fields):
if f in (['FID', "Area", 'Category', 'SHAPE_Area']):
fields.remove (f)
with open(CSVFile, 'w') as f:
f.write(','.join(fields)+'\n')
with arcpy.da.SearchCursor(fc, fields) as cursor:
for row in cursor:
f.write(','.join([str(r) for r in row])+'\n')
f.close()
columns = defaultdict(list)
with open(CSVFile) as f:
reader = csv.DictReader(f)
for row in reader:
for (k,v) in row.items():
columns[k].append(v)
print(columns['label'])
Given the name of the file
columns['label'] you can use the following to move a file
srcpath = os.path.join(src, columns['label'])
dstpath = os.path.join(dst, columns['label'])
shutil.copyfile(srcpath, dstpath)
Here is the script I used to solve my problem:
import os
import arcpy
import os.path
import shutil
featureclass = "C:\\work_Data\\Export_Output.shp"
src = "C:\\Data\\UC_Training_Areas"
dst = "C:\\Data\\Script"
rows = arcpy.SearchCursor(featureclass)
row = rows.next()
while row:
print row.Label
shutil.move(os.path.join(src,str(row.Label)),dst)
row = rows.next()
Think of it this ways way source and destination
assuming you want to copy file from your picture folder to your image folder located somewhere in your machine destination
X is your machine name
Z is the file name``
import os;
import shutil;
import glob;
source="C:/Users/X/Pictures/test/Z.jpg"
dest="C:/Users/Public/Image"
if os.path.exists(dest):
print("this folder exit in this dir")
else:
dir = os.mkdir(dest)
for file in glob._iglob(os.path.join(source),""):
shutil.copy(file,dest)
print("done")

Python Want to move/Create a text file into a folder with the same name

I have a series of text files that are named with a series of numbers (20040719.txt) that I need edited and placed into folders with the same name as the text file (but without the .txt in the folder name). I am able to do my edits and create the folders with the correct names, but can't seem to get the edited files into their corresponding folders. There are no errors, so my question is how can i do this type of file move.
here is what I have so far
import glob
import os
import shutil
list_of_files = glob.glob("f:/Python scripts/Tests2/*.txt")
root_path = 'f:/Python scripts/Tests2/'
for file_name in list_of_files:
folders = [file_name.replace('.txt', 'D')]
for folder in folders:
os.mkdir(os.path.join(root_path,folder))
input = open(file_name, 'r')
output = open(file_name.replace('.txt', 't2.txt'), "w")
for line in input:
str = line.strip(" dd/mm/yyyy hh:mm:ss kA\t")
str = str.replace("date", "ddmmyyyy_hhmmss")
str = str.replace("lat. long. amp.", " lat long ka")
output.write(str)
input.close()
output.close()
list_of_folders = glob.glob("f:/Python scripts/Tests2/*D")
list_of_t2txt = glob.glob("f:/Python scripts/Tests2/*t2.txt")
for Folder_Name in list_of_folders:
for t2txt_Name in list_of_t2txt:
if t2txt_Name.replace('*t2.txt', '*D') == Folder_Name:
shutil.move(t2txt_Name, Folder_Name)
the end 'if' statement was a trial to see if I could do it that way

Categories