How to import data from xls files with differents header - python

I'm trying to collect data from multiples xls files.
The columns names and size change in each file.
For exemple I've a different header on each file.
Columns title are differents and can have speratated columns for the same datafield.
For exemple :
What I need : Reference, Name, Qt, Price, Amout
Exemple files
'A'+'B' = Reference ('1'/'1') / 'C' = Reference / 'D' = Quantity...
'A' = Reference ('1.1') / 'B' = Reference / 'C' = Nothing / 'D' = Quantity ...
'A' = Reference + Name / 'C' = Quantity...
What is the best practice to import dataset?
Python? Machine learning?
Thank you

Wow! This sounds like a really ugly design. Ok, I am assuming at least SOME of the field names are the same, otherwise, you are not actually doing anything useful here.
Before.
After.
Run this VBA code in Excel to align the field names, as shown above.
Sub CompareRowDifferences1()
Dim sht As Worksheet
Dim i, LastColumn As Long
Set sht = ThisWorkbook.Worksheets("Transposed Fields")
LastColumn = sht.Cells.SpecialCells(xlLastCell).Column
With sht
For i = 1 To LastColumn
If StrComp(.Cells(1, i), .Cells(2, i), vbBinaryCompare) <> 0 Then
.Cells(2, i).Insert Shift:=xlToRight
.Cells(2, i).Value2 = "NULL AS " & .Cells(1, i).Value2
End If
Next i
End With
End Sub

Related

VBA Excel - Fill columns based on values from another column and conditional

I have solved this problem in Python, but I would like it in VBA so that anybody can just cut/paste it into their workbooks and use it, since most of the people I work alongside are not Python-literate, or even novices in the most liberal sense of the word when it comes to programming.
The columns of interest (see below) are B, C, D. Column B represents levels of separation away from the top order (Column A). If any value in Col B is 1, Col D of that row == A (John). When BX > 1 (X is any row number), DX takes the value of the first row in B that is exactly 1 less than BX. For instance, D4 == B3, but D8 == B7, B9 == B6 and so on.
In python, I solved it like this (I know it's not very elegant):
import pandas as pd
df = pd.read_csv('workbook.csv', header=0)
levels = df['col2'].to_list()
child = df['col3'].to_list()
parent = []
length = len(levels)
indexing_container = {}
for idx in range(length):
if levels[idx]==1:
parents.append('John')
elif levels[idx]>1:
index_container.update({str(levels[idx]):child[idx]})
parents.append(index_container[str(levels[idx]-1)])
df['Parents'] = parents
This works great, the problem is I don't know VBA. Reading the docs in the meantime, but not sure how it will go. How do I write this in a VBA script?
We use Office 2019, if that makes a difference.
The following code is rather simple: It created an array of names. The column B gives the index into the array, column C the name.
The code loops over all rows, fills the name into the array and then looks for the name of the previous index.
Sub xy()
Const MaxNames = 10
Dim wb As Workbook, lastRow As Long
Set wb = Workbooks.Open("workbook.csv")
Dim names(1 To MaxNames) As String
With wb.Sheets(1)
names(1) = .Cells(2, 1)
lastRow = .Cells(.Rows.Count, 1).End(xlUp).row
Dim row As Long
For row = 2 To lastRow
Dim index As Long
index = .Cells(row, 2)
names(index) = .Cells(row, 3)
If index > 1 Then index = index - 1
.Cells(row, 4) = names(index)
Next
End With
End Sub
Note that I simply open the CSV file as workbook. Depending on the content of your CSV file and the settings in your Excel, this might or might not work. If you have problems (eg a whole line of data is written into one cell), you need to figure out which method of reading the data fits your need - lots of information on SO and elsewhere about that.

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

df.replace() not being converted into the text or csv file

When I use:
df = df.replace(oldvalue, newvalue)
it replaces the file, but when I try to put the new dataframe into either a text file or a csv file, it does not update and continues to be the original output before the replace.
I am getting the data from two files and trying to add them together. Right now I am trying to change the formatting to match the original formatting.
I have tried altering the placement of the replacement, as well as editing my df.replace command numerous times to either include regrex=True, to_replace, value=, and other small things. Below is a small sampling of code.
drdf['adg'] = adgvals #adds adg values into dataframe
for column, valuex in drdf.iteritems():
#value = value.replace('444.000', '444.0')
for indv in valuex:
valuex = valuex.replace('444.000', '444.0')
for difindv in valuex:
fourspace = ' '
if len(difindv) == 2:
indv1 = difindv + fourspace
value1 = valuex.replace(difindv, indv1)
drdf = drdf.replace(to_replace=valuex, value=value1)
#Transfers new dataframe into new text file
np.savetxt(r'/Users/username/test.txt', drdf.values, fmt='%s', delimiter='' )
drdf.to_csv(r'/Users/username/089010219.tot')
It should be replacing the values (for example 40 with 40(four spaces). It does this within the spyder interface, but it does not translate into the files that are being created.
Did you try:
df.replace(old, new, inplace=True)
Inplace essentially puts the new value 'inplace' of the old in some cases. However, I do not claim to know all the inner technical workings of inplace.
This is how I would do it with map:
drdf['adg'] = adgvals #adds adg values into dataframe
for column, valuex in drdf.iteritems():
#value = value.replace('444.000', '444.0')
for indv in valuex:
valuex = valuex.map('444.000':'444.0')
for difindv in valuex:
fourspace = ' '
if len(difindv) == 2:
indv1 = difindv + fourspace
value1 = valuex.map(difindv:indv1)
drdf = drdf.replace(valuex,value1)
#Transfers new dataframe into new text file
np.savetxt(r'/Users/username/test.txt', drdf.values, fmt='%s', delimiter='' )
drdf.to_csv(r'/Users/username/089010219.tot')

Equivalent of arcpy.Statistics_analysis using NumPy (or other)

I am having a problem (I think memory related) when trying to do an arcpy.Statistics_analysis on an approximately 40 million row table. I am trying to count the number of non-null values in various columns of the table per category (e.g. there are x non-null values in column 1 for category A). After this, I need to join the statistics results to the input table.
Is there a way of doing this using numpy (or something else)?
The code I currently have is like this:
arcpy.Statistics_analysis(input_layer, output_layer, "'Column1' COUNT; 'Column2' COUNT; 'Column3' COUNT", "Categories")
I am very much a novice with arcpy/numpy so any help much appreciated!
You can convert a table to a numpy array using the function arcpy.da.TableToNumPyArray. And then convert the array to a pandas.DataFrame object.
Here is an example of code (I assume you are working with Feature Class because you use the term null values, if you work with shapefile you will need to change the code as null values are not supported are replaced with a single space string (' '):
import arcpy
import pandas as pd
# Change these values
gdb_path = 'path/to/your/geodatabase.gdb'
table_name = 'your_table_name'
cat_field = 'Categorie'
fields = ['Column1','column2','Column3','Column4']
# Do not change
null_value = -9999
input_table = gdb_path + '\\' + table_name
# Convert to pandas DataFrame
array = arcpy.da.TableToNumPyArray(input_table,
[cat_field] + fields,
skip_nulls=False,
null_value=null_value)
df = pd.DataFrame(array)
# Count number of non null values
not_null_count = {field: {cat: 0 for cat in df[cat_field].unique()}
for field in fields}
for cat in df[cat_field].unique():
_df = df.loc[df[cat_field] == cat]
len_cat = len(_df)
for field in fields:
try: # If your field contains integrer or float
null_count = _df[field].value_counts()[int(null_value)]
except IndexError: # If it contains text (string)
null_count = _df[field].value_counts()[str(null_value)]
except KeyError: # There is no null value
null_count = 0
not_null_count[field][cat] = len_cat - null_count
Concerning joining the results to the input table without more information, it's complicated to give you an exact answer that will meet your expectations (because there are multiple columns, so it's unsure which value you want to add).
EDIT:
Here is some additional code following your clarifications:
# Create a copy of the table
copy_name = '' # name of the copied table
copy_path = gdb_path + '\\' + copy_name
arcpy.Copy_management(input_table, copy_path)
# Dividing copy data with summary
# This step doesn't need to convert the dict (not_null_value) to a table
with arcpy.da.UpdateCursor(copy_path, [cat_field] + fields) as cur:
for row in cur:
category = row[0]
for i, fld in enumerate(field):
row[i+1] /= not_null_count[fld][category]
cur.updateRow(row)
# Save the summary table as a csv file (if needed)
df_summary = pd.DataFrame(not_null_count)
df_summary.index.name = 'Food Area' # Or any name
df_summary.to_csv('path/to/file.csv') # Change path
# Summary to ArcMap Table (also if needed)
arcpy.TableToTable_conversion('path/to/file.csv',
gdb_path,
'name_of_your_new_table')

Code segment in Python keeps changing Dataframe content despite not being referenced

I have the following Dataframe in Python, where "data" = the full dataset composed of 2 columns of strings, 'Description' and 'Category'.
"dataTrain" is a subset of "data"
"catBag" is a list of all the words used in the 'Description' from rows of a specific 'Category'
"catDict" is a list of all the words used in the 'Description' from rows of all the other Categories.
"catUnique" returns me all the words that are unique to a specific category.
The nested loop replaces the 'Description' text with only words that are unique to the row's category.
classNames = sorted(list(set(dataTrain['Category'])))
catUnique = [[] for _ in range(len(classNames))]
dataTemp = dataTrain
for i in range(len(classNames)):
catBag = set()
data2 = dataTrain.loc[data['Category'] == classNames[i]]
data2['Description'].str.lower().str.split().apply(catBag.update)
catDict = set()
data3 = dataTrain.loc[data['Category'] != classNames[i]]
data3['Description'].str.lower().str.split().apply(catDict.update)
catUnique[i] = list(catBag-catDict)
for j in range(len(data2)):
if len(catUnique[i]) > 0:
data22 = data2
dataTemp.at[data22.index[j], 'Description'] = " ".join(list(set(data22.at[data22.index[j], 'Description'].lower().split()) & set(catUnique[i])))
However, running this code updates dataTrain's Description text despite not being referenced. Even when I change it so that dataTrain isn't used as an input, it still gets updated.
This issue means that more words are missing from "data3" as non-unique words are stripped from previously processed Categories.
I think it's to do with the data2['Description'].str.lower().str.spl...... lines but not sure how to fix it.
In your last line, you are updating dataTemp, which is the same as dataTrain.
In order to make a copy of dataTrain, use
dataTemp = dataTrain.copy()
In Python, dataTemp = dataTrain only creates a new variable that references the same object.

Categories