Implement pandas groupby method on a dataframe with certain conditions - python

I am working with about 1000 XML files. I have written a script where the program loops through the folder containing these XML files and I have achieved the following:
Created a list with all the paths of the XML files
Read the files and extract the values I need to work with.
I have a new dataframe which consists of the only two columns I need to work with.
Here is the full code :
import glob
import pandas as pd
# Empty list to store path of xml files
path_list = []
# Function to iterate folder and store path of xml files.
# Can be modified to take the path as an argument via command line if required
time_sum = []
testcase = []
def calc_time(path):
for path in glob.iglob(f'{path}/*.xml'):
path_list.append(path)
try:
for file in path_list:
xml_df = pd.read_xml(file, xpath=".//testcase")
# Get the classname values from the XML file
testcase_v = xml_df.at[0, 'classname']
testcase.append(testcase_v)
# Get the aggregate time value of all instances of the classname
time_sum_test = xml_df['time'].sum()
time_sum.append(time_sum_test)
new_df = pd.DataFrame({'testcase': testcase, 'time': time_sum})
except Exception as ex:
msg_template = "An exception of type {0} occurred. Arguments:\n{1!r}"
message = msg_template.format(type(ex).__name__, ex.args)
print(message)
calc_time('assignment-1/data')
Now I need to group these values on the following condition.
Equally distribute classname by their time into 5 groups, so that, total time for each group will approximately same.
The new_df looks like this:
'TestMensaSynthesis': 0.49499999999999994,
'SyncVehiclesTest': 0.303,
'CallsPromotionEligibilityTask': 3.722,
'TestSambaSafetyMvrOverCustomer': 8.546,
'TestScheduledRentalPricingEstimateAPI': 1.6360000000000001,
'TestBulkImportWithHWRegistration': 0.7819999999999999,
'calendars.tests.test_intervals.TestTimeInterval': 0.006,
The dataframe has more than 1000 lines containing the classname and time.
I need to add a groupby statement which will make 5 groups of these classes and the total time of these groups will be
approximately equal to each other.

Related

How to combine a Pandas dataframe with a Tensor dataset?

I have a Tensor dataset that is a list of file names and a Pandas dataframe that contains metadata for each file.
filename_ds = tf.data.Dataset.list_files(path + "/*.bmp")
metadata_df = pandas.read_csv(path + "/metadata.csv")
File names contain an idx that references a line in the metadata dataframe, like "3_data.bmp" where 3 is the idx. I hoped to call filename_ds.map(combine_data).
It appears to be not as simple as parsing the file name and doing a dataframe lookup. The following fails because filename is a Tensor, and since I'm running this on a Dataset.map() call, I do not have access to tf.executing_eagerly() methods like .numpy() and cannot get a string value from the filename to do my regex and df lookup.
combine_data(filename)
idx = re.findall("(\d+)_data.bmp", filename)[0]
val = metadata_df.loc[metadata_df["idx"] == idx]["test-col"]
...
New to Tensorflow, and I suspect I'm going about this in an odd way. What would be the correct way to go about this? I could list my files and concatenate a dataset for each file, but I'm wondering if I'm just missing the "Tensorflow way" of doing it.
One way of iteration is through as_numpy_iterator()
dataset_list=list(filename_ds.as_numpy_iterator())
for each_file in dataset_list:
file_name=each_file.decode('utf-8') # this will contain the abs path /user/me/so/file_1.png
try:
idx=re.findall("(\d+).*.png", file_name)[0] # changed for my case
except :
print("Exception==>")
print(f"File:{file_name},idx:{idx}")

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

How to filter multiple dataframes and append a string to the save filenames?

The reason I'm trying to accomplish this is to use lots of variable names to create lots of new variable names containing the names of the original variables.
For example, I have several pandas data frames of inventory items in each location.
I want to create new data frames containing only the the negative inventory items with '_neg' appended to the original variable names (inventory locations).
I want to be able to do this with a for loop something like this:
warehouse = pd.read_excel('warehouse.xls')
retail = pd.read_excel('retailonhand.xls')
shed3 = pd.read_excel('shed3onhand.xls')
tank1 = pd.read_excel('tank1onhand.xls')
tank2 = pd.read_excel('tank2onhand.xls')
all_stock_sites = [warehouse,retail,shed3,tank1,tank2]
all_neg_stock_sites = []
for site in all_stock_sites:
string_value_of_new_site = (pseudo code):'site-->string_value_of_site' + '_neg'
string_value_of_new_site = site[site.OnHand < 0]
all_neg_stock_sites.append(string_value_of_new_site)
to create something like this
# create new dataframes for each stock site's negative 'OnHand' values
warehouse_neg = warehouse[warehouse.OnHand < 0]
retail_neg = retail[retail.OnHand < 0]
shed3_neg = shed3[shed3.OnHand < 0]
tank1_neg = tank1[tank1.OnHand < 0]
tank2_neg = tank2[tank2.OnHand < 0]
Without having to type out all 500 different stock site locations and appending '_neg' by hand.
My recommendation would be to not use variable names as the "keys" to the data, but rather assign them proper names, in a tuple or dict.
So instead of:
warehouse = pd.read_excel('warehouse.xls')
retail = pd.read_excel('retailonhand.xls')
shed3 = pd.read_excel('shed3onhand.xls')
You would have:
sites = {}
sites['warehouse'] = pd.read_excel('warehouse.xls')
sites['retail'] = pd.read_excel('retailonhand.xls')
sites['shed3'] = pd.read_excel('shed3onhand.xls')
...etc
Then you could create the negative keys like so:
sites_neg = {}
for site_name, site in sites.items():
neg_key = site_name + '_neg'
sites_neg[neg_key] = site[site.OnHand < 0]
Use rglob from the pathlib module to create a list of existing files
Python 3's pathlib Module: Taming the File System
.parent
.stem
.suffix
Use f-strings to update the file names
PEP 498 - Literal String Interpolation
Iterate through each file:
Create a dataframe
Filter the dataframe. An error will occur if the column doesn't exist (e.g. AttributeError: 'DataFrame' object has no attribute 'OnHand'), so we put the code in a try-except block. The continue statement, continues with the next iteration of the loop.
Check that the dataframe is not empty. If it's not empty then...
Add the dataframe to a dictionary for additional processing, if desired.
Save the dataframe as a new file with _neg added to the file name
from pathlib import Path
import pandas as pd
# set path to top file directory
d = Path(r'e:\PythonProjects\stack_overflow\stock_sites')
# get all xls files
files = list(d.rglob('*.xls'))
# create, filter and save dict of dataframe
df_dict = dict()
for file in files:
# create dataframe
df = pd.read_excel(file)
try:
# filter df and add to dict
df = df[df.OnHand < 0]
except AttributeError as e:
print(f'{file} caused:\n{e}\n')
continue
if not df.empty:
df_dict[f'{file.stem}_neg'] = df
# save to new file
new_path = file.parent / f'{file.stem}_neg{file.suffix}'
df.to_excel(new_path, index=False)
print(df_dict.keys())
>>> dict_keys(['retailonhand_neg', 'shed3onhand_neg', 'tank1onhand_neg', 'tank2onhand_neg', 'warehouse_neg'])
# access individual dataframes as you would any dict
df_dict['retailonhand_neg']

python pandas HDF5Store append new dataframe with long string columns fails

For a given path, i process many GigaBytes of files inside, and yield dataframes for every processed one.
For every dataframe that is yield, which includes two string columns of varying size, I want to dump them to disk using the very efficient HDF5 format. The error is raised when the HDFStore.append procedure is called, for the 4th or 5th iteration.
I use the following routine(simplified) to build the dataframes:
def build_data_frames(path):
data = df({'headline': [],
'content': [],
'publication': [],
'file_ref': []},
columns=['publication','file_ref','headline','content'])
for curdir, subdirs, filenames in os.walk(path):
for file in filenames:
if (zipfile.is_zipfile(os.path.join(curdir, file))):
with zf(os.path.join(curdir, file), 'r') as arch:
for arch_file_name in arch.namelist():
if re.search('A[r|d]\d+.xml', arch_file_name) is not None:
xml_file_ref = arch.open(arch_file_name, 'r')
xml_file = xml_file_ref.read()
metadata = XML2MetaData(xml_file)
headlineTokens, contentTokens = XML2TokensParser(xml_file)
rows= [{'headline': " ".join(headlineTokens),
'content': " ".join(contentTokens)}]
rows[0].update(metadata)
data = data.append(df(rows,
columns=['publication',
'file_ref',
'headline',
'content']),
ignore_index=True)
arch.close()
yield data
Then I use the following method to write these dataframes to disk:
def extract_data(path):
hdf_fname = extract_name(path)
hdf_fname += ".h5"
data_store = HDFStore(hdf_fname)
for dataframe in build_data_frames(path):
data_store.append('df', dataframe, data_columns=True)
## passing min_itemsize doesn't work either
## data_store.append('df', dataframe, min_itemsize=8000)
## trying the "alternative" command didn't help
## dataframe.to_hdf(hdf_fname, 'df', format='table', append=True,
## min_itemsize=80000)
data_store.close()
->
%time load_data(publications_path)
And the ValueError I get is:
...
ValueError: Trying to store a string with len [5761] in [values_block_0]
column but this column has a limit of [4430]!
Consider using min_itemsize to preset the sizes on these columns
I tried all the options, went through all the documentation necessary for this task, and tried all the tricks I saw on the Internet. Yet, no idea why it happens.
I use pandas ver: 0.17.0
Appreciate your help very much!
Have you seen this post? stackoverflow
data_store.append('df',dataframe,min_itemsize={ 'string' : 5761 })
Change 'string' to your type.

Python : Separating a .txt file into columns and finding the most frequent data item in one of the columns

I read from a file and stored into artists_tag with column names .
Now this file has multiple columns and I need to generate a new data structure which has 2 columns from the artists_tag as it is and the most frequent value from the 'Tag' column as the 3rd column value.
Here is what I have written as of now:
import pandas as pd
from collections import Counter
def parse_artists_tags(filename):
df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
return df
def parse_user_artists_matrix(filename):
df = pd.read_csv(filename)
return df
# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")
#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")
#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119
# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
temp = all_tags.Tag
a = Counter(temp)
a = a.most_common()
print (a)
top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
return top_tags
top_tags = calculate_top_tag(artists_tags)
# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line
In the last method calculate_top_tag I don't understand how to choose the most frequent value from the 'Tag' column and put it as the third column for top_tags before returning it.
I am new to python and my knowledge of syntax and data structures is limited. I did try the various solutions mentioned for finding the most frequent value from the list but they seem to display the entire column and not one particular value. I know this is some trivial syntax issue but after having searched for long I still cannot figure out how to get this one.
edit 1 :
I need to find the most common tag for a particular artist and not the most common overall.
But again, I don't know how to.
edit 2 :
here is the link to the data files:
https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz
I'm sure there is a more succint way of doing it, but this should get you started:
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'

Categories