How to filter multiple dataframes and append a string to the save filenames? - python

The reason I'm trying to accomplish this is to use lots of variable names to create lots of new variable names containing the names of the original variables.
For example, I have several pandas data frames of inventory items in each location.
I want to create new data frames containing only the the negative inventory items with '_neg' appended to the original variable names (inventory locations).
I want to be able to do this with a for loop something like this:
warehouse = pd.read_excel('warehouse.xls')
retail = pd.read_excel('retailonhand.xls')
shed3 = pd.read_excel('shed3onhand.xls')
tank1 = pd.read_excel('tank1onhand.xls')
tank2 = pd.read_excel('tank2onhand.xls')
all_stock_sites = [warehouse,retail,shed3,tank1,tank2]
all_neg_stock_sites = []
for site in all_stock_sites:
string_value_of_new_site = (pseudo code):'site-->string_value_of_site' + '_neg'
string_value_of_new_site = site[site.OnHand < 0]
all_neg_stock_sites.append(string_value_of_new_site)
to create something like this
# create new dataframes for each stock site's negative 'OnHand' values
warehouse_neg = warehouse[warehouse.OnHand < 0]
retail_neg = retail[retail.OnHand < 0]
shed3_neg = shed3[shed3.OnHand < 0]
tank1_neg = tank1[tank1.OnHand < 0]
tank2_neg = tank2[tank2.OnHand < 0]
Without having to type out all 500 different stock site locations and appending '_neg' by hand.

My recommendation would be to not use variable names as the "keys" to the data, but rather assign them proper names, in a tuple or dict.
So instead of:
warehouse = pd.read_excel('warehouse.xls')
retail = pd.read_excel('retailonhand.xls')
shed3 = pd.read_excel('shed3onhand.xls')
You would have:
sites = {}
sites['warehouse'] = pd.read_excel('warehouse.xls')
sites['retail'] = pd.read_excel('retailonhand.xls')
sites['shed3'] = pd.read_excel('shed3onhand.xls')
...etc
Then you could create the negative keys like so:
sites_neg = {}
for site_name, site in sites.items():
neg_key = site_name + '_neg'
sites_neg[neg_key] = site[site.OnHand < 0]

Use rglob from the pathlib module to create a list of existing files
Python 3's pathlib Module: Taming the File System
.parent
.stem
.suffix
Use f-strings to update the file names
PEP 498 - Literal String Interpolation
Iterate through each file:
Create a dataframe
Filter the dataframe. An error will occur if the column doesn't exist (e.g. AttributeError: 'DataFrame' object has no attribute 'OnHand'), so we put the code in a try-except block. The continue statement, continues with the next iteration of the loop.
Check that the dataframe is not empty. If it's not empty then...
Add the dataframe to a dictionary for additional processing, if desired.
Save the dataframe as a new file with _neg added to the file name
from pathlib import Path
import pandas as pd
# set path to top file directory
d = Path(r'e:\PythonProjects\stack_overflow\stock_sites')
# get all xls files
files = list(d.rglob('*.xls'))
# create, filter and save dict of dataframe
df_dict = dict()
for file in files:
# create dataframe
df = pd.read_excel(file)
try:
# filter df and add to dict
df = df[df.OnHand < 0]
except AttributeError as e:
print(f'{file} caused:\n{e}\n')
continue
if not df.empty:
df_dict[f'{file.stem}_neg'] = df
# save to new file
new_path = file.parent / f'{file.stem}_neg{file.suffix}'
df.to_excel(new_path, index=False)
print(df_dict.keys())
>>> dict_keys(['retailonhand_neg', 'shed3onhand_neg', 'tank1onhand_neg', 'tank2onhand_neg', 'warehouse_neg'])
# access individual dataframes as you would any dict
df_dict['retailonhand_neg']

Related

Implement pandas groupby method on a dataframe with certain conditions

I am working with about 1000 XML files. I have written a script where the program loops through the folder containing these XML files and I have achieved the following:
Created a list with all the paths of the XML files
Read the files and extract the values I need to work with.
I have a new dataframe which consists of the only two columns I need to work with.
Here is the full code :
import glob
import pandas as pd
# Empty list to store path of xml files
path_list = []
# Function to iterate folder and store path of xml files.
# Can be modified to take the path as an argument via command line if required
time_sum = []
testcase = []
def calc_time(path):
for path in glob.iglob(f'{path}/*.xml'):
path_list.append(path)
try:
for file in path_list:
xml_df = pd.read_xml(file, xpath=".//testcase")
# Get the classname values from the XML file
testcase_v = xml_df.at[0, 'classname']
testcase.append(testcase_v)
# Get the aggregate time value of all instances of the classname
time_sum_test = xml_df['time'].sum()
time_sum.append(time_sum_test)
new_df = pd.DataFrame({'testcase': testcase, 'time': time_sum})
except Exception as ex:
msg_template = "An exception of type {0} occurred. Arguments:\n{1!r}"
message = msg_template.format(type(ex).__name__, ex.args)
print(message)
calc_time('assignment-1/data')
Now I need to group these values on the following condition.
Equally distribute classname by their time into 5 groups, so that, total time for each group will approximately same.
The new_df looks like this:
'TestMensaSynthesis': 0.49499999999999994,
'SyncVehiclesTest': 0.303,
'CallsPromotionEligibilityTask': 3.722,
'TestSambaSafetyMvrOverCustomer': 8.546,
'TestScheduledRentalPricingEstimateAPI': 1.6360000000000001,
'TestBulkImportWithHWRegistration': 0.7819999999999999,
'calendars.tests.test_intervals.TestTimeInterval': 0.006,
The dataframe has more than 1000 lines containing the classname and time.
I need to add a groupby statement which will make 5 groups of these classes and the total time of these groups will be
approximately equal to each other.

I have multiple lists and I want to filter by the most current

I have the following bucket AWS schema:
In my python code, it returns a list of the buckets with their dates.
I need to stick with the most up-to-date of the two main buckets:
I am starting in Python, this is my code:
str_of_ints = [7100, 7144]
for get_in_scenarioid in str_of_ints:
resultado = s3.list_objects(Bucket=source,Delimiter='/',Prefix=get_in_scenarioid +'/')
#print(resultado)
sub_prefix = [val['Prefix'] for val in resultado['CommonPrefixes']]
for get_in_sub_prefix in sub_prefix:
resultado2 = s3.list_objects(Bucket=source,Delimiter='/',Prefix=get_in_sub_prefix) # +'/')
#print(resultado2)
get_key_and_last_modified = [val['Key'] for val in resultado2['Contents']] + int([val['LastModified'].strftime('%Y-%m-%d %H:%M:%S') for val in resultado2['Contents']])
print(get_key_and_last_modified)
I would recommend to convert your array into pandas DataFrame and to use group by:
import pandas as pd
df = pd.DataFrame([["a",1],["a",2],["a",3],["b",2],["b",4]], columns=["lbl","val"])
df.groupby(['lbl'], sort=False)['val'].max()
lbl
a 3
b 4
In your case you would also have to split your label into 2 parts first, better keep in separate column.
Update:
Once you split your lable into bucket and sub_bucket, you can return max values like this:
dfg = df.groupby("main_bucket")
dfm = dfg.max()
res = dfm.reset_index()

How can I load my data in one line using Python and Jupyter

How can I load or create a loop to load my data composed by the same name but with different numbers {a}dhs{b}... eg., (1dhs2 or 3dhs4 ) using NumPy or pygeostat and then extract the value of the variable to use it (i have 10.000 files). Thanks :)
df1_1 = gs.DataFile('1dhs1.out')
df1_2 = gs.DataFile('1dhs2.out')
...
df3_2 = gs.DataFile('3dhs2.out')
value1_1 = df1_1['value']
value1_2 = df1_2['value']
value1_3 = df1_3['value']
value1_4 = df1_4['value']
...
value4_3 = df1_3['value']
value4_4 = df1_4['value']
Something like this?
import os
import re
# Regular expression for the expected file name: <first number>dhs<second number>.out
FILENAME_REGEX = re.compile(r'(\d+)dhs(\d+).out')
folder_path = 'some path' # path of the folder containing the data files
values = {} # dictionary to collect extracted values
# Loop over all the files in the folder_path
for f in next(os.walk(folder_path))[2]:
# if the name of the file does not match the template, go to the next one
m = FILENAME_REGEX.match(f)
if m is None:
continue
# extract relevant numbers from the file name
first_int, second_int = list(map(int, m.groups()))
# store the values in a two-level dictionary, using the extracted numbers as keys
values.setdefault(first_int, {})[second_int] = gs.DataFile(os.path.join(folder_path, f))['value']
# Print values extracted from 3dhs2.out
print(values[3][2])

Merge all h5 files using h5py

I am Novice at coding. Can some one help with a script in Python using h5py wherein we can read all the directories and sub-directories to merge multiple h5 files into a single h5 file.
What you need is a list of all datasets in the file. I think that the notion of a recursive function is what is needed here. This would allow you to extract all 'datasets' from a group, but when one of them appears to be a group itself, recursively do the same thing until all datasets are found. For example:
/
|- dataset1
|- group1
|- dataset2
|- dataset3
|- dataset4
Your function should in pseudo-code look like:
def getdatasets(key, file):
out = []
for name in file[key]:
path = join(key, name)
if file[path] is dataset: out += [path]
else out += getdatasets(path, file)
return out
For our example:
/dataset1 is a dataset: add path to output, giving
out = ['/dataset1']
/group is not a dataset: call getdatasets('/group',file)
/group/dataset2 is a dataset: add path to output, giving
nested_out = ['/group/dataset2']
/group/dataset3 is a dataset: add path to output, giving
nested_out = ['/group/dataset2', '/group/dataset3']
This is added to what we already had:
out = ['/dataset1', '/group/dataset2', '/group/dataset3']
/dataset4 is a dataset: add path to output, giving
out = ['/dataset1', '/group/dataset2', '/group/dataset3', '/dataset4']
This list can be used to copy all data to another file.
To make a simple clone you could do the following.
import h5py
import numpy as np
# function to return a list of paths to each dataset
def getdatasets(key,archive):
if key[-1] != '/': key += '/'
out = []
for name in archive[key]:
path = key + name
if isinstance(archive[path], h5py.Dataset):
out += [path]
else:
out += getdatasets(path,archive)
return out
# open HDF5-files
data = h5py.File('old.hdf5','r')
new_data = h5py.File('new.hdf5','w')
# read as much datasets as possible from the old HDF5-file
datasets = getdatasets('/',data)
# get the group-names from the lists of datasets
groups = list(set([i[::-1].split('/',1)[1][::-1] for i in datasets]))
groups = [i for i in groups if len(i)>0]
# sort groups based on depth
idx = np.argsort(np.array([len(i.split('/')) for i in groups]))
groups = [groups[i] for i in idx]
# create all groups that contain dataset that will be copied
for group in groups:
new_data.create_group(group)
# copy datasets
for path in datasets:
# - get group name
group = path[::-1].split('/',1)[1][::-1]
# - minimum group name
if len(group) == 0: group = '/'
# - copy data
data.copy(path, new_data[group])
Further customizations are, of course, possible depending on what you want. You describe some combination of files. In that case you would have to
new_data = h5py.File('new.hdf5','a')
and probably add something to the path.

Python : Separating a .txt file into columns and finding the most frequent data item in one of the columns

I read from a file and stored into artists_tag with column names .
Now this file has multiple columns and I need to generate a new data structure which has 2 columns from the artists_tag as it is and the most frequent value from the 'Tag' column as the 3rd column value.
Here is what I have written as of now:
import pandas as pd
from collections import Counter
def parse_artists_tags(filename):
df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
return df
def parse_user_artists_matrix(filename):
df = pd.read_csv(filename)
return df
# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")
#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")
#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119
# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
temp = all_tags.Tag
a = Counter(temp)
a = a.most_common()
print (a)
top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
return top_tags
top_tags = calculate_top_tag(artists_tags)
# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line
In the last method calculate_top_tag I don't understand how to choose the most frequent value from the 'Tag' column and put it as the third column for top_tags before returning it.
I am new to python and my knowledge of syntax and data structures is limited. I did try the various solutions mentioned for finding the most frequent value from the list but they seem to display the entire column and not one particular value. I know this is some trivial syntax issue but after having searched for long I still cannot figure out how to get this one.
edit 1 :
I need to find the most common tag for a particular artist and not the most common overall.
But again, I don't know how to.
edit 2 :
here is the link to the data files:
https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz
I'm sure there is a more succint way of doing it, but this should get you started:
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'

Categories