The reason I'm trying to accomplish this is to use lots of variable names to create lots of new variable names containing the names of the original variables.
For example, I have several pandas data frames of inventory items in each location.
I want to create new data frames containing only the the negative inventory items with '_neg' appended to the original variable names (inventory locations).
I want to be able to do this with a for loop something like this:
warehouse = pd.read_excel('warehouse.xls')
retail = pd.read_excel('retailonhand.xls')
shed3 = pd.read_excel('shed3onhand.xls')
tank1 = pd.read_excel('tank1onhand.xls')
tank2 = pd.read_excel('tank2onhand.xls')
all_stock_sites = [warehouse,retail,shed3,tank1,tank2]
all_neg_stock_sites = []
for site in all_stock_sites:
string_value_of_new_site = (pseudo code):'site-->string_value_of_site' + '_neg'
string_value_of_new_site = site[site.OnHand < 0]
all_neg_stock_sites.append(string_value_of_new_site)
to create something like this
# create new dataframes for each stock site's negative 'OnHand' values
warehouse_neg = warehouse[warehouse.OnHand < 0]
retail_neg = retail[retail.OnHand < 0]
shed3_neg = shed3[shed3.OnHand < 0]
tank1_neg = tank1[tank1.OnHand < 0]
tank2_neg = tank2[tank2.OnHand < 0]
Without having to type out all 500 different stock site locations and appending '_neg' by hand.
My recommendation would be to not use variable names as the "keys" to the data, but rather assign them proper names, in a tuple or dict.
So instead of:
warehouse = pd.read_excel('warehouse.xls')
retail = pd.read_excel('retailonhand.xls')
shed3 = pd.read_excel('shed3onhand.xls')
You would have:
sites = {}
sites['warehouse'] = pd.read_excel('warehouse.xls')
sites['retail'] = pd.read_excel('retailonhand.xls')
sites['shed3'] = pd.read_excel('shed3onhand.xls')
...etc
Then you could create the negative keys like so:
sites_neg = {}
for site_name, site in sites.items():
neg_key = site_name + '_neg'
sites_neg[neg_key] = site[site.OnHand < 0]
Use rglob from the pathlib module to create a list of existing files
Python 3's pathlib Module: Taming the File System
.parent
.stem
.suffix
Use f-strings to update the file names
PEP 498 - Literal String Interpolation
Iterate through each file:
Create a dataframe
Filter the dataframe. An error will occur if the column doesn't exist (e.g. AttributeError: 'DataFrame' object has no attribute 'OnHand'), so we put the code in a try-except block. The continue statement, continues with the next iteration of the loop.
Check that the dataframe is not empty. If it's not empty then...
Add the dataframe to a dictionary for additional processing, if desired.
Save the dataframe as a new file with _neg added to the file name
from pathlib import Path
import pandas as pd
# set path to top file directory
d = Path(r'e:\PythonProjects\stack_overflow\stock_sites')
# get all xls files
files = list(d.rglob('*.xls'))
# create, filter and save dict of dataframe
df_dict = dict()
for file in files:
# create dataframe
df = pd.read_excel(file)
try:
# filter df and add to dict
df = df[df.OnHand < 0]
except AttributeError as e:
print(f'{file} caused:\n{e}\n')
continue
if not df.empty:
df_dict[f'{file.stem}_neg'] = df
# save to new file
new_path = file.parent / f'{file.stem}_neg{file.suffix}'
df.to_excel(new_path, index=False)
print(df_dict.keys())
>>> dict_keys(['retailonhand_neg', 'shed3onhand_neg', 'tank1onhand_neg', 'tank2onhand_neg', 'warehouse_neg'])
# access individual dataframes as you would any dict
df_dict['retailonhand_neg']
Related
I am working with about 1000 XML files. I have written a script where the program loops through the folder containing these XML files and I have achieved the following:
Created a list with all the paths of the XML files
Read the files and extract the values I need to work with.
I have a new dataframe which consists of the only two columns I need to work with.
Here is the full code :
import glob
import pandas as pd
# Empty list to store path of xml files
path_list = []
# Function to iterate folder and store path of xml files.
# Can be modified to take the path as an argument via command line if required
time_sum = []
testcase = []
def calc_time(path):
for path in glob.iglob(f'{path}/*.xml'):
path_list.append(path)
try:
for file in path_list:
xml_df = pd.read_xml(file, xpath=".//testcase")
# Get the classname values from the XML file
testcase_v = xml_df.at[0, 'classname']
testcase.append(testcase_v)
# Get the aggregate time value of all instances of the classname
time_sum_test = xml_df['time'].sum()
time_sum.append(time_sum_test)
new_df = pd.DataFrame({'testcase': testcase, 'time': time_sum})
except Exception as ex:
msg_template = "An exception of type {0} occurred. Arguments:\n{1!r}"
message = msg_template.format(type(ex).__name__, ex.args)
print(message)
calc_time('assignment-1/data')
Now I need to group these values on the following condition.
Equally distribute classname by their time into 5 groups, so that, total time for each group will approximately same.
The new_df looks like this:
'TestMensaSynthesis': 0.49499999999999994,
'SyncVehiclesTest': 0.303,
'CallsPromotionEligibilityTask': 3.722,
'TestSambaSafetyMvrOverCustomer': 8.546,
'TestScheduledRentalPricingEstimateAPI': 1.6360000000000001,
'TestBulkImportWithHWRegistration': 0.7819999999999999,
'calendars.tests.test_intervals.TestTimeInterval': 0.006,
The dataframe has more than 1000 lines containing the classname and time.
I need to add a groupby statement which will make 5 groups of these classes and the total time of these groups will be
approximately equal to each other.
I have the following bucket AWS schema:
In my python code, it returns a list of the buckets with their dates.
I need to stick with the most up-to-date of the two main buckets:
I am starting in Python, this is my code:
str_of_ints = [7100, 7144]
for get_in_scenarioid in str_of_ints:
resultado = s3.list_objects(Bucket=source,Delimiter='/',Prefix=get_in_scenarioid +'/')
#print(resultado)
sub_prefix = [val['Prefix'] for val in resultado['CommonPrefixes']]
for get_in_sub_prefix in sub_prefix:
resultado2 = s3.list_objects(Bucket=source,Delimiter='/',Prefix=get_in_sub_prefix) # +'/')
#print(resultado2)
get_key_and_last_modified = [val['Key'] for val in resultado2['Contents']] + int([val['LastModified'].strftime('%Y-%m-%d %H:%M:%S') for val in resultado2['Contents']])
print(get_key_and_last_modified)
I would recommend to convert your array into pandas DataFrame and to use group by:
import pandas as pd
df = pd.DataFrame([["a",1],["a",2],["a",3],["b",2],["b",4]], columns=["lbl","val"])
df.groupby(['lbl'], sort=False)['val'].max()
lbl
a 3
b 4
In your case you would also have to split your label into 2 parts first, better keep in separate column.
Update:
Once you split your lable into bucket and sub_bucket, you can return max values like this:
dfg = df.groupby("main_bucket")
dfm = dfg.max()
res = dfm.reset_index()
How can I load or create a loop to load my data composed by the same name but with different numbers {a}dhs{b}... eg., (1dhs2 or 3dhs4 ) using NumPy or pygeostat and then extract the value of the variable to use it (i have 10.000 files). Thanks :)
df1_1 = gs.DataFile('1dhs1.out')
df1_2 = gs.DataFile('1dhs2.out')
...
df3_2 = gs.DataFile('3dhs2.out')
value1_1 = df1_1['value']
value1_2 = df1_2['value']
value1_3 = df1_3['value']
value1_4 = df1_4['value']
...
value4_3 = df1_3['value']
value4_4 = df1_4['value']
Something like this?
import os
import re
# Regular expression for the expected file name: <first number>dhs<second number>.out
FILENAME_REGEX = re.compile(r'(\d+)dhs(\d+).out')
folder_path = 'some path' # path of the folder containing the data files
values = {} # dictionary to collect extracted values
# Loop over all the files in the folder_path
for f in next(os.walk(folder_path))[2]:
# if the name of the file does not match the template, go to the next one
m = FILENAME_REGEX.match(f)
if m is None:
continue
# extract relevant numbers from the file name
first_int, second_int = list(map(int, m.groups()))
# store the values in a two-level dictionary, using the extracted numbers as keys
values.setdefault(first_int, {})[second_int] = gs.DataFile(os.path.join(folder_path, f))['value']
# Print values extracted from 3dhs2.out
print(values[3][2])
I am Novice at coding. Can some one help with a script in Python using h5py wherein we can read all the directories and sub-directories to merge multiple h5 files into a single h5 file.
What you need is a list of all datasets in the file. I think that the notion of a recursive function is what is needed here. This would allow you to extract all 'datasets' from a group, but when one of them appears to be a group itself, recursively do the same thing until all datasets are found. For example:
/
|- dataset1
|- group1
|- dataset2
|- dataset3
|- dataset4
Your function should in pseudo-code look like:
def getdatasets(key, file):
out = []
for name in file[key]:
path = join(key, name)
if file[path] is dataset: out += [path]
else out += getdatasets(path, file)
return out
For our example:
/dataset1 is a dataset: add path to output, giving
out = ['/dataset1']
/group is not a dataset: call getdatasets('/group',file)
/group/dataset2 is a dataset: add path to output, giving
nested_out = ['/group/dataset2']
/group/dataset3 is a dataset: add path to output, giving
nested_out = ['/group/dataset2', '/group/dataset3']
This is added to what we already had:
out = ['/dataset1', '/group/dataset2', '/group/dataset3']
/dataset4 is a dataset: add path to output, giving
out = ['/dataset1', '/group/dataset2', '/group/dataset3', '/dataset4']
This list can be used to copy all data to another file.
To make a simple clone you could do the following.
import h5py
import numpy as np
# function to return a list of paths to each dataset
def getdatasets(key,archive):
if key[-1] != '/': key += '/'
out = []
for name in archive[key]:
path = key + name
if isinstance(archive[path], h5py.Dataset):
out += [path]
else:
out += getdatasets(path,archive)
return out
# open HDF5-files
data = h5py.File('old.hdf5','r')
new_data = h5py.File('new.hdf5','w')
# read as much datasets as possible from the old HDF5-file
datasets = getdatasets('/',data)
# get the group-names from the lists of datasets
groups = list(set([i[::-1].split('/',1)[1][::-1] for i in datasets]))
groups = [i for i in groups if len(i)>0]
# sort groups based on depth
idx = np.argsort(np.array([len(i.split('/')) for i in groups]))
groups = [groups[i] for i in idx]
# create all groups that contain dataset that will be copied
for group in groups:
new_data.create_group(group)
# copy datasets
for path in datasets:
# - get group name
group = path[::-1].split('/',1)[1][::-1]
# - minimum group name
if len(group) == 0: group = '/'
# - copy data
data.copy(path, new_data[group])
Further customizations are, of course, possible depending on what you want. You describe some combination of files. In that case you would have to
new_data = h5py.File('new.hdf5','a')
and probably add something to the path.
I read from a file and stored into artists_tag with column names .
Now this file has multiple columns and I need to generate a new data structure which has 2 columns from the artists_tag as it is and the most frequent value from the 'Tag' column as the 3rd column value.
Here is what I have written as of now:
import pandas as pd
from collections import Counter
def parse_artists_tags(filename):
df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
return df
def parse_user_artists_matrix(filename):
df = pd.read_csv(filename)
return df
# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")
#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")
#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119
# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
temp = all_tags.Tag
a = Counter(temp)
a = a.most_common()
print (a)
top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
return top_tags
top_tags = calculate_top_tag(artists_tags)
# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line
In the last method calculate_top_tag I don't understand how to choose the most frequent value from the 'Tag' column and put it as the third column for top_tags before returning it.
I am new to python and my knowledge of syntax and data structures is limited. I did try the various solutions mentioned for finding the most frequent value from the list but they seem to display the entire column and not one particular value. I know this is some trivial syntax issue but after having searched for long I still cannot figure out how to get this one.
edit 1 :
I need to find the most common tag for a particular artist and not the most common overall.
But again, I don't know how to.
edit 2 :
here is the link to the data files:
https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz
I'm sure there is a more succint way of doing it, but this should get you started:
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'