How do I get the name of a DataFrame and print it as a string?
Example:
boston (var name assigned to a csv file)
import pandas as pd
boston = pd.read_csv('boston.csv')
print('The winner is team A based on the %s table.) % boston
You can name the dataframe with the following, and then call the name wherever you like:
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'
print df.name
>>>
Ones
Sometimes df.name doesn't work.
you might get an error message:
'DataFrame' object has no attribute 'name'
try the below function:
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
In many situations, a custom attribute attached to a pd.DataFrame object is not necessary. In addition, note that pandas-object attributes may not serialize. So pickling will lose this data.
Instead, consider creating a dictionary with appropriately named keys and access the dataframe via dfs['some_label'].
df = pd.DataFrame()
dfs = {'some_label': df}
From here what I understand DataFrames are:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
And Series are:
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
Series have a name attribute which can be accessed like so:
In [27]: s = pd.Series(np.random.randn(5), name='something')
In [28]: s
Out[28]:
0 0.541
1 -1.175
2 0.129
3 0.043
4 -0.429
Name: something, dtype: float64
In [29]: s.name
Out[29]: 'something'
EDIT: Based on OP's comments, I think OP was looking for something like:
>>> df = pd.DataFrame(...)
>>> df.name = 'df' # making a custom attribute that DataFrame doesn't intrinsically have
>>> print(df.name)
'df'
DataFrames don't have names, but you have an (experimental) attribute dictionary you can use. For example:
df.attrs['name'] = "My name" # Can be retrieved later
attributes are retained through some operations.
Here is a sample function:
'df.name = file` : Sixth line in the code below
def df_list():
filename_list = current_stage_files(PATH)
df_list = []
for file in filename_list:
df = pd.read_csv(PATH+file)
df.name = file
df_list.append(df)
return df_list
I am working on a module for feature analysis and I had the same need as yours, as I would like to generate a report with the name of the pandas.Dataframe being analyzed. To solve this, I used the same solution presented by #scohe001 and #LeopardShark, originally in https://stackoverflow.com/a/18425523/8508275, implemented with the inspect library:
import inspect
def aux_retrieve_name(var):
callers_local_vars = inspect.currentframe().f_back.f_back.f_locals.items()
return [var_name for var_name, var_val in callers_local_vars if var_val is var]
Note the additional .f_back term since I intend to call it from another function:
def header_generator(df):
print('--------- Feature Analyzer ----------')
print('Dataframe name: "{}"'.format(aux_retrieve_name(df)))
print('Memory usage: {:03.2f} MB'.format(df.memory_usage(deep=True).sum() / 1024 ** 2))
return
Running this code with a given dataframe, I get the following output:
header_generator(trial_dataframe)
--------- Feature Analyzer ----------
Dataframe name: "trial_dataframe"
Memory usage: 63.08 MB
Related
I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')
I'm running a for loop using pandas that checks if another DataFrame with same name has been created. If it has been created, then just append the values to the correspondent columns. If it has not been created, then create the df and append the values to the named columns.
dflistglobal = []
####
For loop that generate a, b, and c variables every time it runs.
####
###
The following code runs inside the for loop, so that everytime it runs, it should generate a, b, and c, then check if a df has been created with a specific name, if yes, it should append the values to that "listname". If not, it should create a new list with "listname". List name changes everytime I run the code, and it varies but can be repeated during this for loop.
###
if listname not in dflistglobal:
dflistglobal.append(listname)
listname = pd.DataFrame(columns=['a', 'b', 'c'])
listname = listname.append({'a':a, 'b':b, 'c':c}, ignore_index=True)
else:
listname = listname.append({'a':a, 'b':b, 'c':c}, ignore_index=True)
I am getting the following error:
File "test.py", line 150, in <module>
functiontest(image, results, list)
File "test.py", line 68, in funtiontest
listname = listname.append({'a':a, 'b':b, 'c':c}, ignore_index=True)
AttributeError: 'str' object has no attribute 'append'
The initial if statement runs fine, but the else statement causes problems.
Solved this issue by not using pandas dataframes. I looped thru the for loop generating a unique identifier for each listname, then appended a,b,c,listname on a list. At the end you will end up with a large df that can be filtered using the groupby function.
Not sure if this will be helpful for anyone, but avoid creating pandas dfs and using list is the best approach.
That error tells you that listname is a string (and you cannot append a DataFrame to a string).
You may want to check if somewhere in your code you are adding a string to your list dflistglobal.
EDIT: Possible solution
I'm not sure how you are naming your DataFrames, and I don't see how you can access them afterwards.
Instead of using a list, you can store your DataFrames inside a dictionary dict = {"name": df}. This will let you easily access the DataFrames by name.
import pandas as pd
import random
df_dict = {}
# For loop
for _ in range(10):
# Logic to get your variables (example)
a = random.randint(1, 10)
b = random.randint(1, 10)
c = random.randint(1, 10)
# # Logic to get your DataFrame name (example)
df_name = f"dataframe{random.randint(1,10)}"
if df_name not in df_dict.keys():
# DataFrame with same name does not exist
new_df = pd.DataFrame(columns=['a', 'b', 'c'])
new_df = new_df.append({'a':a, 'b':b, 'c':c}, ignore_index=True)
df_dict[df_name] = new_df
else:
# DataFrame with same name exists
updated_df = df_dict[df_name].append({'a':a, 'b':b, 'c':c}, ignore_index=True)
df_dict[df_name] = updated_df
Also, for more info, you may want to visit this question
I hope it was clear and it helps.
I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds).
Throughout the examples we use:
import pandas as pd
import pyarrow as pa
Here's a minimal example to show the situation:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': ObjectId('5e9992543bfddb58073803e7')},
{'name': 'bob', 'oid': ObjectId('5e9992543bfddb58073803e8')},
]
)
df.to_parquet('some_path')
And we get:
ArrowInvalid: ('Could not convert 5e9992543bfddb58073803e7 with type ObjectId: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column oid with type object')
I tried to follow this reference: https://arrow.apache.org/docs/python/extending_types.html
Thus I wrote the following type extension:
class ObjectIdType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.binary(12), "my_package.objectid")
def __arrow_ext_serialize__(self):
# since we don't have a parametrized type, we don't need extra
# metadata to be deserialized
return b''
#classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata.
return ObjectId()
And was able to get a working pyarray for my oid column:
values = df['oid']
storage_array = pa.array(values.map(lambda oid: oid.binary), type=pa.binary(12))
pa.ExtensionArray.from_storage(objectid_type, storage_array)
Now where I’m stuck, and cannot find any good solution on the internet, is how to save my df to parquet, letting it interpret which column needs which Extension. I might change columns in the future, and I have several different types that need this treatment.
How can I simply create parquet file from dataframes and restore them while transparently converting the types ?
I tried to create a pyarrow.Table object, and append columns to it after preprocessing, but it doesn’t work as table.append_column takes binary columns and not pyarrow.Arrays, plus the whole isinstance thing looks like a terrible solution.
table = pa.Table.from_pandas(pd.DataFrame())
for col, values in test_df.iteritems():
if isinstance(values.iloc[0], ObjectId):
arr = pa.array(
values.map(lambda oid: oid.binary), type=pa.binary(12)
)
elif isinstance(values.iloc[0], ...):
...
else:
arr = pa.array(values)
table.append_column(arr, col) # FAILS (wrong type)
Pseudocode of the ideal solution:
parquetize(df, path, my_custom_types_conversions)
# ...
new_df = unparquetize(path, my_custom_types_conversions)
assert df.equals(new_df) # types have been correctly restored
I’m getting lost in pyarrow’s doc on if I should use ExtensionType, serialization or other things to write these functions. Any pointer would be appreciated.
Side note, I do not need parquet at all means, the main issue is to being able to save and restore dataframes with custom types quickly and space efficiently. I tried a solution based on jsonifying and gziping the dataframe, but it was too slow.
I think it is probably because the 'ObjectId' is not a defined keyword in python hence it is throwing up this exception in type conversion.
I tried the example you provided and tried by casting the oid values as string type during dataframe creation and it worked.
Check below the steps:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': "ObjectId('5e9992543bfddb58073803e7')"},
{'name': 'bob', 'oid': "ObjectId('5e9992543bfddb58073803e8')"},
]
)
df.to_parquet('parquet_file.parquet')
df1 = pd.read_parquet('parquet_file.parquet',engine='pyarrow')
df1
output:
name oid
0 alice ObjectId('5e9992543bfddb58073803e7')
1 bob ObjectId('5e9992543bfddb58073803e8')
You could write a method that reads the column names and types and outputs a new DF with the columns converted to compatible types, using a switch-case pattern to choose what type to convert column to (or whether to leave it as is).
I have a dataset, where the second column looks like this.
FileName
892e7c8382943342a29a6ae5a55f2272532d8e04.exe.asm
2d42c1b2c33a440d165683eeeec341ebf61218a1.exe.asm
1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed.exe.asm
Now, I want to extract the name before ".exe.asm" from the column and append it to a new list for all the rows of my dataset. I tried the following code:
import pandas as pd
df = pd.read_csv("dataset1.csv")
exekey = []
for row in df.iterrows():
exekey.append(row[1].split('.'))
exekey
This execution gave me the following error:
AttributeError: 'Series' object has no attribute 'split'
I am not able to do it. Please help
On changing, the output was of the form Output image
Split the filename using . and access 1st element using indexing.
import pandas as pd
df = pd.DataFrame({'FileName':['892e7c8382943342a29a6ae5a55f2272532d8e04.exe.asm',
'2d42c1b2c33a440d165683eeeec341ebf61218a1.exe.asm',
'1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed.exe.asm']})
exekey = [i.split(".")[0] for i in df['FileName']]
print(exekey)
Alternate way:
exekey2 = df['FileName'].apply(lambda x: x.split(".")[0]).tolist()
Output:
['892e7c8382943342a29a6ae5a55f2272532d8e04', '2d42c1b2c33a440d165683eeeec341ebf61218a1', '1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed']
You can use map like this to split on . and take index 0,
df['FileName'].map(lambda f : f.split('.')[0])
# Output
0 892e7c8382943342a29a6ae5a55f2272532d8e04
1 2d42c1b2c33a440d165683eeeec341ebf61218a1
2 1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed
Name: FileName, dtype: object
If you want to get a list of names you can do,
df['FileName'].map(lambda f : f.split('.')[0]).values.tolist()
# Output : ['892e7c8382943342a29a6ae5a55f2272532d8e04',
'2d42c1b2c33a440d165683eeeec341ebf61218a1',
'1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed']
I have just discovered pandas and am impressed by its capabilities.
I am having difficulties understanding how to work with DataFrame with MultiIndex.
I have two questions :
(1) Exporting the DataFrame
Here my problem:
This dataset
import pandas as pd
import StringIO
d1 = StringIO.StringIO(
"""Gender,Employed,Region,Degree
m,yes,east,ba
m,yes,north,ba
f,yes,south,ba
f,no,east,ba
f,no,east,bsc
m,no,north,bsc
m,yes,south,ma
f,yes,west,phd
m,no,west,phd
m,yes,west,phd """
)
df = pd.read_csv(d1)
# Frequencies tables
tab1 = pd.crosstab(df.Gender, df.Region)
tab2 = pd.crosstab(df.Gender, [df.Region, df.Degree])
tab3 = pd.crosstab([df.Gender, df.Employed], [df.Region, df.Degree])
# Now we export the datasets
tab1.to_excel('H:/test_tab1.xlsx') # OK
tab2.to_excel('H:/test_tab2.xlsx') # fails
tab3.to_excel('H:/test_tab3.xlsx') # fails
One work-around I could think of is to change the columns (The way R does)
def NewColums(DFwithMultiIndex):
NewCol = []
for item in DFwithMultiIndex.columns:
NewCol.append('-'.join(item))
return NewCol
# New Columns
tab2.columns = NewColums(tab2)
tab3.columns = NewColums(tab3)
# New export
tab2.to_excel('H:/test_tab2.xlsx') # OK
tab3.to_excel('H:/test_tab3.xlsx') # OK
My question is : Is there a more efficient way to do this in Pandas that I missed in the documentation ?
2) Selecting columns
This new structure does not allow to select colums on a given variable (the advantage of hierarchical indexing in first place). How can I select columns containing a given string (e.g. '-ba') ?
P.S: I have seen this question which is related but have not understood the reply proposed
This looks like a bug in to_excel, for the moment as a workaround I would recommend using to_csv (which seems not to show this issue).
I added this as an issue on github.
To answer the second question, if you really need to use to_excel...
You can use filter to select only those columns which include '-ba':
In [21]: filter(lambda x: '-ba' in x, tab2.columns)
Out[21]: ['east-ba', 'north-ba', 'south-ba']
In [22]: tab2[filter(lambda x: '-ba' in x, tab2.columns)]
Out[22]:
east-ba north-ba south-ba
Gender
f 1 0 1
m 1 1 0