I have a Dataframe I receive from a crawler that I am importing into a database for long-term storage.
The problem I am running into is a large amount of the various dataframes have uppercase and whitespace.
I have a fix for it but I was wondering if it can be done any cleaner than this:
def clean_columns(dataframe):
for column in dataframe:
dataframe.rename(columns = {column : column.lower().replace(" ", "_")},
inplace = 1)
return dataframe
print(dataframe.columns)
Index(['Daily Foo', 'Weekly Bar'])
dataframe = clean_columns(dataframe)
print(dataframe.columns)
Index(['daily_foo', 'weekly_bar'])
You can try via columns attribute:
df.columns=df.columns.str.lower().str.replace(' ','_')
OR
via rename() method:
df=df.rename(columns=lambda x:x.lower().replace(' ','_'))
Related
I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')
When writing the pandas mainTable dataframe to mainTable.csv, but after the file is written the name of the column for index is missing.
Why does this happen since I have specified index=True?
mainTable.to_csv(r'/Users/myuser/Completed/mainTable.csv',index=True)
mainTable = pd.read_csv('mainTable.csv')
print(mainTable.columns)
MacBook-Pro:Completed iliebogdanbarbulescu$ python map.py
Index(['Unnamed: 0', 'name', 'iso_a3', 'geometry', 'iso_code', 'continent']
print output
save with index_label='Index_name', since by default index_label=None.
See for pandas' .csv() method : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
mainTable.to_csv(r'/Users/myuser/Completed/mainTable.csv',index=True, index_label='Index_name')
I am new to Spark.
I have a DataFrame and I used the following command to group it by 'userid'
def test_groupby(df):
return list(df)
high_volumn = self.df.filter(self.df.outmoney >= 1000).rdd.groupBy(
lambda row: row.userid).mapValues(test_groupby)
It gives a RDD which in following structure:
(326033430, [Row(userid=326033430, poiid=u'114233866', _mt_datetime=u'2017-06-01 14:54:48', outmoney=1127.0, partner=2, paytype=u'157', locationcity=u'\u6f4d\u574a', locationprovince=u'\u5c71\u4e1c\u7701', location=None, dt=u'20170601')])
326033430 is the big group.
My question is how can I convert this RDD back to a DataFrame Structure? If I cannot do that, how I can get values from the Row term?
Thank you.
You should just
from pyspark.sql.functions import *
high_volumn = self.df\
.filter(self.df.outmoney >= 1000)\
.groupBy('userid').agg(collect_list('col'))
and in .agg method pass what You want to do with rest of data.
Follow this link : http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg
How do I get the name of a DataFrame and print it as a string?
Example:
boston (var name assigned to a csv file)
import pandas as pd
boston = pd.read_csv('boston.csv')
print('The winner is team A based on the %s table.) % boston
You can name the dataframe with the following, and then call the name wherever you like:
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'
print df.name
>>>
Ones
Sometimes df.name doesn't work.
you might get an error message:
'DataFrame' object has no attribute 'name'
try the below function:
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
In many situations, a custom attribute attached to a pd.DataFrame object is not necessary. In addition, note that pandas-object attributes may not serialize. So pickling will lose this data.
Instead, consider creating a dictionary with appropriately named keys and access the dataframe via dfs['some_label'].
df = pd.DataFrame()
dfs = {'some_label': df}
From here what I understand DataFrames are:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
And Series are:
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
Series have a name attribute which can be accessed like so:
In [27]: s = pd.Series(np.random.randn(5), name='something')
In [28]: s
Out[28]:
0 0.541
1 -1.175
2 0.129
3 0.043
4 -0.429
Name: something, dtype: float64
In [29]: s.name
Out[29]: 'something'
EDIT: Based on OP's comments, I think OP was looking for something like:
>>> df = pd.DataFrame(...)
>>> df.name = 'df' # making a custom attribute that DataFrame doesn't intrinsically have
>>> print(df.name)
'df'
DataFrames don't have names, but you have an (experimental) attribute dictionary you can use. For example:
df.attrs['name'] = "My name" # Can be retrieved later
attributes are retained through some operations.
Here is a sample function:
'df.name = file` : Sixth line in the code below
def df_list():
filename_list = current_stage_files(PATH)
df_list = []
for file in filename_list:
df = pd.read_csv(PATH+file)
df.name = file
df_list.append(df)
return df_list
I am working on a module for feature analysis and I had the same need as yours, as I would like to generate a report with the name of the pandas.Dataframe being analyzed. To solve this, I used the same solution presented by #scohe001 and #LeopardShark, originally in https://stackoverflow.com/a/18425523/8508275, implemented with the inspect library:
import inspect
def aux_retrieve_name(var):
callers_local_vars = inspect.currentframe().f_back.f_back.f_locals.items()
return [var_name for var_name, var_val in callers_local_vars if var_val is var]
Note the additional .f_back term since I intend to call it from another function:
def header_generator(df):
print('--------- Feature Analyzer ----------')
print('Dataframe name: "{}"'.format(aux_retrieve_name(df)))
print('Memory usage: {:03.2f} MB'.format(df.memory_usage(deep=True).sum() / 1024 ** 2))
return
Running this code with a given dataframe, I get the following output:
header_generator(trial_dataframe)
--------- Feature Analyzer ----------
Dataframe name: "trial_dataframe"
Memory usage: 63.08 MB
I have just discovered pandas and am impressed by its capabilities.
I am having difficulties understanding how to work with DataFrame with MultiIndex.
I have two questions :
(1) Exporting the DataFrame
Here my problem:
This dataset
import pandas as pd
import StringIO
d1 = StringIO.StringIO(
"""Gender,Employed,Region,Degree
m,yes,east,ba
m,yes,north,ba
f,yes,south,ba
f,no,east,ba
f,no,east,bsc
m,no,north,bsc
m,yes,south,ma
f,yes,west,phd
m,no,west,phd
m,yes,west,phd """
)
df = pd.read_csv(d1)
# Frequencies tables
tab1 = pd.crosstab(df.Gender, df.Region)
tab2 = pd.crosstab(df.Gender, [df.Region, df.Degree])
tab3 = pd.crosstab([df.Gender, df.Employed], [df.Region, df.Degree])
# Now we export the datasets
tab1.to_excel('H:/test_tab1.xlsx') # OK
tab2.to_excel('H:/test_tab2.xlsx') # fails
tab3.to_excel('H:/test_tab3.xlsx') # fails
One work-around I could think of is to change the columns (The way R does)
def NewColums(DFwithMultiIndex):
NewCol = []
for item in DFwithMultiIndex.columns:
NewCol.append('-'.join(item))
return NewCol
# New Columns
tab2.columns = NewColums(tab2)
tab3.columns = NewColums(tab3)
# New export
tab2.to_excel('H:/test_tab2.xlsx') # OK
tab3.to_excel('H:/test_tab3.xlsx') # OK
My question is : Is there a more efficient way to do this in Pandas that I missed in the documentation ?
2) Selecting columns
This new structure does not allow to select colums on a given variable (the advantage of hierarchical indexing in first place). How can I select columns containing a given string (e.g. '-ba') ?
P.S: I have seen this question which is related but have not understood the reply proposed
This looks like a bug in to_excel, for the moment as a workaround I would recommend using to_csv (which seems not to show this issue).
I added this as an issue on github.
To answer the second question, if you really need to use to_excel...
You can use filter to select only those columns which include '-ba':
In [21]: filter(lambda x: '-ba' in x, tab2.columns)
Out[21]: ['east-ba', 'north-ba', 'south-ba']
In [22]: tab2[filter(lambda x: '-ba' in x, tab2.columns)]
Out[22]:
east-ba north-ba south-ba
Gender
f 1 0 1
m 1 1 0