Can't Create pandas DataFrame in python (Wrong Shape) - python

I'm trying to create the following data frame
new_df = pd.DataFrame(data = percentage_default, columns =
df['purpose'].unique())
The variables I'm using are as follows
percentage_default = [0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748]
df['purpose'].unique = array(['debt_consolidation', 'credit_card', 'all_other',
'home_improvement', 'small_business', 'major_purchase',
'educational'], dtype=object)
When I try to create this data frame I get the following error:
Shape of passed values is (1, 7), indices imply (7, 7)
To me it seemed like the shape of the values and idices were the same. Could someone explain what I'm missing here?
Thanks!

You're creating a dataframe from a list. Calling pd.DataFrame(your_list) where your_list is a simple homogenous list will create a single row for every element in that list. For your input:
percentage_default = [0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748]
pandas will create a dataframe like this:
Column
0.15238817285822592
0.11568938193343899
0.16602316602316602
0.17011128775834658
0.2778675282714055
0.11212814645308924
0.20116618075801748
Because of this, your dataframe only has one column. You're trying to pass multiple column names, which is confusing pandas.
If you wish to create a dataframe from a list with multiple columns, you need to nest more lists or tuples inside your original list. Each nested tuple/list will become a row in the dataframe, and each element in the nested tuple/list will become a new column. See this:
percentage_default = [(0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748)] # nested tuple
We have one nested tuple in this list, so our dataframe will have 1 row with n columns, where n is the number of elements in the nested tuple (7). We can then pass your 7 column names:
percentage_default = [(0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748)]
col_names = ['debt_consolidation', 'credit_card', 'all_other',
'home_improvement', 'small_business', 'major_purchase',
'educational']
new_df = pd.DataFrame(percentage_default, columns = col_names)
print(new_df)
debt_consolidation credit_card all_other home_improvement \
0 0.152388 0.115689 0.166023 0.170111
small_business major_purchase educational
0 0.277868 0.112128 0.201166

Try to rewrite your data in a next way:
percentage_default = {
'debt_consolidation': 0.15238817285822592,
'credit_card': 0.11568938193343899,
...
}
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Related

convert list from data series into a suitable data frame

I have a list of data series that looks something like this:
list = np.array([[0.32689251, 0.32677079, 0.32649432, 0.32594585, 0.32532732, 0.32509514, 0.32503138, 0.32492934, 0.324797, 0.32458332], [0.32689251, 0.32677079, 0.32649432, 0.32594585, 0.32532732, 0.32509514, 0.32503138, 0.32492934, 0.324797, 0.32458332], [0.32689251, 0.32677079, 0.32649432, 0.32594585, 0.32532732, 0.32509514, 0.32503138, 0.32492934, 0.324797, 0.32458332]])
I need to convert it to a pandas DataFrame that has the dimension 3x3. so for each data series one row and one column
by the following code, you only get a DataFrame of the format (3, 10):
df = pd.DataFrame(list)
try this. I would not recommend to use reserved word like "list" for variable names.
This might create just noice if not errors.
df = pd.DataFrame(list).transpose()

Creating a new column in Dataframe based on multiple lists

I'm trying to create a new column 'BroadCategory' within a dataframe based on whether values within another column called 'Venue Category' within the data occur in specific lists. I have 5 lists that I am using to fill in the values in the new column
For example:
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Bar),'Bar','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Museum_ArtGallery),'Museum/Art Gallery','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Public_Transport),'Public Transport','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Restaurant_FoodVenue),'Restaurant/Food Venue','Other')
I ultimately want the values in VenueCategory column occurring in the list Bar to be labeled 'Bar' and those occurring in the list Museum_ArtGallery to be labeled 'Museum_ArtGallery', etc. My code above doesn't accomplish this.
I tried this in order to keep the values I had previously filled but it's still overwriting the values I had filled in based on my previous conditions:
df['BroadCategory'] = np.where(df[df.VenueCategory!='Other'].isin(Entertainment_Venue),'Entertainment Venue','Other')
How can I fill the column BoardCategory with the specific values based on whether the values in the VenueCategory column occur in the specified lists Bar, Restaurant, Public_Transport, Museum_ArtGallery, etc?
support your data is like this
df=pd.DataFrame({'VenueCategory':['drink','wine','MOMA','MTA','sushi','Hudson']})
Bar=['drink','wine','alcohol']
Museum_ArtGallery=['MOMA','MCM']
Public_Transport=['MTA','MBTA']
Restaurant_FoodVenue=['sushi','chicken']
prepare a dictionary:
from collections import defaultdict
d=defaultdict(lambda:'other')
d.update({x:'Bar' for x in Bar})
d.update({x:'Museum_ArtGallery' for x in Museum_ArtGallery})
d.update({x:'Public_Transport' for x in Public_Transport})
d.update({x:'Restaurant_FoodVenue' for x in Restaurant_FoodVenue})
build new column and print result:
df['BroadCategory']=df['VenueCategory'].apply(lambda x:d[x])
df
venue_list = [['Bar', Bar],
['Museum_ArtGallery',Museum_ArtGallery]
#etc
]
venue_lookup = pd.concat([
pd.DataFrame({
'BroadCategory':venue[0],
'VenueCategory':venue[1]}) for venue in venue_list]
)
pd.merge(df, venue_lookup, how='left', on = 'VenueCategory')
Your solution is already close. Just that in order not to overwrite previously values, you should get a subset of the rows and only set new values on the subset.
To do that, you can firstly initialize new column BroadCategory to 'Other'. Then set up a subset of rows of each category by subscripting the new column with Boolean mask using the .isin() function like you are using now. The codes are like below:
df['BroadCategory'] = 'Other'
df['BroadCategory'][df['VenueCategory'].isin(Bar)] = 'Bar'
df['BroadCategory'][df['VenueCategory'].isin(Museum_ArtGallery)] = 'Museum/Art Gallery'
df['BroadCategory'][df['VenueCategory'].isin(Public_Transport)] = 'Public Transport'
df['BroadCategory'][df['VenueCategory'].isin(Restaurant_FoodVenue)] = 'Restaurant/Food Venue'
df['BroadCategory'][df['VenueCategory'].isin(Entertainment_Venue)] = 'Entertainment Venue'

How to drop element from a list inside a pandas column in Python?

I have a column in a dataframe that contain a list inside. My dataframe column is:
[],
['NORM'],
['NORM'],
['NORM'],
['NORM'],
['MI', 'STTC'],
As you can see I have an empty list and also a list with two elements. How can I change list with two elements to just take one of it (I don't care which one of it).
I tried with df.column.explode()but this just add more rows and I don't want more rows, I just need to take one of it.
Thank you so much
You can use Series.map with a custom mapping function which maps the elements of column according to desired requirements:
df['col'] = df['col'].map(lambda l: l[:1])
Result:
# print(df['col'])
0 []
1 [NORM]
2 [NORM]
3 [NORM]
4 [NORM]
5 [MI]
i, j is the location of the cell you need to access and this will give the first element of the list
list_ = df.loc[i][j]
if len(list_) > 0:
print(list_[0])
As you store lists into a pandas column, I assume that you do not worry for vectorization. So you could just use a list comprehension:
df[col] = [i[:1] for i in df[col]]

Pandas dataFrame.nunique() : ("unhashable type : 'list'", 'occured at index columns')

I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don't know how to solve ("unhashable type : 'list'", 'occured at index columns'):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don't use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
To get nunique or unique in a pandas.Series , my preferred approaches are
Quick Approach
NOTE: It wouldn't hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps) might handle all the cases.
OP's solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} \n")
print(f"Null values per column: \n{df.isna().sum()} \n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: \n{df[ls_cols_nunique].nunique()} \n")
print(f"Columns error nunique: \n{ls_cols_error_nunique} \n")
This code should split your dataframe columns into 2 lists:
Column that can calculate .nunique()
Column that errors when running .nunique()
Then just calculate the .nunique() on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).

For Looping error in pyspark

I am facing the following problem:
I have a list which I need to compare with the elements of a column in a dataframe(acc_name). I am using the following looping function but it only returns me 1 record when it should provide me 30.
Using Pyspark
bs_list =
['AC_E11','AC_E12','AC_E13','AC_E135','AC_E14','AC_E15','AC_E155','AC_E157',
'AC_E16','AC_E163','AC_E165','AC_E17','AC_E175','AC_E180','AC_E185', 'AC_E215','AC_E22','AC_E225','AC_E23','AC_E23112','AC_E235','AC_E245','AC_E258','AC_E25','AC_E26','AC_E265','AC_E27','AC_E275','AC_E31','AC_E39','AC_E29']
for i in bs_list:
bs_acc1 = (acc\
.filter(i == acc.acc_name)
.select(acc.acc_name,acc.acc_description)
)
the bs_list elements are subset of acc_name column. I am trying to create a new DF which will have the following 2 Columns acc_name, acc_description. It will only contain details of the value of elements present in list bs_list
Please let me know where I am going wrong?
Thats because , in loop everytime you filter on i, you are creating a new dataframe bs_acc1. So it must be showing you only 1 row belonging to last value in bs_list i.e. row for 'AC_E29'
one way to do it is repeat union with itself, so previous results also remain in the dataframe like -
# create a empty dataframe, give schema which is appropriate to your data below
bs_acc1 = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for i in bs_list:
bs_acc1 = bs_acc1.union(
acc\
.filter(i == acc_fil.acc_name)
.select(acc.acc_name,acc.acc_description)
)
better way is not doing loop at all -
from pyspark.sql.functions import *
bs_acc1 = acc.where(acc.acc_name.isin(bs_list))
You can also transform bs_list to dataframe with column acc_name and then just do join to acc dataframe.
bs_rdd = spark.sparkContext.parallelize(bs_list)
bs_df = bs_rdd.map(lambda x: Row(**{'acc_name':x})).toDF()
bs_join_df = bs_df.join(acc, on='acc_name')
bs_join_df.show()

Categories