Assign two dimensional array to panda dataframe - python

I want to explode the two dimensional array into 1 dimensional array and assign it to the panda data frame. Need help on this
This is my panda data frame.
Id Dept Date
100 Healthcare 2007-01-03
100 Healthcare 2007-01-10
100 Healthcare 2007-01-17
Two dimensional array looks like
array([[10, 20, 30],
[40, 50, 60],
[70, 80, 90]])
The output to be.
Id Dept Date vect
100 Healthcare 2007-01-03 [10, 20, 30]
100 Healthcare 2007-01-10 [40, 50, 60]
100 Healthcare 2007-01-17 [70, 80, 90]

You can achieve that by convert the array to list by using tolist
df['vect']=ary.tolist()
df
Out[281]:
Id Dept Date vect
0 100 Healthcare 2007-01-03 [10, 20, 30]
1 100 Healthcare 2007-01-10 [40, 50, 60]
2 100 Healthcare 2007-01-17 [70, 80, 90]

Related

Pandas groupby using range and type

I have a Dataframe where I have "room_type" and "review_scores_rating" as labels
The dataframe looks like this
room_type review_scores_rating
0 Private room 98.0
1 Private room 89.0
2 Entire home/apt 100.0
3 Private room 99.0
4 Private room 97.0
I already use groupby so I also have this dataframe
review_scores_rating
room_type
Entire home/apt 11930
Hotel room 97
Private room 3116
Shared room 44
I want to create a dataframe where I have as columns the different room types and each row counts how many are in for different ranges of the rating
I was able to get to this point
count
review_scores_rating
(19.92, 30.0] 24
(30.0, 40.0] 23
(40.0, 50.0] 9
(50.0, 60.0] 97
(60.0, 70.0] 74
(70.0, 80.0] 486
(80.0, 90.0] 1701
(90.0, 100.0] 12773
But I donĀ“t know how to make it count not only by range of the score but also for room type so I can now for example how many private room have a review score rating between 30 and 40
You can use a crosstab with cut:
pd.crosstab(pd.cut(df['review_scores_rating'], bins=range(0, 101, 10)),
df['room_type'])
Output:
room_type Entire home/apt Private room
review_scores_rating
(80, 90] 0 1
(90, 100] 1 3
Or groupby.count:
df.groupby(['room_type', pd.cut(df['review_scores_rating'], bins=range(0, 101, 10))]).count()
Output:
review_scores_rating
room_type review_scores_rating
Entire home/apt (0, 10] 0
(10, 20] 0
(20, 30] 0
(30, 40] 0
(40, 50] 0
(50, 60] 0
(60, 70] 0
(70, 80] 0
(80, 90] 0
(90, 100] 1
Private room (0, 10] 0
(10, 20] 0
(20, 30] 0
(30, 40] 0
(40, 50] 0
(50, 60] 0
(60, 70] 0
(70, 80] 0
(80, 90] 1
(90, 100] 3

percentage of cells to row total python

hi I have a dataset that looks much like this data frame below:
#Table1 :
print("Table1: Current Table")
data = [['ALFA', 35, 47, 67, 44, 193],
['Bravo', 51, 52, 16, 8, 127],
['Charlie', 59, 75, 2, 14, 150],
['Delta', 59, 75, 2, 34, 170],
['Echo', 59, 75, 2, 14, 150],
['Foxtrot', 40, 43, 26, 27, 136],
['Golf', 35, 31, 22, 13, 101],
['Hotel', 89, 58, 24, 34, 205]]
df = pd.DataFrame(data, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Total'])
#df.loc[:,'Total'] = df.sum(axis=1)
print(df)
i would want to get the percentage of all cells against their row totals (calculated in column 'Total') such that it looks this:
#Table2 :
print('Table2: Expected Outcome')
data2 = [['ALFA',18.1, 24.4, 34.7, 22.8, 193],
['Bravo',40.2, 40.9, 12.6, 6.3, 127],
['Charlie',39.3, 50.0, 1.3, 9.3, 150],
['Delta',34.7, 44.1, 1.2, 20.0, 170],
['Echo',39.3, 50.0, 1.3, 9.3, 150],
['Foxtrot',29.4, 31.6, 19.1, 19.9, 136],
['Hotel',34.7, 30.7, 21.8, 12.9, 101],
['Golf',43.4, 28.3, 11.7, 16.6, 205]]
df2 = pd.DataFrame(data2, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Total']) #.round(decimals=1)
#df.loc[:,'Total'] = df.sum(axis=1)
print(df2)
I am not really interested if the total column does change, is recalculated or have to be dropped in the process; but for completeness sake it would be good to have a 'Total' column along with the cells' percentages
Use fast vecorized division all columns without Objects/Total by DataFrame.div:
c = df.columns.difference(['Objects','Total'])
df[c] = df[c].div(df['Total'], axis=0).mul(100)
print(df)
Objects Column1 Column2 Column3 Column4 Total
0 ALFA 18.134715 24.352332 34.715026 22.797927 193
1 Bravo 40.157480 40.944882 12.598425 6.299213 127
2 Charlie 39.333333 50.000000 1.333333 9.333333 150
3 Delta 34.705882 44.117647 1.176471 20.000000 170
4 Echo 39.333333 50.000000 1.333333 9.333333 150
5 Foxtrot 29.411765 31.617647 19.117647 19.852941 136
6 Golf 34.653465 30.693069 21.782178 12.871287 101
7 Hotel 43.414634 28.292683 11.707317 16.585366 205
you can try using apply :
df[['Column1', 'Column2', 'Column3', 'Column4']] = df[['Column1', 'Column2', 'Column3', 'Column4']].apply(lambda x: x/x.sum(), axis=1)
Output :
Table1: Current Table
Objects Column1 Column2 Column3 Column4 Total
0 ALFA 0.181347 0.243523 0.347150 0.227979 193
1 Bravo 0.401575 0.409449 0.125984 0.062992 127
2 Charlie 0.393333 0.500000 0.013333 0.093333 150
3 Delta 0.347059 0.441176 0.011765 0.200000 170
4 Echo 0.393333 0.500000 0.013333 0.093333 150
5 Foxtrot 0.294118 0.316176 0.191176 0.198529 136
6 Golf 0.346535 0.306931 0.217822 0.128713 101
7 Hotel 0.434146 0.282927 0.117073 0.165854 205
Create a new dataframe using the same data. Loop through all columns of the dataframe except the last column(ie, Total) using df.columns[1:-1] and compute the percentage.
df1=pd.DataFrame(data, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Total'])
for col in df.columns[1:-1]:
df1[col]=(df[col]*100/df.Total)
df1

Select for one gender, age group and marital status using groupby() and display a frequency distribution using value_counts()

Groupby Help Needed
I would like to output a frequency table showing marital status ("DMDMARTLx") for women ("RIAGENDR" ==2) in all age groups. The code below returns the age group stratification and counts the marital status correctly but I am struggling to understand why I can't select just the women from the the dataframe.
da["agegrp"] = pd.cut(da.RIDAGEYR, [18, 30, 40, 50, 60, 70, 80])
da["RIAGENDRf"] = da["RIAGENDR"] == 2
da.groupby(['agegrp', 'RIAGENDRf']) ['DMDMARTLx'].value_counts()
agegrp RIAGENDRf DMDMARTLx
(18, 30] False Never Married 262
Married 104
Living w Partner 95
Separated 7
Divorced 2
Widowed 2
True Never Married 259
Married 158
Living w Partner 114
Divorced 11
Separated 11
...
If I use the [] to select the data within da["RIAGENDR"] == 2, I get a value error:
da["agegrp"] = pd.cut(da.RIDAGEYR, [18, 30, 40, 50, 60, 70, 80])
da["RIAGENDRf"] = [da["RIAGENDR"]==2]
da.groupby(['agegrp', 'RIAGENDRf']) ['DMDMARTLx'].value_counts()
ValueError: Length of values does not match length of index
Does this mean I need to fill or drop missing data?
If you want a dataframe for women then you must write :
da_women = da[da["RIAGENDR"] == 2]

Matplotlib error plotting interval bins for discretized values form pandas dataframe

An error is returned when I want to plot an interval.
I created an interval for my age column so now I want to show on a chart the age interval compares to the revenue
my code
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
clients['tranche'] = pd.cut(clients.age, bins)
clients.head()
client_id sales revenue birth age sex tranche
0 c_1 39 558.18 1955 66 m (60, 70]
1 c_10 58 1353.60 1956 65 m (60, 70]
2 c_100 8 254.85 1992 29 m (20, 30]
3 c_1000 125 2261.89 1966 55 f (50, 60]
4 c_1001 102 1812.86 1982 39 m (30, 40]
# Plot a scatter tranche x revenue
df = clients.groupby('tranche')[['revenue']].sum().reset_index().copy()
plt.scatter(df.tranche, df.revenue)
plt.show()
But an error appears ending by
TypeError: float() argument must be a string or a number, not 'pandas._libs.interval.Interval'
How to use an interval for plotting ?
You'll need to add labels. (i tried to convert them to str using .astype(str) but that does not seem to work in 3.9)
if you do the following, it will work just fine.
labels = ['10-20', '20-30', '30-40']
df['tranche'] = pd.cut(df.age, bins, labels=labels)

pandas - How to convert aggregated data to dictionary

Here is a snippet of the CSV file I am working:
ID SN Age Gender Item ID Item Name Price
0, Lisim78, 20, Male, 108, Extraction of Quickblade, 3.53
1, Lisovynya38, 40, Male, 143, Frenzied Scimitar, 1.56
2, Ithergue48, 24, Male, 92, Final Critic, 4.88
3, Chamassasya86,24, Male, 100, Blindscythe, 3.27
4, Iskosia90, 23, Male, 131, Fury, 1.44
5, Yalae81, 22, Male, 81, Dreamkiss, 3.61
6, Itheria73, 36, Male, 169, Interrogator, 2.18
7, Iskjaskst81, 20, Male, 162, Abyssal Shard, 2.67
8, Undjask33, 22, Male, 21, Souleater, 1.1
9, Chanosian48, 35, Other, 136, Ghastly, 3.58
10, Inguron55, 23, Male, 95, Singed Onyx, 4.74
I wanna get the count of the most profitable items - profitable items are determined by taking the sum of the prices of the most frequently purchased items.
This is what I tried:
profitableCount = df.groupby('Item ID').agg({'Price': ['count', 'sum']})
And the output looks like this:
Price
count sum
Item ID
0 4 5.12
1 3 9.78
2 6 14.88
3 6 14.94
4 5 8.50
5 4 16.32
6 2 7.40
7 7 9.31
8 3 11.79
9 4 10.92
10 4 7.16
I want to extract the 'count' and 'sum' columns and put them in a dictionary but I can't seem to drop the 'Item ID' column (Item ID seems to be the index). How do I do this? Please help!!!
Dictionary consist of a series of {key:value} pairs. In outcome you provided there is no key:value.
{(4: 5.12), (3 : 9.78), (6:14.88), (6:14.94), (5:8.50), (4:16.32),
(2,7.40), (7,9.31), (3,11.79), (4,10.92), (4,7.16)}
Alternatively you can create two lists: df.count.tolist() and df.sum.tolist()
And put them to list of tuples: list(zip(list1,llist2))

Categories