How do I convert this dictionary to a dataframe in python? - python

I have a dictionary of Student information in this format. I cannot change this, it is the output from another program I am trying to use.
student_info_dict = {
"Student_1_Name": "Alice",
"Student_1_Age": 23,
"Student_1_Phone_Number": 1111,
"Student_1_before_after": (120, 109),
"Student_2_Name": "Bob",
"Student_2_Age": 56,
"Student_2_Phone_Number": 1234,
"Student_2_before_after": (115, 107),
"Student_3_Name": "Casie",
"Student_3_Age": 47,
"Student_3_Phone_Number": 4567,
"Student_3_before_after": (180, 140),
"Student_4_Name": "Donna",
"Student_4_Age": 33,
"Student_4_Phone_Number": 6789,
"Student_4_before_after": (150, 138),
}
The keys to my dictionary increment by 1 to display the next students information. How do I convert this to a DataFrame that looks like this:
Name Age Phone_Number Before_and_After
0 Alice 23 1111 (120,109)
1 Bob 56 1234 (115,107)
3 Casie 47 4567 (180,140)
4 Donna 33 6789 (150,138)

Use:
#create Series
s = pd.Series(student_info_dict)
#split index created by keys to second _
s.index = s.index.str.split('_', n=2, expand=True)
#remove first level (Student) and reshape to DataFrame
df = s.droplevel(0).unstack()
print (df)
Age Name Phone_Number before_after
1 23 Alice 1111 (120, 109)
2 56 Bob 1234 (115, 107)
3 47 Casie 4567 (180, 140)
4 33 Donna 6789 (150, 138)

You can use a simple dictionary comprehension feeding the Series constructor, getting the Student number and the field name as keys, then unstacking to DataFrame:
df = (pd.Series({tuple(k.split('_',2)[1:]): v
for k,v in student_info_dict.items()})
.unstack(1)
)
output:
Age Name Phone_Number before_after
1 23 Alice 1111 (120, 109)
2 56 Bob 1234 (115, 107)
3 47 Casie 4567 (180, 140)
4 33 Donna 6789 (150, 138)

use this code:
import pandas as pd
name=[]
age=[]
pn=[]
baa=[]
for i in student_info_dict.keys():
if i.find('Name')>=0:
name.append(i)
elif i.find('Age')>=0:
age.append(i)
elif i.find('Phone')>=0:
pn.append(i)
elif i.find('before')>=0:
baa.append(i)
df = pd.DataFrame('Name':name, 'Age':age, 'Phone_number':pn, 'before_after':baa)

Related

Pandas groupby using range and type

I have a Dataframe where I have "room_type" and "review_scores_rating" as labels
The dataframe looks like this
room_type review_scores_rating
0 Private room 98.0
1 Private room 89.0
2 Entire home/apt 100.0
3 Private room 99.0
4 Private room 97.0
I already use groupby so I also have this dataframe
review_scores_rating
room_type
Entire home/apt 11930
Hotel room 97
Private room 3116
Shared room 44
I want to create a dataframe where I have as columns the different room types and each row counts how many are in for different ranges of the rating
I was able to get to this point
count
review_scores_rating
(19.92, 30.0] 24
(30.0, 40.0] 23
(40.0, 50.0] 9
(50.0, 60.0] 97
(60.0, 70.0] 74
(70.0, 80.0] 486
(80.0, 90.0] 1701
(90.0, 100.0] 12773
But I donĀ“t know how to make it count not only by range of the score but also for room type so I can now for example how many private room have a review score rating between 30 and 40
You can use a crosstab with cut:
pd.crosstab(pd.cut(df['review_scores_rating'], bins=range(0, 101, 10)),
df['room_type'])
Output:
room_type Entire home/apt Private room
review_scores_rating
(80, 90] 0 1
(90, 100] 1 3
Or groupby.count:
df.groupby(['room_type', pd.cut(df['review_scores_rating'], bins=range(0, 101, 10))]).count()
Output:
review_scores_rating
room_type review_scores_rating
Entire home/apt (0, 10] 0
(10, 20] 0
(20, 30] 0
(30, 40] 0
(40, 50] 0
(50, 60] 0
(60, 70] 0
(70, 80] 0
(80, 90] 0
(90, 100] 1
Private room (0, 10] 0
(10, 20] 0
(20, 30] 0
(30, 40] 0
(40, 50] 0
(50, 60] 0
(60, 70] 0
(70, 80] 0
(80, 90] 1
(90, 100] 3

How do a join two columns into another seperate column in Pandas?

Any help would be greatly appreciated. This is probably easy, but im new to Python.
I want to add two columns which are Latitude and Longitude and put it into a column called Location.
For example:
First row in Latitude will have a value of 41.864073 and the first row of Longitude will have a value of -87.706819.
I would like the 'Locations' column to display 41.864073, -87.706819.
please and thank you.
Setup
df = pd.DataFrame(dict(lat=range(10, 20), lon=range(100, 110)))
zip
This should be better than using apply
df.assign(location=[*zip(df.lat, df.lon)])
lat lon location
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
list variant
Though I'd still suggest tuple
df.assign(location=df[['lat', 'lon']].values.tolist())
lat lon location
0 10 100 [10, 100]
1 11 101 [11, 101]
2 12 102 [12, 102]
3 13 103 [13, 103]
4 14 104 [14, 104]
5 15 105 [15, 105]
6 16 106 [16, 106]
7 17 107 [17, 107]
8 18 108 [18, 108]
9 19 109 [19, 109]
I question the usefulness of this column, but you can generate it by applying the tuple callable over the columns.
>>> df = pd.DataFrame([[1, 2], [3,4]], columns=['lon', 'lat'])
>>> df
>>>
lon lat
0 1 2
1 3 4
>>>
>>> df['Location'] = df.apply(tuple, axis=1)
>>> df
>>>
lon lat Location
0 1 2 (1, 2)
1 3 4 (3, 4)
If there are other columns than 'lon' and 'lat' in your dataframe, use
df['Location'] = df[['lon', 'lat']].apply(tuple, axis=1)
Data from Pir
df['New']=tuple(zip(*df[['lat','lon']].values.T))
df
Out[106]:
lat lon New
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
I definitely learned something from W-B and timgeb. My idea was to just convert to strings and concatenate. I posted my answer in case you wanted the result as a string. Otherwise it looks like the answers above are the way to go.
import pandas as pd
from pandas import *
Dic = {'Lattitude': [41.864073], 'Longitude': [-87.706819]}
DF = pd.DataFrame.from_dict(Dic)
DF['Location'] = DF['Lattitude'].astype(str) + ',' + DF['Longitude'].astype(str)

Mean of grouped data

I have data in a dataframe regarding salaries of employees. Each employee also has data stored about their sex, discipline, years since earning phd, and years working at the current employer. An example of the data is as follows.
rank dsc phd srv sex salary
1 Prof B 19 18 Male 139750
2 Prof B 20 16 Male 173200
3 Asst B 4 3 Male 79750
4 Prof B 45 39 Male 115000
5 Prof B 40 41 Male 141500
6 Assoc B 6 6 Male 97000
7 Prof B 30 23 Male 175000
8 Prof B 45 45 Male 147765
9 Prof B 21 20 Male 119250
10 Prof B 18 18 Female 129000
What I want to access is the mean salary of all employees grouped by both sex and a range of ten years of service. For example; Males that have 0-10 years of service, females with 0-10 years of service, Males that have 11-20 years of service, etc. I can get the mean of a range of workers with ranges of years working without separating by the sexes by doing:
serviceSalary = data.groupby(pd.cut(data['yrs.service'], np.arange(0, 70,
10)))['salary'].mean()
What further can I do to add a third grouping to this variable?
You can groupby multiple columns with a list as the first argument, so instead of just one:
In [11]: df.groupby(pd.cut(df['srv'], np.arange(0, 70, 10)))['salary'].mean()
Out[11]:
srv
(0, 10] 88375.0
(10, 20] 140300.0
(20, 30] 175000.0
(30, 40] 115000.0
(40, 50] 144632.5
(50, 60] NaN
Name: salary, dtype: float64
can pass 'sex' too:
In [12]: df.groupby([pd.cut(df['srv'], np.arange(0, 70, 10)), 'sex'])['salary'].mean()
Out[12]:
srv sex
(0, 10] Male 88375.000000
(10, 20] Female 129000.000000
Male 144066.666667
(20, 30] Male 175000.000000
(30, 40] Male 115000.000000
(40, 50] Male 144632.500000
Name: salary, dtype: float64

Order Dataframe Index based on a second Dataframe

I got two DataFrames in Python, but the Column there are to be used as Indexes (CodeNumber) are not in the same order. There would be needed to order them equally; Follows the code:
#generating DataFrames:
d3 = {'CodeNumber': [1234, 1235, 111, 101], 'Date': [20150808, 20141201, 20180119, 20120720], 'Weight': [26, 32, 41, 24]}
d4 = {'CodeNumber': [1235, 1234, 101, 111], 'Date': [20160808, 20151201, 20180219, 20130720], 'Weight': [28, 25, 47, 3]}
data_SKU3 = pd.DataFrame(data=d3)
data_SKU4 = pd.DataFrame(data=d4)
Then i set as an index the CodeNumber:
dados_SKU3.set_index('CodeNumber', inplace = True)
dados_SKU4.set_index('CodeNumber', inplace = True)
if we print the resulting DataFrames, note that data_SKU3 has the following order of Code Number: 1234 1235 111 101 , while data_SKU4: 1235 1234 101 111
Is there a way to order the Code Numbers so both DataFrames would be in the same order?
You can also sort values by CodeNumber on each dataframe by calling .sort_values(by = 'CodeNumber') before setting them as index:
d3 = {'CodeNumber': [1234, 1235, 111, 101], 'Date': [20150808, 20141201, 20180119, 20120720], 'Weight': [26, 32, 41, 24]}
d4 = {'CodeNumber': [1235, 1234, 101, 111], 'Date': [20160808, 20151201, 20180219, 20130720], 'Weight': [28, 25, 47, 3]}
data_SKU3 = pd.DataFrame(data=d3).sort_values(by = 'CodeNumber')
data_SKU4 = pd.DataFrame(data=d4).sort_values(by = 'CodeNumber')
data_SKU3.set_index('CodeNumber', inplace = True)
data_SKU4.set_index('CodeNumber', inplace = True)
Use sort_index if same number of values in both indices:
data_SKU3 = data_SKU3.set_index('CodeNumber').sort_index()
data_SKU4 = data_SKU4.set_index('CodeNumber').sort_index()
print (data_SKU3)
Date Weight
CodeNumber
101 20120720 24
111 20180119 41
1234 20150808 26
1235 20141201 32
print (data_SKU4)
Date Weight
CodeNumber
101 20180219 47
111 20130720 3
1234 20151201 25
1235 20160808 28
Another approach is use reindex by another index values, but is necessary unique values and only diiference is different ordering:
data_SKU3 = data_SKU3.set_index('CodeNumber')
data_SKU4 = data_SKU4.set_index('CodeNumber').reindex(index=data_SKU3.index)
print (data_SKU3)
Date Weight
CodeNumber
1234 20150808 26
1235 20141201 32
111 20180119 41
101 20120720 24
print (data_SKU4)
Date Weight
CodeNumber
1234 20151201 25
1235 20160808 28
111 20130720 3
101 20180219 47

create a bigram from a column in pandas df

i have this test table in pandas dataframe
Leaf_category_id session_id product_id
0 111 1 987
3 111 4 987
4 111 1 741
1 222 2 654
2 333 3 321
this is the extension of my previous question, which was answered by #jazrael.
view answer
so after getting the values in product_id column as(just an assumption, little different from the output of my previous question,
|product_id |
---------------------------
|111,987,741,34,12 |
|987,1232 |
|654,12,324,465,342,324 |
|321,741,987 |
|324,654,862,467,243,754 |
|6453,123,987,741,34,12 |
and so on,
i want to create a new column, in which all the values in a row should be made as a bigram with its next one, and the last no. in the row combined with the first one,for example:
|product_id |Bigram
-------------------------------------------------------------------------
|111,987,741,34,12 |(111,987),**(987,741)**,(741,34),(34,12),(12,111)
|987,1232 |(987,1232),(1232,987)
|654,12,324,465,342,32 |(654,12),(12,324),(324,465),(465,342),(342,32),(32,654)
|321,741,987 |(321,741),**(741,987)**,(987,321)
|324,654,862 |(324,654),(654,862),(862,324)
|123,987,741,34,12 |(123,987),(987,741),(34,12),(12,123)
ignore the **( i'll tell you later on why i starred that)
the code to achive the bigram is
for i in df.Leaf_category_id.unique():
print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index())
from this df, i want to consider the bigram column and make one more column named as frequency, which gives me frequency of bigram occured.
Note* : (987,741) and (741,987) are to be considered as same and one dublicate entry should be removed and thus frequency of (987,741) should be 2.
similar is the case with (34,12) it occurs two times, so frequency should be 2
|Bigram
---------------
|(111,987),
|**(987,741)**
|(741,34)
|(34,12)
|(12,111)
|**(741,987)**
|(987,321)
|(34,12)
|(12,123)
Final Result should be.
|Bigram | frequency |
--------------------------
|(111,987) | 1
|(987,741) | 2
|(741,34) | 1
|(34,12) | 2
|(12,111) | 1
|(987,321) | 1
|(12,123) | 1
i am hoping to find answer here, please help me, i have elaborated it as much as possible.
try this code
from itertools import combinations
import pandas as pd
df = pd.DataFrame.from_csv("data.csv")
#consecutive
grouped_consecutive_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in zip(x,x[1:])]).reset_index()
df1=pd.DataFrame(grouped_consecutive_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_consecutive = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_consecutive["index"]
for combinations (all possible bi-grams)
from itertools import combinations
import pandas as pd
df = pd.DataFrame.from_csv("data.csv")
#combinations
grouped_combination_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)]).reset_index()
df1=pd.DataFrame(grouped_combination_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_combinations = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_combinations["index"]
where data.csv contains
Leaf_category_id,session_id,product_id
0,111,1,111
3,111,4,987
4,111,1,741
1,222,2,654
2,333,3,321
5,111,1,87
6,111,1,34
7,111,1,12
8,111,1,987
9,111,4,1232
10,222,2,12
11,222,2,324
12,222,2,465
13,222,2,342
14,222,2,32
15,333,3,321
16,333,3,741
17,333,3,987
18,333,3,324
19,333,3,654
20,333,3,862
21,222,1,123
22,222,1,987
23,222,1,741
24,222,1,34
25,222,1,12
The resultant bigram_frequency_consecutive will be
Bigram freq
0 (12, 34) 2
1 (12, 324) 1
2 (12, 654) 1
3 (12, 987) 1
4 (32, 342) 1
5 (34, 87) 1
6 (34, 741) 1
7 (87, 741) 1
8 (111, 741) 1
9 (123, 987) 1
10 (321, 321) 1
11 (321, 741) 1
12 (324, 465) 1
13 (324, 654) 1
14 (324, 987) 1
15 (342, 465) 1
16 (654, 862) 1
17 (741, 987) 2
18 (987, 1232) 1
The resultant bigram_frequency_combinations will be
Bigram freq
0 (12, 32) 1
1 (12, 34) 2
2 (12, 87) 1
3 (12, 111) 1
4 (12, 123) 1
5 (12, 324) 1
6 (12, 342) 1
7 (12, 465) 1
8 (12, 654) 1
9 (12, 741) 2
10 (12, 987) 2
11 (32, 324) 1
12 (32, 342) 1
13 (32, 465) 1
14 (32, 654) 1
15 (34, 87) 1
16 (34, 111) 1
17 (34, 123) 1
18 (34, 741) 2
19 (34, 987) 2
20 (87, 111) 1
21 (87, 741) 1
22 (87, 987) 1
23 (111, 741) 1
24 (111, 987) 1
25 (123, 741) 1
26 (123, 987) 1
27 (321, 321) 1
28 (321, 324) 2
29 (321, 654) 2
30 (321, 741) 2
31 (321, 862) 2
32 (321, 987) 2
33 (324, 342) 1
34 (324, 465) 1
35 (324, 654) 2
36 (324, 741) 1
37 (324, 862) 1
38 (324, 987) 1
39 (342, 465) 1
40 (342, 654) 1
41 (465, 654) 1
42 (654, 741) 1
43 (654, 862) 1
44 (654, 987) 1
45 (741, 862) 1
46 (741, 987) 3
47 (862, 987) 1
48 (987, 1232) 1
in the above case it groups by both
We are going to pull out the values from product_id, create bigrams that are sorted and thus deduplicated, and count them to get the frequency, and then populate a data frame.
from collections import Counter
# assuming your data frame is called 'df'
bigrams = [list(zip(x,x[1:])) for x in df.product_id.values.tolist()]
bigram_set = [tuple(sorted(xx) for x in bigrams for xx in x]
freq_dict = Counter(bigram_set)
df_freq = pd.DataFrame([list(f) for f in freq_dict], columns=['bigram','freq'])

Categories