I have a df with raw survey data similar to the following with 12000 rows and forty questions. All responses are categorical
import pandas as pd
df = pd.DataFrame({'Age' : ['20-30','20-30','30-45', '20-30','30-45','20-30'],
'Gender' : ['M', 'F', 'F','F','M','F'],
'Income' : ['20-30k', '30-40k', '40k+', '40k+', '40k+', '20-30k'],
'Question1' : ['Good','Bad','OK','OK','Bad','Bad'],
'Question2' : ['Happy','Unhappy','Very_Unhappy','Very_Unhappy','Very_Unhappy','Happy']})
I want to categorize the responses to each question according to Age, Gender and Income, to produce a frequency (by %) table for each question that looks like this screenshot showing questions.
Crosstab produces too many categories ie it breaks down by income and within income, by age etc. So I'm not sure how best to go about this. I'm sure this is an easy problem but I'm new to python to any help would be appreciated
As you said, using cross tab for all the columns breaks down the result by each column. You can use individual cross tabs and then concat
pd.concat([pd.crosstab(df.Question1, df.Gender), pd.crosstab(df.Question1, df.Income), pd.crosstab(df.Question1, df.Age)], axis = 1)
F M 20-30k 30-40k 40k+ 20-30 30-45
Question1
Bad 2 1 1 1 1 2 1
Good 0 1 1 0 0 1 0
OK 2 0 0 0 2 1 1
Edit: To get additional level at columns
age = pd.crosstab(df.Question1, df.Age)
age.columns = pd.MultiIndex.from_product([['Age'], age.columns])
gender = pd.crosstab(df.Question1, df.Gender)
gender.columns = pd.MultiIndex.from_product([['Gender'], gender.columns])
income = pd.crosstab(df.Question1, df.Income)
income.columns = pd.MultiIndex.from_product([['Income'], income.columns])
pd.concat([age, gender, income], axis = 1)
Age Gender Income
20-30 30-45 F M 20-30k 30-40k 40k+
Question1
Bad 2 1 2 1 1 1 1
Good 1 0 0 1 1 0 0
OK 1 1 2 0 0 0 2
You can do melt before crosstab
s=df.drop('Question2',1).\
melt(['Age','Gender','Income']).drop('variable',1).\
rename(columns={'value':'v1'}).melt('v1')
pd.crosstab(s.v1,[s.variable,s.value])
Out[235]:
variable Age Gender Income
value 20-30 30-45 F M 20-30k 30-40k 40k+
v1
Bad 2 1 2 1 1 1 1
Good 1 0 0 1 1 0 0
OK 1 1 2 0 0 0 2
Related
dataframe df has a column
id data_words
1 [salt,major,lab,water]
2 [lab,plays,critical,salt]
3 [water,success,major]
I want to make one-hot-code of the column
id critical lab major plays salt success water
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 1 0 1 0
What I tried:
Attempt 1:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('data_words')),
columns=mlb.classes_,
index=df.index))
Error: ValueError: columns overlap but no suffix specified: Index(['class'], dtype='object')
Attempt 2:
I converted the list into simple comma separated string with the following code
df['data_words_Joined'] = df.data_words.apply(','.join)
it makes the dataframe as following
id data_words
1 salt,major,lab,water
2 lab,plays,critical,salt
3 water,success,major
Then I tried
pd.concat([df,pd.get_dummies(df['data_words_Joined'])],axis=1)
But It makes all the words into one column name instead of separate words as separate columns
id salt,major,lab,water lab,plays,critical,salt water,success,major
1 1 0 0
2 0 1 0
3 0 0 1
You can try with explode followed by pivot_table
df_e = df.explode('data_words')
print(df_e.pivot_table(index=df_e['id'],columns=df_e['data_words'],values='id',aggfunc='count',fill_value=0))
Returning the following output:
data_words critical lab major plays salt success water
id
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 0 0 1 1
Edit: Adding data for replication purposes:
df = pd.DataFrame({'id':[1,2,3],
'data_words':[['salt','major','lab','water'],['lab','plays','critical','salt'],['water','success','major']]})
Which looks like:
id data_words
0 1 [salt, major, lab, water]
1 2 [lab, plays, critical, salt]
2 3 [water, success, major]
One possible approach could be to use get_dummies with your apply function:
new_df = df.data_words.apply(','.join).str.get_dummies(sep=',')
print(new_df)
Output:
critical lab major plays salt success water
0 0 1 1 0 1 0 1
1 1 1 0 1 1 0 0
2 0 0 1 0 0 1 1
Tested with pandas version 1.1.2 and borrowed input data from Celius Stingher's Answer.
I encountered the SettingWithCopyWarning in Python. I searched online but it seems that all the solutions do not work for me.
The input data is like this:
id genre
0 1 Drama, Romance
1 2 Action, Drama
2 3 Action, Comedy
3 4 Thriller
The expected outcome should be:
id Drama Romance Action Comedy Thriller
0 1 1 1 0 0 0
1 2 1 0 1 0 0
2 3 0 0 1 1 0
3 4 0 0 0 0 1
I want to get the subset of the input data, add some columns and modify the added column, and return the subset. Basically, I DO NOT want to modify the original data, I just want to get a subset, which should be a brand new dataframe :
# the function to deal with the genre
def genre(data):
subset = data[['id', 'genre']]
for i, row in subset.iterrows():
if isinstance(row['genre'], float):
continue
genreList = row['genre'].split(', ')
for genre in genreList:
if genre in list(subset):
subset.loc[i][genre] = 1
else:
subset.loc[:][genre] = 0
subset.loc[i][genre] = 1
return subset
I tried many ways, but neither of them gets rid of the SettingWithCopyWarning :
subset = data[['A', 'B']].copy().
subset = data.filter(['A','B'], axis=1)
subset = pd.Dataframe(data[['A', 'B']])
subset = data.copy()subset.drop(columns =['C','D'])
subset = pd.DataFrame({'id': list(data.id), 'genre': list(data.genre)})
Does anyone have any idea how to fix this? Or is there a way to surpress the warning?
Thanks in advance!!
Iteration is not needed, and neither is subsetting. You can use str.get_dummies.
df.drop('genre', 1).join(df['genre'].str.get_dummies(sep=', '))
id Action Comedy Drama Romance Thriller
0 1 0 0 1 1 0
1 2 1 0 1 0 0
2 3 1 1 0 0 0
3 4 0 0 0 0 1
The result is a new DataFrame, you can assign this to something else (df2 = ...).
I am new in python. I have a Column in MS Excel file, in which four tag are used which are LOC , ORG , PER and MISC ,given data is like this:
1 LOC/Thai Buddhist temple;
2 PER/louis;
3 ORG/WikiLeaks;LOC/Southern Ocean;
4 ORG/queen;
5 PER/Sanchez;PER/Eli Wallach;MISC/The Good, The Bad and the Ugly;
6
7 PER/Thomas Watson;
...................
...................
.............#continue upto 2,000 rows
and i want a Result that in the specific row which tag is present or not ,if some tag is present then in their specific (NEW Columns which are shown below) column put "1" and if not present any tag then put "0" . I want all 4 columns in this excel file which are LOC/ORG/PER/MISC and will be 2nd ,3rd, 4th and 5th column while first column is given data,and file is contains almost 2815 rows and every row has different tag from these LOC/ORG/PER/MISC .
My goal is to count from the new columns
total number of LOC, total number of ORG, total number of PER and total number of MISC
The result will be like this:
given data LOC ORG PER MISC
1 LOC/Thai Buddhist temple; 1 0 0 0 #here only LOC is present
2 PER/louis; 0 0 1 0 #here only PER is present
3 ORG/WikiLeaks;LOC/Southern Ocean; 1 1 0 0 #here LOC and ORG is present
4 PER/Eli Wallach;MISC/The Good; 0 0 1 1 #here PER and MISC is present
5 .................................................
6 0 0 0 0 #here no tag is present
7 .....................................................
.......................................................
..................................continue up to 2815 rows....
I am beginner in Python.so, I have tried my best to search out its solution code but, I cannot find any program related to my problem that's why I posted here. so, kindly anyone helps me.
I assume you have successfully read the data from excel and created a dataframe in python using pandas (To read the excel file we have df1 = read_excel("File/path/name.xls" Header = True/False)).
Here is the layout of your dataframe df1
Colnum | Tagstring
1 |LOC/Thai Buddhist temple;
2 |PER/louis;
3 |ORG/WikiLeaks;LOC/Southern Ocean;
4 |ORG/queen;
5 |PER/Sanchez;PER/Eli Wallach;MISC/The Good, The Bad and the Ugly;
6 |PER/Thomas Watson;
Now, there are couple of ways to search for text in a string.
I will demonstrate find function :
Syntax : str.find(str, beg=0, end=len(string))
str1 = "LOC";
str2 = "PER";
str3 = "ORG";
str4 = "MISC";
df1["LOC"] = (if Tagstring.find(str1) >= 0 then 1 else 0).astype('int')
df1["PER"] = (if Tagstring.find(str2) >= 0 then 1 else 0).astype('int')
df1["ORG"] = (if Tagstring.find(str3) >= 0 then 1 else 0).astype('int')
df1["MISC"] = (if Tagstring.find(str4) >= 0 then 1 else 0).astype('int')
if you have read your data, df then you can do:
pd.concat([df,pd.DataFrame({i:df.Tagstring.str.contains(i).astype(int) for i in 'LOC ORG PER MISC'.split()})],axis=1)
Out[716]:
Tagstring LOC ORG PER MISC
Colnum
1 LOC/Thai Buddhist temple; 1 0 0 0
2 PER/louis; 0 0 1 0
3 ORG/WikiLeaks;LOC/Southern Ocean; 1 1 0 0
4 ORG/queen; 0 1 0 0
5 PER/Sanchez;PER/Eli Wallach;MISC/The Good, The... 0 0 1 1
6 PER/Thomas Watson; 0 0 1 0
I have a Pandas script that counts the number of readmissions to hospital within 30 days based on a few conditions. I wonder if it could be vectorized to improve performance. I've experimented with df.rolling().apply, but so far without luck.
Here's a table with contrived data to illustrate:
ID VISIT_NO ARRIVED LEFT HAD_A_MASSAGE BROUGHT_A_FRIEND
1 1 29/02/1996 01/03/1996 0 1
1 2 01/12/1996 04/12/1996 1 0
2 1 20/09/1996 21/09/1996 1 0
3 1 27/06/1996 28/06/1996 1 0
3 2 04/07/1996 06/07/1996 0 1
3 3 16/07/1996 18/07/1996 0 1
4 1 21/02/1996 23/02/1996 0 1
4 2 29/04/1996 30/04/1996 1 0
4 3 02/05/1996 02/05/1996 0 1
4 4 02/05/1996 03/05/1996 0 1
5 1 03/10/1996 05/10/1996 1 0
5 2 07/10/1996 08/10/1996 0 1
5 3 10/10/1996 11/10/1996 0 1
First, I create a dictionary with IDs:
ids = massage_df[massage_df['HAD_A_MASSAGE'] == 1]['ID']
id_dict = {id:0 for id in ids}
Everybody in this table has had a massage, but in my real dataset, not all people are so lucky.
Next, I run this bit of code:
for grp, df in massage_df.groupby(['ID']):
date_from = df.loc[df[df['HAD_A_MASSAGE']==1].index, 'LEFT']
date_to = date_from + DateOffset(days=30)
mask = ((date_from.values[0] < df['ARRIVED']) &
(df['ARRIVED'] <= date_to.values[0]) &
(df['BROGHT_A_FRIEND'] == 1))
if len(df[mask]) > 0:
id_dict[df['ID'].iloc[0]] = len(df[mask])
Basically, I want to count the number of times when someone originally came in for a massage (single or with a friend) and then came back within 30 days with a friend. The expected results for this table would be a total of 6 readmissions for IDs 3, 4 and 5.
I have a df with badminton scores. Each sets of a games for a team are on rows and the score at each point on the columns like so:
0 0 1 1 2 3 4
0 1 2 3 3 4 4
I want to obtain only O and 1 when a point is scored, like so: (to analyse if there any pattern in the points):
0 0 1 0 1 1 1
0 1 1 1 0 1 0
I was thinking of using df.itertuples() and iloc and conditions to attribute 1 to new dataframe if next score = score+1 or 0 if next score = score + 1
But I dont know how to iterate through the generated tuples and how to generate my new df with the 0 and 1 at the good locations.
Hope that is clear thanks for your help.
Oh also, any suggestions to analyse the patterns after that ?
You just need diff(If you need convert it back try cumsum)
df.diff(axis=1).fillna(0).astype(int)
Out[1382]:
1 2 3 4 5 6 7
0 0 0 1 0 1 1 1
1 0 1 1 1 0 1 0