My data includes a few variables holding data from multi-answer questions. These are stored as string (comma separated) and aren't ordered by value.
I need to run different counts across 2 or more of these variables at the same time, i.e. get the frequencies of each combination of their unique values.
I also have a second dataframe with the available codes for each variable
df_meta['a']['Categories'] = ['1', '2', '3','4']
df_meta['b']['Categories'] = ['1', '2']
If this is my data
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],["3,1","2,1"]]),
columns=['a', 'b'])
index a b
1 1,3 1
2 3 1,2
3 1,3,2 1
4 3,1 2,1
Ideally, this is what the output would look like
a b count
1 1 3
1 2 1
2 1 1
2 2 0
3 1 4
3 2 2
4 1 0
4 2 0
Although if I it's not possible to get the zero-counts, this would be just fine
a b count
1 1 3
1 2 1
2 1 1
3 1 4
3 2 2
So far, I got the counts for each of these variables individually, by using split and value_counts
df["a"].str.split(',',expand=True).stack().value_counts()
3 4
1 3
2 1
df["b"].str.split(',',expand=True).stack().value_counts()
1 4
2 2
But I can't figure how to group by them, because of the differences in the indexes.
df2 = pd.DataFrame()
df2["a"] = df["a"].str.split(',',expand=True).stack()
df2["b"] = df["b"].str.split(',',expand=True).stack()
df2.groupby(['a','b']).size()
a b
1 1 3
3 1 1
2 1
Is there a way to adjust the groupby to only count the instances of the first index or another way to count the unique combinations more efficiency?
I can alternatively iterate through all codes using the df_meta dataframe, but some of the actual variables have 300-400 codes and it's very slow, when I try to cross 2-3 of them and, if it's possible to use groupby or another function, it should work much faster.
First we make your dataframe to start with.
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],
["3,1","2,1"]]),columns=['a', 'b'])
Then split columns to separate dataframes.
da = df["a"].str.split(',',expand=True)
db = df["b"].str.split(',',expand=True)
Loop through all rows and both dataframes. Make temporary dataframes of all compinations and add them to a list.
ab = list()
for r in range(len(da)):
for i in da.iloc[r,:]:
for j in db.iloc[r,:]:
if i != None and j != None:
daf = pd.DataFrame({'a':[i], 'b':[j]})
ab.append(daf)
Concatenate list of temporary dataframes into one new dataframe.
dfn = pd.concat(ab)
Groupby with 'a' and 'b' columns and size() gives you the answer.
print(dfn.groupby(['a', 'b']).size().reset_index(name='count'))
a b count
0 1 1 3
1 1 2 1
2 2 1 1
3 3 1 4
4 3 2 2
So suppose I have a dataframe like:
A B
0 1 1
1 2 4
2 3 9
I want to have one long dataframe where there are three columns row, col, value like:
row col value
0 0 A 1
1 1 A 2
2 2 A 3
3 0 B 1
4 1 B 4
5 2 B 9
Basically making a 2D array into 1D and remembering the row and column of each entry so the resulting dataframe would be of shape (n*m , 3).
How is this possible with Pandas?
Actually the order of entries in the resulting dataframe isn't important for me.
use melt:
df = df.reset_index()
df.melt(id_vars=['index'], value_vars=['A','B'])
it should give you the thing you want. Let me know if it works.
I have a dataframe like this:
user_id order_id
0 a 1
1 a 2
2 a 3
3 b 4
4 c 5
Now I want to add a column to show whether the user of each order has multiple orders:
user_id order_id repetitive
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 0
4 c 5 0
Since a has three orders, the tag is 1. I know the method value_counts can calculate the result but it only shows the result after groupby. I want to combine it with the origin dataframe. How can I achieve this?
Use groupby and transform to get your counts while maintaining the same structure.
df['repetitive'] = df.groupby('user_id').transform('count').gt(1).astype(int)
I have troubles converting a pandas dataframe into the format i need in order to analyze it further. The current data is derived from a survey where we asked people to order preferred means of communication (1=highest,4=lowest). Every row is a respondee.
The current dataframe:
A B C D
0 1 2 4 3
1 2 3 1 4
2 2 1 4 3
3 2 1 4 3
4 1 3 4 2
...
For data analysis i want to transform this into the following dataframe, where every row is a different means of communication and the columns are the counts how often a person ranked it in that spot.
1st 2d 3th 4th
A 2 3 0 0
B 2 1 2 0
C 1 0 0 4
D 0 1 3 1
I have tried apply defined functions on the original dataframe, i have tried to apply .groupby function or .T on the dataframe with I don't seem to come closer to the result I actually want.
This is the function I wrote but I can't figure out how to apply it correctly to give me the desired result.
def count_values_rank(column,rank):
total_count_n1 = 0
for i in column:
if i == rank:
total_count_n1 += 1
return total_count_n1
Running this piece of code on a single column of my dataframe get's the desired results but having troubles to actually write it so i can apply it to the dataframe and get the result I am looking for. The below line of code would return 2.
count_values_rank(df.iloc[:,0],'1')
It is probably a really obvious solution but having troubles seeing the easiest way to solve this.
Thanks alot!
melt with crosstab
pd.crosstab(df.melt().variable,df.melt().value).add_suffix('st')
Out[107]:
value 1st 2st 3st 4st
variable
A 2 3 0 0
B 2 1 2 0
C 1 0 0 4
D 0 1 3 1
Simple and practical question, yet I can't find a solution.
The questions I took a look were the following:
Modifying a subset of rows in a pandas dataframe
Changing certain values in multiple columns of a pandas DataFrame at once
Fastest way to copy columns from one DataFrame to another using pandas?
Selecting with complex criteria from pandas.DataFrame
The key difference between those and mine is that I need not to insert a single value, but a row.
My problem is, I pick up a row of a dataframe, say df1. Thus I have a series.
Now I have this other dataframe, df2, that I have selected multiple rows according to a criteria, and I want to replicate that series to all those row.
df1:
Index/Col A B C
1 0 0 0
2 0 0 0
3 1 2 3
4 0 0 0
df2:
Index/Col A B C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
What I want to accomplish is inserting df1[3] into the lines df2[2] and df3[3] for example. So something like the non working code:
series = df1[3]
df2[df2.index>=2 and df2.index<=3] = series
returning
df2:
Index/Col A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Use loc and pass a list of the index labels of interest, after the following comma the : indicates we want to set all column values, we then assign the series but call attribute .values so that it's a numpy array. Otherwise you will get a ValueError as there will be a shape mismatch as you're intending to overwrite 2 rows with a single row and if it's a Series then it won't align as you desire:
In [76]:
df2.loc[[2,3],:] = df1.loc[3].values
df2
Out[76]:
A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Suppose you have to copy certain rows and columns from dataframe to some another data frame do this.
code
df2 = df.loc[x:y,a:b] // x and y are rows bound and a and b are column
bounds that you have to select