Problems transforming a pandas dataframe - python

I have troubles converting a pandas dataframe into the format i need in order to analyze it further. The current data is derived from a survey where we asked people to order preferred means of communication (1=highest,4=lowest). Every row is a respondee.
The current dataframe:
A B C D
0 1 2 4 3
1 2 3 1 4
2 2 1 4 3
3 2 1 4 3
4 1 3 4 2
...
For data analysis i want to transform this into the following dataframe, where every row is a different means of communication and the columns are the counts how often a person ranked it in that spot.
1st 2d 3th 4th
A 2 3 0 0
B 2 1 2 0
C 1 0 0 4
D 0 1 3 1
I have tried apply defined functions on the original dataframe, i have tried to apply .groupby function or .T on the dataframe with I don't seem to come closer to the result I actually want.
This is the function I wrote but I can't figure out how to apply it correctly to give me the desired result.
def count_values_rank(column,rank):
total_count_n1 = 0
for i in column:
if i == rank:
total_count_n1 += 1
return total_count_n1
Running this piece of code on a single column of my dataframe get's the desired results but having troubles to actually write it so i can apply it to the dataframe and get the result I am looking for. The below line of code would return 2.
count_values_rank(df.iloc[:,0],'1')
It is probably a really obvious solution but having troubles seeing the easiest way to solve this.
Thanks alot!

melt with crosstab
pd.crosstab(df.melt().variable,df.melt().value).add_suffix('st')
Out[107]:
value 1st 2st 3st 4st
variable
A 2 3 0 0
B 2 1 2 0
C 1 0 0 4
D 0 1 3 1

Related

Sliding minimum value in a pandas column

I am working with a pandas dataframe where I have the following two columns: "personID" and "points". I would like to create a third variable ("localMin") which will store the minimum value of the column "points" at each point in the dataframe as compared with all previous values in the "points" column for each personID (see image below).
Does anyone have an idea how to achieve this most efficiently? I have approached this problem using shift() with different period sizes, but of course, shift is sensitive to variations in the sequence and doesn't always produce the output I would expect.
Thank you in advance!
Use groupby.cummin:
df['localMin'] = df.groupby('personID')['points'].cummin()
Example:
df = pd.DataFrame({'personID': list('AAAAAABBBBBB'),
'points': [3,4,2,6,1,2,4,3,1,2,6,1]
})
df['localMin'] = df.groupby('personID')['points'].cummin()
output:
personID points localMin
0 A 3 3
1 A 4 3
2 A 2 2
3 A 6 2
4 A 1 1
5 A 2 1
6 B 4 4
7 B 3 3
8 B 1 1
9 B 2 1
10 B 6 1
11 B 1 1

Python Pandas - How to get group by counts by values from multiple columns with multiple values

My data includes a few variables holding data from multi-answer questions. These are stored as string (comma separated) and aren't ordered by value.
I need to run different counts across 2 or more of these variables at the same time, i.e. get the frequencies of each combination of their unique values.
I also have a second dataframe with the available codes for each variable
df_meta['a']['Categories'] = ['1', '2', '3','4']
df_meta['b']['Categories'] = ['1', '2']
If this is my data
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],["3,1","2,1"]]),
columns=['a', 'b'])
index a b
1 1,3 1
2 3 1,2
3 1,3,2 1
4 3,1 2,1
Ideally, this is what the output would look like
a b count
1 1 3
1 2 1
2 1 1
2 2 0
3 1 4
3 2 2
4 1 0
4 2 0
Although if I it's not possible to get the zero-counts, this would be just fine
a b count
1 1 3
1 2 1
2 1 1
3 1 4
3 2 2
So far, I got the counts for each of these variables individually, by using split and value_counts
df["a"].str.split(',',expand=True).stack().value_counts()
3 4
1 3
2 1
df["b"].str.split(',',expand=True).stack().value_counts()
1 4
2 2
But I can't figure how to group by them, because of the differences in the indexes.
df2 = pd.DataFrame()
df2["a"] = df["a"].str.split(',',expand=True).stack()
df2["b"] = df["b"].str.split(',',expand=True).stack()
df2.groupby(['a','b']).size()
a b
1 1 3
3 1 1
2 1
Is there a way to adjust the groupby to only count the instances of the first index or another way to count the unique combinations more efficiency?
I can alternatively iterate through all codes using the df_meta dataframe, but some of the actual variables have 300-400 codes and it's very slow, when I try to cross 2-3 of them and, if it's possible to use groupby or another function, it should work much faster.
First we make your dataframe to start with.
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],
["3,1","2,1"]]),columns=['a', 'b'])
Then split columns to separate dataframes.
da = df["a"].str.split(',',expand=True)
db = df["b"].str.split(',',expand=True)
Loop through all rows and both dataframes. Make temporary dataframes of all compinations and add them to a list.
ab = list()
for r in range(len(da)):
for i in da.iloc[r,:]:
for j in db.iloc[r,:]:
if i != None and j != None:
daf = pd.DataFrame({'a':[i], 'b':[j]})
ab.append(daf)
Concatenate list of temporary dataframes into one new dataframe.
dfn = pd.concat(ab)
Groupby with 'a' and 'b' columns and size() gives you the answer.
print(dfn.groupby(['a', 'b']).size().reset_index(name='count'))
a b count
0 1 1 3
1 1 2 1
2 2 1 1
3 3 1 4
4 3 2 2

Multiple Condition Apply Function that iterates over itself

So I have a Dataframe that is the same thing 348 times, but with a different date as a static column. What I would like to do is add a column that checks against that date and then counts the number of rows that are within 20 miles using a lat/lon column and geopy.
My frame is like this:
What I am looking to do is something like an apply function that takes all of the identifying dates that are equal to the column and then run this:
geopy.distance.vincenty(x, y).miles
X would be the location's lat/lon and y would be the iterative lat/lon. I'd want the count of locations in which the above is < 20. I'd then like to store this count as a column in the initial Dataframe.
I'm ok with Pandas, but this is just outside my comfort zone. Thanks.
I started with this DataFrame (because I did not want to type that much by hand and you did not provide any code for the data):
df
Index Number la ID
0 0 1 [43.3948, -23.9483] 1/1/90
1 1 2 [22.8483, -34.3948] 1/1/90
2 2 3 [44.9584, -14.4938] 1/1/90
3 3 4 [22.39458, -55.34924] 1/1/90
4 4 5 [33.9383, -23.4938] 1/1/90
5 5 6 [22.849, -34.397] 1/1/90
Now I introduced an artificial column which is only there to help us get the cartesian product of the distances
df['join'] = 1
df_c = pd.merge(df, df[['la', 'join','Index']], on='join')
The next step is to apply the vincenty function via .apply and store the result in an extra column
df_c['distance'] = df_c.apply(lambda x: distance.vincenty(x.la_x, x.la_y).miles, 1)
Now we have the cartesian product of the original matrix, which means we have the comparison of each city with itself, too. But we will take that into account in the next step by performing -1. We groupby the Index_x and sum all the distances smaller the 20 miles.
df['num_close_cities'] = df_c.groupby('Index_x').apply(lambda x: sum((x.distance < 20))) -1
df.drop('join', 1)
Index Number la ID num_close_cities
0 0 1 [43.3948, -23.9483] 1/1/90 0
1 1 2 [22.8483, -34.3948] 1/1/90 1
2 2 3 [44.9584, -14.4938] 1/1/90 0
3 3 4 [22.39458, -55.34924] 1/1/90 0
4 4 5 [33.9383, -23.4938] 1/1/90 0
5 5 6 [22.849, -34.397] 1/1/90 1

Is there a way to have column values in a DataFrame update automatically when other columns are updated

Is there a way to have a column in a Dataframe update automatically when an original entry in the Dataframe is modified? Suppose I have the following:
dataset=({'A':[1,2,3]})
df=pd.DataFrame(dataset)
df['B']=df['A'].cumprod()
In: df
Out[257]:
A B
0 1 1
1 2 2
2 3 6
If I change a values in A
df.iloc[1,0]=4
Column B does not change.
In: df
Out[260]:
A B
0 1 1
1 4 2
2 3 6
I am wondering whether there is some way to define B so that I would get:
Out[260]:
A B
0 1 1
1 4 4
2 6 24
I saw an answer on a thread from 2013 that said this functionality would be added, but can't seem to find anything other documentation on it. I have tried using pd.eval but that doesn't seem to have this functionality.

Applying operations on groups without aggregating

I want to apply an operation on multiple groups of a data frame and then fill all values of that group by the result. Lets take mean and np.cumsum as an example and the following dataframe:
df=pd.DataFrame({"a":[1,3,2,4],"b":[1,1,2,2]})
which looks like this
a b
0 1 1
1 3 1
2 2 2
3 4 2
Now I want to group the dataframe by b, then take the mean of a in each group, then apply np.cumsum to the means, and then replace all values of a by the (group dependent) result.
For the first three steps, I would start like this
df.groupby("b").mean().apply(np.cumsum)
which gives
a
b
1 2
2 5
But what I want to get is
a b
0 2 1
1 2 1
2 5 2
3 5 2
Any ideas how this can be solved in a nice way?
You can use map by Series:
df1 = df.groupby("b").mean().cumsum()
print (df1)
a
b
1 2
2 5
df['a'] = df['b'].map(df1['a'])
print (df)
a b
0 2 1
1 2 1
2 5 2
3 5 2

Categories