Let's say I have a dataframe in a long form like this
data = {'name': ["A","A","A","B","B","B","C","C","C"],
'time':["1pm","1pm","1pm","2pm","2pm","2pm","3pm","3pm","3pm"],
'idx': [1,2,3,1,2,3,1,2,3], 'var1':[34,234,645,3,23,65,34,24,25],
'var2':[1,35,2,65,2,1,7,3,8]}
df = pd.DataFrame(data)
I pivot it:
df_piv = df.pivot_table(values=['var1','var2'],index=['name','time'], columns='idx',aggfunc=np.sum)
After I performed some operations on the data in the pivoted dataframe I would like to get it back into the long form.
How would i do that?
I tried several pandas functions including melt, to_records,reset_index, swaplevel in multiple combinations. None of it had the desired outcome.
Use DataFrame.stack with DataFrame.reset_index:
df = df_piv.stack().reset_index()
print (df)
name time idx var1 var2
0 A 1pm 1 34 1
1 A 1pm 2 234 35
2 A 1pm 3 645 2
3 B 2pm 1 3 65
4 B 2pm 2 23 2
5 B 2pm 3 65 1
6 C 3pm 1 34 7
7 C 3pm 2 24 3
8 C 3pm 3 25 8
Related
I currently have a dataframe as following.
No
ID
Sub_No
Weight
1
a23mcsk
2
30
2
smcd302
3
60
3
a23mcsk
1
24
4
smcd302
2
45
5
a23mcsk
3
18
6
smcd302
1
12
I want to be able to sort this dataframe first by 'ID' and then by the 'Sub_No'. Is there a way I can do this on Python using Pandas?
Expected Result:
No
ID
Sub_No
Weight
3
a23mcsk
1
24
1
a23mcsk
2
30
5
a23mcsk
3
18
6
smcd302
1
12
4
smcd302
2
45
2
smcd302
3
60
Use helper column here for correct sorting by numeric values from ID with Sub_No:
df['new'] = df['ID'].str.extract('(\d+)').astype(int)
df = df.sort_values(by=['new', 'Sub_No'], ascending=False).drop('new', axis=1)
Another idea with natural sorting:
import natsort as ns
df['new'] = pd.Categorical(df['ID'], ordered=True, categories= ns.natsorted(df['a'].unique()))
df = df.sort_values(by=['new', 'Sub_No'], ascending=False).drop('new', axis=1)
Use:
df.sort_values(by=['ID', 'Sub_No'])
You can use
df.sort_values(by=['ID', 'Sub_No'], ascending=True)
I have the following dataframe:
import pandas as pd
data = [['tom', 10,2,'c',6], ['tom',16 ,3,'a',8], ['tom', 22,2,'a',10],['matt', 10,1,'c',11]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Col a','Col c','Category', 'Value'])
df
How can I set up a new column called Calculation where depending on the Category would be which column gets used in the calculation?
For example if Category=='a' then i could like the calculation to be df['Value'] - df['Col a']
My expected output should be:
Name Col a Col c Category Value Calculation
0 tom 10 2 c 6 4
1 tom 16 3 a 8 -8
2 tom 22 2 a 10 -12
3 matt 10 1 c 11 10
I have lots of different columns so (maybe 10 possible calculations so happy to hard code them in)
Any help on this would be much appreciated!
You can use DataFrame.lookup to get the values from the columns based on the corresponding categories then subtract them from the column Value:
df['Calc'] = df['Value'] - df.lookup(df.index, df['Category'].radd('Col '))
Name Col a Col c Category Value Calc
0 tom 10 2 c 6 4
1 tom 16 3 a 8 -8
2 tom 22 2 a 10 -12
3 matt 10 1 c 11 10
Currently have a dataframe that is countries by series, with values ranging from 0-25
I want to sort the df so that the highest values appear in the top left (first), while the lowest appear in the bottom right (last).
FROM
A B C D ...
USA 4 0 10 16
CHN 2 3 13 22
UK 2 1 8 14
...
TO
D C A B ...
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
...
In this, the column with the highest values is now first, and the same is true with the index.
I have considered reindexing, but this loses the 'Countries' Index.
D C A B ...
0 22 13 2 3
1 16 10 4 0
2 14 8 2 1
...
I have thought about creating a new column and row that has the Mean or Sum of values for that respective column/row, but is this the most efficient way?
How would I then sort the DF after I have the new rows/columns??
Is there a way to reindex using...
df_mv.reindex(df_mv.mean(or sum)().sort_values(ascending = False).index, axis=1)
... that would allow me to keep the country index, and simply sort it accordingly?
Thanks for any and all advice or assistance.
EDIT
Intended result organizes columns AND rows from largest to smallest.
Regarding the first row of the A and B columns in the intended output, these are supposed to be 2, 3 respectively. This is because the intended result interprets the A column as greater than the B column in both sum and mean (even though either sum or mean can be considered for the 'value' of a row/column).
By saying the higher numbers would be in the top left, while the lower ones would be in the bottom right, I simply meant this as a general trend for the resulting df. It is the columns and rows as whole however, that are the intended focus. I apologize for the confusion.
You could use:
rows_index=df.max(axis=1).sort_values(ascending=False).index
col_index=df.max().sort_values(ascending=False).index
new_df=df.loc[rows_index,col_index]
print(new_df)
D C A B
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
Use .T to transpose rows to columns and vice versa:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.T
df = df.sort_values(df.columns[0], ascending=False).T
Result:
>>> df
D C B A
CHN 22 13 3 2
USA 16 10 0 4
UK 14 8 1 2
Here's another way, this time without transposing but using axis=1 as an argument:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.sort_values(df.index[0], axis=1, ascending=False)
Using numpy:
arr = df.to_numpy()
arr = arr[np.max(arr, axis=1).argsort()[::-1], :]
arr = np.sort(arr, axis=1)[:, ::-1]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df1)
Output:
A B C D
USA 22 13 3 2
CHN 16 10 4 0
UK 14 8 2 1
I have a Pandas Dataframe with data about calls. Each call has a unique ID and each customer has an ID (but can have multiple Calls). A third column gives a day. For each customer I want to calculate the maximum number of calls made in a period of 7 days.
I have been using the following code to count the number of calls within 7 days of the call on each row:
df['ContactsIN7Days'] = df.apply(lambda row: len(df[(df['PersonID']==row['PersonID']) & (abs(df['Day'] - row['Day']) <=7)]), axis=1)
Output:
CallID Day PersonID ContactsIN7Days
6 2 3 2
3 14 2 2
1 8 1 1
5 1 3 2
2 12 2 2
7 100 3 1
This works, however this is going to be applied on a big data set. Would there be a way to make this more efficient. Through vectorization?
IIUC this is a convoluted, but I think effective solution to your issue. Note that the order of your dataframe is modified as a result, and that your Day column is modified to a timedelta dtype:
Starting from your dataframe df:
CallID Day PersonID
0 6 2 3
1 3 14 2
2 1 8 1
3 5 1 3
4 2 12 2
5 7 100 3
Start by modifying Day to a timedelta series:
df['Day'] = pd.to_timedelta(df['Day'], unit='d')
Then, use pd.merge_asof, to merge your dataframe with the count of calls by each individual in a period of 7 days. To get this, use groupby with a pd.Grouper with a frequency of 7 days:
new_df = (pd.merge_asof(df.sort_values(['Day']),
df.sort_values(['Day'])
.groupby([pd.Grouper(key='Day', freq='7d'), 'PersonID'])
.size()
.to_frame('ContactsIN7Days')
.reset_index(),
left_on='Day', right_on='Day',
left_by='PersonID', right_by='PersonID',
direction='nearest'))
Your resulting new_df will look like this:
CallID Day PersonID ContactsIN7Days
0 5 1 days 3 2
1 6 2 days 3 2
2 1 8 days 1 1
3 2 12 days 2 2
4 3 14 days 2 2
5 7 100 days 3 1
I am trying to develop a new panda dataframe based on data I got from an existing dataframe and then taking into account the previously calculated value in the new dataframe.
As an example, here are two dataframes with the same size.
df1 = pd.DataFrame(np.random.randint(0,10, size = (5, 4)), columns=['1', '2', '3', '4'])
df2 = pd.DataFrame(np.zeros(df1.shape), index=df1.index, columns=df1.columns)
Then I created a list which starts as a starting basis for my second dataframe df2
L = [2,5,6,7]
df2.loc[0] = L
Then for the remaining rows of df2 I want to take the value from the previous time step (df2) and add the value of df1.
for i in df2.loc[1:]:
df2.ix[i] = df2.ix[i-1] + df1
As an example my dataframes should look like this:
>>> df1
1 2 3 4
0 4 6 0 6
1 7 0 7 9
2 9 1 9 9
3 5 2 3 6
4 0 3 2 9
>>> df2
1 2 3 4
0 2 5 6 7
1 9 5 13 16
2 18 6 22 25
3 23 8 25 31
4 23 11 27 40
I know there is something wrong with the indication of indexes in the for loop but I cannot figure out how the argument must be formulated. I would be very thankful for any help on this.
this is a simple cumsum.
df2 = df1.copy()
df2.loc[0] = [2,5,6,7]
desired_df = df2.cumsum()