Transform pandas timeseries into timeseries with non-date index - python

I'm trying to generate a timeseries from a dataframe, but the solutions I've found here don't really address my specific problem. I have a dataframe which is a series of id's which iterate from 1 to n, then repeat, like this:
key ID Var_1
0 1 1
0 2 1
0 3 2
1 1 3
1 2 2
1 3 1
I want to reshape it into a timeseries in which the index
ID Var_1_0 Var_2_0
1 1 3
2 1 2
3 2 1
I have tried the stack() method but it doesn't generate the result I want. Generating an index from ID seems to be the right ID is not a proper date so I'm not sure how to proceed. Pointers much appreciated.

Try this:
import pandas as pd
df = pd.DataFrame([[0,1,1], [0,2,1], [0,3,2], [1,1,3], [1,2,2], [1,3,1]], columns=('key', 'ID', 'Var_1'))
Use the pivot function:
df2 = df.pivot('ID', 'key', 'Var_1')
You can rename the columns by:
df2.columns = ('Var_1_0', 'Var_2_0')
Result:
Out:
Var_1_0 Var_2_0
ID
1 1 3
2 1 2
3 2 1

Related

Count number of occurences in Dataframe per column

I have a sample dataframe whereby all numbers are userID:
from
to
1
3
1
2
2
3
How do I count the number of occurrences for each columns, sum it up based on the same values and displays in the following format in a new dataframe?
UserID
Occurences
1
2
2
2
3
2
Thank you.
IIUC, you can stack then value_counts
out = (df.stack().value_counts()
.to_frame('Occurences')
.rename_axis('UserID')
.reset_index())
print(out)
UserID Occurences
0 1 2
1 2 2
2 3 2
Use DataFrame.melt with GroupBy.size:
df = df.melt(value_name='UserID').groupby('UserID').size().reset_index(name='Occurences')
print (df)
UserID Occurences
0 1 2
1 2 2
2 3 2
The pd.Series.value counts method may be used to count the instances of each userID in the columns "from" and "to," and pd.concat can be used to combine the results. At the end create a dataframe from the resulting series using the pd.DataFrame.reset index method:
import pandas as pd
data_frame = pd.DataFrame({'from': [1, 1, 2], 'to': [3, 2, 3]})
occur = pd.concat([df['from'].value_counts(), df['to'].value_counts()])
result_df = occur.reset_index()
result_df.columns = ['UserID', 'occur']
result_df = result_df.groupby(['UserID'])['occur'].sum().reset_index()
UserID Occur
0 1 2
1 2 2
2 3 2

Pandas dataframe: merge rows into 1 row and sum a coulmn

I have a pandas dataframe that contains user id and ad click (if any) by this user across several days
df =pd.DataFrame([['A',0], ['A',1], ['A',0], ['B',0], ['B',0], ['B',0], ['B',1], ['B',1], ['B',1]],columns=['user_id', 'click_count'])
Out[8]:
user_id click_count
0 A 0
1 A 1
2 A 0
3 B 0
4 B 0
5 B 0
6 B 1
7 B 1
8 B 1
I would like to convert this dataframe into A dataframe WITH 1 row per user where 'click_cnt' now is sum of all click_count across all rows for each user in the original dataframe i.e.
Out[18]:
user_id click_cnt
0 A 1
1 B 3
What you're after is the function groupby:
df = df.groupby('user_id', as_index=False).sum()
Adding the flag as_index=False will add the keys as a separate column instead of using them for the new index.
groupby is super useful - have a read through the documentation for more info.

Summarize Pandas DataFrame by Column Values

I have a Pandas DataFrame and each column is a binary indicator 1/0. It has 4 columns (and 6 rows). I would like to produce a DataFrame that groups rows that are similar and the last (5th) column shows the number of rows that fit that category. Please see the sample below:
df = pd.DataFrame([[0,1,1,0],
[0,1,1,0],
[0,0,0,1],
[0,0,0,1],
[1,1,1,0],
[1,1,1,1],
[1,1,1,0]])
res = pd.DataFrame([[0,1,1,0,2],
[0,0,0,1,2],
[1,1,1,0,2],
[1,1,1,1,1]])
I need to create the "res" DataFrame from df.
This is groupby + size
df.groupby(list(df)).size().to_frame('size').reset_index()
Out[612]:
0 1 2 3 size
0 0 0 0 1 2
1 0 1 1 0 2
2 1 1 1 0 2
3 1 1 1 1 1

Copy pandas DataFrame row to multiple other rows

Simple and practical question, yet I can't find a solution.
The questions I took a look were the following:
Modifying a subset of rows in a pandas dataframe
Changing certain values in multiple columns of a pandas DataFrame at once
Fastest way to copy columns from one DataFrame to another using pandas?
Selecting with complex criteria from pandas.DataFrame
The key difference between those and mine is that I need not to insert a single value, but a row.
My problem is, I pick up a row of a dataframe, say df1. Thus I have a series.
Now I have this other dataframe, df2, that I have selected multiple rows according to a criteria, and I want to replicate that series to all those row.
df1:
Index/Col A B C
1 0 0 0
2 0 0 0
3 1 2 3
4 0 0 0
df2:
Index/Col A B C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
What I want to accomplish is inserting df1[3] into the lines df2[2] and df3[3] for example. So something like the non working code:
series = df1[3]
df2[df2.index>=2 and df2.index<=3] = series
returning
df2:
Index/Col A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Use loc and pass a list of the index labels of interest, after the following comma the : indicates we want to set all column values, we then assign the series but call attribute .values so that it's a numpy array. Otherwise you will get a ValueError as there will be a shape mismatch as you're intending to overwrite 2 rows with a single row and if it's a Series then it won't align as you desire:
In [76]:
df2.loc[[2,3],:] = df1.loc[3].values
df2
Out[76]:
A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Suppose you have to copy certain rows and columns from dataframe to some another data frame do this.
code
df2 = df.loc[x:y,a:b] // x and y are rows bound and a and b are column
bounds that you have to select

Drop Rows by Multiple Column Criteria in DataFrame

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.
Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

Categories