Reshape dataframe to have the same index as another dataframe - python

I have two dataframes:
dayData
power_comparison final_average_delta_power calculated_power
1 0.0 0.0 0
2 0.0 0.0 0
3 0.0 0.0 0
4 0.0 0.0 0
5 0.0 0.0 0
7 0.0 0.0 0
and
historicPower
power
0 0.0
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
I'm trying to reindex the historicPower dataframe to have the same shape as the dayData dataframe (so in this example it would looks like):
power
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
The dataframes in reality will be alot larger with different shapes.

I think you can use reindex if index has no duplicates:
historicPower = historicPower.reindex(dayData.index)
print (historicPower)
power
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0

Related

How can I assign the words from a specific column as a label to a new dataframe

Hi Friend I'm new here 😊,
Make a matrix from most repeated words in specific column A and add to my data frame with names of selected column as label.
What I have:
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pr.DataFrame(raw_data)
What is my goal:
I want to do:
1- Separate the string & count the words in specific column
2- Make a Zero-Matrix
3- The new matrix should be labelled with founded words in step 1 (my-problem)
4- Search every row, if the word has been founded then 1 else 0
The new data frame what I have as result:
A word_count char_count 0 1 2 3 4 5 6 7 8 9 10 11
0 This is yellow 3 14 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green 3 13 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 These are orange 3 16 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 This is a pen 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 This is an Orange 4 17 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
What I did:
import pandas as pd
import numpy as np
# 1- Data frame
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pd.DataFrame(raw_data)
df
## 2- Count the words and characters in evrey row in columns "A"
df['word_count'] = df['A'].agg(lambda x: len(x.split(" ")))
df['char_count'] = df['A'].agg(lambda x:len(x))
df
# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
df_word_count=list(df_word_count["A"])
len(df_word_count)
A Count
0 is 4
1 This 3
2 orange 1
3 That 1
4 yellow 1
5 Orange 1
6 are 1
7 a 1
8 an 1
9 These 1
10 green 1
11 pen 1
# 4- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(df_word_count)))
allfeatures.shape
# 5- Make a For-Loop
for i in range(len(df_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(df_word_count[i]))
# 5- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
What I wanted:
The Words in "A" in step 3 should be label of new matrix instead 0 1 2 ...
A word_count char_count is This orange etc.
0 This is yellow 3 14 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green 3 13 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 These are orange 3 16 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 This is a pen 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 This is an Orange 4 17 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
So I changed your code a little, your step 3 looks like this:
# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
list_word_count=list(df_word_count["A"])
len(list_word_count)
The big change is the name of a variable in list_word_count=list(df_word_count["A"])
the rest of the code looks like this with the new variable:
# 4- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(list_word_count)))
allfeatures.shape
# 5- Make a For-Loop
for i in range(len(list_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(list_word_count[i]))
# 6- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
The only change is the different name of variable. What I do is a seventh step
# 7- change columns name from list
#This creates a list of the words you wanted
l = list(df_word_count["A"])
# if you see this, it shows only the words you have in the column A
# but the result dataset that you showed you wanted, you also had some columns #that had values such as word count, etc. So we need to add that. We do this by #inserting those values you want in the list, at the beginning
l.insert(0,"char_count")
l.insert(0,"word_count")
l.insert(0,"A")
# Finally, I rename all the columns with the names that I have in the list l
Complete_data.columns = l
I get this:

Reset the counter when column has non zero value

I have the dataframe with a column.
A
0.0
0.0
0.0
12.0
0.0
0.0
34.0
0.0
0.0
0.0
0.0
11.0
I want the output like this with a counter column. I want the counter to be restarted after non zero value. For the row after every non zero value, the counter should be intilaized again and then should increment.
A Counter
0.0 1
0.0 2
0.0 3
12.0 4
0.0 1
0.0 2
34.0 3
0.0 1
0.0 2
0.0 3
0.0 4
11.0 5
Let us try cumsum create the groupby key , [::-1] here is reverse the order
df['Counter'] = df.A.groupby(df.A.ne(0)[::-1].cumsum()).cumcount()+1
Out[442]:
0 1
1 2
2 3
3 4
4 1
5 2
6 3
7 1
8 2
9 3
10 4
11 5
dtype: int64

How to cope around memory overflow with Pivot table?

I have two medium-sized datasets which looks like:
books_df.head()
ISBN Book-Title Book-Author
0 0195153448 Classical Mythology Mark P. O. Morford
1 0002005018 Clara Callan Richard Bruce Wright
2 0060973129 Decision in Normandy Carlo D'Este
3 0374157065 Flu: The Story of the Great Influenza Pandemic... Gina Bari Kolata
4 0393045218 The Mummies of Urumchi E. J. W. Barber
and
ratings_df.head()
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
And I wanna get a pivot table like this:
ISBN 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952
User-ID
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I've tried:
R_df = ratings_df.pivot(index = 'User-ID', columns ='ISBN', values = 'Book-Rating').fillna(0) # Memory overflow
which failed for:
MemoryError:
and this:
R_df = q_data.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().unstack()
which failed for the same.
I want to use it for singular value decomposition and matrix factorization.
Any ideas?
The dataset I'm working with is: http://www2.informatik.uni-freiburg.de/~cziegler/BX/
One option is to use pandas Sparse functionality, since your data here is (very) sparse:
In [11]: df
Out[11]:
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
In [12]: res = df.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().astype('Sparse[int]')
In [13]: res.unstack(fill_value=0)
Out[13]:
ISBN 0155061224 034545104X 0446520802 052165615X 0521795028
User-ID
276725 0 0 0 0 0
276726 5 0 0 0 0
276727 0 0 0 0 0
276729 0 0 0 3 6
In [14]: _.dtypes
Out[14]:
ISBN
0155061224 Sparse[int64, 0]
034545104X Sparse[int64, 0]
0446520802 Sparse[int64, 0]
052165615X Sparse[int64, 0]
0521795028 Sparse[int64, 0]
dtype: object
My understanding is that you can then use this with scipy e.g. for SVD:
In [15]: res.unstack(fill_value=0).sparse.to_coo()
Out[15]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>

Pandas: create new (sub-level) columns in a multi-index dataframe and assign values

Let's be given a data-frame like the following one:
import pandas as pd
import numpy as np
a = ['a', 'b']
b = ['i', 'ii']
mi = pd.MultiIndex.from_product([a,b], names=['first', 'second'])
A = pd.DataFrame(np.zeros([3,4]), columns=mi)
first a b
second i ii i ii
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
I would like to create new columns iii for all first-level columns and assign the value of a new array (of matching size). I tried the following, to no avail.
A.loc[:,pd.IndexSlice[:,'iii']] = np.arange(6).reshape(3,-1)
The result should look like this:
a b
i ii iii i ii iii
0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 2.0 0.0 0.0 3.0
2 0.0 0.0 4.0 0.0 0.0 5.0
Since you have multiple index in columns , I recommend create the additional append df , then concat it back
appenddf=pd.DataFrame(np.arange(6).reshape(3,-1),
index=A.index,
columns=pd.MultiIndex.from_product([A.columns.levels[0],['iii']]))
appenddf
a b
iii iii
0 0 1
1 2 3
2 4 5
A=pd.concat([A,appenddf],axis=1).sort_index(level=0,axis=1)
A
first a b
second i ii iii i ii iii
0 0.0 0.0 0 0.0 0.0 1
1 0.0 0.0 2 0.0 0.0 3
2 0.0 0.0 4 0.0 0.0 5
Another workable solution
for i,x in enumerate(A.columns.levels[0]):
A[x,'iii']=np.arange(6).reshape(3,-1)[:,i]
A
first a b a b
second i ii i ii iii iii
0 0.0 0.0 0.0 0.0 0 1
1 0.0 0.0 0.0 0.0 2 3
2 0.0 0.0 0.0 0.0 4 5
# here I did not add `sort_index`

pandas count the number of value different from zero between two zeros

I have the following dataframe
0 0 0
1 0 0
1 1 0
1 1 1
1 1 1
0 0 0
0 1 0
0 1 0
0 0 0
how do you get a dataframe which looks like this
0 0 0
4 0 0
4 3 0
4 3 2
4 3 2
0 0 0
0 2 0
0 2 0
0 0 0
Thank you for your help.
You may need using for loop here , with tranform, and using cumsum create the key and assign the position back to your original df
for x in df.columns:
df.loc[df[x]!=0,x]=df[x].groupby(df[x].eq(0).cumsum()[df[x]!=0]).transform('count')
df
Out[229]:
1 2 3
0 0.0 0.0 0.0
1 4.0 0.0 0.0
2 4.0 3.0 0.0
3 4.0 3.0 2.0
4 4.0 3.0 2.0
5 0.0 0.0 0.0
6 0.0 2.0 0.0
7 0.0 2.0 0.0
8 0.0 0.0 0.0
Or without for loop
s=df.stack().sort_index(level=1)
s2=s.groupby([s.index.get_level_values(1),s.eq(0).cumsum()]).transform('count').sub(1).unstack()
df=df.mask(df!=0).combine_first(s2)
df
Out[255]:
1 2 3
0 0.0 0.0 0.0
1 4.0 0.0 0.0
2 4.0 3.0 0.0
3 4.0 3.0 2.0
4 4.0 3.0 2.0
5 0.0 0.0 0.0
6 0.0 2.0 0.0
7 0.0 2.0 0.0
8 0.0 0.0 0.0

Categories