Creating a dataframe in pandas by multiplying two series together - python

Say I have two series in pandas, series A and series B. How do I create a dataframe in which all of those values are multiplied together, i.e. with series A down the left hand side and series B along the top. Basically the same concept as this, where series A would be the yellow on the left and series B the yellow along the top, and all the values in between would be filled in by multiplication:
http://www.google.co.uk/imgres?imgurl=http://www.vaughns-1-pagers.com/computer/multiplication-tables/times-table-12x12.gif&imgrefurl=http://www.vaughns-1-pagers.com/computer/multiplication-tables.htm&h=533&w=720&sz=58&tbnid=9B8R_kpUloA4NM:&tbnh=90&tbnw=122&zoom=1&usg=__meqZT9kIAMJ5b8BenRzF0l-CUqY=&docid=j9BT8tUCNtg--M&sa=X&ei=bkBpUpOWOI2p0AWYnIHwBQ&ved=0CE0Q9QEwBg
Sorry, should probably have added that my two series are not the same length. I'm getting an error now that 'matrices are not aligned' so I assume that's the problem.

You can use matrix multiplication dot, but before you have to convert Series to DataFrame (because dot method on Series implements dot product):
>>> B = pd.Series(range(1, 5))
>>> A = pd.Series(range(1, 5))
>>> dfA = pd.DataFrame(A)
>>> dfB = pd.DataFrame(B)
>>> dfA.dot(dfB.T)
0 1 2 3
0 1 2 3 4
1 2 4 6 8
2 3 6 9 12
3 4 8 12 16

You can create a DataFrame from multiplying two series of unequal length by broadcasting each value of the row (or column) with the other series. For example:
> row = pd.Series(np.arange(1, 6), index=np.arange(1, 6))
> col = pd.Series(np.arange(1, 4), index=np.arange(1, 4))
> row.apply(lambda r: r * col)
1 2 3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15

First create a DataFrame of 1's. Then broadcast multiply along each axis in turn.
>>> s1 = Series([1,2,3,4,5])
>>> s2 = Series([10,20,30])
>>> df = DataFrame(1, index=s1.index, columns=s2.index)
>>> df
0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
>>>> df.multiply(s1, axis='index') * s2
0 1 2
0 10 20 30
1 20 40 60
2 30 60 90
3 40 80 120
4 50 100 150
You need to use df.multiply in order to specify that the series will line up with the row index. You can use the normal multiplication operator * with s2 because matching on columns is the default way of doing multiplication between a DataFrame and a Series.

So I think this may get you most of the way there if you have two series of different lengths. This seems like a very manual process but I cannot think of another way using pandas or NumPy functions.
>>>> a = Series([1, 3, 3, 5, 5])
>>>> b = Series([5, 10])
First convert your row values a to a DataFrame and make copies of this Series in the form of new columns as many as you have values in your columns series b.
>>>> result = DataFrame(a)
>>>> for i in xrange(len(b)):
result[i] = a
0 1
0 1 1
1 3 3
2 3 3
3 5 5
4 5 5
You can then broadcast your Series b over your DataFrame result:
>>>> result = result.mul(b)
0 1
0 5 10
1 15 30
2 15 30
3 25 50
4 25 50
In the example I have chosen, you will end up with indexes that are duplicates due to your initial Series. I would recommend leaving the indexes as unique identifiers. This makes programmatic sense otherwise you will return more than one value when you select an index that has more than one row assigned to it. If you must, you can then reindex your row labels and your column labels using these functions:
>>>> result.columns = b
>>>> result.set_index(a)
5 10
1 5 10
3 15 30
3 15 30
5 25 50
5 25 50
Example of duplicate indexing:
>>>> result.loc[3]
5 10
3 15 30
3 15 30

In order to use the DataFrame.dot method, you need to transpose one of the series:
>>> a = pd.Series([1, 2, 3, 4])
>>> b = pd.Series([10, 20, 30])
>>> a.to_frame().dot(b.to_frame().transpose())
0 1 2
0 10 20 30
1 20 40 60
2 30 60 90
3 40 80 120
Also make sure the series have the same name.

Related

Allocate lowest value over n rows to n rows in DataFrame

I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1

Creating Data Frame with repeating values that repeat

I'm trying to create a dataframe in Pandas that has two variables ("date" and "time_of_day" where "date" is 120 observations long with 30 days (each day has four observations: 1,1,1,1; 2,2,2,2; etc.) and then the second variable "time_of_day) repeats 30 times with values of 1,2,3,4.
The closest I found to this question was here: How to create a series of numbers using Pandas in Python, which got me the below code, but I'm receiving an error that it must be a 1-dimensional array.
df = pd.DataFrame({'date': np.tile([pd.Series(range(1,31))],4), 'time_of_day': pd.Series(np.tile([1, 2, 3, 4],30 ))})
So the final dataframe would look something like
date
time_of_day
1
1
1
2
1
3
1
4
2
1
2
2
2
3
2
4
Thanks much!
you need once np.repeat and once np.tile
df = pd.DataFrame({'date': np.repeat(range(1,31),4),
'time_of_day': np.tile([1, 2, 3, 4],30)})
print(df.head(10))
date time_of_day
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 3 1
9 3 2
or you could use pd.MultiIndex.from_product, same result.
df = (
pd.MultiIndex.from_product([range(1,31), range(1,5)],
names=['date','time_of_day'])
.to_frame(index=False)
)
or product from itertools
from itertools import product
df = pd.DataFrame(product(range(1,31), range(1,5)), columns=['date','time_of_day'])
New feature in merge cross
out = pd.DataFrame(range(1,31)).merge(pd.DataFrame([1, 2, 3, 4]),how='cross')

How to find the maximum value of a column with pandas?

I have a table with 40 columns and 1500 rows. I want to find the maximum value among the 30-32nd (3 columns). How can it be done? I want to return the maximum value among these 3 columns and the index of dataframe.
print(Max_kVA_df.iloc[30:33].max())
hi you can refer this example
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],
'col2':[4,5,6,7,8],
'col3':[2,3,4,5,7]
})
print(df)
#print(df.iloc[:,0:3].max())# Mention range of the columns which you want, In your case change 0:3 to 30:33, here 33 will be excluded
ser=df.iloc[:,0:3].max()
print(ser.max())
Output
8
Select values by positions and use np.max:
Sample: for maximum by first 5 rows:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(10, 3)), columns=list('ABC'))
print (df)
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (df.iloc[0:5])
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (np.max(df.iloc[0:5].max()))
9
Or use iloc this way:
print(df.iloc[[30, 31], 2].max())

How to get dataframe of unique ids

I'm trying to group the following dataframe by unique binId and then parse the resulting rows based of 'z' and pick the row with highest value of 'z'. Here is my dataframe.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3','4','5','6'], 'binId': ['1','2','2','1','1','3'], 'x':[1,4,5,6,3,4], 'y':[11,24,35,16,23,34],'z':[1,4,5,2,3,4]})
`
I tried following code which gives required answer,
def f(x):
tp = df[df['binId'] == x][['binId','ID','x','y','z']].sort_values(by='z', ascending=False).iloc[0]
return tp`
and then,
binids= pd.Series(df.binId.unique())
print binids.apply(f)
The output is,
binId ID x y z
0 1 5 3 23 3
1 2 3 5 35 5
2 3 6 4 34 4
But the execution is too slow. What is the faster way of doing this?
Use idxmax for indices of max and select by loc:
df1 = df.loc[df.groupby('binId')['z'].idxmax()]
Or faster is use sort_values with drop_duplicates:
df1 = df.sort_values(['binId', 'z']).drop_duplicates('binId', keep='last')
print (df1)
ID binId x y z
4 5 1 3 23 3
2 3 2 5 35 5
5 6 3 4 34 4

How to find the top 10 pairs that appear the most in a pandas dataframe in python

I have a pandas dataframe in python with columns 'a', 'b', 'c'. The 'a','b' pairs are unique and repeat multiple times. 'c' is changing all the time. I want to find the 10 pairs 'a','b' that appear the most and put them in a dataframe but don't know how. Any help is appreciated.
I'm not entirely sure I follow you, but assuming you mean you have a DataFrame looking something like
>>> N = 1000
>>> df = pd.DataFrame(np.random.randint(0, 10, (N, 3)), columns="A B C".split())
>>> df.head()
A B C
0 7 4 5
1 5 1 3
2 8 9 8
3 2 3 0
4 2 3 0
and you simply want to count (A, B) pairs, that's easy enough:
>>> df.groupby(["A", "B"]).size().order().iloc[-10:]
A B
6 1 13
1 0 14
4 0 14
7 2 14
1 6 15
8 2 15
1 8 16
2 6 16
6 4 16
7 4 16
dtype: int64
That can be broken down into four parts:
groupby, which groups the data by (A, B) tuples
size, which computes the size of each group
order, which returns the Series sorted by value
iloc, which lets us select the last 10 entries in the Series by position
That results in a Series, but you could make a DataFrame out of it simply by passing it to pd.DataFrame.

Categories