Combine multiple Pandas series with identical column names, but different indices - python

I have many pandas series structured more or less as follows.
s1 s2 s3 s4
Date val1 Date val1 Date val2 Date val2
Jan 10 Apr 25 Jan 14 Apr 11
Feb 11 May 18 Feb 17 May 7
Mar 8 Jun 15 Mar 16 Jun 21
I would like to combine these series into a single data frame, with structure as follows:
Date val1 val2
Jan 10 14
Feb 11 17
Mar 8 16
Apr 25 11
May 18 7
Jun 15 21
In an attempt to combine them, I have tried using pd.concat to create this single data frame. However, I have not been able to do so. The results of pd.concat(series, axis=1) (where series is a list [s1,s2,s3,s4]) is:
Date val1 val1 val2 val2
Jan 10 nan 14 nan
Feb 11 nan 17 nan
Mar 8 nan 16 nan
Apr nan 25 nan 11
May nan 18 nan 7
Jun nan 15 nan 21
And pd.concat(series, axis=0) simply creates a single series, ignoring the column names.
Is there a parameter in concat that will yield my desired result? Or is there some other function that can collapse the incorrect, nan-filled data frame into a frame with non-repeated columns and no nans?

One way to do is groupby Date and choose first:
(pd.concat( [s1,s2,s3,s4])
.groupby('Date', as_index=False, sort=False).first()
)
Output:
Date val1 val2
0 Jan 10 14
1 Feb 11 17
2 Mar 8 16
3 Apr 25 11
4 May 18 7
5 Jun 15 21

Related

Pandas Multindex: iterate rows and add specific values to create a new variable

I have a pandas data frame with Multindex (id and datetime) and one column named X1.
X1
id datetime
a1ssjdldf 2019 Jul 10 2
2019 Jul 11 22
2019 Jul 12 21
r2dffs 2019 Jul 10 14
2019 Jul 11 13
2019 Jul 12 11
I want to create a new variable X2 where the corresponding value is the difference between the X1 value of the same row and the X1 value of the previous row. But every time it sees a new id the corresponding value has to be restarted from zero.
For example:
X1 X2
id datetime
a1ssjdldf 2019 Jul 10 2 0
2019 Jul 11 22 20
2019 Jul 12 21 -1
r2dffs 2019 Jul 10 14 0
2019 Jul 11 13 -1
2019 Jul 12 11 -2
Use DataFrameGroupBy.diff by first level and replace missing values by Series.fillna:
df['X2'] = df.groupby(level=0)['X1'].diff().fillna(0, downcast='int')
print (df)
X1 X2
id datetime
a1ssjdldf 2019 Jul 10 2 0
2019 Jul 11 22 20
2019 Jul 12 21 -1
r2dffs 2019 Jul 10 14 0
2019 Jul 11 13 -1
2019 Jul 12 11 -2

How do I find the count of a particular column [Model], based on another column [SoldDate] using pandas?

I have a dataframe with 3 columns, such as SoldDate,Model and TotalSoldCount. How do I create a new column, 'CountSoldbyMonth' which will give the count of each of the many models sold monthly? A screenshot describing the problem is given.
The ‘CountSoldbyMonth’ should always be less than the ‘TotalSoldCount’.
I am new to Python.
enter image description here
Date Model TotalSoldCount
Jan 19 A 4
Jan 19 A 4
Jan 19 A 4
Jan 19 B 6
Jan 19 C 2
Jan 19 C 2
Feb 19 A 4
Feb 19 B 6
Feb 19 B 6
Feb 19 B 6
Mar 19 B 6
Mar 19 B 6
The new df should look like this.
Date Model TotalSoldCount CountSoldbyMonth
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 A 4 3
Jan 19 B 6 1
Jan 19 C 2 2
Jan 19 C 2 2
Feb 19 A 4 1
Feb 19 B 6 3
Feb 19 B 6 3
Feb 19 B 6 3
Mar 19 B 6 2
Mar 19 B 6 2
I tried doing
df['CountSoldbyMonth'] = df.groupby(['date','model']).totalsoldcount.transform('sum')
but it is generating a different value.
Suppose you have this data set:
date model totalsoldcount
0 Jan 19 A 110
1 Jan 19 A 110
2 Jan 19 A 110
3 Jan 19 B 50
4 Jan 19 C 70
5 Jan 19 C 70
6 Feb 19 A 110
7 Feb 19 B 50
8 Feb 19 B 50
9 Feb 19 B 50
10 Mar 19 B 50
11 Mar 19 B 50
And you want to define a new column, countsoldbymonth. You can groupby the date and model columns and then sum the totalsoldcount with a transform and then create the new column:
s['countsoldbymonth'] = s.groupby([
'date',
'model'
]).totalsoldcount.transform('sum')
print(s)
date model totalsoldcount countsoldbymonth
0 Jan 19 A 110 330
1 Jan 19 A 110 330
2 Jan 19 A 110 330
3 Jan 19 B 50 50
4 Jan 19 C 70 140
5 Jan 19 C 70 140
6 Feb 19 A 110 110
7 Feb 19 B 50 150
8 Feb 19 B 50 150
9 Feb 19 B 50 150
10 Mar 19 B 50 100
11 Mar 19 B 50 100
Or, if you just want to see the sums without creating a new column you can use sum instead of transform like this:
print(s.groupby([
'date',
'model'
]).totalsoldcount.sum())
date model
Feb 19 A 110
B 150
Jan 19 A 330
B 50
C 140
Mar 19 B 100
Edit
If you just want to know how many sales were done in the month you can do the same groupby, but instead of sum use count
df['CountSoldByMonth'] = df.groupby([
'Date',
'Model'
]).TotalSoldCount.transform('count')
print(df)
Date Model TotalSoldCount CountSoldByMonth
0 Jan 19 A 4 3
1 Jan 19 A 4 3
2 Jan 19 A 4 3
3 Jan 19 B 6 1
4 Jan 19 C 2 2
5 Jan 19 C 2 2
6 Feb 19 A 4 1
7 Feb 19 B 6 3
8 Feb 19 B 6 3
9 Feb 19 B 6 3
10 Mar 19 B 6 2
11 Mar 19 B 6 2
it's easier to help if you give code that let's the user experiment. In this case, I'd think taking your dataframe (df) & doing the following should work:
df['CountSoldbyMonth'] = df.groupby(['Date','Model'])['TotalSoldCount'].transform('sum')

How to add a column with the time to a pandas dataframe (created from a JSON)?

I retrieve data (JSON format) from a software API and transform it into a dataframe to write it in a CSV (pandas library). I would add a column with the time. I would like it to be written "time" on the first row and for example "Fri Mar 29 09:16:02 2019" on the following ones. An idea on how to achieve this?
I got to add the time but just on the first row of my dataframe.
import json
import pandas as pd
import time
import urllib.request
url='http://localhost:47800/api/v1/bacnet/devices/0/objects?properties=present-value&properties=object-name'
req = urllib.request.Request(url)
r = urllib.request.urlopen(req).read()
data = json.loads(r.decode('utf-8'))
time=time.asctime(time.localtime(time.time()))
result = pd.io.json.json_normalize(data['objects'])
result_tri = result.reindex(columns=[time,'object-name','present-value'])
Current result
Fri Mar 29 09:47:36 2019 object-name present-value
0 NaN Température_1 0 660.0
1 NaN Humidité_1 1 497.0
2 NaN Pression_1 2 497.0
3 NaN Vitesse_Vent 3 497.0
4 NaN Luminosité 4 497.0
5 NaN Etat_Pompe 3 0.0
6 NaN Greisch_Simulator NaN
7 NaN networkPort 30800 NaN
Desired result
Time object-name present-value
0 Fri Mar 29 09:47:36 2019 Température_1 0 660.0
1 Fri Mar 29 09:47:36 2019 Humidité_1 1 497.0
2 Fri Mar 29 09:47:36 2019 Pression_1 2 497.0
3 Fri Mar 29 09:47:36 2019 Vitesse_Vent 3 497.0
4 Fri Mar 29 09:47:36 2019 Luminosité 4 497.0
5 Fri Mar 29 09:47:36 2019 Etat_Pompe 3 0.0
6 Fri Mar 29 09:47:36 2019 Greisch_Simulator NaN
7 Fri Mar 29 09:47:36 2019 networkPort 30800 NaN
use
result_tri = result.reindex(columns=['Time','object-name','present-value'])
result_tri['Time'] = time
You can add new column in your df directly.
When you are doing
result_tri = result.reindex(columns=[time,'object-name','present-value'])
**you actually doing**
result_tri = result.reindex(columns="Fri Mar 29 09:47:36 2019",'object-name','present-value']
time is variable in you method which gets replaced with the value you have assigned to it.
you just need to do:
result = pd.io.json.json_normalize(data['objects'])
result["time"] = time.asctime(time.localtime(time.time()))
result = result.reindex(columns=['Time','object-name','present-value'])

Moving average for months over years

I am new to pandas and would appreciate guidance with the following problem. I have a dataframe that looks like the following:
In [88]: df.head()
Out[88]:
Jan Feb Mar Apr May Jun ... Dec
Year ...
1758 13 15 14 5 5 5 ... 12
1759 11 10 7 4 3 6 ... 11
1760 19 15 18 5 13 6 ... 11
1761 14 16 14 9 9 11 ... 10
1762 13 12 12 8 5 3 ... 11
I need to compute moving average per month in the following way:
Moving_average of Mar_1761 = (value_of_Mar_1761)/(sum of values from Sep_1760 to Aug_1761)
If I am using the rolling average function of pandas, how do I code the logic to inspect predecessor or successor row for a particular point?
The easiest approach is to reshape to data to a long format using .stack, which can be be passed straight into rolling mean.
In [34]: pd.rolling_mean(df.stack(), window=12)
Out[34]:
Year
1758 Jan NaN
Feb NaN
Mar NaN
Apr NaN
May NaN
Jun NaN
Jul NaN
Aug NaN
Sep NaN
Oct NaN
Nov NaN
Dec 0.035038
1759 Jan -0.076660
Feb -0.153907
Mar -0.286818
Apr -0.306684
May -0.159371
Jun -0.230627
Jul -0.175845

Adding a column to the end of a column within the same DataFrame

I currently have a dataframe which I scraped from the internet using Beautiful Soup. However it is setup so that it is gridded, rather then a continuous list.
As in Months for rows, and Years for Columns.
However I am trying to make it so that it is one continuous column as this data will be plotted against other data, aka births vs deaths.
An example of the df I currently have is as below,
2010 2011 2013 2014
Jan 1.474071 -0.064034 0.781836 -1.282782
Feb -1.071357 0.441153 0.583787 2.353925
Mar 0.221471 -0.744471 1.729689 0.758527
Apr -0.964980 -0.845696 1.846883 -1.340896
May -1.328865 1.682706 0.888782 -1.717693
Jun 0.228440 0.901805 0.520260 1.171216
Jul -1.197071 -1.066969 -0.858447 -0.303421
Aug 0.306996 -0.028665 1.574159 0.384316
Sep -0.014805 -0.284319 -1.461665 0.650776
Oct 1.588931 0.476720 -0.242861 0.473424
Nov -0.014805 -0.284319 -1.461665 0.650776
Dec 0.964980 -0.845696 1.846883 -1.340896
However when I try append (with ignore index) I get
df[["2010"]].append(df[["2011"]], ignore_index=True)
00 1.474071 NaN
01 -1.071357 NaN
02 0.221471 NaN
03 -0.964980 NaN
04 -1.328865 NaN
05 0.228440 NaN
06 -1.197071 NaN
07 0.306996 NaN
08 -0.014805 NaN
09 1.588931 NaN
11 -0.014805 NaN
12 NaN -0.064034
13 NaN 0.441153
14 NaN -0.744471
15 NaN -0.845696
16 NaN 1.682706
However I am trying to get the whole dataset into one continuous column, e.g.
00 1.474071
01 -1.071357
02 0.221471
03 -0.964980
04 -1.328865
05 0.228440
06 -1.197071
07 0.306996
08 -0.014805
09 1.588931
11 -0.014805
12 -0.064034
13 0.441153
14 -0.744471
15 -0.845696
16 1.682706
How do I get all four columns into one single column?
Another way to do this is to unstack the DataFrame. Then reset the index to the default integer index with reset_index(drop=True):
df.unstack().reset_index(drop=True)
You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the month names as index values repeated:
In [228]:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)
Out[228]:
0 1.474071
1 -1.071357
2 0.221471
3 -0.964980
4 -1.328865
5 0.228440
6 -1.197071
7 0.306996
8 -0.014805
9 1.588931
10 -0.014805
11 0.964980
12 -0.064034
13 0.441153
14 -0.744471
15 -0.845696
16 1.682706
17 0.901805
18 -1.066969
19 -0.028665
20 -0.284319
21 0.476720
22 -0.284319
23 -0.845696
24 0.781836
25 0.583787
26 1.729689
27 1.846883
28 0.888782
29 0.520260
30 -0.858447
31 1.574159
32 -1.461665
33 -0.242861
34 -1.461665
35 1.846883
36 -1.282782
37 2.353925
38 0.758527
39 -1.340896
40 -1.717693
41 1.171216
42 -0.303421
43 0.384316
44 0.650776
45 0.473424
46 0.650776
47 -1.340896
dtype: float64

Categories