Pandas / SQLITE DataFrame plot - python

I try to plot data from sqlite but i can't achieve this :-/
p2 = sql.read_sql('select DT_COMPUTE_FORCAST,VALUE_DEMANDE,VALUE_FORCAST from PCE', cnx)
# Data frame p2 show the datas
DT_COMPUTE_FORCAST VALUE_DEMANDE VALUE_FORCAST
0 27/06/2014 06:00 5.128 5.324
1 27/06/2014 07:00 5.779 5.334
2 27/06/2014 08:00 5.539 5.354
df = pd.DataFrame({'Demande' : p2['VALUE_DEMANDE'],'Forcast' :p2['VALUE_FORCAST']},index=p2['DT_COMPUTE_FORCAST'])
df.plot(title='Title Here')
=> My chart is showing but with no values, could you give me a hint ?!
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 20109 entries, 27/06/2014 06:00 to 11/05/2015 05:00
Data columns (total 2 columns):
Demande 0 non-null float64
Forcast 0 non-null float64
dtypes: float64(2)
memory usage: 392.8+ KB
the followinf sentence is the correct or i miss something ?:
df = pd.DataFrame({'Demande' : p2['VALUE_DEMANDE'],'Forcast' : p2['VALUE_FORCAST']},index=p2['DT_COMPUTE_FORCAST'])

I think what happens here is that because you pass the data from p2 and using one of the columns as the index, the index values no longer align so you end up with 0 values. You can get around this by assigning the index after the df creation:
df = pd.DataFrame({'Demande' : p2['VALUE_DEMANDE'],'Forcast' :p2['VALUE_FORCAST']})
and then
df.index = p2['DT_COMPUTE_FORCAST']
Example:
In [160]:
df = pd.DataFrame({'a':np.arange(5), 'b':list('abcde')})
df
Out[160]:
a b
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
In [161]:
df1 = pd.DataFrame({'a_copy':df['a']}, index=df['b'])
df1
Out[161]:
a_copy
b
a NaN
b NaN
c NaN
d NaN
e NaN
Another way to get around this is to access the .values attribute so that the data is anonymous:
In [162]:
df1 = pd.DataFrame({'a_copy':df['a'].values}, index=df['b'])
df1
Out[162]:
a_copy
b
a 0
b 1
c 2
d 3
e 4
So the following should work:
df = pd.DataFrame({'Demande' : p2['VALUE_DEMANDE'].values,'Forcast' : p2['VALUE_FORCAST'].values},index=p2['DT_COMPUTE_FORCAST'])

Related

How to join two pandas dataframes based on a date in df1 being >= date in df2

I have a large data frame with key IDs, states, start dates and other characteristics. I have another data frame with states, a start date and a "1" to signify a flag.
I want to join the two, based on the state and the date in df1 being greater than or equal to the date in df2.
Take the example below. df1 is the table of states, start dates, and a 1 for a flag. df2 is a dataframe that needs those flags if the date in df2 is >= the date in df1. The end result is df3. The only observations get the flag whose states match and dates are >= the original dates.
import pandas as pd
dict1 = {'date':['2020-01-01', '2020-02-15', '2020-02-04','2020-03-17',
'2020-06-15'],
'state':['AL','FL','MD','NC','SC'],
'flag': [1,1,1,1,1]}
df1 = pd.DataFrame(dict1)
df1['date'] = pd.to_datetime(df1['date'])
dict2 = {'state': ['AL','FL','MD','NC','SC'],
'keyid': ['001','002','003','004','005'],
'start_date':['2020-02-01', '2020-01-15', '2020-01-30','2020-05-18',
'2020-05-16']}
df2 = pd.DataFrame(dict2)
df2['start_date'] = pd.to_datetime(df2['start_date'])
df3 = df2
df3['flag'] = [0,1,1,0,1]
How do I get to df3 programmatically? My actual df1 has a row for each state. My actual df2 has over a million observations with different dates.
Use merge_asof for merge by greater or equal datetimes by parameter direction='forward':
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
df2['need'] = [0,1,1,0,1]
df1 = df1.sort_values('date')
df2 = df2.sort_values('start_date')
df = pd.merge_asof(df2,
df1,
left_on='start_date',
right_on='date',
by='state',
direction='forward')
df['flag'] = df['flag'].fillna(0).astype(int)
print (df)
state keyid start_date need date flag
0 FL 002 2020-01-15 1 2020-02-15 1
1 MD 003 2020-01-30 1 2020-02-04 1
2 AL 001 2020-02-01 0 NaT 0
3 SC 005 2020-05-16 1 2020-06-15 1
4 NC 004 2020-05-18 0 NaT 0
You can also rename column for avoid appending in output DataFrame:
df2['need'] = [0,1,1,0,1]
df1 = df1.sort_values('date')
df2 = df2.sort_values('start_date')
df = pd.merge_asof(df2,
df1.rename(columns={'date':'start_date'}),
on='start_date',
by='state',
direction='forward')
df['flag'] = df['flag'].fillna(0).astype(int)
print (df)
state keyid start_date need flag
0 FL 002 2020-01-15 1 1
1 MD 003 2020-01-30 1 1
2 AL 001 2020-02-01 0 0
3 SC 005 2020-05-16 1 1
4 NC 004 2020-05-18 0 0
Use df.merge and numpy.where:
In [29]: import numpy as np
In [30]: df3 = df2.merge(df1)[['state', 'keyid', 'start_date', 'date']]
In [31]: df3['flag'] = np.where(df3['start_date'].ge(df3['date']), 0, 1)
In [33]: df3.drop('date', 1, inplace=True)
In [34]: df3
Out[34]:
state keyid start_date flag
0 AL 001 2020-02-01 0
1 FL 002 2020-01-15 1
2 MD 003 2020-01-30 1
3 NC 004 2020-05-18 0
4 SC 005 2020-05-16 1

Groupby sum, index vs. column results

For the following dataframe:
df = pd.DataFrame({'group':['a','a','b','b'], 'data':[5,10,100,30]},columns=['group', 'data'])
print(df)
group data
0 a 5
1 a 10
2 b 100
3 b 30
When grouping by column, adding and creating a new column, the result is:
df['new'] = df.groupby('group')['data'].sum()
print(df)
group data new
0 a 5 NaN
1 a 10 NaN
2 b 100 NaN
3 b 30 NaN
However if we reset the df to the original data and move the group column to the index,
df.set_index('group', inplace=True)
print(df)
data
group
a 5
a 10
b 100
b 30
And then group and sum, then we get:
df['new'] = df.groupby('group')['data'].sum()
print(df)
data new
group
a 5 15
a 10 15
b 100 130
b 30 130
Why does the column group not set the values in the new column but the index grouping does set the values in the new column?
Better here is use GroupBy.transform for return Series with same size like original DataFrame, so after assign all working correctly:
df['new'] = df.groupby('group')['data'].transform('sum')
Because if assign new Series values are align by index values. If index is different, get NaNs:
print (df.groupby('group')['data'].sum())
group
a 15
b 130
Name: data, dtype: int64
Different index values - get NaNs:
print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')
print (df.index)
RangeIndex(start=0, stop=4, step=1)
df.set_index('group', inplace=True)
print (df.groupby('group')['data'].sum())
group
a 15
b 130
Name: data, dtype: int64
Index can align, because values matched:
print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')
print (df.index)
Index(['a', 'a', 'b', 'b'], dtype='object', name='group')
You're not getting what you want because when using df.groupby('group')['data'].sum(), this is returning an aggregated result with group as index:
group
a 15
b 130
Name: data, dtype: int64
Where clearly indexes are not aligned.
If you want this to work you'll have to use transform, which returns a Series with the transformed vales which has the same axis length as self:
df['new'] = df.groupby('group')['data'].transform('sum')
group data new
0 a 5 15
1 a 10 15
2 b 100 130
3 b 30 130

How to select and order columns in a dataframe using an array in Python

I have a fairly lage dataframe, df2 (~50,000 rows x 2,000 columns). The column headings are sample names. Separately, I have a dataframe, df1, with a list of samples I want to include in my analysis as the df1 index. I want to use the list of samples from df1 index to select only the columns from df2 for those selected samples, discarding the rest. I also want to preserve the sample order from the df1 index.
Example data:
# df1
data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'],
'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
'Year':[2012, 2014, 2015]}
df1 = pd.DataFrame(data1)
df1.set_index('Sample')
# df2
data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
'Sample_A': [0,1,0,0,1],
'Sample_B':[0,0,1,0,0],
'Sample_C':[1,0,0,0,1],
'Sample_D':[0,0,1,1,0]}
df2 = pd.DataFrame(data2)
df2.set_index('Num')
First I generate the list of samples I want from the index of df1, e.g.
samples = df1['Sample'].tolist()
'samples' is then,
['Sample_A', 'Sample_D', 'Sample_E']
And using 'samples', my desired output dataframe, df3, should look like:
index Sample_A Sample_D
Value_1 0 0
Value_2 1 0
Value_3 0 1
Value_4 0 1
Value_5 1 0
But if I use
df3 = df2[samples]
Then I get the error message:
"['Sample_E'] not in index"
So how do I ignore samples that are not found in df2 to avoid this error message?
UPDATE
The solution that worked -
# 1. Define samples to use from df1
samples = df1['Sample'].tolist()
# Only include samples that are found in df2 as well
final_samples = list(set(list(df2.columns)) & set(samples ))
# Make new df with columns corresponding to final_samples
df3 = df2.loc[:, final_samples]
try like this..
df = pd.read_csv("data.csv", usecols=['Sample_A','Sample_D']).fillna('')
print(df)
Selecting all of the rows and some columns, It is possible to select all of the rows by using a single colon.
>>> df.loc[:, ['Sample_A','Sample_D']]
Your answer from the dataset you provided:
>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
... 'Sample_A': [0,1,0,0,1],
... 'Sample_B':[0,0,1,0,0],
... 'Sample_C':[1,0,0,0,1],
... 'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num').loc[:, ['Sample_A','Sample_D']]
Sample_A Sample_D
Num
Value_1 0 0
Value_2 1 0
Value_3 0 1
Value_4 0 1
Value_5 1 0
=====================================
>>> df3 = df2.loc[:, samples]
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
OR
>>> df3 = df2.reindex(columns=samples)
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
You can do it this way. They columns array is in Order which you actually want.
import pandas as pd
data = {'index': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
'Sample_A': [0,1,0,0,1],
'Sample_B':[0,0,1,0,0],
'Sample_C':[1,0,0,0,1],
'Sample_D':[0,0,1,1,0]}
df = pd.DataFrame(data)
df.set_index('index')
df1 = df[['index']+['Sample_A','Sample_D']]
output:
index Sample_A Sample_D
0 Value_1 0 0
1 Value_2 1 0
2 Value_3 0 1
3 Value_4 0 1
4 Value_5 1 0
but to ignore the missing columns take the columns only belong df on which you're doing analysis.
samples = ['index', 'Sample_A', 'Sample_D','Extra_Sample']
final_samples = list(set(list(df1.columns)) & set(samples ))
Now you can pass the final_samples which is having only df2 columns.
df3 = df2[final_samples]

How to combine two pandas dataframes value by value

I have 2 dataframes - players (only has playerid) and dates (only has date). I want new dataframe which will contain for each player each date. In my case, players df contains about 2600 rows and date df has 1100 rows. I used 2 for loops to do this, but it is really slow, is there a way to do it faster via some function? thx
my loop:
player_elo = pd.DataFrame(columns = ['PlayerID','Date'])
for row in players.itertuples():
idx = row.Index
pl = players.at[idx,'PlayerID']
for i in dates.itertuples():
idd = row.Index
dt = dates.at[idd, 0]
new = {'PlayerID': [pl], 'Date': [dt]}
new = pd.DataFrame(new)
player_elo = player_elo.append(new)
If you have a key that is repeated for each df, you can come up with the cartesian product you are looking for using pd.merge().
import pandas as pd
players = pd.DataFrame([['A'], ['B'], ['C']], columns=['PlayerID'])
dates = pd.DataFrame([['12/12/2012'],['12/13/2012'],['12/14/2012']], columns=['Date'])
dates['Date'] = pd.to_datetime(dates['Date'])
players['key'] = 1
dates['key'] = 1
print(pd.merge(players, dates,on='key')[['PlayerID', 'Date']])
Output
PlayerID Date
0 A 2012-12-12
1 A 2012-12-13
2 A 2012-12-14
3 B 2012-12-12
4 B 2012-12-13
5 B 2012-12-14
6 C 2012-12-12
7 C 2012-12-13
8 C 2012-12-14

Why does concat Series to DataFrame with index matching columns not work?

I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2

Categories