copy dataframe to postgres table with column that has defalut value - python

I have the following postgreSql table stock, there the structure is following with column insert_time has a default value now()
| column | pk | type |
+-------------+-----+-----------+
| id | yes | int |
| type | yes | enum |
| c_date | | date |
| qty | | int |
| insert_time | | timestamp |
I was trying to copy the followning df
| id | type | date | qty |
+-----+------+------------+------+
| 001 | CB04 | 2015-01-01 | 700 |
| 155 | AB01 | 2015-01-01 | 500 |
| 300 | AB01 | 2015-01-01 | 1500 |
I was using psycopg to upload the df to the table stock
cur.copy_from(df, stock, null='', sep=',')
conn.commit()
getting this error.
DataError: missing data for column "insert_time"
CONTEXT: COPY stock, line 1: "001,CB04,2015-01-01,700"
I was expecting with the psycopg copy_from function, my postgresql table will auto-populate the rows along side the insert time.
| id | type | date | qty | insert_time |
+-----+------+------------+------+---------------------+
| 001 | CB04 | 2015-01-01 | 700 | 2018-07-25 12:00:00 |
| 155 | AB01 | 2015-01-01 | 500 | 2018-07-25 12:00:00 |
| 300 | AB01 | 2015-01-01 | 1500 | 2018-07-25 12:00:00 |

You can specify columns like this:
cur.copy_from(df, stock, null='', sep=',', columns=('id', 'type', 'c_date', 'qty'))

Related

Extract utc format for datetime object in a new Python column

Be the following pandas DataFrame:
| ID | date |
|--------------|---------------------------------------|
| 0 | 2022-03-02 18:00:20+01:00 |
| 0 | 2022-03-12 17:08:30+01:00 |
| 1 | 2022-04-23 12:11:50+01:00 |
| 1 | 2022-04-04 10:15:11+01:00 |
| 2 | 2022-04-07 08:24:19+02:00 |
| 3 | 2022-04-11 02:33:22+02:00 |
I want to separate the date column into two columns, one for the date in the format "yyyy-mm-dd" and one for the time in the format "hh:mm:ss+tmz".
That is, I want to get the following resulting DataFrame:
| ID | date_only | time_only |
|--------------|-------------------------|----------------|
| 0 | 2022-03-02 | 18:00:20+01:00 |
| 0 | 2022-03-12 | 17:08:30+01:00 |
| 1 | 2022-04-23 | 12:11:50+01:00 |
| 1 | 2022-04-04 | 10:15:11+01:00 |
| 2 | 2022-04-07 | 08:24:19+02:00 |
| 3 | 2022-04-11 | 02:33:22+02:00 |
Right now I am using the following code, but it does not return the time with utc +hh:mm.
df['date_only'] = df['date'].apply(lambda a: a.date())
df['time_only'] = df['date'].apply(lambda a: a.time())
| ID | date_only |time_only |
|--------------|-------------------------|----------|
| 0 | 2022-03-02 | 18:00:20 |
| 0 | 2022-03-12 | 17:08:30 |
| ... | ... | ... |
| 3 | 2022-04-11 | 02:33:22 |
I hope you can help me, thank you in advance.
Convert column to datetimes and then extract Series.dt.date and times with timezones by Series.dt.strftime:
df['date'] = pd.to_datetime(df['date'])
df['date_only'] = df['date'].dt.date
df['time_only'] = df['date'].dt.strftime('%H:%M:%S%z')
Or split converted values to strings by space and select second lists:
df['date'] = pd.to_datetime(df['date'])
df['date_only'] = df['date'].dt.date
df['time_only'] = df['date'].astype(str).str.split().str[1]

Accessing nested columns

I have the following DataFrame containing stock data from several finance institutions.
+-------------+-------+-------+-------+-------+------------+-------+-------+-------+-------+------------+
| Bank Ticker | BAC | C |
+-------------+-------+-------+-------+-------+------------+-------+-------+-------+-------+------------+
| Stock Info | Open | High | Low | Close | Volume | Open | High | Low | Close | Volume |
+-------------+-------+-------+-------+-------+------------+-------+-------+-------+-------+------------+
| Date | | | | | | | | | | |
+-------------+-------+-------+-------+-------+------------+-------+-------+-------+-------+------------+
| 2015-12-31 | 52.07 | 52.39 | 51.75 | 51.75 | 11274831.0 | 17.01 | 17.07 | 16.83 | 16.83 | 47106760.0 |
| 2015-12-30 | 52.84 | 52.94 | 52.25 | 52.30 | 8763137.0 | 17.20 | 17.24 | 17.04 | 17.05 | 35035518.0 |
+-------------+-------+-------+-------+-------+------------+-------+-------+-------+-------+------------+
The values are nested columns within the ticker (not sure if such structure exists in Python or if it's some sort of multiindex), with Bank Ticker and Stock Info being the column names.
I need to access all certain columns, like 'Close' from all institutions. I managed to achieve the result with the following for loop for obtaining the maximum close values.
for x in tickers:
print(x, bank_stocks[x, 'Close'].max())
BAC 54.9
C 60.34
GS 247.92
JPM 70.08
MS 89.3
WFM 73.0
Since I'm not an advanced user, I was wondering if there's a better way to get the result, using pandas itself.
Try:
df.loc[:, (slice(None), 'Close')].droplevel(level=1, axis=1).max()
or
df.loc(axis=1)[:, 'Close'].droplevel(leve=1, axis=1).max()
or
df.loc[:, pd.IndexSlice[:, 'Close']].droplevel(level=1, axis=1).max()

How to group a Pandas DataFrame by url without the query string?

I have a Pandas DataFrame that is structured like this:
+-------+------------+------------------------------------+----------+
| index | Date | path | Count |
+-------+------------+------------------------------------+----------+
| 0 | 2020-06-10 | about/v1/ | 10865 |
| 1 | 2020-06-10 | about/v1/?status=active | 2893 |
| 2 | 2020-06-10 | about/v1/?status=active?name=craig | 264 |
| 3 | 2020-06-09 | about/v1/?status=active?name=craig | 182 |
+-------+------------+------------------------------------+----------+
How do I group by the path, and the date without the query string so that the table looks like this?
+-------+------------+-------------------------+----------+
| index | Date | path | Count |
+-------+------------+-------------------------+----------+
| 0 | 2020-06-10 | about/v1/ | 10865 |
| 1 | 2020-06-10 | about/v1/?status=active | 3157 |
| 3 | 2020-06-09 | about/v1/?status=active | 182 |
+-------+------------+-------------------------+----------+
Replace the name=craig section, and groupby on the Date and path columns :
result = (df.assign(path=df.path.str.replace(r"\?name=.*",""))
.drop("index",axis=1)
.groupby(["Date","path"],sort=False)
.sum()
)
result
Count
Date path
2020-06-10 about/v1/ 10865
about/v1/?status=active 3157
2020-06-09 about/v1/?status=active 182

How to apply an aggregate function to all columns of a pivot table in Pandas

A pivot table is counting the monthly occurrences of a phenomenon. Here's the simplified sample data followed by the pivot:
+--------+------------+------------+
| ad_id | entreprise | date |
+--------+------------+------------+
| 172788 | A | 2020-01-28 |
| 172931 | A | 2020-01-26 |
| 172793 | B | 2020-01-26 |
| 172768 | C | 2020-01-19 |
| 173219 | C | 2020-01-14 |
| 173213 | D | 2020-01-13 |
+--------+------------+------------+
My pivot_table code is the following:
my_pivot_table = pd.pivot_table(df[(df['date'] >= some_date) & ['date'] <= some_other_date)],
values=['ad_id'], index=['entreprise'],
columns=['year', 'month'], aggfunc=['count'])
The resulting table looks like this:
+-------------+---------+----------+-----+----------+
| | 2018 | | | |
+-------------+---------+----------+-----+----------+
| entreprise | january | february | ... | december |
| A | 12 | 10 | ... | 8 |
| B | 24 | 12 | ... | 3 |
| ... | ... | ... | ... | ... |
| D | 31 | 18 | ... | 24 |
+-------------+---------+----------+-----+----------+
Now, I would like to add a column that gives me the monthly average, and perform other operations such as comparing last month's count to the monthly average of, say, the last 12 months...
I tried to fiddle with the aggfunc parameter of the pivot_table, as well as trying to add an average column to the original dataframe, but without success.
Thanks in advance!
Because you get Multiindex table after pivot_table you can use:
df1 = df.mean(axis=1, level=0)
df1.columns = pd.MultiIndex.from_product([df1.columns, ['mean']])
Or:
df2 = df.mean(axis=1, level=1)
df2.columns = pd.MultiIndex.from_product([['all_years'], df2.columns])

Resampling Pandas time series data with keeping only valid numbers for each row

I have a dataframe which has a list of web pages with summed hourly trraffic by unix hour.
Pivoted, it looks like this:
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| unix hour | 394533 | 394534 | 394535 | 394536 | 394537 | 394538 | 394539 | 394540 | 394541 | 394542 | 394543 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| page | | | | | | | | | | | |
| 3530765 | 5791 | 6017 | 5302 | | | | | | | | |
| 3563667 | | | | 3481 | 2840 | 2421 | | | | | |
| 3579922 | | | | | | | 1816 | 1947 | 1878 | 2013 | 1718 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
Instead of having the time be actually over time, I would like to centralize so that it looks like this:
+---------+------+------+------+------+------+
| hour | 1 | 2 | 3 | 4 | 5 |
+---------+------+------+------+------+------+
| page | | | | | |
| 3530765 | 5791 | 6017 | 5302 | | |
| 3563667 | 3481 | 2840 | 2421 | | |
| 3579922 | 1816 | 1947 | 1878 | 2013 | 1718 |
+---------+------+------+------+------+------+
Would would be the best way to do this in pandas?
*Note - I realize the hours as columns isn't ideal, but for my full data set i have 7k pages and only over a span of 72 hours, so to me, pages as the index and hours as the columns makes the most sense.
Assuming the data is stored as float:
In [191]:
print df.dtypes
394533 float64
394534 float64
394535 float64
394536 float64
394537 float64
394538 float64
394539 float64
394540 float64
394541 float64
394542 float64
394543 float64
dtype: object
We will just do:
In [192]:
print df.apply(lambda x: pd.Series(data=x[np.isfinite(x)].values), 1)
0 1 2 3 4
page
3530765 5791 6017 5302 NaN NaN
3563667 3481 2840 2421 NaN NaN
3579922 1816 1947 1878 2013 1718
The idea is to get the valid numbers of each rows, put those into Series, but without the original UNIXtime as index. The index, therefore will become 0,1,2...., if you must you can easily make it into 1,2,3... by df2.columns = df2.columns+1, assuming the result is assigned df2.

Categories