How to group a Pandas DataFrame by url without the query string? - python

I have a Pandas DataFrame that is structured like this:
+-------+------------+------------------------------------+----------+
| index | Date | path | Count |
+-------+------------+------------------------------------+----------+
| 0 | 2020-06-10 | about/v1/ | 10865 |
| 1 | 2020-06-10 | about/v1/?status=active | 2893 |
| 2 | 2020-06-10 | about/v1/?status=active?name=craig | 264 |
| 3 | 2020-06-09 | about/v1/?status=active?name=craig | 182 |
+-------+------------+------------------------------------+----------+
How do I group by the path, and the date without the query string so that the table looks like this?
+-------+------------+-------------------------+----------+
| index | Date | path | Count |
+-------+------------+-------------------------+----------+
| 0 | 2020-06-10 | about/v1/ | 10865 |
| 1 | 2020-06-10 | about/v1/?status=active | 3157 |
| 3 | 2020-06-09 | about/v1/?status=active | 182 |
+-------+------------+-------------------------+----------+

Replace the name=craig section, and groupby on the Date and path columns :
result = (df.assign(path=df.path.str.replace(r"\?name=.*",""))
.drop("index",axis=1)
.groupby(["Date","path"],sort=False)
.sum()
)
result
Count
Date path
2020-06-10 about/v1/ 10865
about/v1/?status=active 3157
2020-06-09 about/v1/?status=active 182

Related

Extract utc format for datetime object in a new Python column

Be the following pandas DataFrame:
| ID | date |
|--------------|---------------------------------------|
| 0 | 2022-03-02 18:00:20+01:00 |
| 0 | 2022-03-12 17:08:30+01:00 |
| 1 | 2022-04-23 12:11:50+01:00 |
| 1 | 2022-04-04 10:15:11+01:00 |
| 2 | 2022-04-07 08:24:19+02:00 |
| 3 | 2022-04-11 02:33:22+02:00 |
I want to separate the date column into two columns, one for the date in the format "yyyy-mm-dd" and one for the time in the format "hh:mm:ss+tmz".
That is, I want to get the following resulting DataFrame:
| ID | date_only | time_only |
|--------------|-------------------------|----------------|
| 0 | 2022-03-02 | 18:00:20+01:00 |
| 0 | 2022-03-12 | 17:08:30+01:00 |
| 1 | 2022-04-23 | 12:11:50+01:00 |
| 1 | 2022-04-04 | 10:15:11+01:00 |
| 2 | 2022-04-07 | 08:24:19+02:00 |
| 3 | 2022-04-11 | 02:33:22+02:00 |
Right now I am using the following code, but it does not return the time with utc +hh:mm.
df['date_only'] = df['date'].apply(lambda a: a.date())
df['time_only'] = df['date'].apply(lambda a: a.time())
| ID | date_only |time_only |
|--------------|-------------------------|----------|
| 0 | 2022-03-02 | 18:00:20 |
| 0 | 2022-03-12 | 17:08:30 |
| ... | ... | ... |
| 3 | 2022-04-11 | 02:33:22 |
I hope you can help me, thank you in advance.
Convert column to datetimes and then extract Series.dt.date and times with timezones by Series.dt.strftime:
df['date'] = pd.to_datetime(df['date'])
df['date_only'] = df['date'].dt.date
df['time_only'] = df['date'].dt.strftime('%H:%M:%S%z')
Or split converted values to strings by space and select second lists:
df['date'] = pd.to_datetime(df['date'])
df['date_only'] = df['date'].dt.date
df['time_only'] = df['date'].astype(str).str.split().str[1]

How to get the column values of a Dataframe into another dataframe as a new column after matching the values in columns that both dataframes have?

I'm trying to create a new column in a DataFrame and storing it with values stored in a different dataframe by first comparing the values of columns that both dataframes have. For example:
df1 >>>
| name | team | week | dates | interceptions | pass_yds | rating |
| ---- | ---- | -----| ---------- | ------------- | --------- | -------- |
| maho | KC | 1 | 2020-09-10 | 0 | 300 | 105 |
| went | PHI | 1 | 2020-09-13 | 2 | 225 | 74 |
| lock | DEN | 1 | 2020-09-14 | 0 | 150 | 89 |
| dris | DEN | 2 | 2020-09-20 | 1 | 220 | 95 |
| went | PHI | 2 | 2020-09-20 | 2 | 250 | 64 |
| maho | KC | 2 | 2020-09-21 | 1 | 245 | 101 |
df2 >>>
| name | team | week | catches | rec_yds | rec_tds |
| ---- | ---- | -----| ------- | ------- | ------- |
| ertz | PHI | 1 | 5 | 58 | 1 |
| fant | DEN | 2 | 6 | 79 | 0 |
| kelc | KC | 2 | 8 | 105 | 1 |
| fant | DEN | 1 | 3 | 29 | 0 |
| kelc | KC | 1 | 6 | 71 | 1 |
| ertz | PHI | 2 | 7 | 91 | 2 |
| goed | PHI | 2 | 2 | 15 | 0 |
I want to create a dates column in df2 with the values of the dates stored in the dates column in df1 after matching the teams and the weeks columns. After the matching, df2 in this example should look something like this:
df2 >>>
| name | team | week | catches | rec_yds | rec_tds | dates |
| ---- | ---- | -----| ------- | ------- | ------- | ---------- |
| ertz | PHI | 1 | 5 | 58 | 1 | 2020-09-13 |
| fant | DEN | 2 | 6 | 79 | 0 | 2020-09-20 |
| kelc | KC | 2 | 8 | 105 | 1 | 2020-09-20 |
| fant | DEN | 1 | 3 | 29 | 0 | 2020-09-14 |
| kelc | KC | 1 | 6 | 71 | 1 | 2020-09-10 |
| ertz | PHI | 2 | 7 | 91 | 2 | 2020-09-20 |
| goed | PHI | 2 | 2 | 15 | 0 | 2020-09-20 |
I'm looking for an optimal solution. I've already tried nested for loops and comparing the week and team columns from both dataframes together but that hasn't worked. At this point I'm all out of ideas. Please help!
Disclaimer: The actual DataFrames I'm working with are a lot larger. They have a lot more rows, columns, and values (i.e. a lot more teams in the team columns, a lot more dates in the dates columns, and a lot more weeks in the week columns)

copy dataframe to postgres table with column that has defalut value

I have the following postgreSql table stock, there the structure is following with column insert_time has a default value now()
| column | pk | type |
+-------------+-----+-----------+
| id | yes | int |
| type | yes | enum |
| c_date | | date |
| qty | | int |
| insert_time | | timestamp |
I was trying to copy the followning df
| id | type | date | qty |
+-----+------+------------+------+
| 001 | CB04 | 2015-01-01 | 700 |
| 155 | AB01 | 2015-01-01 | 500 |
| 300 | AB01 | 2015-01-01 | 1500 |
I was using psycopg to upload the df to the table stock
cur.copy_from(df, stock, null='', sep=',')
conn.commit()
getting this error.
DataError: missing data for column "insert_time"
CONTEXT: COPY stock, line 1: "001,CB04,2015-01-01,700"
I was expecting with the psycopg copy_from function, my postgresql table will auto-populate the rows along side the insert time.
| id | type | date | qty | insert_time |
+-----+------+------------+------+---------------------+
| 001 | CB04 | 2015-01-01 | 700 | 2018-07-25 12:00:00 |
| 155 | AB01 | 2015-01-01 | 500 | 2018-07-25 12:00:00 |
| 300 | AB01 | 2015-01-01 | 1500 | 2018-07-25 12:00:00 |
You can specify columns like this:
cur.copy_from(df, stock, null='', sep=',', columns=('id', 'type', 'c_date', 'qty'))

Multiindex Roll-up Indicator

How do you roll-up multi-index by Date & ID and create indicators?
+--------+-----+------+-------------+
| Date | ID | Flag | Action Type |
+--------+-----+------+-------------+
| 201712 | 123 | - | Delete |
| 201712 | 456 | + | Add |
| 201712 | 123 | + | Add |
| 201801 | 123 | + | Change |
+--------+-----+------+-------------+
with an output of:
+--------+-----+------+--------------+
| Date | ID | Flag | Action Type |
+--------+-----+------+--------------+
| 201712 | 123 | * | Add & Delete |
| 201712 | 456 | + | Add |
| 201801 | 123 | + | Added Chg |
+--------+-----+------+--------------+
You can using groupby and join
s=df.groupby(['Date','ID'],as_index=False).agg('&'.join)
s.Flag.str.len().gt(1)
Out[285]:
0 True
1 False
2 False
Name: Flag, dtype: bool
s.loc[s.Flag.str.len().gt(1),'Flag']='*'
s
Out[287]:
Date ID Flag Actiontype
0 201712 123 * Delete&Add
1 201712 456 + Add
2 201801 123 + Change

Resampling Pandas time series data with keeping only valid numbers for each row

I have a dataframe which has a list of web pages with summed hourly trraffic by unix hour.
Pivoted, it looks like this:
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| unix hour | 394533 | 394534 | 394535 | 394536 | 394537 | 394538 | 394539 | 394540 | 394541 | 394542 | 394543 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| page | | | | | | | | | | | |
| 3530765 | 5791 | 6017 | 5302 | | | | | | | | |
| 3563667 | | | | 3481 | 2840 | 2421 | | | | | |
| 3579922 | | | | | | | 1816 | 1947 | 1878 | 2013 | 1718 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
Instead of having the time be actually over time, I would like to centralize so that it looks like this:
+---------+------+------+------+------+------+
| hour | 1 | 2 | 3 | 4 | 5 |
+---------+------+------+------+------+------+
| page | | | | | |
| 3530765 | 5791 | 6017 | 5302 | | |
| 3563667 | 3481 | 2840 | 2421 | | |
| 3579922 | 1816 | 1947 | 1878 | 2013 | 1718 |
+---------+------+------+------+------+------+
Would would be the best way to do this in pandas?
*Note - I realize the hours as columns isn't ideal, but for my full data set i have 7k pages and only over a span of 72 hours, so to me, pages as the index and hours as the columns makes the most sense.
Assuming the data is stored as float:
In [191]:
print df.dtypes
394533 float64
394534 float64
394535 float64
394536 float64
394537 float64
394538 float64
394539 float64
394540 float64
394541 float64
394542 float64
394543 float64
dtype: object
We will just do:
In [192]:
print df.apply(lambda x: pd.Series(data=x[np.isfinite(x)].values), 1)
0 1 2 3 4
page
3530765 5791 6017 5302 NaN NaN
3563667 3481 2840 2421 NaN NaN
3579922 1816 1947 1878 2013 1718
The idea is to get the valid numbers of each rows, put those into Series, but without the original UNIXtime as index. The index, therefore will become 0,1,2...., if you must you can easily make it into 1,2,3... by df2.columns = df2.columns+1, assuming the result is assigned df2.

Categories