String Split to Unique Columns Depends on Name of the series - python

I am getting started with pandas and I have one column like this:
0 | Layer 7 Data
-------------------------------------------
1 | HTTP Request Method: GET, HTTP URI: /ucp/
2 | HTTP Return Code: 200, HTTP User-Agent: Mozilla/5.0
3 | HTTP Return Code: 401, HTTP Request Method: POST
After I string split and expand it to different columns by df = df["Layer 7 Data"].str.split(",", expand=True), I get some columns like this:
0 | 0 | 1
------------------------------------------------------------
1 | HTTP Request Method: GET | HTTP URI: /ucp/
2 | HTTP Return Code: 200 | HTTP User-Agent: Mozilla/5.0
3 | HTTP Return Code: 401 | HTTP Request Method: POST
However, I want it to have separate columns for unique values, return Null for that cell if it doesn't match:
0 | 0 | 1 | 2 |3
---------------------------------------------------------------------------------------
1 | HTTP Request Method: GET | HTTP URI: /ucp/ |Null | Null
2 | Null | Null | HTTP Return Code: 200 | HTTP User-Agent: Mozilla/5.0
3 | HTTP Request Method: POST | Null | HTTP Return Code: 401 | Null
Thank you very much!

Use nested list cmprehension:
L = [{y.split(': ', 1)[0]:y for y in x.split(", ")} for x in df["Layer 7 Data"]]
df = pd.DataFrame(L, index=df.index)
print (df)
HTTP Request Method HTTP URI HTTP Return Code \
0 HTTP Request Method: GET HTTP URI: /ucp/ NaN
1 NaN NaN HTTP Return Code: 200
2 HTTP Request Method: POST NaN HTTP Return Code: 401
HTTP User-Agent
0 NaN
1 HTTP User-Agent: Mozilla/5.0
2 NaN
Or:
L = [dict([y.split(': ', 1) for y in x.split(", ")]) for x in df["Layer 7 Data"]]
df = pd.DataFrame(L, index=df.index)
print (df)
HTTP Request Method HTTP URI HTTP Return Code HTTP User-Agent
0 GET /ucp/ NaN NaN
1 NaN NaN 200 Mozilla/5.0
2 POST NaN 401 NaN

You can use str.extractall to extract the key/values and pivot the DataFrame:
out = (df['Layer 7 Data']
.str.extractall(r'\s*([^,:]+):\s*([^:,]+)')
.droplevel(1).pivot(columns=0, values=1).rename_axis(columns=None)
)
Output:
HTTP Request Method HTTP Return Code HTTP URI HTTP User-Agent
1 GET NaN /ucp/ NaN
2 NaN 200 NaN Mozilla/5.0
3 POST 401 NaN NaN
Intermediate of extractall:
df['Layer 7 Data'].str.extractall(r'\s*([^,:]+):\s*([^:,]+)')
0 1
match
1 0 HTTP Request Method GET
1 HTTP URI /ucp/
2 0 HTTP Return Code 200
1 HTTP User-Agent Mozilla/5.0
3 0 HTTP Return Code 401
1 HTTP Request Method POST

Related

How do you identify which IDs have an increasing value over time in another column in a Python dataframe?

Lets say I have a data frame with 3 columns:
| id | value | date |
+====+=======+===========+
| 1 | 50 | 1-Feb-19 |
+----+-------+-----------+
| 1 | 100 | 5-Feb-19 |
+----+-------+-----------+
| 1 | 200 | 6-Jun-19 |
+----+-------+-----------+
| 1 | 500 | 1-Dec-19 |
+----+-------+-----------+
| 2 | 10 | 6-Jul-19 |
+----+-------+-----------+
| 3 | 500 | 1-Mar-19 |
+----+-------+-----------+
| 3 | 200 | 5-Apr-19 |
+----+-------+-----------+
| 3 | 100 | 30-Jun-19 |
+----+-------+-----------+
| 3 | 10 | 25-Dec-19 |
+----+-------+-----------+
ID column contains the ID of a particular person.
Value column contains the value of their transaction.
Date column contains the date of their transaction.
Is there a way in Python to identify ID 1 as the ID with the increasing value of transactions over time?
I'm looking for some way I can extract ID 1 as my desired ID with increasing value of transactions, filter out ID 2 because it doesn't have enough transactions to analyze a trend and also filter out ID 3 as it's trend of transactions is declining over time.
Perhaps group by the id, and check that the sorted values are the same whether sorted by values or by date:
>>> df.groupby('id').apply( lambda x:
... (
... x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value']
... ).all()
... )
id
1 True
2 True
3 False
dtype: bool
EDIT:
To make id=2 not True, we can do this instead:
>>> df.groupby('id').apply( lambda x:
... (
... (x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value'])
... & (len(x) > 1)
... ).all()
... )
id
1 True
2 False
3 False
dtype: bool
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'incresase',
np.where(x.diff()<0,'decrease','--')))
df = df.groupby('id').new.agg(['last'])
df
Output:
last
id
1 increase
2 --
3 decrease
Only increasing ID:
increasingList = df[(df['last']=='increase')].index.values
print(increasingList)
Result:
[1]
Assuming this won't happen
1 50
1 100
1 50
If so, then:
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'increase',
np.where(x.diff()<0,'decrease','--')))
df
Output:
value new
id
1 50 --
1 100 increase
1 200 increase
2 10 --
3 500 --
3 300 decrease
3 100 decrease
Concat strings:
df = df.groupby(['id'])['new'].apply(lambda x: ','.join(x)).reset_index()
df
Intermediate Result:
id new
0 1 --,increase,increase
1 2 --
2 3 --,decrease,decrease
Check if decrease exist in a row / only "--" exists. Drop them
df = df.drop(df[df['new'].str.contains("dec")].index.values)
df = df.drop(df[(df['new']=='--')].index.values)
df
Result:
id new
0 1 --,increase,increase

Cleaning up URL column in pandas dataframe

I have the csv (or the dataframe) with the content as follows:
date | URLs | Count
-----------------------------------------------------------------------
17-mar-2014 | www.example.com/abcdef&=randstring | 20
10-mar-2016 | www.example.com/xyzabc | 12
14-apr-2015 | www.example.com/abcdef | 11
12-mar-2016 | www.example.com/abcdef/randstring | 30
15-mar-2016 | www.example.com/abcdef | 10
17-feb-2016 | www.example.com/xyzabc&=randstring | 15
17-mar-2016 | www.example.com/abcdef&=someotherrandstring | 12
I want to clean up the column 'URLs' where I want to convert www.example.com/abcdef&=randstring or www.example.com/abcdef/randstring to just www.example.com/abcdef, and so on, for all the rows.
I tried to play around with urlparse library and parse the URLs to combile just the urlparse(url).netloc along with urlparse(url).path/query/params. But, it tuned out to be inefficient as every URL leads to completely different path/query/params.
Is there any work around for this using pandas? Any hints/ suggestions are much appreciated.
I think it's related with regex more than pandas, try to use pandas.apply to change one column.
import pandas as pd
import re
def clear_url(origin_url):
p = re.compile('(www.example.com/[a-zA-Z]*)')
r = p.search(origin_url)
if r:
return r.groups(1)[0]
else:
return origin_url
d = [
{'id':1, 'url':'www.example.com/abcdef&=randstring'},
{'id':2, 'url':'www.example.com/abcdef'},
{'id':3, 'url':'www.example.com/xyzabc&=randstring'}
]
df = pd.DataFrame(d)
print 'origin_df'
print df
df['url'] = df['url'].apply(clear_url)
print 'new_df'
print df
Output:
origin_df
id url
0 1 www.example.com/abcdef&=randstring
1 2 www.example.com/abcdef
2 3 www.example.com/xyzabc&=randstring
new_df
id url
0 1 www.example.com/abcdef
1 2 www.example.com/abcdef
2 3 www.example.com/xyzabc
I think you can use extract by regex - filter all string created by a-z and A-Z between www and .com and also another string starts with /:
print (df.URLs.str.extract('(www.[a-zA-Z]*.com/[a-zA-Z]*)', expand=False))
0 www.example.com/abcdef
1 www.example.com/xyzabc
2 www.example.com/abcdef
3 www.example.com/abcdef
4 www.example.com/abcdef
5 www.example.com/xyzabc
6 www.example.com/abcdef
Name: URLs, dtype: object

Python pandas - construct multivariate pivot table to display count of NaNs and non-NaNs

I have a dataset based on different weather stations for several variables (Temperature, Pressure, etc.),
stationID | Time | Temperature | Pressure |...
----------+------+-------------+----------+
123 | 1 | 30 | 1010.5 |
123 | 2 | 31 | 1009.0 |
202 | 1 | 24 | NaN |
202 | 2 | 24.3 | NaN |
202 | 3 | NaN | 1000.3 |
...
and I would like to create a pivot table that would show the number of NaNs and non-NaNs per weather station, such that:
stationID | nanStatus | Temperature | Pressure |...
----------+-----------+-------------+----------+
123 | NaN | 0 | 0 |
| nonNaN | 2 | 2 |
202 | NaN | 1 | 2 |
| nonNaN | 2 | 1 |
...
Below I show what I have done so far, which works (in a cumbersome way) for Temperature. But how can I get the same for both variables, as shown above?
import pandas as pd
import bumpy as np
df = pd.DataFrame({'stationID':[123,123,202,202,202], 'Time':[1,2,1,2,3],'Temperature':[30,31,24,24.3,np.nan],'Pressure':[1010.5,1009.0,np.nan,np.nan,1000.3]})
dfnull = df.isnull()
dfnull['stationID'] = df['stationID']
dfnull['tempValue'] = df['Temperature']
dfnull.pivot_table(values=["tempValue"], index=["stationID","Temperature"], aggfunc=len,fill_value=0)
The output is:
----------------------------------
tempValue
stationID | Temperature
123 | False 2
202 | False 2
| True 1
UPDATE: thanks to #root:
In [16]: df.groupby('stationID')[['Temperature','Pressure']].agg([nans, notnans]).astype(int).stack(level=1)
Out[16]:
Temperature Pressure
stationID
123 nans 0 0
notnans 2 2
202 nans 1 2
notnans 2 1
Original answer:
In [12]: %paste
def nans(s):
return s.isnull().sum()
def notnans(s):
return s.notnull().sum()
## -- End pasted text --
In [37]: df.groupby('stationID')[['Temperature','Pressure']].agg([nans, notnans]).astype(np.int8)
Out[37]:
Temperature Pressure
nans notnans nans notnans
stationID
123 0 2 0 2
202 1 2 2 1
I'll admit this is not the prettiest solution, but it works. First define two temporary columns TempNaN and PresNaN:
df['TempNaN'] = df['Temperature'].apply(lambda x: 'NaN' if x!=x else 'NonNaN')
df['PresNaN'] = df['Pressure'].apply(lambda x: 'NaN' if x!=x else 'NonNaN')
Then define your results DataFrame using a MultiIndex:
Results = pd.DataFrame(index=pd.MultiIndex.from_tuples(list(zip(*[sorted(list(df['stationID'].unique())*2),['NaN','NonNaN']*df['stationID'].nunique()])),names=['stationID','NaNStatus']))
Store your computations in the results DataFrame:
Results['Temperature'] = df.groupby(['stationID','TempNaN'])['Temperature'].apply(lambda x: x.shape[0])
Results['Pressure'] = df.groupby(['stationID','PresNaN'])['Pressure'].apply(lambda x: x.shape[0])
And fill the blank values with zero:
Results.fillna(value=0,inplace=True)
You can loop over the columns if that is easier. For example:
Results = pd.DataFrame(index=pd.MultiIndex.from_tuples(list(zip(*[sorted(list(df['stationID'].unique())*2),['NaN','NonNaN']*df['stationID'].nunique()])),names=['stationID','NaNStatus']))
for col in ['Temperature','Pressure']:
df[col + 'NaN'] = df[col].apply(lambda x: 'NaN' if x!=x else 'NonNaN')
Results[col] = df.groupby(['stationID',col + 'NaN'])[col].apply(lambda x: x.shape[0])
df.drop([col + 'NaN'],axis=1,inplace=True)
Results.fillna(value=0,inplace=True)
d = {'stationID':[], 'nanStatus':[], 'Temperature':[], 'Pressure':[]}
for station_id, data in df.groupby(['stationID']):
temp_nans = data.isnull().Temperature.mean()*data.isnull().Temperature.count()
pres_nans = data.isnull().Pressure.mean()*data.isnull().Pressure.count()
d['stationID'].append(station_id)
d['nanStatus'].append('NaN')
d['Temperature'].append(temp_nans)
d['Pressure'].append(pres_nans)
d['stationID'].append(station_id)
d['nanStatus'].append('nonNaN')
d['Temperature'].append(data.isnull().Temperature.count() - temp_nans)
d['Pressure'].append(data.isnull().Pressure.count() - pres_nans)
df2 = pd.DataFrame.from_dict(d)
print(df2)
The result is:
Pressure Temperature nanStatus stationID
0 0.0 0.0 NaN 123
1 2.0 2.0 nonNaN 123
2 2.0 1.0 NaN 202
3 1.0 2.0 nonNaN 202

Use pandas groupby.size() results for arithmetical operation

I got the following problem which I got stuck on and unfortunately cannot resolve by myself or by similar questions that I found on stackoverflow.
To keep it simple, I'll give a short example of my problem:
I got a Dataframe with several columns and one column that indicates the ID of a user. It might happen that the same user has several entries in this data frame:
| | userID | col2 | col3 |
+---+-----------+----------------+-------+
| 1 | 1 | a | b |
| 2 | 1 | c | d |
| 3 | 2 | a | a |
| 4 | 3 | d | e |
Something like this. Now I want to known the number of rows that belongs to a certain userID. For this operation I tried to use df.groupby('userID').size() which in return I want to use for another simple calculation, like division whatsover.
But as I try to save the results of the calculation in a seperate column, I keep getting NaN values.
Is there a way to solve this so that I get the result of the calculations in a seperate column?
Thanks for your help!
edit//
To make clear, how my output should look like. The upper dataframe is my main data frame so to say. Besides this frame I got a second frame looking like this:
| | userID | value | value/appearances |
+---+-----------+----------------+-------+
| 1 | 1 | 10 | 10 / 2 = 5 |
| 3 | 2 | 20 | 20 / 1 = 20 |
| 4 | 3 | 30 | 30 / 1 = 30 |
So I basically want in the column 'value/appearances' to have the result of the number in the value column divided by the number of appearances of this certain user in the main dataframe. For user with ID=1 this would be 10/2, as this user has a value of 10 and has 2 rows in the main dataframe.
I hope this makes it a bit clearer.
IIUC you want to do the following, groupby on 'userID' and call transform on the grouped column and pass 'size' to identify the method to call:
In [54]:
df['size'] = df.groupby('userID')['userID'].transform('size')
df
Out[54]:
userID col2 col3 size
1 1 a b 2
2 1 c d 2
3 2 a a 1
4 3 d e 1
What you tried:
In [55]:
df.groupby('userID').size()
Out[55]:
userID
1 2
2 1
3 1
dtype: int64
When assigned back to the df aligns with the df index so it introduced NaN for the last row:
In [57]:
df['size'] = df.groupby('userID').size()
df
Out[57]:
userID col2 col3 size
1 1 a b 2
2 1 c d 1
3 2 a a 1
4 3 d e NaN

How to exclude a single value from Groupby method using Pandas

I have a dataframe where I have transformed all NaN to 0 for a specific reason. In doing another calculation on the df, my group by is picking up a 0 and making it a value to perform the counts on. Any idea how to get python and pandas to exclude the 0 value? In this case the 0 represents a single row in the data. Is there a way to exclude all 0's from the groupby?
My groupby looks like this
+----------------+----------------+-------------+
| Team | Method | Count |
+----------------+----------------+-------------+
| Team 1 | Automated | 1 |
| Team 1 | Manual | 14 |
| Team 2 | Automated | 5 |
| Team 2 | Hybrid | 1 |
| Team 2 | Manual | 25 |
| Team 4 | 0 | 1 |
| Team 4 | Automated | 1 |
| Team 4 | Hybrid | 13 |
+----------------+----------------+-------------+
My code looks like this (after importing excel file)
df = df1.filnna(0)
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'}
I'd filter the df prior to grouping:
In [8]:
a = df.loc[df['Method'] !=0, ['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[8]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 Automated 1
Hybrid 1
Here we only select rows where method is not equal to 0
compare against without filtering:
In [9]:
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[9]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 0 1
Automated 1
Hybrid 1
You need the filter.
The filter method returns a subset of the original object. Suppose
we want to take only elements that belong to groups with a group sum
greater than 2.
Example:
In [94]: sf = pd.Series([1, 1, 2, 3, 3, 3])
In [95]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[95]: 3 3
4 3 5 3 dtype: int64
Source.

Categories