Cleaning up URL column in pandas dataframe

Cleaning up URL column in pandas dataframe - python

I have the csv (or the dataframe) with the content as follows:
date | URLs | Count
-----------------------------------------------------------------------
17-mar-2014 | www.example.com/abcdef&=randstring | 20
10-mar-2016 | www.example.com/xyzabc | 12
14-apr-2015 | www.example.com/abcdef | 11
12-mar-2016 | www.example.com/abcdef/randstring | 30
15-mar-2016 | www.example.com/abcdef | 10
17-feb-2016 | www.example.com/xyzabc&=randstring | 15
17-mar-2016 | www.example.com/abcdef&=someotherrandstring | 12
I want to clean up the column 'URLs' where I want to convert www.example.com/abcdef&=randstring or www.example.com/abcdef/randstring to just www.example.com/abcdef, and so on, for all the rows.
I tried to play around with urlparse library and parse the URLs to combile just the urlparse(url).netloc along with urlparse(url).path/query/params. But, it tuned out to be inefficient as every URL leads to completely different path/query/params.
Is there any work around for this using pandas? Any hints/ suggestions are much appreciated.

I think it's related with regex more than pandas, try to use pandas.apply to change one column.
import pandas as pd
import re
def clear_url(origin_url):
p = re.compile('(www.example.com/[a-zA-Z]*)')
r = p.search(origin_url)
if r:
return r.groups(1)[0]
else:
return origin_url
d = [
{'id':1, 'url':'www.example.com/abcdef&=randstring'},
{'id':2, 'url':'www.example.com/abcdef'},
{'id':3, 'url':'www.example.com/xyzabc&=randstring'}
]
df = pd.DataFrame(d)
print 'origin_df'
print df
df['url'] = df['url'].apply(clear_url)
print 'new_df'
print df
Output:
origin_df
id url
0 1 www.example.com/abcdef&=randstring
1 2 www.example.com/abcdef
2 3 www.example.com/xyzabc&=randstring
new_df
id url
0 1 www.example.com/abcdef
1 2 www.example.com/abcdef
2 3 www.example.com/xyzabc

I think you can use extract by regex - filter all string created by a-z and A-Z between www and .com and also another string starts with /:
print (df.URLs.str.extract('(www.[a-zA-Z]*.com/[a-zA-Z]*)', expand=False))
0 www.example.com/abcdef
1 www.example.com/xyzabc
2 www.example.com/abcdef
3 www.example.com/abcdef
4 www.example.com/abcdef
5 www.example.com/xyzabc
6 www.example.com/abcdef
Name: URLs, dtype: object

Related

How to compare different dataframes by column?

I have two csv files with 200 columns each. The two files have the exact same numbers in rows and columns. I want to compare each columns separately.
The idea would be to compare column 1 value of file "a" to column 1 value of file "b" and check the difference and so on for all the numbers in the column (there are 100 rows) and write out a number that in how many cases were the difference more than 3.
I would like to repeat the same for all the columns. I know it should be a double for loop but idk exactly how. Probably 2 for loops but have no idea how to do that...
Thanks in advance!
import pandas as pd
dk = pd.read_csv('C:/Users/D/1_top_a.csv', sep=',', header=None)
dk = dk.dropna(how='all')
dk = dk.dropna(how='all', axis=1)
print(dk)
dl = pd.read_csv('C:/Users/D/1_top_b.csv', sep=',', header=None)
dl = dl.dropna(how='all')
dl = dl.dropna(how='all', axis=1)
print(dl)
rows=dk.shape[0]
print(rows)
for i
print(dk._get_value(0,0))

df1 = pd.DataFrame(dict(cola=[1,2,3,4], colb=[4,5,6,7]))
df2 = pd.DataFrame(dict(cola=[1,2,4,5], colb=[9,7,8,9]))
for label, content in df1.items():
diff = df1[label].compare(df2[label])
if diff.shape[0] >= 3:
print(f'Found {diff.shape[0]} diffs in {label}')
print(diff.to_markdown())
Out:
Found 4 diffs in colb
| | self | other |
|---:|-------:|--------:|
| 0 | 4 | 9 |
| 1 | 5 | 7 |
| 2 | 6 | 8 |
| 3 | 7 | 9 |

How can I remove special characters for just one column in a data frame?

I am trying to clean my data frame but I just want to remove special characters for just one column. (Please refer the figure below)
df1
| A | B | C |
|---------|----––|––----|
| Ags(1) | 5 | 4 |
| Cdmx(2) | 6 | 6 |
|Leon(4) | 90 | 45 |
|
What I want to remove is just the numbers and special characters of the column A
This is what I tried:
df = re.sub('[^A-Za-z0-9]+', '', df1["A"])
>> expected string or bytes-like object

I would try to use a lambda with the apply function on the wanted column.
df1["A"] = df1["A"].apply(lambda x: re.sub('[^A-Za-z0-9]+', '', x))

You can also use .str.extract() to keep the part you want (vs replace, which eliminates the part you don't want):
from io import StringIO
import pandas as pd
data = ''' A B C
Ags(1) 5 4
Cdmx(2) 6 6
Leon(4) 90 45
'''
df = pd.read_csv(StringIO(data), sep='\s\s+', engine='python')
df['A'] = df['A'].str.extract(r'(\w+)', expand=False)
print(df)
A B C
0 Ags 5 4
1 Cdmx 6 6
2 Leon 90 45

When I store several dataframes in a list of dataframes and I recall one of them, is there a way to format the column header of the output?

I am new to Python and Stackoverflow, so please bear with me. I have a large datafile of roughly 140k rows stored as a csv. The file is split up into sections based on age groups, ie. 16-24, 24-50 etc. At every break there are information lines about the age and etnicity of the subjects. After loading the csv into pandas, I tried to break up the dataframe into several smaller ones by dividing on the information lines of the age groups using iloc. Now I have a list of dataframes. I can access each dataframe in the list, no problem, however (I guess due to the information lines) pandas displays all information in one column. Is there a way to format the output and make pandas display the column headers and put the information lines into the header above the column headers? I'm sorry if this is not very clear, please feel free to suggest any edits.
The data in the csv looks something like this:
0 Some information
1 Some information
2 Some information
3
4
5 a | b | c | d |
6 a | 1 | 1 | 1 |
7 a | 1 | 1 | 1 |
8 a | 1 | 1 | 1 |
9
10 Some information
11 Some information
12 Some information
13
14
15 a | b | c | d |
16 a | 1 | 1 | 1 |
17 a | 1 | 1 | 1 |
18 a | 1 | 1 | 1 |
I used iloc to break this up on the information lines by row index.
36065,43278,50491,57704,
64917,72130,79343,86556,
93769,100982,108195,115408,
122621,129834,137047]
l_mod = [0] + l + [max(l)+1]
list_of_dfs = [mydata_df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
when accessing I used: df1_df=list_of_dfs[1]
The output is currently as follows:
0
--------------------
1 a,b,c
2 a,1,1,
I hope this makes sense, please suggest edits and I'll do my best to explain.

You can try df[0].str.split(',', expand=True), which expands your dataframe based on every split on a comma. Then you can assign the new column names to it, since it will give column names [0, 1, 2, 3.. etc]

Use pandas groupby.size() results for arithmetical operation

I got the following problem which I got stuck on and unfortunately cannot resolve by myself or by similar questions that I found on stackoverflow.
To keep it simple, I'll give a short example of my problem:
I got a Dataframe with several columns and one column that indicates the ID of a user. It might happen that the same user has several entries in this data frame:
| | userID | col2 | col3 |
+---+-----------+----------------+-------+
| 1 | 1 | a | b |
| 2 | 1 | c | d |
| 3 | 2 | a | a |
| 4 | 3 | d | e |
Something like this. Now I want to known the number of rows that belongs to a certain userID. For this operation I tried to use df.groupby('userID').size() which in return I want to use for another simple calculation, like division whatsover.
But as I try to save the results of the calculation in a seperate column, I keep getting NaN values.
Is there a way to solve this so that I get the result of the calculations in a seperate column?
Thanks for your help!
edit//
To make clear, how my output should look like. The upper dataframe is my main data frame so to say. Besides this frame I got a second frame looking like this:
| | userID | value | value/appearances |
+---+-----------+----------------+-------+
| 1 | 1 | 10 | 10 / 2 = 5 |
| 3 | 2 | 20 | 20 / 1 = 20 |
| 4 | 3 | 30 | 30 / 1 = 30 |
So I basically want in the column 'value/appearances' to have the result of the number in the value column divided by the number of appearances of this certain user in the main dataframe. For user with ID=1 this would be 10/2, as this user has a value of 10 and has 2 rows in the main dataframe.
I hope this makes it a bit clearer.

IIUC you want to do the following, groupby on 'userID' and call transform on the grouped column and pass 'size' to identify the method to call:
In [54]:
df['size'] = df.groupby('userID')['userID'].transform('size')
df
Out[54]:
userID col2 col3 size
1 1 a b 2
2 1 c d 2
3 2 a a 1
4 3 d e 1
What you tried:
In [55]:
df.groupby('userID').size()
Out[55]:
userID
1 2
2 1
3 1
dtype: int64
When assigned back to the df aligns with the df index so it introduced NaN for the last row:
In [57]:
df['size'] = df.groupby('userID').size()
df
Out[57]:
userID col2 col3 size
1 1 a b 2
2 1 c d 1
3 2 a a 1
4 3 d e NaN

How to exclude a single value from Groupby method using Pandas

I have a dataframe where I have transformed all NaN to 0 for a specific reason. In doing another calculation on the df, my group by is picking up a 0 and making it a value to perform the counts on. Any idea how to get python and pandas to exclude the 0 value? In this case the 0 represents a single row in the data. Is there a way to exclude all 0's from the groupby?
My groupby looks like this
+----------------+----------------+-------------+
| Team | Method | Count |
+----------------+----------------+-------------+
| Team 1 | Automated | 1 |
| Team 1 | Manual | 14 |
| Team 2 | Automated | 5 |
| Team 2 | Hybrid | 1 |
| Team 2 | Manual | 25 |
| Team 4 | 0 | 1 |
| Team 4 | Automated | 1 |
| Team 4 | Hybrid | 13 |
+----------------+----------------+-------------+
My code looks like this (after importing excel file)
df = df1.filnna(0)
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'}

I'd filter the df prior to grouping:
In [8]:
a = df.loc[df['Method'] !=0, ['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[8]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 Automated 1
Hybrid 1
Here we only select rows where method is not equal to 0
compare against without filtering:
In [9]:
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[9]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 0 1
Automated 1
Hybrid 1

You need the filter.
The filter method returns a subset of the original object. Suppose
we want to take only elements that belong to groups with a group sum
greater than 2.
Example:
In [94]: sf = pd.Series([1, 1, 2, 3, 3, 3])
In [95]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[95]: 3 3
4 3 5 3 dtype: int64
Source.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning up URL column in pandas dataframe - python

Related

How to compare different dataframes by column?

How can I remove special characters for just one column in a data frame?

When I store several dataframes in a list of dataframes and I recall one of them, is there a way to format the column header of the output?

Use pandas groupby.size() results for arithmetical operation

How to exclude a single value from Groupby method using Pandas

Categories

Resources