This question already has answers here:
splitting at underscore in python and storing the first value
(5 answers)
Get last "column" after .str.split() operation on column in pandas DataFrame
(5 answers)
Closed 2 years ago.
I have a dataframe of email addresses, and I want to search which are the most used email providers (eg. gmail.com, yahoo.com etc). I used the following code
dfEmail=Ecom['Email']
I have the following data
0 pdunlap#yahoo.com
1 anthony41#reed.com
2 amymiller#morales-harrison.com
3 brent16#olson-robinson.info
4 christopherwright#gmail.com
...
9995 iscott#wade-garner.com
9996 mary85#hotmail.com
9997 tyler16#gmail.com
9998 elizabethmoore#reid.net
9999 rachelford#vaughn.com
Name: Email, Length: 10000, dtype: object
I want to split these email addresses at "#" and get only names of email providers.
I tried the following
dfEmailSplit=dfEmail.str.split('#')
dfEmailSplit[500][1]
this gave me the following result:
'gmail.com'
How do i do this for all the email addresses?
import pandas as pd
df = pd.DataFrame()
data = {'email':['pdunlap#yahoo.com', 'anthony41#reed.com', 'amymiller#morales- harrison.com']}
df = pd.DataFrame(data)
tlds = {'tlds': [x.split('#')[1] for x in df['email']]}
df = pd.DataFrame(tlds)
print(df)
Related
This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 10 months ago.
I'm trying to set a column based on the value of other columns in my dataframe, but i'm having a hard time with the syntax of this. I can best describe this with an example:
Say you have a dataframe: the columns "Computer", "IP", "IP2" "Signal", "Connected"
data = {'Computer':['cp1', 'cp2'], 'IP1':[51.20, 51.21], IP2:[52.20, 52.21], 'Signal':[IN, OUT]}
df = pd.DataFrame(data)
df[Connected]=np.nan
Here's what I've tried:
for i in df['Signal']:
if i =='IN':
df['Connected']= df['IP2']
else: df['Connected'] =df[IP1]
But this doesn't give me the correct output.
What I would like as an output is for every instance of 'IN' Connected takes the value of IP2
And for every instance of 'OUT' it takes the value of IP1
I hope this makes sense. Thank you
Use mask with the right condition
df['Connected'] = df['IP1'].mask(df['Signal'] == 'IN', df['IP2'])
df
Out[20]:
Computer IP1 IP2 Signal Connected
0 cp1 51.20 52.20 IN 52.20
1 cp2 51.21 52.21 OUT 51.21
This question already has an answer here:
Pandas | Group by with all the values of the group as comma separated
(1 answer)
Closed 11 months ago.
This post was edited and submitted for review 11 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a data frame as below -
df_add = pd.DataFrame({
'doc_id':[100,101,102,103],
'last_name':['Mallesham','Mallesham','Samba','Bhavik'],
'first_name':['Yamulla','Yamulla','Anil','Yamulla'],
'dob':['06-03-1900','06-03-1900','20-09-2020','09-16-2020']
})
Here doc_id 100 and 101 are duplicated rows on considering last, first names and DOB's.
Here My requirement is to roll up 101 to 100 as follows -
doc_id should be filled up as 100;101 with semicolon separator.
In a second case:
If I have just consider last_name and first_name combination it should display as below since a Same Name persons might have different DOB's
You need to change doc_id to str , to use str.cat function
df_add["doc_id"] = df_add["doc_id"].astype('str)
resultant_df = df_add.groupby(["first_name",
"last_name","dob"])[['doc_id']].apply(lambda x : x.str.cat(sep=','))
print(resultant_df.reset_index())
first_name last_name dob 0
0 Anil Samba 20-09-2020 102
1 Yamulla Bhavik 09-16-2020 103
2 Yamulla Mallesham 06-03-1900 100,101
This question already has answers here:
How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
(3 answers)
Pandas: convert categories to numbers
(6 answers)
Convert pandas series from string to unique int ids [duplicate]
(2 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a dataframe with a column consisting of author names, where sometimes the name of an author repeats. My problem is: I want to assign a unique number to each author name in a corresponding parallel column (for simplicity, assume that this numbering follows the progression of whole numbers, starting with 0, then 1, 2, 3, and so on).
I can do this using nested FOR loops, but with 57000 records consisting of 500 odd unique authors, it is taking way too long. Is there a quicker way to do this?
For example,
Original DataFrame contains:
**Author**
Name 1
Name 2
Name 1
Name 3
I want another column added next to it, such that:
**Author** **AuthorID*
Name 1 1
Name 2 2
Name 1 1
Name 3 3
This question already has answers here:
How to filter a pandas dataframe based on the length of a entry
(2 answers)
Closed 1 year ago.
I have a pandas DataFrame like this:
id subjects
1 [math, history]
2 [English, Dutch, Physics]
3 [Music]
How to filter this dataframe based on the length of the column subjects?
So for example, if I only want to have rows where len(subjects) >= 2?
I tried using
df[len(df["subjects"]) >= 2]
But this gives
KeyError: True
Also, using loc does not help, that gives me the same error.
Thanks in advance!
Use the string accessor to work with lists:
df[df['subjects'].str.len() >= 2]
Output:
id subjects
0 1 [math, history]
1 2 [English, Dutch, Physics]
This question already has answers here:
How to reverse the order of first and last name in a Pandas Series
(4 answers)
Closed 4 years ago.
I need to swap the list of names which is in the format of FirstName and LastName which in a dataframe one column, using python.
Below is the sample format:
~Adam Smith
The above need to change into
~Smith Adam
Is there any single line function available in python?
Could anyone help on this!!
Using apply
import pandas as pd
df = pd.DataFrame({"names": ["Adam Smith", "Greg Rogers"]})
df["names"] = df["names"].apply(lambda x: " ".join(reversed(x.split())))
print(df)
Output:
names
0 Smith Adam
1 Rogers Greg