I am working on an assignment i have two CSV files. First File Contains my whole data(shown Below)
First File:
0 ID Name Suburb State Postcode Email Lat Lon
0 0 1 Hurstville Clinic Hurstville NSW 1493 hurstville#myclinic.com.au -33.975869 151.088939
1 1 2 Sydney Centre Clinic Sydney NSW 2000 sydney#myclinic.com.au -33.867139 151.207114
2 2 3 Auburn Clinic Auburn NSW 2144 auburn#myclinic.com.au -33.849322 151.033421
3 3 4 Riverwood Clinic Riverwood NSW 2210 riverwood#myclinic.com.au -33.949859 151.052469
Second File contains the data Which i have to Replace with the first file Email Column.
I used Regex to convert the second file into HTML links.
This is what I've done to clean my data:
def clean(filename):
df = pd.read_csv(filename)
df['Email'] = df['Email'].apply(lambda x: x if '#' in str(x) else str(x)+'#myclinic.com.au')
return df.to_csv('temp1.csv')
Second File output
Email
which is not correct. The above function is ommiting everything before the spaces in the email column and also ommmiting any row which has space before # in the email column.
This is what I have to do:
a) Clean the data of file one (Two unwanted columns) and There are some spaces in the Email column as some addresses have spaces in the name and they can not be read in the final function.
b) The end output that i am getting is not the output that i want. it is ommiting 10 rows:
this is what I've done in file2.
emails = re.findall(r'\S+#\S+', text)
for x in range(0, len(emails)):
emails[x] = '' % emails[x];
emails.insert(0, 'Email')
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in emails:
writer.writerow([val])
here, text is a dictonary which contains my whole csv data. Insteading of calling CSV file i just copied whole content in my python file.
Final Output
ID Name Suburb State Postcode Email_Str Lat Lon
0 1 Hurstville Clinic Hurstville NSW 1493 -33.975869 151.088939
1 2 Sydney Centre Clinic Sydney NSW 2000 -33.867139 151.207114
2 3 Auburn Clinic Auburn NSW 2144 -33.849322 151.033421
3 4 Riverwood Clinic Riverwood NSW 2210 -33.949859 151.052469
4 6 Harrington Clinic Harrington NSW 2427 -31.872153 152.689811
5 9 Benolong Clinic Benolong NSW 2830 -32.413736 148.63938
6 11 Preston Clinic Preston VIC 3072 -37.738736 145.000515
7 13 Douglas Clinic Douglas VIC 3409 -37.842988 144.892631
8 14 Mildura Clinic Mildura VIC 3500 -34.181714 142.163072
9 15 Broadford Clinic Broadford VIC 3658 -37.203001 145.050171
10 16 Officer Clinic Officer VIC 3809 -38.063056 145.40958
11 18 Langsborough Clinic Langsborough VIC 3971 -38.651487 146.675098
12 19 Brisbane Centre Clinic Brisbane QLD 4000 -27.46758 153.027892
13 20 Robertson Clinic Robertson QLD 4109 -27.565733 153.057213
14 22 Ipswich Clinic Ipswich QLD 4305 -27.614604 152.760876
15 24 Caboolture Clinic Caboolture QLD 4510 -27.085007 152.951707
16 25 Booie Clinic Booie QLD 4610 -26.498426 151.935421
17 26 Rockhampton Clinic Rockhampton QLD 4700 -23.378941 150.512323
18 28 Cairns Clinic Cairns QLD 4870 -16.925397 145.775178
19 29 Adelaide Centre Clinic Adelaide SA 5000 -34.92577 138.599732
My data is missing after final merging
As you can see it is missing a lot of data.
Please help me.
Not sure what the question is but it seems that you're trying to convert the email text into an email link. You can do this:
df['Email'] = df['Email'].apply(lambda x: '')
Looks like you are trying to merge two data-frames on the 'Email' column such that after merge you should get Email String from DF2 merged with DF1.
First we need to create a second column within DF2 that has email
pattern matching DF1 (Email_Str).
Next, we merge DF1, and DF2 on common Email column
Further, we remove unwanted column
Finally, rearrange columns to desired sequence.
Working code as below (using your data)
import pandas as pd
import re
df1 = pd.read_csv('file1.txt', sep=",", engine="python")
df2 = pd.read_csv('file2.txt', sep=",", engine="python")
def get_email(x):
return ''.join(re.findall(r'"([^"]*)"', x))
df2.columns = ['Email_Str']
df2['Email']=df2['Email_Str'].apply(get_email)
df2 = df2[['Email','Email_Str']]
df3=pd.merge(df1,df2,on='Email').drop(['Email'], axis=1)
df3 = df3[[u'ID', u'Name', u'Suburb', u'State', u'
Postcode',u'Email_Str', u'Lat', u'Lon', ]]
Result:
>>> df3
ID Name Suburb State Postcode \
0 1 Hurstville Clinic Hurstville NSW 1493
1 2 Sydney Centre Clinic Sydney NSW 2000
2 3 Auburn Clinic Auburn NSW 2144
3 4 Riverwood Clinic Riverwood NSW 2210
Email_Str Lat Lon
0 -33.975869 151.088939
1 -33.867139 151.207114
2 -33.849322 151.033421
3 -33.949859 151.052469
>>>
Hope this is what you are looking for.
I figure out what is the problem.
My original data have spaces in the Email Column.
So can anyone update my regex Function?
def clean(filename):
df = pd.read_csv(filename)
df['Email'] = df['Email'].apply(lambda x: x if '#' in str(x) else str(x)+'#myclinic.com.au')
return df.to_csv('temp1.csv')
Related
I have a big dataset. It's about news reading. I'm trying to clean it. I created a checklist of cities that I want to keep (the set has all the cities). How can I drop the rows based on that checklist? For example, I have a checklist (as a list) that contains all the french cities. How can I drop other cities?
To picture the data frame (I have 1.5m rows btw):
City Age
0 Paris 25-34
1 Lyon 45-54
2 Kiev 35-44
3 Berlin 25-34
4 New York 25-34
5 Paris 65+
6 Toulouse 35-44
7 Nice 55-64
8 Hannover 45-54
9 Lille 35-44
10 Edinburgh 65+
11 Moscow 25-34
You can do this using pandas.Dataframe.isin. This will return boolean values checking whether each element is inside the list x. You can then use the boolean values and take out the subset of the df with rows that return True by doing df[df['City'].isin(x)]. Following is my solution:
import pandas as pd
x = ['Paris' , 'Marseille']
df = pd.DataFrame(data={'City':['Paris', 'London', 'New York', 'Marseille'],
'Age':[1, 2, 3, 4]})
print(df)
df = df[df['City'].isin(x)]
print(df)
Output:
>>> City Age
0 Paris 1
1 London 2
2 New York 3
3 Marseille 4
City Age
0 Paris 1
3 Marseille 4
Lets say I had this sample of a mixed dataset:
df:
Property Name Date of entry Old data Updated data
City Jim 1/7/2021 Jacksonville Miami
State Jack 1/8/2021 TX CA
Zip Joe 2/2/2021 11111 22222
Address Harry 2/3/2021 123 lane 123 street
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Order count Tammy 3/4/2021 2 3
I'd like to group by all this data starting with property and have it look like this:
grouped:
Property Name Date of entry Old data Updated Data
City names1 date 1 data 1 data 2
names2 date 2 data 1 data 2
names3 date 3 data 1 data 2
State names1 date 1 data 1 data 2
names2 date 2 data 1 data 2
names3 date 3 data 1 data 2
grouped = pd.DataFrame(df.groupby(['Property','Name','Date of entry','Old Data', 'updated data'])
.size(),columns=['Count'])
grouped
and I get a type error saying: '<' not supported between instances of 'int' and 'datetime.datetime'
Is there some sort of formatting that I need to do to the df['Old data'] & df['Updated data'] columns to allow them to be added to the groupby?
added data types:
Property: Object
Name: Object
Date of entry: datetime
Old data: Object
Updated data: Object
*I modified your initial data to get a better view of the output.
You can try with pivot_table instead of groupby:
df.pivot_table(index = ['Property', 'Name', 'Date of entry'], aggfunc=lambda x: x)
Output:
Old data Updated data
Property Name Date of entry
Address Harry 2/3/2021 123 lane 123 street
Lisa 2/3/2021 123 lane 123 street
City Jack 1/8/2021 TX Miami
Jim 1/7/2021 Jacksonville Miami
Tammy 1/8/2021 TX Miami
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Order count Jack 3/4/2021 2 3
Tammy 3/4/2021 2 3
State Jack 1/8/2021 TX CA
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Zip Joe 2/2/2021 11111 22222
The whole code:
import pandas as pd
from io import StringIO
txt = '''Property Name Date of entry Old data Updated data
City Jim 1/7/2021 Jacksonville Miami
City Jack 1/8/2021 TX Miami
State Jack 1/8/2021 TX CA
Zip Joe 2/2/2021 11111 22222
Order count Jack 3/4/2021 2 3
Address Harry 2/3/2021 123 lane 123 street
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Address Lisa 2/3/2021 123 lane 123 street
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Order count Tammy 3/4/2021 2 3
City Tammy 1/8/2021 TX Miami
'''
df = pd.read_csv(StringIO(txt), header=0, skipinitialspace=True, sep=r'\s{2,}', engine='python')
print(df.pivot_table(index = ['Property', 'Name', 'Date of entry'], aggfunc=lambda x: x))
I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None
Goal: If the name in df2 in row i is a sub-string or an exact match of a name in df1 in some row N and the state and district columns of row N in df1 are a match to the respective state and district columns of df2 row i, combine.
Break down of data frame inputs:
df1 is a time-series style data frame.
df2 is a regular data frame.
3.df1 and df2 do not have the same length.
df1 Names contain initials,titles, and even weird character encodings.
df2 Names are just a combination of First Name, Space and Last Name.
My attempts have centered around taking into account 1. Names, Districts and State.
My approaches have tried to take into account that names in df1 have initials or second names, titles, etc whereas df2 is simply first and last names. I tried to use str.contains('A-za-z') to account for this difference.
# Data Frame Samples
# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)
print df1
# CandidateName District Party State
#0 Theodorick A. Bland 9 VA
#1 Aedanus Rutherford Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
#4 Theodorick Bland 9 VA
#5 Aedanus Burke 2 SC
#6 Jason Initial Lewis 2 Democrat MN
#7 '' 1 Whig NH
#8 '' 1 Whig NH
Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)
print df2
# CandidateName District Party State
#0 Theodorick Bland 9 VA
#1 Aedanus Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
# Attempt code
df3 = df1.merge(df2, left_on = (df1.State, df1.District,df1.CandidateName.str.contains('[A-Za-z]')), right_on=(df2.State, df2.District,df2.Name.str.contains('[A-Za-z]')))
I included merging on District and State in order to reduce redundancies and inaccuracies. When I removed district and state from left_on and right_on, not did the output df3 increase in size with a lot of wrong matches.
Examples include CandidateName and Name being two different people:
Theodorick A. Bland sharing the same row as Jasson Lewis Sr.
Some of the row results with the Attempt Code above are as follows:
Header
key_0 key_1 key_2 CandidateName District_x Party_x State_x District_y Name Party_y State_y
Row 6, index 4
MN 2 True Jason Lewis 2 Democrat MN 2 Jasson Lewis Sr. Republican MN
Row 11, index 3
3 VA 10 True Barbara Comstock 10 VA 10 Barbara Comstock Democrat VA
We can use difflib for this to create an artificial key column to merge on. We call this column name, like the one in df2:
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])
print(df_merge)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Theodorick Bland VA 9 Theodorick Bland
2 Aedanus Rutherford Burke SC 2 Aedanus Burke
3 Aedanus Burke SC 2 Aedanus Burke
4 Jason Lewis MN 2 Jason Lewis
5 Jason Initial Lewis MN 2 Democrat Jason Lewis
6 Barbara Comstock VA 10 Democrat Barbara Comstock
Explanation of difflib.get_close_matches. It looks for similar strings in df2. This is how the new Name column in df1 looks like:
print(df1)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Aedanus Rutherford Burke SC 2 Aedanus Burke
2 Jason Lewis MN 2 Jason Lewis
3 Barbara Comstock VA 10 Democrat Barbara Comstock
4 Theodorick Bland VA 9 Theodorick Bland
5 Aedanus Burke SC 2 Aedanus Burke
6 Jason Initial Lewis MN 2 Democrat Jason Lewis
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09