I have following 2 data frames that are taken from excel files:
df1 = 10000 rows (like the master list that has all unique #s)
df2 = 670 rows
I am loading a excel file (df2) that has zip, address, state and I want to match that info and then add on the supplier # from df1 so that I could have 1 file thats still 670 rows but now has the supplier row column.
Since there was no unique key between two dataframes, I thought that I could make a unique key to merge on by joining 3 columns together - zip, address, and state and join them with a "-".. maybe this is too risky for a match? df1 has a ton of duplicate addresses, zips, states so I couldnt do something like joining just zip and state.
df1 =
(10000 rows)
(unique)
supplier_num ZIP ADDRESS STATE CCCjoin
0 7100000 35481 14th street CA 35481-14th street-CA
1 7000005 45481 14th street CA 45481-14th street-CA
2 7000006 45482 140th circle CT 45482-140th circle-CT
3 7000007 35482 140th circle CT 35482-140th circle-CT
4 7000008 35483 13th road VT 35483-13th road-VT
...
df2 =
(670 rows)
ZIP ADDRESS STATE CCCjoin
0 35481 14th street CA 35481-14th street-CA
1 45481 14th street CA 45481-14th street-CA
2 45482 140th circle CT 45482-140th circle-CT
3 35482 140th circle CT 35482-140th circle-CT
4 35483 13th road VT 35483-13th road-VT
...
OUTPUT:
df3 =
(670 rows)
ZIP ADDRESS STATE Unique Key (Unique)supplier_num
0 35481 14th street CA 35481-14th street-CA 7100000
1 45481 14th street CA 45481-14th street-CA 7100005
2 45482 140th circle CT 45482-140th circle-CT 7100006
3 35482 140th circle CT 35482-140th circle-CT 7100007
4 35483 13th road VT 35483-13th road-VT 7100008
...
670 15483 13 baker road CA 15483-13 baker road-CA 7100009
I've looked around on here and found some helpful tricks and I think ive made some progress. Here is some code that I tried.
df1['g'] = df1.groupby('CCCjoin').cumcount()
df2['g'] = df2.groupby('CCCjoin').cumcount()
then I merge:
merged_table = pd.merge(df1,df2, on=['CCCjoin', 'g'], how='inner').drop('g', axis =1 )
This sort of works. I get a match of 293 rows and I cross checked the supplier number and it matches the address.
What am I missing to get the 377 matches? Thanks in advance!
Related
A column in my dataframe has the following campaign contribution data formatted in one of two ways:
JOHN A. DONOR1234 W ROAD ST CITY, STATE 56789
And
JANE M. DONOR
1234 W ROAD ST
CITY, STATE 56789
I want to split this column into two. Column one should be the name of the donor. Column two should be the address.
Currently, I'm using the following regex code to try and accomplish this:
url = ("http://www.voterfocus.com/CampaignFinance/candidate_pr.php?op=rp&e=8&c=munmiamibeach&ca=64&sdc=116&rellevel=4&dhc=774&committee=N")
dfs = pd.read_html(url)
df = dfs[0]
df['Contributor'].str.split(r'\d\d?', expand=True)
But instead of splitting after the first match and quitting - as I intend - the regex seems continue matching and splitting. My output should looke like this:
Col1 Col2
JOHN A. DONOR 1234 W ROAD ST CITY, STATE 56789
it may be much simpler than that. You can use the string methods. For example, I think this is the behavior you want:
import pandas as pd
s = """JOHN A. DONOR
1234 W ROAD ST
CITY, STATE 56789"""
df = pd.DataFrame([s], columns=["donors"])
df.donors.str.split("\n", 1, expand=True)
output:
0 1
0 JOHN A. DONOR 1234 W ROAD ST\nCITY, STATE 56789
Splitting solution
You can use
df['Contributor'].str.split(r'(?<=\D)(?=\d)', expand=True, n=1)
The (?<=\D)(?=\d) regex finds a location between a non-digit char (\D) and a digit char (\d), splits the string there and only performs this operation once (due to n=1).
Alternative solution
You can match and capture the names up to the first number, and then capture all text there remains after and including the first digit using
df['Contributor'].str.extract(r'(?P<Name>\D*)(?P<Address>\d.*)', expand=True)
# => Name # Address
# 0 Contributor CHRISTIAN ULVERT 1742 W FLAGLER STMIAMI, FL 33135
# 1 Contributor Roger Thomson 4271 Alton Miami Beach , FL 33140
# 2 Contributor Steven Silverstein 691 West 247th Street Bronx , NY 10471
# 3 Contributor Cathy Raduns 691 West 247th Street Bronx, NY 10471
# 4 Contributor Asher Raduns-Silverstein 691 West 247th StreetBRONX, NY 10471
The (?P<Name>\D*)(?P<Address>\d.*) pattern means
(?P<Name>\D*) - Group "Name": zero or more chars other than digits
(?P<Address>\d.*) - Group "Address": a digit and then any zero or more chars other than line break chars.
If there are line breaks in the string, add (?s) at the start of the pattern, i.e. r'(?s)(?P<Name>\D*)(?P<Address>\d.*)'.
See the regex demo.
I have a dataframe with multiple columns and I want to check for each row of the Words column, is that word present in any column.
For example, for the word, "ANAND", it will check the whole columns of Area, City, City2 and States and then the row for a new column would be updated as 'Place'. If it's not present, then it'll be None.
You can see the example given below. I have tried multiple approaches but the function's taking 30+ minutes to process and then at the end, I am getting zero matches when that's not true. Can anybody please help?
Input:
Words Area City City2 States
0 ANAND Whitefield Abohar Achalpur Maharashtra
1 Gujarat Koramangala Adilabad Achhnera Uttar Pradesh
2 PARDESHI Electronic City Agartala Adalaj Gujarat
3 PARDESHI Anand Chowk Agra Adilabad Telangana
Output:
Words Area City City2 States Result
0 ANAND Whitefield Abohar Achalpur Maharashtra None
1 Gujarat Koramangala Adilabad Achhnera Uttar Pradesh Place
2 PARDESHI Electronic City Agartala Adalaj Gujarat None
3 PARDESHI Anand Chowk Agra Adilabad Telangana None
Edit: - Data added, picture removed.
I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None
I have many excel files that are in different formats. Some of them look like this, which is normal with one header can be read into pandas.
# First Column Second Column Address City State Zip
1 House The Clairs 4321 Main Street Chicago IL 54872
2 Restaurant The Monks 6323 East Wing Miluakee WI 45458
and some of them are in various formats with multiple headers,
Table 1
Comp ID Info
# First Column Second Column Address City State Zip
1 Office The Fairs 1234 Main Street Seattle WA 54872
2 College The Blanks 4523 West Street Madison WI 45875
3 Ground The Brewers 895 Toronto Street Madrid IA 56487
Table2
Comp ID Info
# First Column Second Column Address City State Zip
1 College The Banks 568 Old Street Cleveland OH 52125
2 Professional The Circuits 695 New Street Boston MA 36521
This looks like this in Excel (I am pasting the image here to show how it actually looks in excel),
As you can see above there are three different levels of headers. For sure every file has a row that starts with First Column.
For an individual file like this, I can read like below, which is just fine.
xls = pd.ExcelFile(r'mypath\myfile.xlsx')
df = pd.read_excel('xls', 'mysheet', header=[2])
However, I need a final data frame like this (Appended with files that have only one header),
First Column Second Column Address City State Zip
0 House The Clair 4321 Main Street Chicago IL 54872
1 Restaurant The Monks 6323 East Wing Milwaukee WI 45458
2 Office The Fairs 1234 Main Street Seattle WA 54872
3 College The Blanks 4523 West Street Madison WI 45875
4 Ground The Brewers 895 Toronto Street Madrid IA 56487
5 College The Banks 568 Old Street Cleveland OH 52125
6 Professional The Circuits 695 New Street Boston MA 36521
Since I have many files, I want to read each file in my folder and clean them up by getting only one header from a row. Had I knew the index position of the row, that I need as head, I could simply do something like in this post.
However, as some of those files can have Multiple headers (I showed 2 extra headers in above example, some have 4 headers) in different formats, I want to iterate through the file and set the row that starts with First Column to be header in the beginning of the file.
Additionally, I want to drop those rows that are in the middle of the the file that has First Column.
After I create a cleaned file headers starting with First Column, I can append each data frame and create my output file I need. How can I achieve this in pandas? Any help or suggestions would be great.
I am working on an assignment i have two CSV files. First File Contains my whole data(shown Below)
First File:
0 ID Name Suburb State Postcode Email Lat Lon
0 0 1 Hurstville Clinic Hurstville NSW 1493 hurstville#myclinic.com.au -33.975869 151.088939
1 1 2 Sydney Centre Clinic Sydney NSW 2000 sydney#myclinic.com.au -33.867139 151.207114
2 2 3 Auburn Clinic Auburn NSW 2144 auburn#myclinic.com.au -33.849322 151.033421
3 3 4 Riverwood Clinic Riverwood NSW 2210 riverwood#myclinic.com.au -33.949859 151.052469
Second File contains the data Which i have to Replace with the first file Email Column.
I used Regex to convert the second file into HTML links.
This is what I've done to clean my data:
def clean(filename):
df = pd.read_csv(filename)
df['Email'] = df['Email'].apply(lambda x: x if '#' in str(x) else str(x)+'#myclinic.com.au')
return df.to_csv('temp1.csv')
Second File output
Email
which is not correct. The above function is ommiting everything before the spaces in the email column and also ommmiting any row which has space before # in the email column.
This is what I have to do:
a) Clean the data of file one (Two unwanted columns) and There are some spaces in the Email column as some addresses have spaces in the name and they can not be read in the final function.
b) The end output that i am getting is not the output that i want. it is ommiting 10 rows:
this is what I've done in file2.
emails = re.findall(r'\S+#\S+', text)
for x in range(0, len(emails)):
emails[x] = '' % emails[x];
emails.insert(0, 'Email')
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in emails:
writer.writerow([val])
here, text is a dictonary which contains my whole csv data. Insteading of calling CSV file i just copied whole content in my python file.
Final Output
ID Name Suburb State Postcode Email_Str Lat Lon
0 1 Hurstville Clinic Hurstville NSW 1493 -33.975869 151.088939
1 2 Sydney Centre Clinic Sydney NSW 2000 -33.867139 151.207114
2 3 Auburn Clinic Auburn NSW 2144 -33.849322 151.033421
3 4 Riverwood Clinic Riverwood NSW 2210 -33.949859 151.052469
4 6 Harrington Clinic Harrington NSW 2427 -31.872153 152.689811
5 9 Benolong Clinic Benolong NSW 2830 -32.413736 148.63938
6 11 Preston Clinic Preston VIC 3072 -37.738736 145.000515
7 13 Douglas Clinic Douglas VIC 3409 -37.842988 144.892631
8 14 Mildura Clinic Mildura VIC 3500 -34.181714 142.163072
9 15 Broadford Clinic Broadford VIC 3658 -37.203001 145.050171
10 16 Officer Clinic Officer VIC 3809 -38.063056 145.40958
11 18 Langsborough Clinic Langsborough VIC 3971 -38.651487 146.675098
12 19 Brisbane Centre Clinic Brisbane QLD 4000 -27.46758 153.027892
13 20 Robertson Clinic Robertson QLD 4109 -27.565733 153.057213
14 22 Ipswich Clinic Ipswich QLD 4305 -27.614604 152.760876
15 24 Caboolture Clinic Caboolture QLD 4510 -27.085007 152.951707
16 25 Booie Clinic Booie QLD 4610 -26.498426 151.935421
17 26 Rockhampton Clinic Rockhampton QLD 4700 -23.378941 150.512323
18 28 Cairns Clinic Cairns QLD 4870 -16.925397 145.775178
19 29 Adelaide Centre Clinic Adelaide SA 5000 -34.92577 138.599732
My data is missing after final merging
As you can see it is missing a lot of data.
Please help me.
Not sure what the question is but it seems that you're trying to convert the email text into an email link. You can do this:
df['Email'] = df['Email'].apply(lambda x: '')
Looks like you are trying to merge two data-frames on the 'Email' column such that after merge you should get Email String from DF2 merged with DF1.
First we need to create a second column within DF2 that has email
pattern matching DF1 (Email_Str).
Next, we merge DF1, and DF2 on common Email column
Further, we remove unwanted column
Finally, rearrange columns to desired sequence.
Working code as below (using your data)
import pandas as pd
import re
df1 = pd.read_csv('file1.txt', sep=",", engine="python")
df2 = pd.read_csv('file2.txt', sep=",", engine="python")
def get_email(x):
return ''.join(re.findall(r'"([^"]*)"', x))
df2.columns = ['Email_Str']
df2['Email']=df2['Email_Str'].apply(get_email)
df2 = df2[['Email','Email_Str']]
df3=pd.merge(df1,df2,on='Email').drop(['Email'], axis=1)
df3 = df3[[u'ID', u'Name', u'Suburb', u'State', u'
Postcode',u'Email_Str', u'Lat', u'Lon', ]]
Result:
>>> df3
ID Name Suburb State Postcode \
0 1 Hurstville Clinic Hurstville NSW 1493
1 2 Sydney Centre Clinic Sydney NSW 2000
2 3 Auburn Clinic Auburn NSW 2144
3 4 Riverwood Clinic Riverwood NSW 2210
Email_Str Lat Lon
0 -33.975869 151.088939
1 -33.867139 151.207114
2 -33.849322 151.033421
3 -33.949859 151.052469
>>>
Hope this is what you are looking for.
I figure out what is the problem.
My original data have spaces in the Email Column.
So can anyone update my regex Function?
def clean(filename):
df = pd.read_csv(filename)
df['Email'] = df['Email'].apply(lambda x: x if '#' in str(x) else str(x)+'#myclinic.com.au')
return df.to_csv('temp1.csv')