I have a pandas dataframe which is essentially 2 columns and 9000 rows
CompanyName | CompanyAddress
and the address is in the form
Line1, Line2, ..LineN, PostCode
i.e. basically different numbers of comma-separated items in a string (or dtype 'object'), and I want to just pull out the post code i.e. the item after the last comma in the field
I've tried the Dot notation string manipulation suggestions (possibly badly):
df_address['CompanyAddress'] = df_address['CompanyAddress'].str.rsplit(', ')
which just put '[ ]' around the fields - I had no success trying to isolate the last component of any split-up/partitioned string, with maxsplit kicking up errors.
I had a small degree of success following EdChums comment to Pandas split Column into multiple columns by comma
pd.concat([df_address[['CompanyName']], df_address['CompanyAddress'].str.rsplit(', ', expand=True)], axis=1)
However, whilst isolating the Postcode, this just creates multiple columns and the post code is in columns 3-6... equally no good.
It feels incredibly close, please advise.
EmployerName Address
0 FAUCET INN LIMITED [Union, 88-90 George Street, London, W1U 8PA]
1 CITIBANK N.A [Citigroup Centre,, Canary Wharf, Canada Squar...
2 AGENCY 2000 LIMITED [Sovereign House, 15 Towcester Road, Old Strat...
3 Transform Trust [Unit 11 Castlebridge Office Village, Kirtley ...
4 R & R.C.BOND (WHOLESALE) LIMITED [One General Street, Pocklington Industrial Es...
5 MARKS & SPENCER FINANCIAL SERVICES PLC [Marks & Spencer Financial, Services Kings Mea...
Given the DataFrame,
df = pd.DataFrame({'Name': ['ABC'], 'Address': ['Line1, Line2, LineN, PostCode']})
Address Name
0 Line1, Line2, LineN, PostCode ABC
If you need only post code, you can extract that using rsplit and re-assign it to the column Address. It will save you the step of concat.
df['Address'] = df['Address'].str.rsplit(',').str[-1]
You get
Address Name
0 PostCode ABC
Edit: Give that you have dataframe with address values in list
df = pd.DataFrame({'Name': ['FAUCET INN LIMITED'], 'Address': [['Union, 88-90 George Street, London, W1U 8PA']]})
Address Name
0 [Union, 88-90 George Street, London, W1U 8PA] FAUCET INN LIMITED
You can get last element using
df['Address'] = df['Address'].apply(lambda x: x[0].split(',')[-1])
You get
Address Name
0 W1U 8PA FAUCET INN LIMITED
Just rsplit the existing column into 2 columns - the existing one and a new one. Or two new ones if you want to keep the existing column intact.
df['Address'], df['PostCode'] = df['Address'].str.rsplit(', ', 1).str
Edit: Since OP's Address column is a list with 1 string in it, here is a solution for that specifically:
df['Address'], df['PostCode'] = df['Address'].map(lambda x: x[0]).str.rsplit(', ', 1).str
rsplit returns a list, try rsplit(‘,’)[0] to get last element in source line
Related
I have a *.txt file coming from a SQL query organised in rows.
I'm reading it with pandas library through:
df = pd.read_csv(./my_file_path/my_file.txt, sep = '\n', head = 0)
df.rename(columns = {list(df.columns)[0]: 'cols'}, inplace = True)
the output are rows with the information separated by spaces in an standard structure (dots are meant to be spaces):
name................address........country..........age
0 Bob.Hope............Broadway.......United.States....101
1 Richard.Donner......Park.Avenue....United.States.....76
2 Oscar.Meyer.........Friedrichshain.Germany...........47
I tried to create a dictionary to get the info with comprehensive lists:
col_dict = {'name': [df.cols[i][0:20].strip() for i in range(0,len(df.cols))],
'address': [df.cols[I][21:36].strip() for i in range(0,len(df.cols))],
'country': [df.cols[i][36:52].strip() for i in range(0,len(df.cols))],
'age': [df.cols[i][53:].strip() for i in range(0,len(df.cols))],
}
This script runs well in order to create a dictionary as a basis for a dataframe to work with. But I were asking myself if there is any other way to make the script more pythonic, looping directly through a dictionary with the column names and avoiding the repetition of the same code for every column -the actual dataset is much longer-.
The question is how can I store de string indexation to use it later with the column names to parse everything at once.
You can read it directly with pandas:
df = pd.read_csv(./my_file_path/my_file.txt, delim_whitespace=True)
If you know that the space between the columns is going to be at least 2 spaces, you can do it this way:
df = pd.read_csv(./my_file_path/my_file.txt, sep='\s{2,}')
In your case, the file is fixed width so you need to use a different method:
df = pd.read_fwf(StringIO(my_text), widths=[20,15,16, 10],skiprows=1)
The pandas.read_fwf method is what you are looking for.
df = pd.read_fwf( 'data.txt' )
data.txt
name address country age
Bob Hope Broadway United States 101
Richard Donner Park Avenue United States 76
Oscar Meyer Friedrichshain Germany 47
df
id
name
address
country
age
0
Bob Hope
Broadway
United States
101
1
Richard Donner
Park Avenue
United States
76
2
Oscar Meyer
Friedrichshain
Germany
47
I am facing a problem in applying fuzzy logic for data cleansing in python. My data looks something like this
data=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "Count":['140','120','50','45','30','20','10','5']})
data
I am using fuzzy logic to compare the values in the data frame. The final output should have a third column with result like this:
data_out=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "New_Column":["Deloitte",'Accenture','Accenture','Accenture','Ernst & young','Ernst & young','Tata Consultancy Services','Deloitte']})
data_out
So if you see, I want less occurring values to have a new entry as a new column with the most occurred value of its type. That is where fuzzy logic is helpful.
Most of your duplicate companies can be detected using fuzzy string matching quite easily, however the replacement Ernst & young <-> EY is not really similar at all, which is why I am going to ignore this replacement here. This solution is using my library RapidFuzz, but you could implement something similar using FuzzyWuzzy aswell (with a little more code, since it does not has the extractIndices processor).
import pandas as pd
from rapidfuzz import process, utils
def add_deduped_employer_colum(data):
values = data.values.tolist()
employers = [employer for employer, _ in values]
# preprocess strings beforehand (lowercase + remove punctuation),
# so this is not done multiple times
processed_employers = [utils.default_process(employer)
for employer in employers]
deduped_employers = employers.copy()
replaced = []
for (i, (employer, processed_employer)) in enumerate(
zip(employers, processed_employers)):
# skip elements that already got replaced
if i in replaced:
continue
duplicates = process.extractIndices(
processed_employer, processed_employers[i+1:],
processor=None, score_cutoff=90, limit=None)
for (c, _) in duplicates:
deduped_employers[i+c+1] = employer
"""
by replacing the element with an empty string the index from
extractIndices stays correct but it can be skipped a lot
faster, since the compared strings will have very different
lengths
"""
processed_employers[i+c+1] = ""
replaced.append(i+c+1)
data['New_Column'] = deduped_employers
data=pd.DataFrame({
'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'],
"Count":['140','120','50','45','30','20','10','5']})
add_deduped_employer_colum(data)
print(data)
which results in the following dataframe:
Employer Count New_Column
0 Deloitte 140 Deloitte
1 Accenture 120 Accenture
2 Accenture Solutions Ltd 50 Accenture
3 Accenture USA 45 Accenture
4 Ernst & young 30 Ernst & young
5 EY 20 EY
6 Tata Consultancy Services 10 Tata Consultancy Services
7 Deloitte Uk 5 Deloitte
I have not used fuzzy but can assist as follows
Data
df=pd.DataFrame({'Employer':['Accenture','Accenture Solutions Ltd','Accenture USA', 'hjk USA', 'Tata Consultancy Services']})
df
You did not give an explanation why Tata remains with the full name. Hence I assume it is special and mask it.
m=df.Employer.str.contains('Tata')
I then use np.where to replace anything after the first name for the rest
df['New_Column']=np.where(m, df['Employer'], df['Employer'].str.replace(r'(\s+\D+)',''))
df
Output
This is a question that concerns me for a long time. I have log files that I want to convert to csv. My problem is that the empty fields have been omitted in the log files. I want to end up with a csv file containing all fields.
Now I'm parsing the log files and write them to xml because one of the nice features of Microsoft Excel is that when you open a xml file with a different number of elements, Excel shows you all elements as separate columns.
Last week I came up with the idea that this might be possible with Pandas, but I can not find a good example to get this done.
Someone a good idea how I can get this done?
Updated
I can't share the actual logs here. Below a fictional sample:
Sample 1:
First : John Last : Doe Address : Main Street Email : j_doe#notvalid.gov Sex : male State : TX City : San Antonio Country : US Phone : 210-354-4030
First : Carolyn Last : Wysong Address : 1496 Hewes Avenue Sex : female State : TX City : KEMPNER Country : US Phone : 832-600-8133 Bank_Account : 0123456789
regex :
matches = re.findall(r'(\w+) : (.*?) ', line, re.IGNORECASE)
Sample 2:
:1: John :2: Doe :3: Main Street :4: j_doe#notvalid.gov :5: male :6: TX :7: San Antonio :8: US :9: 210-354-4030
:1: Carolyn :2: Wysong :3: 1496 Hewes Avenue :5: female :6: TX :7: KEMPNER :8: US :9: 832-600-8133 :10: 0123456789
regex:
matches = re.findall(r':(\d+): (.*?) ', line, re.IGNORECASE)
Allow me to concentrate on your first example. Your regex only matches the first word of each field, but let's keep it like that for now as I'm sure you can easily fix that.
You can create a pandas DataFrame to store your parsed data, then for each line you run your regexp, convert it to a dictionary and load it into a pandas Series. Then you append it to your dataframe. Pandas is smart enough to fill missing data with NaN.
df = pd.DataFrame()
for l in lines:
matches = re.findall(r'(\w+) : (.*?) ', l, re.IGNORECASE)
s = pd.Series(dict(matches))
df = df.append(s, ignore_index=True)
>>> print(df)
Address City Country Email First Last Sex State Phone
0 Main San US j_doe#notvalid.gov John Doe male TX NaN
1 1496 KEMPNER US NaN Carolyn Wysong female TX 832-600-8133
I'm not sure the dict step is needed, maybe there's a pandas way to directly parse your list of tuples.
Then you can easily convert it to csv, you will retain all your columns with empty fields where appropriate.
df.to_csv("result.csv", index=False)
>>> !cat result.csv
Address,City,Country,Email,First,Last,Sex,State,Phone
Main,San,US,j_doe#notvalid.gov,John,Doe,male,TX,
1496,KEMPNER,US,,Carolyn,Wysong,female,TX,832-600-8133
About big files performances, if you know all the field names in advance you can initialize the dataframe with a columns argument and run the parsing and csv saving one chunk at the time. IIRC there's a mode parameter for to_csv that should allow you to append to an existing file.
I got this in my data frame
name : john,
address : Milton Kings,
phone : 43133241
Concern:
customer complaint about the services is so suck
thank you
How can I process the above to remove only line of text in data frame containing :? My objective is to get the lines which contains the following only.
customer complaint about the services is so suck
Kindly help.
One thing you can do is to separate the sentence after ':' from your data frame. And you can do this by creating a series from your data frame.
Let's say c is your series.
c=pd.Series(df['column'])
s=[c[i].split(':')[1] for i in range(len(c))]
By doing this you will be able to separate your sentence from colon.
Assuming you want to keep the second part of the sentences, you can use the applymap
method to solve your problem.
import pandas as pd
#Reproduce the dataframe
l = ["name : john",
"address : Milton Kings",
"phone : 43133241",
"Concern : customer complaint about the services is so suck" ]
df = pd.DataFrame(l)
#split on each element of the dataframe, and keep the second part
df.applymap(lambda x: x.split(":")[1])
input :
0
0 name : john
1 address : Milton Kings
2 phone : 43133241
3 Concern : customer complaint about the services is so suck
output :
0
0 john
1 Milton Kings
2 43133241
3 customer complaint about the services is so suck
I have 2 csv files with different sequence of columns. For e.g. the first file starts with 10 digits mobile numbers while that column is at number 4 in the second file.
I need to merge all the customer data into a single csv file. The order of the columns should be as follows:
mobile pincode model Name Address Location pincode date
mobile Name Address Model Location pincode Date
9845299999 Raj Shah nagar No 22 Rivi Building 7Th Main I Crz Mumbai 17/02/2011
9880877777 Managing Partner M/S Aitas # 1010, 124Th Main, Bk Stage. - Bmw 320 D Hyderabad 560070 30-Dec-11
Name Address Location mobile pincode Date Model
Asvi Developers pvt Ltd fantry Road Nariman Point, 1St Floor, No. 150 Chennai 9844066666 13/11/2011 Crz
L R Shiva Gaikwad & Sudha Gaikwad # 42, Suvarna Mansion, 1St Cross, 17Th Main, Banjara Hill, B S K Stage,- Bangalore 9844233333 560085 40859 Mercedes_E 350 Cdi
Second task and that may be slightly difficult is that the new files expected may have a totally different column sequence. In that case I need to extract 10 digits mobile number and 6 digits pincode column. I need to write the code that will guess the city column if it matches with any of the given city list. The new files are expected to have relevant column headings but the column heading may be slightly different. for e.g. "customer address" instead of "address". How do I handle such data?
sed 's/.*\([0-9]\{10\}\).*/\1,&/' input
I have been suggested to use sed to rearrange the 10 digits column at the beginning. But I do also need to rearrange the text columns. For e.g. if a column matches the entries in the following list then it is undoubtedly model column.
['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
If any column matches 10% of the entries with the above list then it is a "model" column and should be at number 3 followed by mobile and pincode.
For your first question, I suggest using pandas to load both files and then concat. After that you can rearrange your columns.
import pandas as pd
dataframe1 = pd.read_csv('file1.csv')
dataframe2 = pd.read_csv('file2.csv')
combined = pd.concat([dataframe1, dataframe2]) #the columns will be ordered alphabetically
To get desired order,
result_df = combined[['mobile', 'pincode', 'model', 'Name', 'Address', 'Location', 'pincode', 'date']]
and then result_df.to_csv('oupput.csv', index=False) to export to csv file.
For the second one, you can do something like this (assuming you have loaded a csv file into df like above)
match_model = lambda m: m in ['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
for c in df:
if df[c].map(match_model).sum()/len(df) > 0.1:
print "Column %s is 'Model'"% c
df.rename(columns={c:'Model'}, inplace=True)
You can modify the matching function match_model to use regex instead if you want.