How to remove lines in pandas data frame based on specific character

How to remove lines in pandas data frame based on specific character - python

I got this in my data frame
name : john,
address : Milton Kings,
phone : 43133241
Concern:
customer complaint about the services is so suck
thank you
How can I process the above to remove only line of text in data frame containing :? My objective is to get the lines which contains the following only.
customer complaint about the services is so suck
Kindly help.

One thing you can do is to separate the sentence after ':' from your data frame. And you can do this by creating a series from your data frame.
Let's say c is your series.
c=pd.Series(df['column'])
s=[c[i].split(':')[1] for i in range(len(c))]
By doing this you will be able to separate your sentence from colon.

Assuming you want to keep the second part of the sentences, you can use the applymap
method to solve your problem.
import pandas as pd
#Reproduce the dataframe
l = ["name : john",
"address : Milton Kings",
"phone : 43133241",
"Concern : customer complaint about the services is so suck" ]
df = pd.DataFrame(l)
#split on each element of the dataframe, and keep the second part
df.applymap(lambda x: x.split(":")[1])
input :
0
0 name : john
1 address : Milton Kings
2 phone : 43133241
3 Concern : customer complaint about the services is so suck
output :
0
0 john
1 Milton Kings
2 43133241
3 customer complaint about the services is so suck

Related

Add new column to existing dataframe from substring of existing column

I have a df that looks similar to this:
|email|first_name|last_name|id|group_email|
|-|-|-|-|-|
|drew#mail.com|drew|barry|05|san-red-gate-rate#mail.com|
|nate#mail.com|nate|lewis|03|san-blue-gate-factor#mail.com|
|chris#mail.com|chris|ryan|04|san-red-wheels-drive#mail.com|
I parse out the group_code, the sub string after the 3rd hyphen. I now want to add this sub tring back into the dataframe for each entry. So the df will look like so:
|email|first_name|last_name|id|group_email|group_code|
|-|-|-|-|-|-|
|drew#mail.com|drew|barry|05|san-red-gate-rate#mail.com|rate|
|nate#mail.com|nate|lewis|03|san-blue-gate-factor#mail.com|factor|
|chris#mail.com|chris|ryan|04|san-red-wheels-drive#mail.com|drive|
How can I go about doing this?

Let's try
df['group_code'] = (df['group_email'].str.extract('(-[^#]*){3}')[0]
.str.lstrip('-'))
print(df)
email first_name last_name id group_email group_code
0 drew#mail.com drew barry 5 san-red-gate-rate#mail.com rate
1 nate#mail.com nate lewis 3 san-blue-gate-factor#mail.com factor
2 chris#mail.com chris ryan 4 san-red-wheels-drive#mail.com drive

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")

One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

Parsing log files and write to csv (different number of fields)

This is a question that concerns me for a long time. I have log files that I want to convert to csv. My problem is that the empty fields have been omitted in the log files. I want to end up with a csv file containing all fields.
Now I'm parsing the log files and write them to xml because one of the nice features of Microsoft Excel is that when you open a xml file with a different number of elements, Excel shows you all elements as separate columns.
Last week I came up with the idea that this might be possible with Pandas, but I can not find a good example to get this done.
Someone a good idea how I can get this done?
Updated
I can't share the actual logs here. Below a fictional sample:
Sample 1:
First : John Last : Doe Address : Main Street Email : j_doe#notvalid.gov Sex : male State : TX City : San Antonio Country : US Phone : 210-354-4030
First : Carolyn Last : Wysong Address : 1496 Hewes Avenue Sex : female State : TX City : KEMPNER Country : US Phone : 832-600-8133 Bank_Account : 0123456789
regex :
matches = re.findall(r'(\w+) : (.*?) ', line, re.IGNORECASE)
Sample 2:
:1: John :2: Doe :3: Main Street :4: j_doe#notvalid.gov :5: male :6: TX :7: San Antonio :8: US :9: 210-354-4030
:1: Carolyn :2: Wysong :3: 1496 Hewes Avenue :5: female :6: TX :7: KEMPNER :8: US :9: 832-600-8133 :10: 0123456789
regex:
matches = re.findall(r':(\d+): (.*?) ', line, re.IGNORECASE)

Allow me to concentrate on your first example. Your regex only matches the first word of each field, but let's keep it like that for now as I'm sure you can easily fix that.
You can create a pandas DataFrame to store your parsed data, then for each line you run your regexp, convert it to a dictionary and load it into a pandas Series. Then you append it to your dataframe. Pandas is smart enough to fill missing data with NaN.
df = pd.DataFrame()
for l in lines:
matches = re.findall(r'(\w+) : (.*?) ', l, re.IGNORECASE)
s = pd.Series(dict(matches))
df = df.append(s, ignore_index=True)
>>> print(df)
Address City Country Email First Last Sex State Phone
0 Main San US j_doe#notvalid.gov John Doe male TX NaN
1 1496 KEMPNER US NaN Carolyn Wysong female TX 832-600-8133
I'm not sure the dict step is needed, maybe there's a pandas way to directly parse your list of tuples.
Then you can easily convert it to csv, you will retain all your columns with empty fields where appropriate.
df.to_csv("result.csv", index=False)
>>> !cat result.csv
Address,City,Country,Email,First,Last,Sex,State,Phone
Main,San,US,j_doe#notvalid.gov,John,Doe,male,TX,
1496,KEMPNER,US,,Carolyn,Wysong,female,TX,832-600-8133
About big files performances, if you know all the field names in advance you can initialize the dataframe with a columns argument and run the parsing and csv saving one chunk at the time. IIRC there's a mode parameter for to_csv that should allow you to append to an existing file.

Extract last term after comma into new column

I have a pandas dataframe which is essentially 2 columns and 9000 rows
CompanyName | CompanyAddress
and the address is in the form
Line1, Line2, ..LineN, PostCode
i.e. basically different numbers of comma-separated items in a string (or dtype 'object'), and I want to just pull out the post code i.e. the item after the last comma in the field
I've tried the Dot notation string manipulation suggestions (possibly badly):
df_address['CompanyAddress'] = df_address['CompanyAddress'].str.rsplit(', ')
which just put '[ ]' around the fields - I had no success trying to isolate the last component of any split-up/partitioned string, with maxsplit kicking up errors.
I had a small degree of success following EdChums comment to Pandas split Column into multiple columns by comma
pd.concat([df_address[['CompanyName']], df_address['CompanyAddress'].str.rsplit(', ', expand=True)], axis=1)
However, whilst isolating the Postcode, this just creates multiple columns and the post code is in columns 3-6... equally no good.
It feels incredibly close, please advise.
EmployerName Address
0 FAUCET INN LIMITED [Union, 88-90 George Street, London, W1U 8PA]
1 CITIBANK N.A [Citigroup Centre,, Canary Wharf, Canada Squar...
2 AGENCY 2000 LIMITED [Sovereign House, 15 Towcester Road, Old Strat...
3 Transform Trust [Unit 11 Castlebridge Office Village, Kirtley ...
4 R & R.C.BOND (WHOLESALE) LIMITED [One General Street, Pocklington Industrial Es...
5 MARKS & SPENCER FINANCIAL SERVICES PLC [Marks & Spencer Financial, Services Kings Mea...

Given the DataFrame,
df = pd.DataFrame({'Name': ['ABC'], 'Address': ['Line1, Line2, LineN, PostCode']})
Address Name
0 Line1, Line2, LineN, PostCode ABC
If you need only post code, you can extract that using rsplit and re-assign it to the column Address. It will save you the step of concat.
df['Address'] = df['Address'].str.rsplit(',').str[-1]
You get
Address Name
0 PostCode ABC
Edit: Give that you have dataframe with address values in list
df = pd.DataFrame({'Name': ['FAUCET INN LIMITED'], 'Address': [['Union, 88-90 George Street, London, W1U 8PA']]})
Address Name
0 [Union, 88-90 George Street, London, W1U 8PA] FAUCET INN LIMITED
You can get last element using
df['Address'] = df['Address'].apply(lambda x: x[0].split(',')[-1])
You get
Address Name
0 W1U 8PA FAUCET INN LIMITED

Just rsplit the existing column into 2 columns - the existing one and a new one. Or two new ones if you want to keep the existing column intact.
df['Address'], df['PostCode'] = df['Address'].str.rsplit(', ', 1).str
Edit: Since OP's Address column is a list with 1 string in it, here is a solution for that specifically:
df['Address'], df['PostCode'] = df['Address'].map(lambda x: x[0]).str.rsplit(', ', 1).str

rsplit returns a list, try rsplit(‘,’)[0] to get last element in source line

Rearrange CSV data

I have 2 csv files with different sequence of columns. For e.g. the first file starts with 10 digits mobile numbers while that column is at number 4 in the second file.
I need to merge all the customer data into a single csv file. The order of the columns should be as follows:
mobile pincode model Name Address Location pincode date
mobile Name Address Model Location pincode Date
9845299999 Raj Shah nagar No 22 Rivi Building 7Th Main I Crz Mumbai 17/02/2011
9880877777 Managing Partner M/S Aitas # 1010, 124Th Main, Bk Stage. - Bmw 320 D Hyderabad 560070 30-Dec-11
Name Address Location mobile pincode Date Model
Asvi Developers pvt Ltd fantry Road Nariman Point, 1St Floor, No. 150 Chennai 9844066666 13/11/2011 Crz
L R Shiva Gaikwad & Sudha Gaikwad # 42, Suvarna Mansion, 1St Cross, 17Th Main, Banjara Hill, B S K Stage,- Bangalore 9844233333 560085 40859 Mercedes_E 350 Cdi
Second task and that may be slightly difficult is that the new files expected may have a totally different column sequence. In that case I need to extract 10 digits mobile number and 6 digits pincode column. I need to write the code that will guess the city column if it matches with any of the given city list. The new files are expected to have relevant column headings but the column heading may be slightly different. for e.g. "customer address" instead of "address". How do I handle such data?
sed 's/.*\([0-9]\{10\}\).*/\1,&/' input
I have been suggested to use sed to rearrange the 10 digits column at the beginning. But I do also need to rearrange the text columns. For e.g. if a column matches the entries in the following list then it is undoubtedly model column.
['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
If any column matches 10% of the entries with the above list then it is a "model" column and should be at number 3 followed by mobile and pincode.

For your first question, I suggest using pandas to load both files and then concat. After that you can rearrange your columns.
import pandas as pd
dataframe1 = pd.read_csv('file1.csv')
dataframe2 = pd.read_csv('file2.csv')
combined = pd.concat([dataframe1, dataframe2]) #the columns will be ordered alphabetically
To get desired order,
result_df = combined[['mobile', 'pincode', 'model', 'Name', 'Address', 'Location', 'pincode', 'date']]
and then result_df.to_csv('oupput.csv', index=False) to export to csv file.
For the second one, you can do something like this (assuming you have loaded a csv file into df like above)
match_model = lambda m: m in ['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
for c in df:
if df[c].map(match_model).sum()/len(df) > 0.1:
print "Column %s is 'Model'"% c
df.rename(columns={c:'Model'}, inplace=True)
You can modify the matching function match_model to use regex instead if you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.