Extract Information/cleaning data from CSV file using Python - python

I have a dataset provided properties.csv (4000 rows and 6 columns). The csv file including many features some of these features are numerical and some of them are nominal (features contain text). Suppose the features in this dataset are
id
F1
F2
F3
F4
Price
Examples of the content of each feature:
id (row 1 to 3 in CSV File) ---> 44525
44859
45465
F1 (row 1 to 3 in CSV File) ---> "Stunning 6 bedroom villa in the heart of the
Golden Mile, Marbella"
"Villa for sale in Rocio de Nagüeles, Marbella
Golden Mile"
"One level 5 bedroom villa for sale in
Nagüeles"
F2 (row 1 to 3 in CSV File) ---> "Fireplace, Elevator, Terrace, Mountain view,
Freight Elevator, Air conditioning, Patio,
Guest toilet, Garden, Balcony, Sea/lake view,
Built-in kitchen"
"Mountain view"
"Elevator, Terrace, Alarm system, Mountain
view, Swimming pool, Air conditioning,
Basement, Sea/lake view"
F3 (row 1 to 3 in CSV File) - contains numerical values ---> 0
0
0
F4 (row 1 to 3 in CSV File) - contains numerical values ---> 393
640
4903
F3 (row 1 to 3 in CSV File) - contains numerical values ---> 4400000
2400000
1900000
In F1, I am looking to do the following:
1- Extract the type of the properties (apartment’, ‘house’ or ‘Villa’) and put it in a separate feature (independent variable) calls "Type" in CSV file. After that, I want to separate them in groups (apartments group, houses group, Vilas group) with calculating the mean price of each type group.
2- Extract the location of each property (locations can be: Alenquer, Quinta da Marinha, Golden Mile, Nagüeles) and put it in a separate feature (independent variable) calls "Location" in csv file.
I am a beginner in NLP. I tried to write this code to extract information "Apartment" from F1, but it does not work probably:
import pandas as pd
from pandas import DataFrame
import re
properties = pd.read_csv (r'C:/Users/User/Desktop/properties.csv')
Extract "Apartment" from F1
Title= DataFrame(properties,columns= ['F1'])
for line in F1:
#return list of apartments in that line
x = re.findall("\apartment", line)
#if a date is found
if len(x) != 0:
print(x)
I need your help to fix this code and what should I do to extract the other information ‘houses’ and ‘Villa’ from F1.
After that, Create a property dataset in this format and save it as a csv file:
id
Location (Information extracted from F1)
type (information extracted from F1 in groups "apartments’, ‘houses’, ‘Villas’")
F1
F2
F3
F4
Price
In case, F1 does not contain the type of some properties "Blank field (no text)", what should I do to deal with the blanks fields (no text) in F1 and extract the type of the properties from other properties?

Here is a solution:
import pandas as pd
import re
df = pd.read_csv('appt_info.csv', delimiter=';')
def extract_housing_type(text):
# Do a regular expression search for the pattern
match = re.search('(apartment|house|villa)s?', text, flags=re.I)
if match is not None:
return match.group(0) # return the value of the match
return 'Unknown' # return a default value if there is no match
df['Type'] = df.F1.apply(lambda x: extract_housing_type(x)) # assign the output to a new column
This should give you a dataframe that looks like this:
id F1 \
0 44525 Stunning 6 bedroom villa in the heart of the G...
1 44859 Villa for sale in Rocio de Nageles, Marbella G...
2 45465 One level 5 bedroom villa for sale in Nageles
F2 F3 F4 Price Type
0 Fireplace, Elevator, Terrace, Mountain view, F... 0 393 4400000 villa
1 Mountain view 0 640 2400000 Villa
2 Elevator, Terrace, Alarm system, Mountain view... 0 4903 1900000 villa

Related

CSV: find best match/closest value by several parameters in 2 different CSV files?

I have a code that conducts a search of closest value between 2 CSV files. It reads a CSV file called "common_list" with some database which looks like this:
common_name
common_Price
common_Offnet
common_Traffic
name1
1300
250
13000
name2
1800
350
18000
The code puts these CSV rows into a list and then creates NumPy arrays.
common_list = pd.read_csv("common_list.csv")
common_list_offnet = common_list["common_Offnet"].to_list()
common_list_traffic = common_list["common_Traffic"].to_list()
array_offnet = np.array(common_list_offnet)
array_traffic = np.array(common_list_traffic)
array = np.column_stack((array_offnet,array_traffic))
We use this CSV file as a database for available cell phone plans (name of the plan, price, offnet calls, and internet traffic).
Then, the code reads another CSV file called "By_ARPU" with 100k+ rows with users and how they use their cell phone plans (how much money they spend (the price of plan), how much offnet calls, and traffic). The headers of this CSV file look like this:
User ID
ARPU_AVERAGE
Offnet Calls
Traffic (MB)
where ARPU_AVERAGE corresponds to the amount of money users spend (the price they pay). The code finds the closest value between the CSV files by 2 parameters: Offnet calls and Traffic (MB).
csv_data = pd.read_csv("By_ARPU.csv")
data = csv_data[['Offnet Calls', 'Traffic (MB)']]
data = data.to_numpy()
sol = []
for target in data:
dist = np.sqrt((np.square(array[:,np.newaxis]-target).sum(axis=2))
idx = np.argmin(dist)
sol.append(idx)
csv_data["Suggested Plan [SP]"] = common_list['common_name'][sol].values
csv_data["SP: Offnet Minutes"] = common_list['common_Offnet'][sol].values
csv_data["SP: Traffic"] = common_list['common_Traffic'][sol].values
csv_data.to_csv ('4.7 final.csv', index = False, header=True)
It finds closest value from the database and shows the name and corresponding offnet calls, traffic. For example, if in the file "By_ARPU" the values for Offnet calls and Traffic (MB) were 250 and 13000 respectively, it will show the name of closest value from "common_list" which is name1.
I wanted to create additional code for the same search but with 3 parameters. You can see that first database "common_list" has 3 parameters: common_Price, common_Offnet and common_Offnet. In the previous code, we found the closest value by 2 values.
Corresponding to each other columns in different CSV files were: "common_Offnet" from common_list - "Offnet calls" from By_ARPU AND "common_Traffic" from common_list - "Traffic (MB)" from By_ARPU.
And I want:
Find closest value by 3 parameters: Price, Offnet Calls and Traffic. The column which corresponds to the price in "By_ARPU" file is called "AVERAGE_APRU".
Please, help to modify the code to find closest value by making search by those 3 paramters instead of 2.
Input data:
>>> plans
common_name common_Price common_Offnet common_Traffic
0 plan1 1300 250 13000
1 plan2 1800 350 18000
>>> df
User ID ARPU_AVERAGE Offnet Calls Traffic (MB)
0 Louis 1300 250 13000 # plan1 for sure (check 1)
1 Paul 1800 350 18000 # plan2, same values (check 2)
2 Alex 1500 260 14000 # plan1, probably
Create your matching function:
def rmse(user, plans):
u = user[['ARPU_AVERAGE', 'Offnet Calls', 'Traffic (MB)']].values.astype(float)
p = plans[['common_Price', 'common_Offnet', 'common_Traffic']].values
plan = np.sqrt(np.square(np.subtract(p, u)).mean(axis=1)).argmin()
return plans.iloc[plan]['common_name']
df['Best Plan'] = df.apply(rmse, axis="columns", plans=plans)
Output:
>>> df
User ID ARPU_AVERAGE Offnet Calls Traffic (MB) Best Plan
0 Louis 1300 250 13000 name1
1 Paul 1800 350 18000 name2
2 Alex 1500 260 14000 name1
Edit: Full code with you variable names:
common_list = pd.read_csv("common_list.csv")
csv_data = pd.read_csv("By_ARPU.csv")
find_the_best_plan = lambda target: np.sqrt(np.square(np.subtract(array, target)).mean(axis=1)).argmin()
array = common_list[['common_Price', 'common_Offnet', 'common_Traffic']].values
data = csv_data[['ARPU_AVERAGE', 'Offnet Calls', 'Traffic (MB)']].values
sol = np.apply_along_axis(find_the_best_plan, 1, data)
csv_data["Suggested Plan [SP]"] = common_list['common_name'].iloc[sol].values
csv_data["SP: Offnet Minutes"] = common_list['common_Offnet'].iloc[sol].values
csv_data["SP: Traffic"] = common_list['common_Traffic'].iloc[sol].values

save full dataframe into txt file

I want to save the dataframe into txt file
#create the dataframe of timetable
import pandas as pd
timeline_table = pd.DataFrame.from_dict(timeline, orient = 'index', columns = ['events'])
a = timeline_table.reset_index()
r=a.rename(columns={"index": "years"})
final_table = r.shift()[1:]
print(final_table)
#save the timeline/timetable into new file
path= "data.txt"
with open(path, mode = 'w') as output:
output.write(f'{final_table}')
The output is like this:
1 1000 About thirty countries maintain about seventy ...
2 1270 from which derived the Old French pole antarti...
3 1391 from which derived the Old French pole antarti...
4 1773 European maps continued to show this hypotheti...
5 1775 European maps continued to show this hypotheti...
.. ... ...
57 2014 In 1985, three British scientists working on d...
58 2015 With the ban of CFCs in the Montreal Protocol ...
59 2018 which operated its own scientific station—Worl...
I want all the rows containing sentences corresponding to the extracted years to be fully displayed in the output file.
Is there any way to do that?
pd.set_option('display.max_colwidth', -1)

Best Python looping system for merging pandas DataFrame rows for export

I am a self-teaching data science student, currently doing my first big Python portfolio project in several steps, the first of which is using pandas to work with IMDb [Internet Movie Database]'s rather oddly structured .tsv files in an effort to create a fully searchable big data repository of all IMDb data (the officially supported searches and even APIs like OMDB (Open Movie Database) don't allow for the kinds of detailed queries I need to do for the larger project).
The structure of IMDb's public files is that they include all data on movies, TV shows, episodes, actors, directors, crew, the whole business, scattered rather haphazardly across seven massive tsv files. I've confirmed that pandas can, in fact, read in all of this data and that my computer's memory can handle it, but what I want to do is merge the seven tsv files into a single DataFrame object which can then be exported to (preferably) a SQL database or even a huge spreadsheet/another TSV file but larger.
Each thing in the database (movie, actor, individual TV episode) has a tconst row, which, in one file is identified as "titleId", a string. In every other file, this is identified as "tconst", also a string. I'm going to need to change titleId when I read that file into tconst; this is one of several challenges I haven't got to yet.
#set pandas formatting parameters
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 25)
#read in the data tables provided
showbiz_core = pd.read_table("name.basics.tsv",sep='\t')
#temporary hack - print the entire dataframe as test
print(showbiz_core)
This works, but I'm not sure exactly how to proceed next. I want to import each of the other tsv files to attempt to locally reconstruct the imdb database. This means that I don't want to have duplicate tconst strings, but rather to end up with new information about a tconst entry (like a film) appended to it as new columns.
Should I be looking to do a "for i in [new file]" type loop somehow? How would you go about this?
The IMdB files are actually highly structured. Looping is always a bad structure for merging data
structure data sourcing - I used wget rather than manually sourcing
the files are large so work with a subset for modelling purpose. I just have used popular movies and actors as driver
the CSV columns in the tsv files are actually sub-tables. Treat them as such. I build a reference entity dmi to do this
there are other associative relationships there as well primaryProfession, genres
finally join (merge) everything together from OMDB and IMdB. Taken first rows where many items associate to a title
I have left data currently as tsv clearly it would be very simple to put into a database using to_sql() method. Main point is sourcing and transformation. aka ETL which has become an unfashionable term. This can be further supplemented with web scraping. I looked at Box Office Mojo however this would require selenium to scrape it as it's dynamic HTML
IMdB sourcing
import requests, json, re, urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import wget,gzip
from pathlib import Path
import numpy as np
# find what IMdB has to give ...
resp = requests.get("https://datasets.imdbws.com")
soup = BeautifulSoup(resp.content.decode(), "html.parser")
files = {}
for f in soup.find_all("a", href=True):
if f["href"].endswith('gz'):
u = urllib.parse.urlparse(f["href"])
fn = Path().cwd().joinpath(u.path.strip("/"))
files[Path(fn.stem).stem] = fn.name
if not fn.is_file():
wget.download(f["href"])
IMdB transform
set alldata=True on first run to prep data. Second run false and you have a manageable subset
alldata = False
subsetdata = True
dfs={}
# work with a subset of data to speed up modelling and iterations. Take a few major actors and titles
# as criteria to build a manageable representative set of data
l = ["Tom Hanks","Will Smith","Clint Eastwood","Leonardo DiCaprio","Johnny Depp","Meryl Streep","Bruce Willis"]
tm = {'tconst': ['tt0111161','tt0468569','tt1375666','tt0137523','tt0110912','tt0109830','tt0944947','tt0133093','tt0120737','tt0167260','tt0068646'],
'averageRating': [9.3, 9.0, 8.8, 8.8, 8.9, 8.8, 9.3, 8.7, 8.8, 8.9, 9.2],
'numVotes': [2275837,2237966,1997918,1805137,1777920,1752954,1699318,1630083,1618100,1602417,1570167]}
# work with subset for modelling purpose
k = "name.basics"
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if subsetdata:
# manage down size of nmi
dfs[k] = dfs[k].loc[(dfs[k]["primaryName"].isin(l)
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][0])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][1])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][2])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][3])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][4])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][5])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][6])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][7])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][8])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][9])
)
&dfs[k]["knownForTitles"].str.contains("tt")]
dfs[k].to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
# birth year is a lot but getting data issues...
# dfs[k] = dfs[k].dropna(subset=["primaryProfession","birthYear"])
# comma separated - not good for joins and merges. rename for consistency
dfs["nmi"] = (dfs["name.basics"].loc[:,["nconst","knownForTitles"]]
.assign(knownForTitles=lambda x: x["knownForTitles"].str.split(","))
.explode("knownForTitles")
).rename(columns={"knownForTitles":"tconst"}).drop_duplicates()
# already extracted known titles so can drop and de-dup - e.g. Tom Hanks
dfs[k] = dfs[k].drop(columns=["knownForTitles"]).drop_duplicates()
for k in [k for k in files.keys() if k not in ["name.basics","omdb.titles"]]:
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if k=="title.akas": dfs[k]=dfs[k].rename(columns={"titleId":"tconst"})
# subset titles to those we have names
if subsetdata:
c = "tconst" if k!= "title.episode" else "parentTconst"
try:
(dfs[k].loc[dfs[k][c].isin(dfs["nmi"]["tconst"])]
.to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False))
except KeyError as e:
print(k, dfs[k].columns, e)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
dfs["name.and.titles"] = dfs["nmi"].merge(dfs["name.basics"], on="nconst").merge(dfs["title.basics"], on="tconst")
OMDB sourcing
omdbcols = ['Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Poster', 'Ratings', 'Metascore', 'imdbRating', 'imdbVotes', 'imdbID', 'Type', 'DVD', 'BoxOffice', 'Production', 'Website', 'Response']
omdbk = "omdb.titles"
files[omdbk] = f"{omdbk}.tsz"
if not Path().cwd().joinpath(files[omdbk]).is_file():
dfs[omdbk] = pd.DataFrame(columns=omdbcols)
else:
dfs[omdbk] = pd.read_csv(files[omdbk], sep="\t", thousands=",")
dfs[omdbk] = dfs[omdbk].astype({c:"Int64" for c in dfs[omdbk].columns}, errors="ignore")
k = "title.basics"
# limited to 1000 API calls a day, so only fetch if have not done already
for tconst in dfs[k].loc[~(dfs[k]["tconst"].isin(dfs[omdbk]["imdbID"]))]["tconst"].values:
# tt0109830 movie Forrest Gump
# http://www.omdbapi.com/?i=tt3896198&apikey=xxx
params={"apikey":apikey,"i":tconst,"plot":"full"}
res = requests.get("http://www.omdbapi.com/", params=params)
if res.status_code!=200:
print("breached API limit")
break
else:
dfs[omdbk] = pd.concat([dfs[omdbk], pd.json_normalize(res.json())])
dfs[omdbk].to_csv(files[omdbk], index=False, sep="\t")
sample analysis
# The Dark Knight tt0468569
# Game of Throne tt0944947
# for demo purpose - just pick first association when there are many
mask = dfs[omdbk]["imdbID"].isin(["tt0468569","tt0944947"])
demo = (dfs[omdbk].loc[mask]
.rename(columns={c:f"OMDB{c}" for c in dfs[omdbk].columns})
.rename(columns={"OMDBimdbID":"tconst"})
.merge(dfs["title.basics"], on="tconst")
.merge(dfs["title.ratings"], on="tconst")
.merge(dfs["title.akas"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.crew"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.principals"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.episode"].groupby("parentTconst", as_index=False).first(),
left_on="tconst", right_on="parentTconst", how="left", suffixes=("","_ep"))
.merge(dfs["nmi"]
.merge(dfs["name.basics"], on="nconst")
.groupby(["tconst"], as_index=False).first(), on="tconst", suffixes=("","_name"))
).T
output
0 1
OMDBTitle The Dark Knight Game of Thrones
OMDBYear 2008 2011–2019
OMDBRated PG-13 TV-MA
OMDBReleased 18 Jul 2008 17 Apr 2011
OMDBRuntime 152 min 57 min
OMDBGenre Action, Crime, Drama, Thriller Action, Adventure, Drama, Fantasy, Romance
OMDBDirector Christopher Nolan NaN
OMDBWriter Jonathan Nolan (screenplay), Christopher Nolan (screenplay), Christopher Nolan (story), David S. Goyer (story), Bob Kane (characters) David Benioff, D.B. Weiss
OMDBActors Christian Bale, Heath Ledger, Aaron Eckhart, Michael Caine Peter Dinklage, Lena Headey, Emilia Clarke, Kit Harington
OMDBLanguage English, Mandarin English
OMDBCountry USA, UK USA, UK
OMDBAwards Won 2 Oscars. Another 153 wins & 159 nominations. Won 1 Golden Globe. Another 374 wins & 602 nominations.
OMDBPoster https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw##._V1_SX300.jpg https://m.media-amazon.com/images/M/MV5BYTRiNDQwYzAtMzVlZS00NTI5LWJjYjUtMzkwNTUzMWMxZTllXkEyXkFqcGdeQXVyNDIzMzcwNjc#._V1_SX300.jpg
OMDBRatings [{'Source': 'Internet Movie Database', 'Value': '9.0/10'}, {'Source': 'Rotten Tomatoes', 'Value': '94%'}, {'Source': 'Metacritic', 'Value': '84/100'}] [{'Source': 'Internet Movie Database', 'Value': '9.3/10'}]
OMDBMetascore 84 <NA>
OMDBimdbRating 9 9.3
OMDBimdbVotes 2234169 1679892
tconst tt0468569 tt0944947
OMDBType movie series
OMDBDVD 09 Dec 2008 NaN
OMDBBoxOffice $533,316,061 NaN
OMDBProduction Warner Bros. Pictures/Legendary NaN
OMDBWebsite <NA> <NA>
OMDBResponse 1 1
OMDBtotalSeasons <NA> 8
titleType movie tvSeries
primaryTitle The Dark Knight Game of Thrones
originalTitle The Dark Knight Game of Thrones
isAdult 0 0
startYear 2008 2011
endYear <NA> 2019
runtimeMinutes 152 57
genres Action,Crime,Drama Action,Adventure,Drama
averageRating 9 9.3
numVotes 2237966 1699318
ordering_x 10 10
title The Dark Knight Taht Oyunları
region GB TR
language en tr
types imdbDisplay imdbDisplay
attributes fake working title literal title
isOriginalTitle 0 0
directors nm0634240 nm0851930,nm0551076,nm0533713,nm0336241,nm1888967,nm1047532,nm0764601,nm0007008,nm0617042,nm0787687,nm0687964,nm0070474,nm1125275,nm0638354,nm0002399,nm0806252,nm0755261,nm0887700,nm0590889
writers nm0634300,nm0634240,nm0333060,nm0004170 nm1125275,nm0552333,nm1888967,nm4984276,nm2643685,nm7260047,nm2977599,nm0961827,nm0260870
ordering_y 10 10
nconst nm0746273 nm0322513
category producer actor
job producer creator
characters ["Bruce Wayne"] ["Jorah Mormont"]
parentTconst NaN tt0944947
tconst_ep NaN tt1480055
seasonNumber <NA> 1
episodeNumber <NA> 1
nconst_name nm0000198 nm0000293
primaryName Gary Oldman Sean Bean
birthYear 1958 1959
deathYear 1998 2020
primaryProfession actor,soundtrack,producer actor,producer,animation_department

How to extract specific codes from string in separate columns?

I have data in the following format.
Data
Data Sample Excel
I want to extract the codes from the column "DIAGNOSIS" and paste each code in a separate column after the "DIAGNOSIS" column. I Know the regular expression to be used to match this which is
[A-TV-Z][0-9][0-9AB].?[0-9A-TV-Z]{0,4}
source: https://www.johndcook.com/blog/2019/05/05/regex_icd_codes/
These are called ICD10 codes represented like Z01.2, E11, etc. The Above expression is meant to match all ICD10 codes.
But I am not sure how to use this expression in python code to do the above task.
The problem that I am trying to solve is?
Count the Total number of Codes assigned for all patients?
Count Total number of UNIQUE code assigned (since multiple patients might have same code assigned)
Generate data Code wise - i.e if I select code Z01.2, I want to extract Patient data (maybe PATID, MOBILE NUMBER OR ANY OTHER COLUMN OR ALL) who have been assigned this code.
Thanks in advance.
Using Python Pandas as follows.
Code
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t')
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['Length'] = df['CODES'].str.len()
print(f"Total Codes: {df['Length'].sum()}")
all_codes = df['CODES'].sum()#.set()
unique_codes = set(all_codes)
print(f'all codes {all_codes}\nCount: {len(all_codes)}')
print(f'unique codes {unique_codes}\nCount: {len(unique_codes)}')
# Select patients with code Z01.2
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
# Show selected columns
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Explanation
Imported data as tab-delimited CSV
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t'
Resulting DataFrame df
PATID PATIENT_NAME MOBILE_NUMBER EMAIL_ADDRESS GENDER PATIENT_AGE \
0 11 Mac 98765 ab1#gmail.com F 51 Y
1 22 Sac 98766 ab1#gmail.com F 24 Y
2 33 Tac 98767 ab1#gmail.com M 43 Y
3 44 Lac 98768 ab1#gmail.com M 54 Y
DISTRICT CLINIC DIAGNOSIS
0 Mars Clinic1 Z01.2 - Dental examinationC50 - Malignant neop...
1 Moon Clinic2 S83.6 - Sprain and strain of other and unspeci...
2 Earth Clinic3 K60.1 - Chronic anal fissureZ20.9 - Contact wi...
3 Saturn Clinic4 E11 - Type 2 diabetes mellitusE78.5 - Hyperlip...
Extract from DIAGNOSIS column using the specified pattern
Add an escape character before . otherwise, it would be a wildcard and match any character (no difference on data supplied).
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['CODES'] each row in the column is a list of codes
0 [Z01.2, C50 , Z10.0]
1 [S83.6, L05.0, Z20.9]
2 [K60.1, Z20.9, J06.9, C50 ]
3 [E11 , E78.5, I10 , E55 , E79.0, Z24.0, Z01.2]
Name: CODES, dtype: object
Add length column to df DataFrame
df['Length'] = df['CODES'].str.len()
df['Length']--correspond to length of each code list
0 3
1 3
2 4
3 7
Name: Length, dtype: int64
Total Codes Used--sum over the length of codes
df['Length'].sum()
Total Codes: 17
All Codes Used--concatenating all the code lists
all_codes = df['CODES'].sum()
['Z01.2', 'C50 ', 'Z10.0', 'S83.6', 'L05.0', 'Z20.9', 'K60.1', 'Z20.9', 'J06.9', 'C50
', 'E11 ', 'E78.5', 'I10 ', 'E55 ', 'E79.0', 'Z24.0', 'Z01.2']
Count: 17
Unique Codes Used--take the set() of the list of all codes
unique_codes = set(all_codes)
{'L05.0', 'S83.6', 'E79.0', 'Z01.2', 'I10 ', 'J06.9', 'K60.1', 'E11 ', 'Z24.0', 'Z
10.0', 'E55 ', 'E78.5', 'Z20.9', 'C50 '}
Count: 14
Select patients by code (i.e. Z01.2)
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
Show PATIE, PATIENT_NAME and MOBILE_NUMBER for these patients
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Result
PATID PATIENT_NAME MOBILE_NUMBER
0 11 Mac 98765
3 44 Lac 98768

Rearrange CSV data

I have 2 csv files with different sequence of columns. For e.g. the first file starts with 10 digits mobile numbers while that column is at number 4 in the second file.
I need to merge all the customer data into a single csv file. The order of the columns should be as follows:
mobile pincode model Name Address Location pincode date
mobile Name Address Model Location pincode Date
9845299999 Raj Shah nagar No 22 Rivi Building 7Th Main I Crz Mumbai 17/02/2011
9880877777 Managing Partner M/S Aitas # 1010, 124Th Main, Bk Stage. - Bmw 320 D Hyderabad 560070 30-Dec-11
Name Address Location mobile pincode Date Model
Asvi Developers pvt Ltd fantry Road Nariman Point, 1St Floor, No. 150 Chennai 9844066666 13/11/2011 Crz
L R Shiva Gaikwad & Sudha Gaikwad # 42, Suvarna Mansion, 1St Cross, 17Th Main, Banjara Hill, B S K Stage,- Bangalore 9844233333 560085 40859 Mercedes_E 350 Cdi
Second task and that may be slightly difficult is that the new files expected may have a totally different column sequence. In that case I need to extract 10 digits mobile number and 6 digits pincode column. I need to write the code that will guess the city column if it matches with any of the given city list. The new files are expected to have relevant column headings but the column heading may be slightly different. for e.g. "customer address" instead of "address". How do I handle such data?
sed 's/.*\([0-9]\{10\}\).*/\1,&/' input
I have been suggested to use sed to rearrange the 10 digits column at the beginning. But I do also need to rearrange the text columns. For e.g. if a column matches the entries in the following list then it is undoubtedly model column.
['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
If any column matches 10% of the entries with the above list then it is a "model" column and should be at number 3 followed by mobile and pincode.
For your first question, I suggest using pandas to load both files and then concat. After that you can rearrange your columns.
import pandas as pd
dataframe1 = pd.read_csv('file1.csv')
dataframe2 = pd.read_csv('file2.csv')
combined = pd.concat([dataframe1, dataframe2]) #the columns will be ordered alphabetically
To get desired order,
result_df = combined[['mobile', 'pincode', 'model', 'Name', 'Address', 'Location', 'pincode', 'date']]
and then result_df.to_csv('oupput.csv', index=False) to export to csv file.
For the second one, you can do something like this (assuming you have loaded a csv file into df like above)
match_model = lambda m: m in ['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
for c in df:
if df[c].map(match_model).sum()/len(df) > 0.1:
print "Column %s is 'Model'"% c
df.rename(columns={c:'Model'}, inplace=True)
You can modify the matching function match_model to use regex instead if you want.

Categories