I have a code that conducts a search of closest value between 2 CSV files. It reads a CSV file called "common_list" with some database which looks like this:
common_name
common_Price
common_Offnet
common_Traffic
name1
1300
250
13000
name2
1800
350
18000
The code puts these CSV rows into a list and then creates NumPy arrays.
common_list = pd.read_csv("common_list.csv")
common_list_offnet = common_list["common_Offnet"].to_list()
common_list_traffic = common_list["common_Traffic"].to_list()
array_offnet = np.array(common_list_offnet)
array_traffic = np.array(common_list_traffic)
array = np.column_stack((array_offnet,array_traffic))
We use this CSV file as a database for available cell phone plans (name of the plan, price, offnet calls, and internet traffic).
Then, the code reads another CSV file called "By_ARPU" with 100k+ rows with users and how they use their cell phone plans (how much money they spend (the price of plan), how much offnet calls, and traffic). The headers of this CSV file look like this:
User ID
ARPU_AVERAGE
Offnet Calls
Traffic (MB)
where ARPU_AVERAGE corresponds to the amount of money users spend (the price they pay). The code finds the closest value between the CSV files by 2 parameters: Offnet calls and Traffic (MB).
csv_data = pd.read_csv("By_ARPU.csv")
data = csv_data[['Offnet Calls', 'Traffic (MB)']]
data = data.to_numpy()
sol = []
for target in data:
dist = np.sqrt((np.square(array[:,np.newaxis]-target).sum(axis=2))
idx = np.argmin(dist)
sol.append(idx)
csv_data["Suggested Plan [SP]"] = common_list['common_name'][sol].values
csv_data["SP: Offnet Minutes"] = common_list['common_Offnet'][sol].values
csv_data["SP: Traffic"] = common_list['common_Traffic'][sol].values
csv_data.to_csv ('4.7 final.csv', index = False, header=True)
It finds closest value from the database and shows the name and corresponding offnet calls, traffic. For example, if in the file "By_ARPU" the values for Offnet calls and Traffic (MB) were 250 and 13000 respectively, it will show the name of closest value from "common_list" which is name1.
I wanted to create additional code for the same search but with 3 parameters. You can see that first database "common_list" has 3 parameters: common_Price, common_Offnet and common_Offnet. In the previous code, we found the closest value by 2 values.
Corresponding to each other columns in different CSV files were: "common_Offnet" from common_list - "Offnet calls" from By_ARPU AND "common_Traffic" from common_list - "Traffic (MB)" from By_ARPU.
And I want:
Find closest value by 3 parameters: Price, Offnet Calls and Traffic. The column which corresponds to the price in "By_ARPU" file is called "AVERAGE_APRU".
Please, help to modify the code to find closest value by making search by those 3 paramters instead of 2.
Input data:
>>> plans
common_name common_Price common_Offnet common_Traffic
0 plan1 1300 250 13000
1 plan2 1800 350 18000
>>> df
User ID ARPU_AVERAGE Offnet Calls Traffic (MB)
0 Louis 1300 250 13000 # plan1 for sure (check 1)
1 Paul 1800 350 18000 # plan2, same values (check 2)
2 Alex 1500 260 14000 # plan1, probably
Create your matching function:
def rmse(user, plans):
u = user[['ARPU_AVERAGE', 'Offnet Calls', 'Traffic (MB)']].values.astype(float)
p = plans[['common_Price', 'common_Offnet', 'common_Traffic']].values
plan = np.sqrt(np.square(np.subtract(p, u)).mean(axis=1)).argmin()
return plans.iloc[plan]['common_name']
df['Best Plan'] = df.apply(rmse, axis="columns", plans=plans)
Output:
>>> df
User ID ARPU_AVERAGE Offnet Calls Traffic (MB) Best Plan
0 Louis 1300 250 13000 name1
1 Paul 1800 350 18000 name2
2 Alex 1500 260 14000 name1
Edit: Full code with you variable names:
common_list = pd.read_csv("common_list.csv")
csv_data = pd.read_csv("By_ARPU.csv")
find_the_best_plan = lambda target: np.sqrt(np.square(np.subtract(array, target)).mean(axis=1)).argmin()
array = common_list[['common_Price', 'common_Offnet', 'common_Traffic']].values
data = csv_data[['ARPU_AVERAGE', 'Offnet Calls', 'Traffic (MB)']].values
sol = np.apply_along_axis(find_the_best_plan, 1, data)
csv_data["Suggested Plan [SP]"] = common_list['common_name'].iloc[sol].values
csv_data["SP: Offnet Minutes"] = common_list['common_Offnet'].iloc[sol].values
csv_data["SP: Traffic"] = common_list['common_Traffic'].iloc[sol].values
I am a self-teaching data science student, currently doing my first big Python portfolio project in several steps, the first of which is using pandas to work with IMDb [Internet Movie Database]'s rather oddly structured .tsv files in an effort to create a fully searchable big data repository of all IMDb data (the officially supported searches and even APIs like OMDB (Open Movie Database) don't allow for the kinds of detailed queries I need to do for the larger project).
The structure of IMDb's public files is that they include all data on movies, TV shows, episodes, actors, directors, crew, the whole business, scattered rather haphazardly across seven massive tsv files. I've confirmed that pandas can, in fact, read in all of this data and that my computer's memory can handle it, but what I want to do is merge the seven tsv files into a single DataFrame object which can then be exported to (preferably) a SQL database or even a huge spreadsheet/another TSV file but larger.
Each thing in the database (movie, actor, individual TV episode) has a tconst row, which, in one file is identified as "titleId", a string. In every other file, this is identified as "tconst", also a string. I'm going to need to change titleId when I read that file into tconst; this is one of several challenges I haven't got to yet.
#set pandas formatting parameters
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 25)
#read in the data tables provided
showbiz_core = pd.read_table("name.basics.tsv",sep='\t')
#temporary hack - print the entire dataframe as test
print(showbiz_core)
This works, but I'm not sure exactly how to proceed next. I want to import each of the other tsv files to attempt to locally reconstruct the imdb database. This means that I don't want to have duplicate tconst strings, but rather to end up with new information about a tconst entry (like a film) appended to it as new columns.
Should I be looking to do a "for i in [new file]" type loop somehow? How would you go about this?
The IMdB files are actually highly structured. Looping is always a bad structure for merging data
structure data sourcing - I used wget rather than manually sourcing
the files are large so work with a subset for modelling purpose. I just have used popular movies and actors as driver
the CSV columns in the tsv files are actually sub-tables. Treat them as such. I build a reference entity dmi to do this
there are other associative relationships there as well primaryProfession, genres
finally join (merge) everything together from OMDB and IMdB. Taken first rows where many items associate to a title
I have left data currently as tsv clearly it would be very simple to put into a database using to_sql() method. Main point is sourcing and transformation. aka ETL which has become an unfashionable term. This can be further supplemented with web scraping. I looked at Box Office Mojo however this would require selenium to scrape it as it's dynamic HTML
IMdB sourcing
import requests, json, re, urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import wget,gzip
from pathlib import Path
import numpy as np
# find what IMdB has to give ...
resp = requests.get("https://datasets.imdbws.com")
soup = BeautifulSoup(resp.content.decode(), "html.parser")
files = {}
for f in soup.find_all("a", href=True):
if f["href"].endswith('gz'):
u = urllib.parse.urlparse(f["href"])
fn = Path().cwd().joinpath(u.path.strip("/"))
files[Path(fn.stem).stem] = fn.name
if not fn.is_file():
wget.download(f["href"])
IMdB transform
set alldata=True on first run to prep data. Second run false and you have a manageable subset
alldata = False
subsetdata = True
dfs={}
# work with a subset of data to speed up modelling and iterations. Take a few major actors and titles
# as criteria to build a manageable representative set of data
l = ["Tom Hanks","Will Smith","Clint Eastwood","Leonardo DiCaprio","Johnny Depp","Meryl Streep","Bruce Willis"]
tm = {'tconst': ['tt0111161','tt0468569','tt1375666','tt0137523','tt0110912','tt0109830','tt0944947','tt0133093','tt0120737','tt0167260','tt0068646'],
'averageRating': [9.3, 9.0, 8.8, 8.8, 8.9, 8.8, 9.3, 8.7, 8.8, 8.9, 9.2],
'numVotes': [2275837,2237966,1997918,1805137,1777920,1752954,1699318,1630083,1618100,1602417,1570167]}
# work with subset for modelling purpose
k = "name.basics"
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if subsetdata:
# manage down size of nmi
dfs[k] = dfs[k].loc[(dfs[k]["primaryName"].isin(l)
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][0])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][1])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][2])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][3])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][4])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][5])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][6])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][7])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][8])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][9])
)
&dfs[k]["knownForTitles"].str.contains("tt")]
dfs[k].to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
# birth year is a lot but getting data issues...
# dfs[k] = dfs[k].dropna(subset=["primaryProfession","birthYear"])
# comma separated - not good for joins and merges. rename for consistency
dfs["nmi"] = (dfs["name.basics"].loc[:,["nconst","knownForTitles"]]
.assign(knownForTitles=lambda x: x["knownForTitles"].str.split(","))
.explode("knownForTitles")
).rename(columns={"knownForTitles":"tconst"}).drop_duplicates()
# already extracted known titles so can drop and de-dup - e.g. Tom Hanks
dfs[k] = dfs[k].drop(columns=["knownForTitles"]).drop_duplicates()
for k in [k for k in files.keys() if k not in ["name.basics","omdb.titles"]]:
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if k=="title.akas": dfs[k]=dfs[k].rename(columns={"titleId":"tconst"})
# subset titles to those we have names
if subsetdata:
c = "tconst" if k!= "title.episode" else "parentTconst"
try:
(dfs[k].loc[dfs[k][c].isin(dfs["nmi"]["tconst"])]
.to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False))
except KeyError as e:
print(k, dfs[k].columns, e)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
dfs["name.and.titles"] = dfs["nmi"].merge(dfs["name.basics"], on="nconst").merge(dfs["title.basics"], on="tconst")
OMDB sourcing
omdbcols = ['Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Poster', 'Ratings', 'Metascore', 'imdbRating', 'imdbVotes', 'imdbID', 'Type', 'DVD', 'BoxOffice', 'Production', 'Website', 'Response']
omdbk = "omdb.titles"
files[omdbk] = f"{omdbk}.tsz"
if not Path().cwd().joinpath(files[omdbk]).is_file():
dfs[omdbk] = pd.DataFrame(columns=omdbcols)
else:
dfs[omdbk] = pd.read_csv(files[omdbk], sep="\t", thousands=",")
dfs[omdbk] = dfs[omdbk].astype({c:"Int64" for c in dfs[omdbk].columns}, errors="ignore")
k = "title.basics"
# limited to 1000 API calls a day, so only fetch if have not done already
for tconst in dfs[k].loc[~(dfs[k]["tconst"].isin(dfs[omdbk]["imdbID"]))]["tconst"].values:
# tt0109830 movie Forrest Gump
# http://www.omdbapi.com/?i=tt3896198&apikey=xxx
params={"apikey":apikey,"i":tconst,"plot":"full"}
res = requests.get("http://www.omdbapi.com/", params=params)
if res.status_code!=200:
print("breached API limit")
break
else:
dfs[omdbk] = pd.concat([dfs[omdbk], pd.json_normalize(res.json())])
dfs[omdbk].to_csv(files[omdbk], index=False, sep="\t")
sample analysis
# The Dark Knight tt0468569
# Game of Throne tt0944947
# for demo purpose - just pick first association when there are many
mask = dfs[omdbk]["imdbID"].isin(["tt0468569","tt0944947"])
demo = (dfs[omdbk].loc[mask]
.rename(columns={c:f"OMDB{c}" for c in dfs[omdbk].columns})
.rename(columns={"OMDBimdbID":"tconst"})
.merge(dfs["title.basics"], on="tconst")
.merge(dfs["title.ratings"], on="tconst")
.merge(dfs["title.akas"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.crew"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.principals"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.episode"].groupby("parentTconst", as_index=False).first(),
left_on="tconst", right_on="parentTconst", how="left", suffixes=("","_ep"))
.merge(dfs["nmi"]
.merge(dfs["name.basics"], on="nconst")
.groupby(["tconst"], as_index=False).first(), on="tconst", suffixes=("","_name"))
).T
output
0 1
OMDBTitle The Dark Knight Game of Thrones
OMDBYear 2008 2011–2019
OMDBRated PG-13 TV-MA
OMDBReleased 18 Jul 2008 17 Apr 2011
OMDBRuntime 152 min 57 min
OMDBGenre Action, Crime, Drama, Thriller Action, Adventure, Drama, Fantasy, Romance
OMDBDirector Christopher Nolan NaN
OMDBWriter Jonathan Nolan (screenplay), Christopher Nolan (screenplay), Christopher Nolan (story), David S. Goyer (story), Bob Kane (characters) David Benioff, D.B. Weiss
OMDBActors Christian Bale, Heath Ledger, Aaron Eckhart, Michael Caine Peter Dinklage, Lena Headey, Emilia Clarke, Kit Harington
OMDBLanguage English, Mandarin English
OMDBCountry USA, UK USA, UK
OMDBAwards Won 2 Oscars. Another 153 wins & 159 nominations. Won 1 Golden Globe. Another 374 wins & 602 nominations.
OMDBPoster https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw##._V1_SX300.jpg https://m.media-amazon.com/images/M/MV5BYTRiNDQwYzAtMzVlZS00NTI5LWJjYjUtMzkwNTUzMWMxZTllXkEyXkFqcGdeQXVyNDIzMzcwNjc#._V1_SX300.jpg
OMDBRatings [{'Source': 'Internet Movie Database', 'Value': '9.0/10'}, {'Source': 'Rotten Tomatoes', 'Value': '94%'}, {'Source': 'Metacritic', 'Value': '84/100'}] [{'Source': 'Internet Movie Database', 'Value': '9.3/10'}]
OMDBMetascore 84 <NA>
OMDBimdbRating 9 9.3
OMDBimdbVotes 2234169 1679892
tconst tt0468569 tt0944947
OMDBType movie series
OMDBDVD 09 Dec 2008 NaN
OMDBBoxOffice $533,316,061 NaN
OMDBProduction Warner Bros. Pictures/Legendary NaN
OMDBWebsite <NA> <NA>
OMDBResponse 1 1
OMDBtotalSeasons <NA> 8
titleType movie tvSeries
primaryTitle The Dark Knight Game of Thrones
originalTitle The Dark Knight Game of Thrones
isAdult 0 0
startYear 2008 2011
endYear <NA> 2019
runtimeMinutes 152 57
genres Action,Crime,Drama Action,Adventure,Drama
averageRating 9 9.3
numVotes 2237966 1699318
ordering_x 10 10
title The Dark Knight Taht Oyunları
region GB TR
language en tr
types imdbDisplay imdbDisplay
attributes fake working title literal title
isOriginalTitle 0 0
directors nm0634240 nm0851930,nm0551076,nm0533713,nm0336241,nm1888967,nm1047532,nm0764601,nm0007008,nm0617042,nm0787687,nm0687964,nm0070474,nm1125275,nm0638354,nm0002399,nm0806252,nm0755261,nm0887700,nm0590889
writers nm0634300,nm0634240,nm0333060,nm0004170 nm1125275,nm0552333,nm1888967,nm4984276,nm2643685,nm7260047,nm2977599,nm0961827,nm0260870
ordering_y 10 10
nconst nm0746273 nm0322513
category producer actor
job producer creator
characters ["Bruce Wayne"] ["Jorah Mormont"]
parentTconst NaN tt0944947
tconst_ep NaN tt1480055
seasonNumber <NA> 1
episodeNumber <NA> 1
nconst_name nm0000198 nm0000293
primaryName Gary Oldman Sean Bean
birthYear 1958 1959
deathYear 1998 2020
primaryProfession actor,soundtrack,producer actor,producer,animation_department
I have data in the following format.
Data
Data Sample Excel
I want to extract the codes from the column "DIAGNOSIS" and paste each code in a separate column after the "DIAGNOSIS" column. I Know the regular expression to be used to match this which is
[A-TV-Z][0-9][0-9AB].?[0-9A-TV-Z]{0,4}
source: https://www.johndcook.com/blog/2019/05/05/regex_icd_codes/
These are called ICD10 codes represented like Z01.2, E11, etc. The Above expression is meant to match all ICD10 codes.
But I am not sure how to use this expression in python code to do the above task.
The problem that I am trying to solve is?
Count the Total number of Codes assigned for all patients?
Count Total number of UNIQUE code assigned (since multiple patients might have same code assigned)
Generate data Code wise - i.e if I select code Z01.2, I want to extract Patient data (maybe PATID, MOBILE NUMBER OR ANY OTHER COLUMN OR ALL) who have been assigned this code.
Thanks in advance.
Using Python Pandas as follows.
Code
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t')
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['Length'] = df['CODES'].str.len()
print(f"Total Codes: {df['Length'].sum()}")
all_codes = df['CODES'].sum()#.set()
unique_codes = set(all_codes)
print(f'all codes {all_codes}\nCount: {len(all_codes)}')
print(f'unique codes {unique_codes}\nCount: {len(unique_codes)}')
# Select patients with code Z01.2
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
# Show selected columns
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Explanation
Imported data as tab-delimited CSV
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t'
Resulting DataFrame df
PATID PATIENT_NAME MOBILE_NUMBER EMAIL_ADDRESS GENDER PATIENT_AGE \
0 11 Mac 98765 ab1#gmail.com F 51 Y
1 22 Sac 98766 ab1#gmail.com F 24 Y
2 33 Tac 98767 ab1#gmail.com M 43 Y
3 44 Lac 98768 ab1#gmail.com M 54 Y
DISTRICT CLINIC DIAGNOSIS
0 Mars Clinic1 Z01.2 - Dental examinationC50 - Malignant neop...
1 Moon Clinic2 S83.6 - Sprain and strain of other and unspeci...
2 Earth Clinic3 K60.1 - Chronic anal fissureZ20.9 - Contact wi...
3 Saturn Clinic4 E11 - Type 2 diabetes mellitusE78.5 - Hyperlip...
Extract from DIAGNOSIS column using the specified pattern
Add an escape character before . otherwise, it would be a wildcard and match any character (no difference on data supplied).
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['CODES'] each row in the column is a list of codes
0 [Z01.2, C50 , Z10.0]
1 [S83.6, L05.0, Z20.9]
2 [K60.1, Z20.9, J06.9, C50 ]
3 [E11 , E78.5, I10 , E55 , E79.0, Z24.0, Z01.2]
Name: CODES, dtype: object
Add length column to df DataFrame
df['Length'] = df['CODES'].str.len()
df['Length']--correspond to length of each code list
0 3
1 3
2 4
3 7
Name: Length, dtype: int64
Total Codes Used--sum over the length of codes
df['Length'].sum()
Total Codes: 17
All Codes Used--concatenating all the code lists
all_codes = df['CODES'].sum()
['Z01.2', 'C50 ', 'Z10.0', 'S83.6', 'L05.0', 'Z20.9', 'K60.1', 'Z20.9', 'J06.9', 'C50
', 'E11 ', 'E78.5', 'I10 ', 'E55 ', 'E79.0', 'Z24.0', 'Z01.2']
Count: 17
Unique Codes Used--take the set() of the list of all codes
unique_codes = set(all_codes)
{'L05.0', 'S83.6', 'E79.0', 'Z01.2', 'I10 ', 'J06.9', 'K60.1', 'E11 ', 'Z24.0', 'Z
10.0', 'E55 ', 'E78.5', 'Z20.9', 'C50 '}
Count: 14
Select patients by code (i.e. Z01.2)
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
Show PATIE, PATIENT_NAME and MOBILE_NUMBER for these patients
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Result
PATID PATIENT_NAME MOBILE_NUMBER
0 11 Mac 98765
3 44 Lac 98768
I have 2 csv files with different sequence of columns. For e.g. the first file starts with 10 digits mobile numbers while that column is at number 4 in the second file.
I need to merge all the customer data into a single csv file. The order of the columns should be as follows:
mobile pincode model Name Address Location pincode date
mobile Name Address Model Location pincode Date
9845299999 Raj Shah nagar No 22 Rivi Building 7Th Main I Crz Mumbai 17/02/2011
9880877777 Managing Partner M/S Aitas # 1010, 124Th Main, Bk Stage. - Bmw 320 D Hyderabad 560070 30-Dec-11
Name Address Location mobile pincode Date Model
Asvi Developers pvt Ltd fantry Road Nariman Point, 1St Floor, No. 150 Chennai 9844066666 13/11/2011 Crz
L R Shiva Gaikwad & Sudha Gaikwad # 42, Suvarna Mansion, 1St Cross, 17Th Main, Banjara Hill, B S K Stage,- Bangalore 9844233333 560085 40859 Mercedes_E 350 Cdi
Second task and that may be slightly difficult is that the new files expected may have a totally different column sequence. In that case I need to extract 10 digits mobile number and 6 digits pincode column. I need to write the code that will guess the city column if it matches with any of the given city list. The new files are expected to have relevant column headings but the column heading may be slightly different. for e.g. "customer address" instead of "address". How do I handle such data?
sed 's/.*\([0-9]\{10\}\).*/\1,&/' input
I have been suggested to use sed to rearrange the 10 digits column at the beginning. But I do also need to rearrange the text columns. For e.g. if a column matches the entries in the following list then it is undoubtedly model column.
['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
If any column matches 10% of the entries with the above list then it is a "model" column and should be at number 3 followed by mobile and pincode.
For your first question, I suggest using pandas to load both files and then concat. After that you can rearrange your columns.
import pandas as pd
dataframe1 = pd.read_csv('file1.csv')
dataframe2 = pd.read_csv('file2.csv')
combined = pd.concat([dataframe1, dataframe2]) #the columns will be ordered alphabetically
To get desired order,
result_df = combined[['mobile', 'pincode', 'model', 'Name', 'Address', 'Location', 'pincode', 'date']]
and then result_df.to_csv('oupput.csv', index=False) to export to csv file.
For the second one, you can do something like this (assuming you have loaded a csv file into df like above)
match_model = lambda m: m in ['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
for c in df:
if df[c].map(match_model).sum()/len(df) > 0.1:
print "Column %s is 'Model'"% c
df.rename(columns={c:'Model'}, inplace=True)
You can modify the matching function match_model to use regex instead if you want.