How to retrieve rows in dataframe as strings?

How to retrieve rows in dataframe as strings? - python

I have probably stupid question for you guys but I cannot find a working solution so far to my problem. I have a data frame provided via automatic input and I am transforming this data as such (not really relevant for my question):
import pandas as pd
import numpy as np
n = int(input())
matrix =[]
for n in range(n+1): # loop as long as expected # rows to be input
new_row = input() # get new row input
new_row = list(new_row.split(",")) # make it a list
matrix.append(new_row) #update the matrix
mat = pd.DataFrame(data=matrix[1:], index=None, columns=matrix[0])
mat.iloc[:,1] = pd.to_numeric(mat.iloc[:,1])
mat.iloc[:,2] = pd.to_numeric(mat.iloc[:,2])
mat.iloc[:,1] = round(mat.iloc[:,1] / mat.iloc[:,2])
mat2 = mat[['state', 'population']].head(5)
mat2['population'] = mat2['population'].astype(int)
mat2 = mat2.sort_values(by=['population'], ascending=False)
mat2 = mat2.to_string(index=False, header=False)
print(mat2)
the answer I am getting is equal to:
New York 354
Florida 331
California 240
Illinois 217
Texas 109
Nicely formated etc, however I need to retrieve my data in string format as:
New York 354
Florida 331
California 240
Illinois 217
Texas 109
I have already tried changing the ending of my code to:
#mat2 = mat2.to_string(index=False, header=False)
print(mat2.iloc[1,:])
to retrieve e.g. first row, but then console returns:
state Florida
population 331
Name: 2, dtype: object
How can I simply access the data from my cells and format it in string format?
Thanks!

After mat2 = mat2.to_string(index=False, header=False), mat2 becomes a string you can transform to your liking. For instance, you could do:
>>> lines = mat2.split('\n')
>>> without_format = [line.strip() for line in lines]
>>> without_format
['New York 354',
'Florida 331',
'California 240',
'Illinois 217',
'Texas 109']
Where .strip() will remove any whitespace before or after the string.

Related

Pandas Fuzzy Matching

I want to check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match. However, it seems that it takes a long time to go through the addresses and perform the calculations. There are 15000+ addresses in my main dataframe and around 50 addresses in my reference dataframe. It ran for 5 minutes and still hadn't finished.
My code is:
import pandas as pd
from fuzzywuzzy import fuzz, process
### Main dataframe
df = pd.read_csv("adressess.csv", encoding="cp1252")
#### Reference dataframe
ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252")
### Variable for accuracy scoring
accuracy = 0
for index, value in df["address"].iteritems():
### This gathers the index from the correct address column in the reference df
ref_index = ref_df["correct_address"][
ref_df["correct_address"]
== process.extractOne(value, ref_df["correct_address"])[0]
].index.toList()[0]
### if each row can score a max total of 1, the ratio must be divided by 100
accuracy += (
fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100
)
Is this the best way to loop through a column in a dataframe and fuzzy match it against another? I want the score to be a ratio because later I will then output an excel file with the correct values and a background colour to indicate what values were wrong and changed.
I don't believe fuzzywuzzy has a method that allows you to pull the index, value and ration into one tuple - just value and ratio of match.

Hopefully the below code (with links to dummy data) helps show what is possible. I tried to use street addresses to mock up a similar situation so it is easier to compare with your dataset; obviously it is no where near as big.
You can pull the csv text from the links in the comments and run it and see what could work on your larger sample.
For five addresses in the reference frame and 100 contacts in the other its execution timings are:
CPU times: user 107 ms, sys: 21 ms, total: 128 ms
Wall time: 137 ms
The below code should be quicker than .iteritems() etc.
Code:
# %%time
import pandas as pd
from fuzzywuzzy import fuzz, process
import difflib
# create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455
df = pd.read_csv('100-contacts.csv')
# create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8
ref_df = pd.read_csv('ref_addresses.csv')
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# given current row of ref_df (via Apply) and series (df['address'])
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = 60)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
contacts_addresses = list(df.address.unique())
ref_addresses = list(ref_df.correct_address.unique())
# via fuzzywuzzy matching and using scoringMatches() above
# return a dictionary of addresses where there is a match
# the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame)
# example:
# {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'}
names = []
for x in ref_addresses:
match = match_addresses(x, contacts_addresses, 75)
if match[1] >= 75:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
# add fuzzywuzzy scoring to original ref_df
ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1)
# merge the fuzzywuzzy address matches frame with the reference frame
compare_df = pd.concat([match_df, ref_df], axis=1)
compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy()
# add difflib scoring for a bit of interest.
# a random thought passed through my head maybe this is interesting?
compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher\
(None, x['ref_address'], x['matched_address']).ratio(),axis=1)
# clean up column ordering ('correct_address' and 'ref_address' are basically
# copies of each other, but shown for completeness)
compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',\
'fuzzywuzzy_score', 'difflib_score']]
# see what we've got
print(compare_df)
# remember: correct_address and ref_address are copies
# so just pick one to compare to matched_address
correct_address ref_address matched_address \
0 86 Nw 66th Street #8673 86 Nw 66th Street #8673 86 Nw 66th St #8673
1 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230
2 6649 N Blue Gum St 6649 N Blue Gum St 6649 N Blue Gum St
3 59 n Groesbeck Hwy 59 n Groesbeck Hwy 59 N Groesbeck Hwy
4 1 Central Avenue 1 Central Avenue 1 Central Ave
fuzzywuzzy_score difflib_score
0 90 0.904762
1 100 1.000000
2 100 1.000000
3 100 0.944444
4 90 0.896552

concat pdf tables into one excel table using python

I'm using tabula in order to concat all tables in the following pdf file
To be a one table within excel format.
Here's my code:
from tabula import read_pdf
import pandas as pd
allin = []
for page in range(1, 115):
table = read_pdf("goal.pdf", pages=page,
pandas_options={'header': None})[0]
allin.append(table)
new = pd.concat(allin)
new.to_excel("out.xlsx", index=False)
Also i tried the following as well:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='all', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Current output: check
But the issue which am facing that from page# 91 i start to see the data not formatted correctly within the excel file.
I've debug the page individually and i couldn't figure out why it's formatted wrongly especially it's within same format.
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='91', pandas_options={'header': None})[0]
print(table)
Example:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='90-91', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Here I've ran the code for two pages 90 and 91.
starting from row# 48 you will see the difference here
Where you will notice the issue that name and address placed into one cell. And city and state placed into one call as well

I digged in source code and it has option columns and you can manually define column boundaries. When you set columns then you have to use guess=False.
tabula-py uses program tabula-java and in its documentation I found that it needs values in percents or points (not pixels). So I used program inkscape to measure boundaries in points.
from tabula import read_pdf
import pandas as pd
# display all columns in dataframe
pd.set_option('display.width', None)
columns = [210, 350, 420, 450] # boundaries in points
#columns = ['210,350,420,450'] # boundaries in points
pages = '90-92'
#pages = [90,91,92]
#pages = list(range(90,93))
#pages = 'all' # read all pages
tables = read_pdf("goal.pdf",
pages=pages,
pandas_options={'header': None},
columns=columns,
guess=False)
df = pd.concat(tables).reset_index(drop=True)
#df.rename(columns=df.iloc[0], inplace=True) # convert first row to headers
#df.drop(df.index[0], inplace=True) # remove first row with headers
# display
#for x in range(0, len(df), 20):
# print(df.iloc[x:x+20])
# print('----------')
print(df.iloc[45:50])
#df.to_csv('output-pdf.csv')
#print(df[ df['State'].str.contains(' ') ])
#print(df[ df.iloc[:,3].str.contains(' ') ])
Result:
0 1 2 3 4
45 JARRARD, GARY 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
46 JARRARD, GARY 2219 COLORADO BLVD DENTON TX (940) 380-1661
47 MASON HARRISON, RATLIFF ENTERPRISES 1815 W. UNIVERSITY DRIVE DENTON TX (940) 387-5431
48 MASON HARRISON, RATLIFF ENTERPRISES 109 N. LOOP #288 DENTON TX (940) 484-2904
49 MASON HARRISON, RATLIFF ENTERPRISES 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
EDIT:
It may need also option area (also in points) to skip headers. Or you will have to remove first row on first page.
I didn't check all rows but it may need some changes in column boundaries.
EDIT:
Few rows make problem - probably because text in City is too long.
col3 = df.iloc[:,3]
print(df[ col3.str.contains(' ') ])
Result:
0 1 2 3 4
1941 UMSTATTD RESTAURANTS, LLC 120 WEST US HIGHWAY 54 EL DORADO SPRING MS O (417) 876-5755
2079 SIMONS, GARY 1412 BURLINGTON NORTH KANSAS CIT MY O (816) 421-5941
2763 GRISHAM, ROBERT (RB) 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830
2764 STAUFFER, JACOB 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830

find different string in one row pandas

I want to return join from this following text that contains find = ['gold', 'mining', 'silver, 'steel'] but turns out it just prints the first one that appears.
one of the row in output.csv
desc
"The **2014 Orkney earthquake** occurred at 12:22:33 SAST on 5 August, with the
epicentre near Orkney, a gold mining town in the Klerksdorp district in the
North West province of South Africa. The shock was assigned a magnitude of 5.5
on the Richter scale by the Council for Geoscience (CGS) in South Africa,
making it the biggest earthquake in South Africa since the 1969 Tulbagh
earthquake, which had a magnitude of 6.3 on the Richter scale. The United
States Geological Survey (USGS) estimated a focal depth of 5.0 km (3.1 mi).
The CGS reported 84 aftershocks on 5 August and 31 aftershocks on 6 August,
with a magnitude of 1.0 to 3.8 on the Richter scale. According to the CGS, the
earthquake is the biggest mining-related earthquake in South African history."
output: gold
expected output: gold, mining
here is what I have done
reader = pd.read_csv('output.csv', chunksize=1000)
find = ['gold','mining','silver','steel']
for chunk in reader:
chunk.columns = ['desc']
def process(x):
for s in find:
if s in x['desc']:
print('', s)
return s
return ''
chunk['place'] = chunk.apply(lambda x: (process(x)), axis=1)
chunk = chunk.drop(chunk[chunk['place'] == ''].index).reset_index(drop=True)
print(chunk)
How to join the result?
EDIT
def preprocess_patetnt(name):
reader = pd.read_csv('output.csv', chunksize=1000)
sname = sorted(name, key=len, reverse=True)
for chunk in reader:
chunk.columns = ['row', 'desc']
chunk['place'] = chunk["desc"].str.findall("|".join(sname)).apply(set)
chunk = chunk.drop(chunk[chunk['place'] == ''].index).reset_index(drop=True)
print(chunk)
place = pd.read_csv('country.csv', chunksize=13000000, error_bad_lines=False)
for chunk in place:
chunk.columns = ['name']
preprocess_patetnt(chunk["name"])
from country.csv is a list of name country like following:
country.csv
and here is output.csv
output.csv
But it gives me this error: re.error: bad character range á-K at position 77230

Your process function returns as soon as it gets the first hit. Insted, you should store all your hits in a string and return it. Use list comprehension for these kind of loops
and str.join(iterable) method to join the list to a string (I'm guessing here that sname is actually find).
reader = pd.read_csv('output.csv', chunksize=1000)
find = ['gold','mining','silver','steel']
for chunk in reader:
chunk.columns = ['desc']
def process(x):
return ','.join([s for s in find if s in x['desc']])
chunk['place'] = chunk.apply(lambda x: (process(x)), axis=1)
chunk = chunk.drop(chunk[chunk['place'] == ''].index).reset_index(drop=True)
print(chunk)
happy coding!

got incorrect mapping from dictionary to array in pandas?

I intend to merge two data frames, Chicago crime and Redfin real estate data, but Redfin data was collected by neighborhood in Chicago, while crime data were collected by community area. To do so, I found neighborhood map in Chicago and I am kinda figured out how to assign neighborhood to the community area. the structure of two dataframe is a bit different, so I did a few step manipulation on that. here are the details about my attempt:
example data snippet
here is the public gist where I can view the example data snippet.
here is the neighborhood mapping that I collected from the online source.
my solution
here is my first mapping solution:
code_pairs_neighborhoods = [[p[0], p[1]] for p in [pair.strip().split('\t') for pair in neighborhood_Map.strip().split('\n')]]
neighborhood_name_dic = {k[0]:k[1] for k in code_pairs_neighborhoods} #neighborhood -> community area
chicago_crime['neighborhood'] = chicago_crime['community_name'].map(neighborhood_name_dic)
Redfin['neighborhood'] = Redfin['Region'].map(neighborhood_name_dic)
final_df= pd.merge(chicago_crime, chicago_crime, on='neighborhood')
but this solution didn't find correct mapping and neighborhood becomes NAN, which is wrong.
second mapping attempt:
without using neighborhood mapping, I intuitively come up this solution for mapping:
chicago_crime['community_name']=[[y.split() for y in x] for x in chicago_crime['community_name']]
Redfin['Region']= [[j.split() for j in i] for i in Redfin['Region']]
idx, datavalue = [], []
for i,dv in enumerate(chicago_crime['community_name']):
for d in dv:
if d in Redfin['Region'][i]:
if i not in idx:
idx.append(i)
datavalue.append(d)
chicago_crime['merge_ref'] = datavalue
Redfin['merge_ref'] = datavalue
final_df= pd.merge(chicago_crime[['community_area','community_name','merge_ref']], Redfin, on='merge_ref')
but this solution gave me error: ValueError: Length of values does not match length of index, AttributeError: 'list' object has no attribute 'split'.
how can I make this work? based on neighborhood mapping, how can I get correct mapping both for Redfin data and chicago crime data? Any idea to make this mapping correct and get right merged dataframe? any thought? thanks in advance.
update:
I put all of my solution including dataset on this github repository all solution and data on github

Ok, here's what I found:
there is a unicode character in the first line of neighborhood_Map that you probably want to remove: Cabrini\xe2\x80\x93Green'-> Cabrini Green
switch the key and value in neighborhood_name_dic around, since you want to map the existing 'Rogers Park' to the neighborhood 'East Rogers Park', like so:
neighborhood_name_dic = {k[1]:k[0] for k in code_pairs_neighborhoods}
We still don't know from your code how your reading in your Redfin data, but I presume you'll have to remove the Chicago, IL - part in the Region column somewhere, before you can merge?
Update: So I think I managed to understand your code (again, please try to clean up these things a bit before posting), and I think that Redfin is equal to house_df there. So instead of the line that says:
house_df=house_df.set_index('Region',drop=False)
I would suggest to create a neighbourhood column:
house_df['neighborhood'] = house_df['Region'].map(lambda x: x.lstrip('Chicago, IL - '))
and then you can merge on:
crime_finalDF = pd.merge(chicago_crime, house_df, left_on='neighborhood', right_on='neighborhood')
To test it, try:
mask=(crime_finalDF['neighborhood']==u'Sheridan Park')
print(crime_finalDF[['robbery','neighborhood', u'2018-06-01 00:00:00']][mask])
which yields:
robbery neighborhood 2018-06-01 00:00:00
0 140.0 Sheridan Park 239.0
1 122.0 Sheridan Park 239.0
2 102.0 Sheridan Park 239.0
3 113.0 Sheridan Park 239.0
4 139.0 Sheridan Park 239.0
so a successful join of both datasets (I think).
Update 2, regarding the success of the merge().
This is how I read in and cleaned up your xlsx file:
house_df = pd.read_excel("./real_eastate_data_main.xlsx",)
house_df.replace({'-': None})
house_df.columns=house_df.columns.astype(str)
house_df = house_df[house_df['Region'] != 'Chicago, IL']
house_df = house_df[house_df['Region'] != 'Chicago, IL metro area']
house_df['neighborhood'] = house_df['Region'].str.split(' - ')## note the surrounding spaces
house_df['neighborhood'] = house_df['neighborhood'].map(lambda x: list(x)[-1])
chicago_crime['neighborhood'] = chicago_crime['community_name'].map(neighborhood_name_dic)
## Lakeview and Humboldt park not defined in neighborhood_name_dic
# print( chicago_crime[['community_name','neighborhood']][pd.isnull(chicago_crime['neighborhood'])] )
chicago_crime = chicago_crime[~pd.isnull(chicago_crime['neighborhood'])] ## remove them
Now we turn to finding all unique neighborhoods in both df's
cc=sorted(chicago_crime['neighborhood'].unique())
ho=sorted(house_df['neighborhood'].unique())
print(30*u"-"+u" chicago_crime: "+30*u"-")
print(len(cc),cc)
print(30*u"-"+u" house_df: "+30*u"-")
print(len(ho),ho)
print(60*"-")
# print('\n'.join(cc))
set1 = set(cc)
set2 = set(ho)
missing = list(sorted(set1 - set2))
added = list(sorted(set2 - set1))
print('These {0} are missing in house_df: {1}'.format(len(missing),missing))
print(60*"-")
print('These {0} are only in house_df: {1}'.format(len(added),added))
Which reveals that 29 are missing in house_df (e.g. 'East Pilsen') and 132 are found only within house_df (e.g. 'Albany Park'), i.e. we can "inner join" only 46 entries.
Now you have to decide how you want to continue, best if you first read this about the way merging works (e.g. understand the venn diagrams posted there) and then you can improve your code yourself accordingly! Or: clean up your data manually before, sometimes there isn´t a fully automatic solution!

(Re) Weighting Random CSV samples in Python

I have a (large) directory CSV with columns [0:3] = Phone Number, Name, City, State.
I created a random sample of 20,000 entries, but it was, of course, weighted drastically to more populated states and cities.
How would I write a python code (using CSV or Pandas - please no linecache) that would equally prioritize/weight each unique city and each state (individually, not as a pair), and also limit each unique city to 3 picks?
TRICKIER idea: How would I write a python code such that for each random line that gets picked, it checks whether that city has been picked before. If that city has been picked before, it ignores it and picks a random line again, reducing the number of considered previous picks for that city by one. So, say that it randomly picks San Antonio, which has been picked twice before. The script ignores this pick, places it back into the list, reduces the number of currently considered previous San Antonio picks, then randomly chooses a line again. IF it picks a line from San Antonio again, then it repeats the previous process, now reducing considered San Antonio picks to 0. So it would have to pick San Antonio three times in a row to add another line from San Antonio. For future picks, it would have to pick San Antonio four times in a row, plus one for each additional pick.
I don't know how well the second option would work to "scatter" my random picks - it's just an idea, and it looks like a fun way to learn more pythonese. Any other ideas along the same line of thought would be greatly appreciated. Insights into statistical sampling and sample scattering would also be welcome.

I may be misunderstanding exactly what you're trying to do.
I think what you're wanting is a bit more complex. I don't quite understand your question, but hopefully this example gives you some food for thought.
However, you probably want to make use of various libraries for your sampling. All in all, you can do this in just a few lines with pandas:
# Group by city, state
groups = df.groupby(['state', 'city'])
# Then get a result with n from each unique city,state
def choose_n(x, n):
idx = np.random.choice(np.arange(len(x)), n, replace=True)
return x.take(idx)
num_from_each = 2
sample = groups.apply(choose_n, num_from_each)
As a more complete example, with some randomly generated data using the picka library:
import numpy as np
import pandas as pd
import picka
# Generate some realistic random data using "picka"
num = 200
names = [picka.name() for _ in range(num)]
phones = [picka.phone_number() for _ in range(num)]
# Let's limit it to a smaller number of cities and states...
cities = np.random.choice(['Springfield', 'Houston', 'Dallas'], num)
states = np.random.choice(['IL', 'TX', 'TN', 'CA'], num)
df = pd.DataFrame(dict(name=names, phone=phones, city=cities, state=states))
# Group by city, state
groups = df.groupby(['state', 'city'])
# Then get a result with n from each unique city,state
def choose_n(x, n):
idx = np.random.choice(np.arange(len(x)), n, replace=True)
return x.take(idx)
num_from_each = 2
sample = groups.apply(choose_n, num_from_each)
print sample
This results in:
city name phone state
state city
CA Dallas 72 Dallas Sarina 133-258-6775 CA
46 Dallas Dusty 799-563-7043 CA
Houston 158 Houston Artie 591-835-3043 CA
195 Houston Federico 899-315-1205 CA
Springfield 66 Springfield Ollie 326-076-1329 CA
53 Springfield Li 702-555-6594 CA
IL Dallas 154 Dallas Lou 146-404-9668 IL
39 Dallas Ollie 399-256-7836 IL
Houston 190 Houston Scarlett 278-499-6901 IL
89 Houston Rhonda 619-966-3691 IL
Springfield 119 Springfield Jae 180-444-0253 IL
130 Springfield Tawna 630-953-5200 IL
TN Dallas 25 Dallas Frank 475-964-0279 TN
50 Dallas Kiara 764-240-4802 TN
Houston 95 Houston Britney 661-490-5178 TN
107 Houston Tommie 648-945-5608 TN
Springfield 55 Springfield Kecia 891-643-2644 TN
55 Springfield Kecia 891-643-2644 TN
TX Dallas 116 Dallas Mara 636-589-0435 TX
98 Dallas Lajuana 759-788-4742 TX
Houston 103 Houston Casey 600-522-2874 TX
140 Houston Rachal 762-082-9017 TX
Springfield 197 Springfield Staci 021-981-7593 TX
168 Springfield Sherrill 754-736-8409 TX

Assuming that your trickier idea is the one you're actually looking for, here's an implementation that would take care of it. It doesn't use pandas, which might be a mistake, but I didn't see that as a strict requirement on your question and I figured this would be more straightforward:
def random_city_sample(n, input_file='my_csv.csv')
samples = set()
city_counter = collections.Counter()
reader = csv.reader(open(input_file), delimiter=",", quotechar="\"")
# Shuffles your entries as well as removing duplicate entries
sample_set = set(tuple(row) for row in reader)
while len(samples) < n:
added_samples = sampling_run(sample_set, city_counter)
# Add selected samples to universal sample list
samples.update(added_samples)
# Remove only those samples which have been successfully selected
sample_set = sample_set.difference(added_samples)
def sampling_run(master_set, city_counter):
city_ticker = 0
current_city = ''
samples_selected = set()
for entry in master_set:
city = entry[2]
if city == current_city:
city_ticker += 1
else:
current_city = city
city_ticker = 1
if city_ticker > city_counter[city]:
samples_selected.update(entry)
return samples_selected
Though this does mean that if you have a very sparse csv, there might be issues, if you change the iteration to a random sample it gets around that, but I'm not sure if you want to or not:
def random_city_sample(n, input_file='my_csv.csv')
samples = set()
city_counter = collections.Counter()
reader = csv.reader(open(input_file), delimiter=",", quotechar="\"")
# Shuffles your entries as well as removing duplicate entries
sample_set = set(tuple(row) for row in reader)
while len(samples_selected) < n
city_ticker = 0
current_city = ''
samples_selected = set()
entry = random.sample(sample_set, 1)
city = entry[2]
if city == current_city:
city_ticker += 1
else:
current_city = city
city_ticker = 1
if city_ticker > city_counter[city]:
samples.update(entry)
sample_set.remove(entry)
I hope that helps! Let me know if you have any more questions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.