(Re) Weighting Random CSV samples in Python - python

I have a (large) directory CSV with columns [0:3] = Phone Number, Name, City, State.
I created a random sample of 20,000 entries, but it was, of course, weighted drastically to more populated states and cities.
How would I write a python code (using CSV or Pandas - please no linecache) that would equally prioritize/weight each unique city and each state (individually, not as a pair), and also limit each unique city to 3 picks?
TRICKIER idea: How would I write a python code such that for each random line that gets picked, it checks whether that city has been picked before. If that city has been picked before, it ignores it and picks a random line again, reducing the number of considered previous picks for that city by one. So, say that it randomly picks San Antonio, which has been picked twice before. The script ignores this pick, places it back into the list, reduces the number of currently considered previous San Antonio picks, then randomly chooses a line again. IF it picks a line from San Antonio again, then it repeats the previous process, now reducing considered San Antonio picks to 0. So it would have to pick San Antonio three times in a row to add another line from San Antonio. For future picks, it would have to pick San Antonio four times in a row, plus one for each additional pick.
I don't know how well the second option would work to "scatter" my random picks - it's just an idea, and it looks like a fun way to learn more pythonese. Any other ideas along the same line of thought would be greatly appreciated. Insights into statistical sampling and sample scattering would also be welcome.

I may be misunderstanding exactly what you're trying to do.
I think what you're wanting is a bit more complex. I don't quite understand your question, but hopefully this example gives you some food for thought.
However, you probably want to make use of various libraries for your sampling. All in all, you can do this in just a few lines with pandas:
# Group by city, state
groups = df.groupby(['state', 'city'])
# Then get a result with n from each unique city,state
def choose_n(x, n):
idx = np.random.choice(np.arange(len(x)), n, replace=True)
return x.take(idx)
num_from_each = 2
sample = groups.apply(choose_n, num_from_each)
As a more complete example, with some randomly generated data using the picka library:
import numpy as np
import pandas as pd
import picka
# Generate some realistic random data using "picka"
num = 200
names = [picka.name() for _ in range(num)]
phones = [picka.phone_number() for _ in range(num)]
# Let's limit it to a smaller number of cities and states...
cities = np.random.choice(['Springfield', 'Houston', 'Dallas'], num)
states = np.random.choice(['IL', 'TX', 'TN', 'CA'], num)
df = pd.DataFrame(dict(name=names, phone=phones, city=cities, state=states))
# Group by city, state
groups = df.groupby(['state', 'city'])
# Then get a result with n from each unique city,state
def choose_n(x, n):
idx = np.random.choice(np.arange(len(x)), n, replace=True)
return x.take(idx)
num_from_each = 2
sample = groups.apply(choose_n, num_from_each)
print sample
This results in:
city name phone state
state city
CA Dallas 72 Dallas Sarina 133-258-6775 CA
46 Dallas Dusty 799-563-7043 CA
Houston 158 Houston Artie 591-835-3043 CA
195 Houston Federico 899-315-1205 CA
Springfield 66 Springfield Ollie 326-076-1329 CA
53 Springfield Li 702-555-6594 CA
IL Dallas 154 Dallas Lou 146-404-9668 IL
39 Dallas Ollie 399-256-7836 IL
Houston 190 Houston Scarlett 278-499-6901 IL
89 Houston Rhonda 619-966-3691 IL
Springfield 119 Springfield Jae 180-444-0253 IL
130 Springfield Tawna 630-953-5200 IL
TN Dallas 25 Dallas Frank 475-964-0279 TN
50 Dallas Kiara 764-240-4802 TN
Houston 95 Houston Britney 661-490-5178 TN
107 Houston Tommie 648-945-5608 TN
Springfield 55 Springfield Kecia 891-643-2644 TN
55 Springfield Kecia 891-643-2644 TN
TX Dallas 116 Dallas Mara 636-589-0435 TX
98 Dallas Lajuana 759-788-4742 TX
Houston 103 Houston Casey 600-522-2874 TX
140 Houston Rachal 762-082-9017 TX
Springfield 197 Springfield Staci 021-981-7593 TX
168 Springfield Sherrill 754-736-8409 TX

Assuming that your trickier idea is the one you're actually looking for, here's an implementation that would take care of it. It doesn't use pandas, which might be a mistake, but I didn't see that as a strict requirement on your question and I figured this would be more straightforward:
def random_city_sample(n, input_file='my_csv.csv')
samples = set()
city_counter = collections.Counter()
reader = csv.reader(open(input_file), delimiter=",", quotechar="\"")
# Shuffles your entries as well as removing duplicate entries
sample_set = set(tuple(row) for row in reader)
while len(samples) < n:
added_samples = sampling_run(sample_set, city_counter)
# Add selected samples to universal sample list
samples.update(added_samples)
# Remove only those samples which have been successfully selected
sample_set = sample_set.difference(added_samples)
def sampling_run(master_set, city_counter):
city_ticker = 0
current_city = ''
samples_selected = set()
for entry in master_set:
city = entry[2]
if city == current_city:
city_ticker += 1
else:
current_city = city
city_ticker = 1
if city_ticker > city_counter[city]:
samples_selected.update(entry)
return samples_selected
Though this does mean that if you have a very sparse csv, there might be issues, if you change the iteration to a random sample it gets around that, but I'm not sure if you want to or not:
def random_city_sample(n, input_file='my_csv.csv')
samples = set()
city_counter = collections.Counter()
reader = csv.reader(open(input_file), delimiter=",", quotechar="\"")
# Shuffles your entries as well as removing duplicate entries
sample_set = set(tuple(row) for row in reader)
while len(samples_selected) < n
city_ticker = 0
current_city = ''
samples_selected = set()
entry = random.sample(sample_set, 1)
city = entry[2]
if city == current_city:
city_ticker += 1
else:
current_city = city
city_ticker = 1
if city_ticker > city_counter[city]:
samples.update(entry)
sample_set.remove(entry)
I hope that helps! Let me know if you have any more questions.

Related

Need help searching a pandas dataframe for a substring from another column

I have a dataframe (called combos) that looks like this (where player_title is an index column, and Team , Opp are regular column headers )
Player_Title
Team
Opp
QB Kirk Cousins
MIN
# HOU
WR Adam Thielen
MIN
# HOU
WR Justin Jefferson
MIN
# HOU
RB Alvin Kamara
NO
# DET
RB Myles Gaskin
MIA
vs SEA
WR Brandin Cooks
HOU
vs MIN
TE Logan Thomas
WAS
vs BAL
RB Kenyan Drake
ARI
# CAR
DST Vikings
MIN
# HOU
I am trying to write a conditional statement that sees whether or not the "Team" for the DST row appears anywhere in the Opp column. If it does, return true. In this example it should return true because the Opp of WR Brandon Cooks is Min.
I used combo['Team'][-1] to find the value of Team, and I used combo['Opp'][:-1] to find the list of Opponents I am trying to search through for Min. Then plugged into a lambda function that didnt find the matching substring. Ideally this would return true/false so that I can use it in an if/else statement, but havent figured out how to do that.
combo['C'] = combo.apply(lambda x: x['Team'][-1] in x['Team'][:-1], axis=1)
What about:
combo['Team'].iloc[-1] in combo['Opp'][:-1].str[-3:].values

How to retrieve rows in dataframe as strings?

I have probably stupid question for you guys but I cannot find a working solution so far to my problem. I have a data frame provided via automatic input and I am transforming this data as such (not really relevant for my question):
import pandas as pd
import numpy as np
n = int(input())
matrix =[]
for n in range(n+1): # loop as long as expected # rows to be input
new_row = input() # get new row input
new_row = list(new_row.split(",")) # make it a list
matrix.append(new_row) #update the matrix
mat = pd.DataFrame(data=matrix[1:], index=None, columns=matrix[0])
mat.iloc[:,1] = pd.to_numeric(mat.iloc[:,1])
mat.iloc[:,2] = pd.to_numeric(mat.iloc[:,2])
mat.iloc[:,1] = round(mat.iloc[:,1] / mat.iloc[:,2])
mat2 = mat[['state', 'population']].head(5)
mat2['population'] = mat2['population'].astype(int)
mat2 = mat2.sort_values(by=['population'], ascending=False)
mat2 = mat2.to_string(index=False, header=False)
print(mat2)
the answer I am getting is equal to:
New York 354
Florida 331
California 240
Illinois 217
Texas 109
Nicely formated etc, however I need to retrieve my data in string format as:
New York 354
Florida 331
California 240
Illinois 217
Texas 109
I have already tried changing the ending of my code to:
#mat2 = mat2.to_string(index=False, header=False)
print(mat2.iloc[1,:])
to retrieve e.g. first row, but then console returns:
state Florida
population 331
Name: 2, dtype: object
How can I simply access the data from my cells and format it in string format?
Thanks!
After mat2 = mat2.to_string(index=False, header=False), mat2 becomes a string you can transform to your liking. For instance, you could do:
>>> lines = mat2.split('\n')
>>> without_format = [line.strip() for line in lines]
>>> without_format
['New York 354',
'Florida 331',
'California 240',
'Illinois 217',
'Texas 109']
Where .strip() will remove any whitespace before or after the string.

concat pdf tables into one excel table using python

I'm using tabula in order to concat all tables in the following pdf file
To be a one table within excel format.
Here's my code:
from tabula import read_pdf
import pandas as pd
allin = []
for page in range(1, 115):
table = read_pdf("goal.pdf", pages=page,
pandas_options={'header': None})[0]
allin.append(table)
new = pd.concat(allin)
new.to_excel("out.xlsx", index=False)
Also i tried the following as well:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='all', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Current output: check
But the issue which am facing that from page# 91 i start to see the data not formatted correctly within the excel file.
I've debug the page individually and i couldn't figure out why it's formatted wrongly especially it's within same format.
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='91', pandas_options={'header': None})[0]
print(table)
Example:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='90-91', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Here I've ran the code for two pages 90 and 91.
starting from row# 48 you will see the difference here
Where you will notice the issue that name and address placed into one cell. And city and state placed into one call as well
I digged in source code and it has option columns and you can manually define column boundaries. When you set columns then you have to use guess=False.
tabula-py uses program tabula-java and in its documentation I found that it needs values in percents or points (not pixels). So I used program inkscape to measure boundaries in points.
from tabula import read_pdf
import pandas as pd
# display all columns in dataframe
pd.set_option('display.width', None)
columns = [210, 350, 420, 450] # boundaries in points
#columns = ['210,350,420,450'] # boundaries in points
pages = '90-92'
#pages = [90,91,92]
#pages = list(range(90,93))
#pages = 'all' # read all pages
tables = read_pdf("goal.pdf",
pages=pages,
pandas_options={'header': None},
columns=columns,
guess=False)
df = pd.concat(tables).reset_index(drop=True)
#df.rename(columns=df.iloc[0], inplace=True) # convert first row to headers
#df.drop(df.index[0], inplace=True) # remove first row with headers
# display
#for x in range(0, len(df), 20):
# print(df.iloc[x:x+20])
# print('----------')
print(df.iloc[45:50])
#df.to_csv('output-pdf.csv')
#print(df[ df['State'].str.contains(' ') ])
#print(df[ df.iloc[:,3].str.contains(' ') ])
Result:
0 1 2 3 4
45 JARRARD, GARY 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
46 JARRARD, GARY 2219 COLORADO BLVD DENTON TX (940) 380-1661
47 MASON HARRISON, RATLIFF ENTERPRISES 1815 W. UNIVERSITY DRIVE DENTON TX (940) 387-5431
48 MASON HARRISON, RATLIFF ENTERPRISES 109 N. LOOP #288 DENTON TX (940) 484-2904
49 MASON HARRISON, RATLIFF ENTERPRISES 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
EDIT:
It may need also option area (also in points) to skip headers. Or you will have to remove first row on first page.
I didn't check all rows but it may need some changes in column boundaries.
EDIT:
Few rows make problem - probably because text in City is too long.
col3 = df.iloc[:,3]
print(df[ col3.str.contains(' ') ])
Result:
0 1 2 3 4
1941 UMSTATTD RESTAURANTS, LLC 120 WEST US HIGHWAY 54 EL DORADO SPRING MS O (417) 876-5755
2079 SIMONS, GARY 1412 BURLINGTON NORTH KANSAS CIT MY O (816) 421-5941
2763 GRISHAM, ROBERT (RB) 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830
2764 STAUFFER, JACOB 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830

Trying to arrange list within list but can't arrange numbers properly

So right I'm working with a file with a bunch of lists within a list. I'm trying to arrange them by top 5 states and bottom 5 states for total number of participants.
import csv
from operator import itemgetter #not really using this right now
crimefile = open('APExam.txt', 'r')
reader = csv.reader(crimefile)
allRows = [row for row in reader]
L = sorted(allRows,key=lambda x: x[1])
for item in L:
print item[0],','.join(map(str,item[1:]))
Which prints something like this:
State Total #,% passed,%female
Wyoming 0,0,0
New Hampshire 101,76,12
Utah 103,54,4
Mass 1067,67,18
Montana 11,55,0
Iowa 118,80,9
Alabama 126,79,17
Georgia 1261,51,18
Florida 1521,44,20
Illinois 1559,69,13
New Jersey 1582,74,15
Maine 161,67,16
This prints the file in a way that is nearly what I'm looking for but the total number of participants isn't sorted by the largest to lowest; it is sorted by looking at the first element. How do I change it to look more like:
New Jersey 1582,74,15
Illinois 1559,69,13
Florida 1521,44,20
Georgia 1261,51,18
Etc...
First time asking a question here, any help is appreciated! :)
Also, I trying not to use the .sort() or find() function or the "in" operator-- such as "if 6 in [5, 4, 7 6] ..."
Edit*: By changing L
L = sorted(allRows,key=lambda x: int(x[1]),reverse=True)
I've gotten to the point where I have the list going in descending order:
State Total #,% passed,%female
California 4964,76,22
Texas 3979,62,23
New York 1858,69,20
Virginia 1655,60,19
Maryland 1629,66,20
New Jersey 1582,74,15
Illinois 1559,69,13
Florida 1521,44,20
Now I can't quite seem to figure out how to only take the top 5 and bottom 5 based on these totals...
You goofed up the sorting. CSV files contain text, regardless of what that text may happen to be.
L = sorted(allRows,key=lambda x: int(x[1]))

What's the best way to compare two files & update the values of the first file from second file in Python?

I apologize for the nOOb question. I'm a newbie to Python & I'm using Python 2.6. I have two files and I need to compare both of them & update the value of the 1st file from the 2nd file.
My first file is as below,
SeqNo City State
1 Chicago IL
2 Boston MA
3 New York NY
4 Los Angeles CA
5 Seattle WA
My second file is as below,
SeqNo City State NewSeqNo
5 Seattle WA 1
1 Chicago IL 2
4 Los Angeles CA 3
2 Boston MA 4
3 New York NY 5
How do I updated the SEQ Number in the first file with the value in the NewSeqNo from the second file?
For example, the output of the first file should be,
NewSeqNo City State
2 Chicago IL
4 Boston MA
5 New York NY
3 Los Angeles CA
1 Seattle WA
I need to achieve this using Python & any help would be greatly appreciated.
Open the second file. Use csv.reader to handle tokenizing each line.
Build up a mapping of oldseq->newseq, using a dict.
import csv
lookup = {}
with open('secondfile') as f:
reader = csv.reader(f)
for line in reader:
oldseq, city, state, newseq = line
lookup[oldseq] = newseq
Now open up your first file. Use the same strategy, but replace your SeqNo with the value in your mapping dict.
with open('firstfile') as f, open('outfile','w') as w:
reader = csv.reader(f)
writer = csv.writer(w)
for line in reader:
seq, city, state = line
if seq in lookup:
seq = lookup[seq]
writer.writerow([seq, city, state])
That's the gist of it. You will have to deal with some small things that I didn't address, like skipping the header row(s), and renaming 'outfile' to 'firstfile' (ie overwriting the old file with the temporary file) once you're done with the operation. It's technically possible to avoid creating a temporary file and directly write into your file as you're iterating over it, but I advise against it for reasons I won't delve into here.

Categories