I noticed pandas is smart when using read_excel / read_csv, it skips the empty rows so if my input has a blank row like
Col1, Col2
Value1, Value2
It just works, but is there a way to get the actual # of skipped rows? (In this case 1)
I want to tie the dataframe row numbers back to the raw input file's row numbers.
You could use the skip_blank_lines=False and import the entire file including the empty lines. Then you can detect them, count them and filter them out:
def custom_read(f_name, **kwargs):
df = pd.read_csv(f_name, skip_blank_lines=False, **kwargs)
non_empty = df.notnull().all(axis=1)
print('Skipped {} blank lines'.format(sum(~non_empty)))
return df.loc[non_empty, :]
You can also use csv.reader to import your file row-by-row and only allow non-empty rows:
import csv
def custom_read2(f_name):
with open(f_name) as f:
cont = []
empty_counts = 0
reader = csv.reader(f, delimiter=',')
for row in reader:
if len(row) > 0:
cont.append(row)
else:
empty_counts += 1
print('Skipped {} blank lines'.format(empty_counts))
return pd.DataFrame(cont)
As far as I can tell, at most one blank line at a time will occupy your memory. This may be useful if you happened to have large files with many blank lines, but I am pretty sure option 1 will always be the better option in practice
CSV is of the format
I want to make a dictionary that when printed will return:
{
'Date': '######'
'Cash': '20000'
'SKY': '5'
'EZJ': '8'
}
so far I have
import csv
import pprint
portfolio = {}
def loadPortfolio(fname):
try:
with open(fname, "rt") as f:
reader = csv.reader(f)
for row in reader:
key = row[0]
portfolio[key] = row[1:]
pprint.pprint(portfolio)
I basically want to know how to make keys for only the first two rows and have the second two rows have their keys taken from the CSV.
Based on this question and its answers, you can retrieve the number of token per row with len(row).
You can use another variable to keep track of the current row number if you need to.
You forloop should look like this :
int row_number = 0
for row in reader:
key = ''
if len(row) > 1:
key = row[0]
else:
key = 'CUSTOM_KEY' # Depends on the row current row_number
portfolio[key] = row[1:]
row_number += 1
REMARK: This will not only work for the first rows but also for every row with less than 2 tokens. If you do not specify a specific key for all the rows with only 1 token, the default value for key will be used instead (in this case en empty str '').
import csv
cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)
pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)
out = open("output.txt", "r+")
for row in creader:
tn = #current phone number
crednum = #number of rows with that phone number
for row in preader:
purnum = #number of rows with that phone number
if crednum != 2*(purnum):
out.write(str(tn) + "\n")
cred.close()
pur.close()
out.close()
For both files I am only looking at the first column (0th), which is for phone numbers. The files are sorted by phone number, so any duplicates are next to each other. I need to know how many rows there are of the same phone number in the cred file, and then I need to know how many rows with that same phone number there are in the pur file. I need to do this as many times as it takes to compare all number of duplicate phone numbers between files
ex:
Credits File
TN,STUFF,THINGS
2476,hseqer,trjar
2476,sthrtj,esreet
3654,rstrhh,trwtr
Purchases File
TN,STUFF,THINGS
2476,hseher,trjdr
3566,sthztj,esrhet
3654,rstjhh,trjtr
What I would need to know with this example is that there are 2 instances of 2476 in the credits file versus 1 in the purchases file, and then that there is 1 instance of 3654 in the credits file versus 1 instance in the purchases file. I need to compare every single phone number in the cred file and get the number of occurrences in both files, but if there are phone numbers present in the pur file that are not in the cred file, I don't need to count anything. (But if there are 2 of a number in cred and none in pur, I do need a 0 to be returned for purnum.) Note that the real two files are 5,000kb and 13,000kb in size and have tens of thousands of lines.
I'm a serious newbie to python so I'm not sure of the best way to go about this. Looping in python is definitely different than I'm used to (I mostly use c++)
I will edit to add anything needed so please let me know if anything needs clarification. This isn't like any project I've ever had to do before so the explanation may not be ideal.
EDIT: I think I may have skipped explaining an important factor because it was included in my sample code. I need to know those numbers only to compare them, not necessarily to print the counts. If crednum != 2*purnum, then I want to print that phone number and only that phone number, otherwise I don't need to see it in the output file, and I'll never need to actually print the counts, just use them for comparison to figure out what phone numbers need printing.
import csv
cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)
pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)
out = open("output.txt", "r+")
def x(reader): # function takes in a reader
dictionary = {} # this is a python date type of key value pairs
for row in reader: # for each row in the reader
number = row[0] # take the first element in the row (the number)
if number == 'TN': # skip the headers
continue
number = int(number) #convert it to number now ('TN' cannot be converted which is why we do it after)
if number in dictionary: # if the number appears alreader
dictionary[number] = dictionary[number]+1 # increment it
else:
dictionary[number] = 1 # else store it in the dictionary as 1
return dictionary # return the dictionary
def assertDoubles(credits, purchases):
outstr = ''
for key in credits:
crednum = credits[key]
if crednum != 2*purchases[key]:
outstr += str(key) + '\n'
print(key)
out.write(outstr)
credits = x(creader)
purchases = x(preader)
assertDoubles(credits,purchases)
#print(credits)
#print('-------')
#print(purchases)
cred.close()
pur.close()
out.close()
I wrote some code. It essentially stores the number you're looking for duplicates as a key in the dictionary. The value that gets stored is the number of occurrences of that number within the file. It skips the first line (headers).
Output is the following:
{2476: 2, 3654: 1}
-------
{2476: 1, 3654: 1, 3566: 1}
New code above simply outputs:
3654
EDIT: I updated the code to fix what you are referring to.
Since you're not interested in new entries, all you need is to run through the first file and collect all the entries in the first column (counting them in the process) and then run through the second file, check if any of its first column entries has been collected in the first step and if so - count them as well. You cannot avoid running the necessary number of loops to read all the lines of both files but you can use a hashmap (dict) for blazingly fast lookups afterwards, so:
import csv
import collections
c_phones = collections.defaultdict(int) # initiate a 'counter' dict to save us some typing
with open("AllCredits.csv", "r") as f: # open the file for reading
reader = csv.reader(f) # create a CSV reader
next(reader) # skip the first row (header)
for row in reader: # iterate over the rest
c_phones[row[0]] += 1 # increase the count of the current phone
Now that you have count of all the phone numbers from the first file stored in the c_phones dictionary, you should clone it but reset the counters so you can count the occurences of these numbers in the second CSV file:
p_phones = {key: 0 for key in c_phones} # reset the phone counter for purchases
with open("AllPurchases.csv", "r") as f: # open the file for reading
reader = csv.reader(f) # create a CSV reader
next(reader) # skip the first row (header)
for row in reader: # iterate over the rest
if row[0] in p_phones: # we're only interested in phones from both files
p_phones[row[0]] += 1 # increase the counter
And now that you have both dictionaries, and you have both counts you can easily iterate over them to print out the counts
for key in c_phones:
print("{:<15} Credits: {:<4} Purchases: {:<4}".format(key, c_phones[key], p_phones[key]))
Which, with your example data, will yield:
3654 Credits: 1 Purchases: 1
2476 Credits: 2 Purchases: 1
To help with my understanding, I've broken this problem into smaller, more manageable tasks:
Read phone numbers from the first column of two sorted csv files.
Find duplicate numbers that appear in both lists of phone numbers.
Reading the phone numbers is a reusable function, so let's separate it:
def read_phone_numbers(file_path):
file_obj = open(file_path, 'r')
phone_numbers = []
for row in csv.reader(file_obj):
phone_numbers.append(row[0])
file_obj.close()
return phone_numbers
For the task of finding duplicates a set() is a useful tool. From the python docs:
A set is an unordered collection with no duplicate elements.
def find_duplicates(credit_nums, purchase_nums):
phone_numbers = set(credit_nums) # the unique credit numbers
duplicates = []
for phone_number in phone_numbers:
credit_count = credit_nums.count(phone_number)
purchase_count = purchase_nums.count(phone_number)
if credit_count > 0 and purchase_count > 0:
duplicates.append({
'phone_number': phone_number,
'credit_count': credit_count,
'purchase_count': purchase_count,
})
return duplicates
And to put it all together:
def main(credit_csv_path, purchase_csv_path, out_csv_path):
credit_nums = read_phone_numbers(credit_csv_path)
purchase_nums = read_phone_numbers(purchase_csv_path)
duplicates = find_duplicates(credit_nums, purchase_nums)
with open(out_csv_path, 'w') as file_obj:
writer = csv.DictWriter(
file_obj,
fieldnames=['phone_number', 'credit_count', 'purchase_count'],
)
writer.writerows(duplicates)
If you need to process files that are hundreds of times larger, you can look into the collections.Counter module.
The way i understand your situation is that you have two files, namely cred and pur.
Now for each of the tn in cred, find whether the same tn exist in pur. Return the count if exist, or 0 if non-exist.
You can use pandas and the algo can be as below:
Agg pur by TN and count
For each row in cred, get the count. Else 0
Below is the ex:
import pandas as pd
# read the csv
# i create my own as suggested in your desc
cred = pd.DataFrame(
dict(
TN = [2476, 2476, 3654],
STUFF = ['hseqer', 'sthrtj', 'rstrhh'],
THINGS = ['trjar', 'esreet', 'trwtr']
),
columns = ['TN','STUFF','THINGS']
)
pur = pd.DataFrame(
dict(
TN = [2476, 3566, 3654, 2476],
STUFF = ['hseher', 'sthztj', 'rstjhh', 'hseher'],
THINGS = ['trjdr', 'esrhet', 'trjtr', 'trjdr']
),
columns = ['TN','STUFF','THINGS']
)
dfpur = pur.groupby('TN').TN.count() # agg and count (step 1)
# step 2
count = []
for row, tnval in enumerate(cred.TN):
if cred.at[row, 'TN'] in dfpur.index:
count.append(dfpur[tnval])
else:
count.append(0)
There you go! you have your count in the list
The code is supposed to find duplicates by comparing FirstName, LastName, and Email. All Duplicates should be written to the Dupes.csv file, and all Uniques should be written to Deduplicated.csv, but this is currently not happening..
Example:
If row A shows up in Orginal.csv 10 times, the code writes A1 to deduplicated.csv, and it writes A2 - A10 to dupes.csv.
This is incorrect. A1-A10 should ALL be written to the dupes.csv file, leaving only unique rows in deduplicated.csv.
Another strange behavior is that A2-A10 are all getting written to dupes.csv TWICE!
I would really appreciate any and all feedback as this is my first professional python script and I'm feeling pretty disheartened.
Here is my code:
import csv
def read_csv(filename):
the_file = open(filename, 'r', encoding='latin1')
the_reader = csv.reader(the_file, dialect='excel')
table = []
#As long as the table row has values we will add it to the table
for row in the_reader:
if len(row) > 0:
table.append(tuple(row))
the_file.close()
return table
def create_file(table, filename):
join_file = open(filename, 'w+', encoding='latin1')
for row in table:
line = ""
#build up the new row - don't comma on last item so add last item separate
for i in range(len(row)-1):
line += row[i] + ","
line += row[-1]
#adds the string to the new file
join_file.write(line+'\n')
join_file.close()
def main():
original = read_csv('Contact.csv')
print('finished read')
#hold duplicate values
dupes = []
#holds all of the values without duplicates
dedup = set()
#pairs to know if we have seen a match before
pairs = set()
for row in original:
#if row in dupes:
#dupes.append(row)
if (row[4],row[5],row[19]) in pairs:
dupes.append(row)
else:
pairs.add((row[4],row[5],row[19]))
dedup.add(row)
print('finished first parse')
#go through and add in one more of each duplicate
seen = set()
for row in dupes:
if row in seen:
continue
else:
dupes.append(row)
seen.add(row)
print ('writing files')
create_file(dupes, 'duplicate_leads.csv')
create_file(dedup, 'deduplicated_leads.csv')
if __name__ == '__main__':
main()
You should look into the pandas module for this, it will be extremely fast, and much easier than rolling your own.
import pandas as pd
x = pd.read_csv('Contact.csv')
duplicates = x.duplicated(['row4', 'row5', 'row19'], keep = False)
#use the names of the columns you want to check
x[duplicates].to_csv('duplicates.csv') #write duplicates
x[~duplicates].to_csv('uniques.csv') #write uniques
I am trying to determine the type of data contained in each column of a .csv file so that I can make CREATE TABLE statements for MySQL. The program makes a list of all the column headers and then grabs the first row of data and determines each data type and appends it to the column header for proper syntax. For example:
ID Number Decimal Word
0 17 4.8 Joe
That would produce something like CREATE TABLE table_name (ID int, Number int, Decimal float, Word varchar());.
The problem is that in some of the .csv files the first row contains a NULL value that is read as an empty string and messes up this process. My goal is to then search each row until one is found that contains no NULL values and use that one when forming the statement. This is what I have done so far, except it sometimes still returns rows that contains empty strings:
def notNull(p): # where p is a .csv file that has been read in another function
tempCol = next(p)
tempRow = next(p)
col = tempCol[:-1]
row = tempRow[:-1]
if any('' in row for row in p):
tempRow = next(p)
row = tempRow[:-1]
else:
rowNN = row
return rowNN
Note: The .csv file reading is done in a different function, whereas this function simply uses the already read .csv file as input p. Also each row is ended with a , that is treated as an extra empty string so I slice the last value off of each row before checking it for empty strings.
Question: What is wrong with the function that I created that causes it to not always return a row without empty strings? I feel that it is because the loop is not repeating itself as necessary but I am not quite sure how to fix this issue.
I cannot really decipher your code. This is what I would do to only get rows without the empty string.
import csv
def g(name):
with open('file.csv', 'r') as f:
r = csv.reader(f)
# Skip headers
row = next(r)
for row in r:
if '' not in row:
yield row
for row in g('file.csv'):
print('row without empty values: {}'.format(row))