Python csv row count using column name - python

I have a csv file with 'n' columns. I need to get the rowcount of
each column using the column name and give out a dictionary of the following format:
csv_dict= {col_a:10,col_b:20,col_c:30}
where 10,20 and 30 are the row count of col a, b and c respectively.
I obtained a list of columns using fieldnames option of Dictreader.
Now i need the row count of every column in my list.
This is what I tried:
for row in csv.DictReader(filename):
col_count= sum(1 for row['col_a'] in re)+1
This just gets the row count of column a. How to get the row counts of all the columns in my list
and put them in a dictionary in the above mentioned format? Any help appreciated. Thanks and regards.

You can try this:#Save this file with FileName.csv
Name,age,DOB
abhijeet,17,17/09/1990
raj,17,7/09/1990
ramesh,17,17/09/1990
rani,21,17/09/1990
mohan,21,17/09/1990
nil,25,17/09/1990
#Following is the python code.
import csvfrom collections import defaultdictcolumns = defaultdict(list) # each value in each column is appended to a listwith open('FileName.csv') as f: reader = csv.DictReader(f) # read rows into a dictionary format for row in reader: # read a row as {column1: value1, column2: value2,...} for (k,v) in row.items(): # go over each column name and value if not v=='': columns[k].append(v) # append the value into the appropriate list # based on column name kprint len(columns['Name']) #print the length of the specified columnprint len(columns['age']) #print the length of the specified columnprint len(columns['DOB']) #print the length of the specified column

I would use pandas!
# FULLNAME= path/filename.extension of CSV file to read
data = pd.read_csv(FULLNAME, header=0)
# counting empty values
nan_values = data.isnull().sum()
# multiply by -1
ds = nan_values.multiply(-1)
# add total of rows from CSV
filled_rows = ds.add(len(data))
# create dict from data series
csv_dict = filled_rows.to_dict()
If you want to preserve column name order, use an OrderedDict
csv_dict_ordered = OrderedDict()
for idx in filled_rows.index:
csv_dict_ordered[idx] = filled_rows[idx]

Related

How to get the values occurring only once in first column of a csv file using python

I am new in python so I'm trying to read a csv with 700 lines included a header, and get a list with the unique values of the first csv column.
Sample CSV:
SKU;PRICE;SUPPLIER
X100;100;ABC
X100;120;ADD
X101;110;ABV
X102;100;ABC
X102;105;ABV
X100;119;ABG
I used the example here
How to create a list in Python with the unique values of a CSV file?
so I did the following:
import csv
mainlist=[]
with open('final_csv.csv', 'r', encoding='utf-8') as csvf:
rows = csv.reader(csvf, delimiter=";")
for row in rows:
if row[0] not in rows:
mainlist.append(row[0])
print(mainlist)
I noticed that in debugging, rows is 1 line not 700 and I get only the
['SKU'] field what I did wrong?
thank you
A solution using pandas. You'll need to call the unique method on the correct column, this will return a pandas series with the unique values in that column, then convert it to a list using the tolist method.
An example on the SKU column below.
import pandas as pd
df = pd.read_csv('final_csv.csv', sep=";")
sku_unique = df['SKU'].unique().tolist()
If you don't know / care for the column name you can use iloc on the correct number of column. Note that the count index starts at 0:
df.iloc[:,0].unique().tolist()
If the question is intending get only the values occurring once then you can use the value_counts method. This will create a series with the index as the values of SKU with the counts as values, you must then convert the index of the series to a list in a similar manner. Using the first example:
import pandas as pd
df = pd.read_csv('final_csv.csv', sep=";")
sku_counts = df['SKU'].value_counts()
sku_single_counts = sku_counts[sku_counts == 1].index.tolist()
If you want the unique values of the first column, you could modify your code to use a set instead of a list. Maybe like this:
import collections
import csv
filename = 'final_csv.csv'
sku_list = []
with open(filename, 'r', encoding='utf-8') as f:
csv_reader = csv.reader(f, delimiter=";")
for i, row in enumerate(csv_reader):
if i == 0:
# skip the header
continue
try:
sku = row[0]
sku_list.append(sku)
except IndexError:
pass
print('All SKUs:')
print(sku_list)
sku_set = set(sku_list)
print('SKUs after removing duplicates:')
print(sku_set)
c = collections.Counter(sku_list)
sku_list_2 = [k for k, v in c.items() if v == 1]
print('SKUs that appear only once:')
print(sku_list_2)
with open('output.csv', 'w') as f:
for sku in sorted(sku_set):
f.write('{}\n'.format(sku))
A solution using neither pandas nor csv :
lines = open('file.csv', 'r').read().splitlines()[1:]
col0 = [v.split(';')[0] for v in lines]
uniques = filter(lambda x: col0.count(x) == 1, col0)
or, using map (but less readable) :
col0 = list(map(lambda line: line.split(';')[0], open('file.csv', 'r').read().splitlines()[1:]))
uniques = filter(lambda x: col0.count(x) == 1, col0)

accessing the values of collections.defaultdict

I have a csv file that I want to read column wise, for that I've this code :
from collections import defaultdict
from csv import DictReader
columnwise_table = defaultdict(list)
with open("Weird_stuff.csv",'rU') as f:
reader = DictReader(f)
for row in reader:
for col,dat in row.items():
columnwise_table[col].append(dat)
#print(columnwise_table.items()) # this gives me everything
print(type(columnwise_table[2]) # I'm look for smt like this
my question is how can get all the element of only one specific column ? and I'm not using conda and the matrix is big 2400x980
UPDATE
I have 980 columns and over 2000 rows I need to work with the file using the columns say 1st column[0]: feature1 2nd column[0]: j_ss01 50th column:Abs2 and so on
since I can't access the dict using the column names I would like to use an index for that. is this possible ?
import csv
import collections
col_values = collections.defaultdict(list)
with open('Wierd_stuff.csv', 'rU') as f:
reader = csv.reader(f)
# skip field names
next(reader)
for row in csv.reader(f):
for col, value in enumerate(row):
col_values[col].append(value)
# for each numbered column you want...
col_index = 33 # for example
print(col_values[col_index])
If you know the columns you want in advance, only storing those columns could save you some space...
cols = set(1, 5, 6, 234)
...
for col, value in enumerate(row):
if col in cols:
col_values[col].append(value)
By iterating on row.items, you get all columns.
If you want only one specific column via index number, use csv.reader and column index instead.
from csv import reader
col_values = []
# Column index number to get values from
col = 1
with open("Weird_stuff.csv",'rU') as f:
reader = reader(f)
for row in reader:
col_val = row[col]
col_values.append(col_val)
# contains only values from column index <col>
print(col_values)

How do I read specific rows AND columns in CSV file while printing values within a certain range

Hi I am having some problems looking at specific rows and columns in an csv file.
My current goal is look 3 different columns out the several that are there. And on top of that, I want to look at the data values (ex. 0.26) and sort through the ones that are betweeen 0.21 and 0.31 in a specific column. My issue is that I dont know how to do both of those at the same time. I keep getting errors that tell me i cant use '<=' with float and str.
Heres my code:
import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('C:\\Users\\AdamStoer\\Documents\\practicedata.csv') as f:
reader = csv.DictReader(f,delimiter=',') # read rows into a dictionary format
for row in reader:
for columns['pitch'] in row:
for v in columns['pitch']:
p=float(v)
if p <= 0.5:
columns['pitch'].append(v)
print(columns['pitch'])
This code was working before for the last part
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
print(columns['pitch'])
Looks like you're confusing a couple things. If you know the specific column you want (pitch) you do not have to loop over all the columns in each row. You can access it directly like so:
for row in reader:
p = float(row['pitch'])
if p <= 0.5:
print p
It's hard for me to tell what output you want, but here's an example that looks at just the pitch in each row and if it is a match appends all the target values for that row to the columns dictionary.
targets = ('pitch', 'roll', 'yaw')
columns = defaultdict(list)
for row in reader:
p = float(row['pitch'])
if p >= 0.21 and p <= 0.31:
for target in targets:
column[target].append(row[target])

How can I merge CSV rows that have the same value in the first cell?

This is the file: https://drive.google.com/file/d/0B5v-nJeoVouHc25wTGdqaDV1WW8/view?usp=sharing
As you can see, there are duplicates in the first column, but if I were to combine the duplicate rows, no data would get overridden in the other columns. Is there any way I can combine the rows with duplicate values in the first column?
For example, turn "1,A,A,," and "1,,,T,T" into "1,A,A,T,T".
Plain Python:
import csv
reader = csv.Reader(open('combined.csv'))
result = {}
for row in reader:
idx = row[0]
values = row[1:]
if idx in result:
result[idx] = [result[idx][i] or v for i, v in enumerate(values)]
else:
result[idx] = values
How this magic works:
iterate over rows in the CSV file
for every record, we check if there was a record with the same index before
if this is the first time we see this index, just copy the row values
if this is a duplicate, assign row values to empty cells.
The last step is done via or trick: None or value will return value. value or anything will return value. So, result[idx][i] or v will return existing value if it is not empty, or row value.
To output this without loosing the duplicated rows, we need to keep index, then iterate and output corresponding result entries:
indices = []
for row in reader:
# ...
indices.append(idx)
writer = csv.writer(open('outfile.csv', 'w'))
for idx in indices:
writer.writerow([idx] + result[idx])

Write last three entries per name in a file

I have the following data in a file:
Sarah,10
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
I would like to keep the last three rows for each person. The output would be:
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
In the example, the first row for Sarah was removed since there where three later rows. The rows in the output also maintain the same order as the rows in the input. How can I do this?
Additional Information
You are all amazing - Thank you so much. Final code which seems to have been deleted from this post is -
import collections
with open("Class2.txt", mode="r",encoding="utf-8") as fp:
count = collections.defaultdict(int)
rev = reversed(fp.readlines())
rev_out = []
for line in rev:
name, value = line.split(',')
if count[name] >= 3:
continue
count[name] += 1
rev_out.append((name, value))
out = list(reversed(rev_out))
print (out)
Since this looks like csv data, use the csv module to read and write it. As you read each line, store the rows grouped by the first column. Store the line number along with the row so that they can be written out maintaining the same order as the input. Use a bound deque to keep only the last three rows for each name. Finally, sort the rows and write them out.
import csv
by_name = defaultdict(lambda x: deque(x, maxlen=3))
with open('my_data.csv') as f_in
for i, row in enumerate(csv.reader(f_in)):
by_name[row[0]].append((i, row))
# sort the rows for each name by line number, discarding the number
rows = sorted(row[1] for value in by_name.values() for row in value, key=lambda row: row[0])
with open('out_data.csv', 'w') as f_out:
csv.writer(f_out).writerows(rows)

Categories