CSV in python getting wrong number of rows - python

I have a csv file that I want to work with using python. For that I'm run this code :
import csv
import collections
col_values = collections.defaultdict(list)
with open('list.csv', 'rU',) as f:
reader = csv.reader(f)
data = list(reader)
row_count =len(data)
print(" Number of rows ", row_count)
As a result I get 4357 but the file has only 2432 rows, I've tried to change the delimiter in the reader() function, it didn't change the result.
So my question is, does anybody has an explanation why do I get this value ?
thanks in advance
UPDATE
since the number of column is also too big , here is the output of the last row and the start of non existing rows for one columns
opening the file with excel the end looks like :
I hope it helps

try using pandas.
import pandas as pd
df = pd.read_csv('list.csv')
df.count()
check whether you are getting proper rows now

Related

Removal of rows containing a particular text in a csv file using Python

I have a genomic dataset consisting of more than 3500 rows. I need to remove rows in two columns that("Length" and "Protein Name") from them. How do I specify the condition for this purpose.
import csv #importing the csv module or method
#opening a new csv file
file = open('C:\\Users\\Admin\\Downloads\\csv.csv', 'r')
type(file)
#reading the csv file
csvreader = csv.reader(file)
header = []
header = next(csvreader)
print(header)
#extracting rows from the csv file
rows = []
for row in csvreader:
rows.append(row)
print(rows)
I am a beginner in python bioinformatic data analysis and I haven't tried any extensive methods. I don't how to proceed from here. I have done the work opening and reading the csv file. I have also extracted the column headers. But I don't know how to proceed from here. Please help.
try this :
csvreader= csvreader[csvreader["columnName"].str.contains("string to delete") == False]
It will be better to read scv in pandas since you have lots of rows. That will be the smart decision to make. And also set your conditional variables which you will use to perform the operation. If this do not help. I will suggest you provide a sample data of your scv file.
df = pd.read_csv('C:\\Users\\Admin\\Downloads\\csv.csv')
length = 10
protein_name = "replace with protain name"
df = df[(df["Length"] > length) & (df["Protein Name"] != protein_name)]
print(df)
You can save the df back to a scv file if you want:
df.to_csv("'C:\\Users\\Admin\\Downloads\\new_csv.csv'", index=False)

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

Read CSV with comma delimiter (sorting issue while importing csv)

I am trying to open a csv file by skipping first 5 rows. The data is not getting aligned in dataframe. See screenshot of file
PO = pd.DataFrame()
PO = pd.read_table(acct.csv',sep='\t',skiprows=5,skip_blank_lines=True)
PO
try to set it after import datewise as below.
First sort your data with proper import as it is sticked to the index values. see data image again and data as well. So, when you have proper separator / delimiter you can do following.
do = pd.read_csv('check_test.csv', "r", delimiter='\t', skiprows=range(1, 7),skip_blank_lines=True, encoding="utf8")
d01 = do.iloc[:,1:7]
d02 = d01.sort_values('Date,Reference,Debit')
This is sorting the values into the way you want.

Reading rows in CSV file and appending a list creates a list of lists for each value

I am copying list output data from a DataCamp course so I can recreate the exercise in Visual Studio Code or Jupyter Notebook. From DataCamp Python Interactive window, I type the name of the list, highlight the output and paste it into a new file in VSCode. I use find and replace to delete all the commas and spaces and now have 142 numeric values, and I Save As life_exp.csv. Looks like this:
43.828
76.423
72.301
42.731
75.32
81.235
79.829
75.635
64.062
79.441
When I read the file into VSCode using either Pandas read_csv or csv.reader and use values.tolist() with Pandas or a for loop to append an existing, blank list, both cases provide me with a list of lists which then does not display the data correctly when I try to create matplotlib histograms.
I used NotePad to save the data as well as a .csv and both ways of saving the data produce the same issue.
import matplotlib.pyplot as plt
import csv
life_exp = []
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
life_exp.append(row)
And
import pandas as pd
life_exp_df = pd.read_csv('c:\\data\\life_exp.csv', header = None)
life_exp = life_exp_df.values.tolist()
When you print life_exp after importing using csv, you get:
[['43.828'],
['76.423'],
['72.301'],
['42.731'],
['75.32'],
['81.235'],
['79.829'],
['75.635'],
['64.062'],
['79.441'],
['56.728'],
….
And when you print life_exp after importing using pandas read_csv, you get the same thing, but at least now it's not a string:
[[43.828],
[76.423],
[72.301],
[42.731],
[75.32],
[81.235],
[79.829],
[75.635],
[64.062],
[79.441],
[56.728],
…
and when you call plt.hist(life_exp) on either version of the list, you get each value as bin of 1.
I just want to read each value in the csv file and put each value into a simple Python list.
I have spent days scouring stackoverflow thinking someone has done this, but I can't seem to find an answer. I am very new to Python, so your help greatly appreciated.
Try:
import pandas as pd
life_exp_df = pd.read_csv('c:\\data\\life_exp.csv', header = None)
# Select the values of your first column as a list
life_exp = life_exp_df.iloc[:, 0].tolist()
instead of:
life_exp = life_exp_df.values.tolist()
With csv reader, it will parse the line into a list using the delimiter you provide. In this case, you provide \n as the delimiter but it will still take that single item and return it as a list.
When you append each row, you are essentially appending that list to another list. The simplest work-around is to index into row to extract that value
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
life_exp.append(row[0])
However, if your data is not guaranteed to be formatted the way you have provided, you will need to handle that a bit differently:
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
for number in row:
life_exp.append(number)
A bit cleaner with list comprehension:
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
[life_exp.append(number) for row in exp_read for number in row]

Merging column values into one column without comma on python (without pandas)

Question on how to merge different column values into out without comma on python...
My task as like this.
A big csv file data has following rows
s,0,6,8,9,2,-,3,6,2,8,7,1,0,n,.,c,s,v
s,0,5,9,6,0,-,3,6,7,0,1,6,0,n,.,c,s,v
s,1,9,0,5,5,-,3,6,1,5,5,8,6,n,.,c,s,v
s,2,8,0,7,9,-,3,2,5,1,8,2,7,n,.,c,s,v
s,0,0,5,6,5,-,3,3,4,0,5,7,0,n,.,c,s,v
s,3,0,3,4,8,-,3,5,9,1,2,2,6,n,.,c,s,v
s,0,3,8,8,9,-,3,7,3,1,0,2,5,n,.,c,s,v
I want to make this look like follow:
06892
05960
19055
28079
00565
30348
03889
I attempted following code without success.
import csv, os
with open ('/Desktop/case.csv','r') as h:
reader = csv.reader(h)
for row in reader:
k = row[1:6]
print(k)
When I did this, following results come up.
0,6,8,9,2
0,5,9,6,0
1,9,0,5,5
2,8,0,7,9
0,0,5,6,5
3,0,3,4,8
0,3,8,8,9
How to make this look like my desired output, i.e. without commas?
Use join:
from io import StringIO
import csv
txtfile = StringIO("""s,0,6,8,9,2,-,3,6,2,8,7,1,0,n,.,c,s,v
s,0,5,9,6,0,-,3,6,7,0,1,6,0,n,.,c,s,v
s,1,9,0,5,5,-,3,6,1,5,5,8,6,n,.,c,s,v
s,2,8,0,7,9,-,3,2,5,1,8,2,7,n,.,c,s,v
s,0,0,5,6,5,-,3,3,4,0,5,7,0,n,.,c,s,v
s,3,0,3,4,8,-,3,5,9,1,2,2,6,n,.,c,s,v
s,0,3,8,8,9,-,3,7,3,1,0,2,5,n,.,c,s,v""")
reader = csv.reader(txtfile)
for row in reader:
k = row[1:6]
print(''.join(k))
Output:
06892
05960
19055
28079
00565
30348
03889

Categories