I'm new in Python and I have a problem of removing unwanted rows in a csv file. For instance I have 3 columns and a lot of rows:
A B C
hi 1.0 5
hello 2.0 6
ok 3.0 7
I loaded the data using numpy (instead of csv)
import numpy as np
a= np.loadtxt('data.csv' , delimiter= ',' , skiprows= 1)
I want to introduce a range for the 2nd column
b=np.arange(0, 2.1,0.1)
I don't have any idea how I should use that piece of code.
What I want as a final output is the following:
A B C
hi 1.0 5
hello 2.0 6
The last row would be remove since I chose a range for the 2nd column up to 2.0 only. I don't have any idea how can I accomplish this.
Try with Pandas:
import pandas as pd
a = pd.read_csv('data.csv', index_col=0) # column A will be the index.
a
B C
A
hi 1 5
hello 2 6
ok 3 7
For every value of B up to 2 :
a[a.B <= 2]
B C
A
hi 1 5
hello 2 6
Details :
a.B
A
hi 1
hello 2
ok 3
Name: B, dtype: float64
a.B <= 2
A
hi True
hello True
ok False
Name: B, dtype: bool
You can do it using logical indexing
index = (x[:, 1] <= 2.0)
Then
x = x[index]
selecting only the lines that satisfy this condition
You can just use the csv module. N.B the following expects that the CSV fields are comma separated, not tab separated (as your sample suggests).
import csv
with open('data.csv') as data:
reader = csv.reader(data) # or csv.reader(data, delimiter='\t') for tabs
field_names = next(reader)
filtered_rows = [row for row in reader if 0 <= float(row[1]) <= 2.0]
>>> field_names
['A', 'B', 'C']
>>> filtered_rows
[['hi', '1.0', '5'], ['hello', '2.0', '6']]
>>> filtered_rows.insert(0, field_names)
>>> filtered_rows
[['A', 'B', 'C'], ['hi', '1.0', '5'], ['hello', '2.0', '6']]
If you require that values be exact tenths within the required range, then you can do this:
import csv
import numpy as np
allowed_values = np.arange(0, 2.1, 0.1)
with open('data.csv') as data:
reader = csv.reader(data)
field_names = next(reader)
filtered_rows = [row for row in reader if float(row[1]) in allowed_values]
Edit after updated requirements
With extra constraints on column "C", e.g. value must be >= 6.
import csv
import numpy as np
allowed_values_B = np.arange(0, 2.1, 0.1)
def accept_row(row):
return (float(row[1]) in allowed_values_B) and (int(row[2]) >= 6)
with open('data.csv') as data:
reader = csv.reader(data)
field_names = next(reader)
filtered_rows = [row for row in reader if accept_row(row)]
>>> filtered_rows
[['hello', '2.0', '6']]
Related
Here is an image of my CSV file:
import csv
f = open("datatest.csv")
reader = csv.reader(f)
dataListed = [row for row in reader]
rc = csv.writer(f)
column1 = []
for row in dataListed:
column1.append(row[0])
column2 = []
for row in dataListed:
column2.append(row[1])
divide = []
for row in dataListed:
divide = row[1] / row[2]
print(divide)
Why does the "divide" list not work? Everything else works as it should, I always just get an error for that that says something about strings, when I try to change the row[1 and 2] as a float, it breaks too! Help is greatly appreciated,
I am a pure beginner. Thanks
There are a lot issues in your code.
Firstly, your dataListed looks like this
[['lis1', 'lis2'], ['1', '2'], ['2', '7'], ['3', '9'], ['10', '10']]
You are trying to divide 2 string items like so.
divide = 'lis1'/'lis2' - TypeError: unsupported operand type(s) for /: 'str' and 'str'
so you need to remove 1st elemnt from list.
Secondly,
divide = row[1] / row[2]
your list has only 2 elemnts list index starts with 0 so it should be
divide = row[0] / row[1]
complete code after code correction
import csv
f = open(r"Tomas.csv")
reader = csv.reader(f)
dataListed = [row for row in reader]
rc = csv.writer(f)
column1 = []
for row in dataListed:
column1.append(row[0])
column2 = []
for row in dataListed:
column2.append(row[1])
dataListed.pop(0)
divide = []
for row in dataListed:
re = int(row[1]) / int(row[0])
divide.append(re)
print(divide)
Gives #
[2.0, 3.5, 3.0, 1.0]
have you considered using other libraries Thomas?
using pandas is very very easy - pandas
say your csv looks like this
lis1 lis2
0 1 2
1 2 7
2 3 9
3 10 10
Then
import pandas as pd
df = pd.read_csv(r"Thomas.csv")
df['new_list_after_Divison'] = (df['lis2']/df['lis1'])
print(df)
Gives #
lis1 lis2 new_list_after_Divison
0 1 2 2.0
1 2 7 3.5
2 3 9 3.0
3 10 10 1.0
What is the best way to take this string:
1
2
3
4
a
b
c
d
1
2
3
4
a
b
c
d
1
2
3
4
a
b
c
d
and transform to a CSV containing 6 columns?
Desired output
Is a CSV which will be imported into Pandas:
1,a,1,a,1,a
2,b,2,b,2,b
etc..
Updated desired output as per comments to 6 rows.
Updated. I can get the first row like this if I assign the string to l variable:
l.split()[0::4]
['1', 'a', '1', 'a', '1', 'a']
with open('data.txt', 'r') as f:
data = f.read().split("\n")
for i in range(4):
d = list()
for j in range(i, len(data), 4):
d.append(data[j])
with open('data.csv', 'a') as csv:
csv.write(','.join(d)+"\n")
Even though Art's answer is accepted, here is another way using pandas. You wouldn't need to export the data prior to importing with pandas if you use something like this.
import pandas as pd
myFile="lines_to_read2.txt"
myData = pd.DataFrame (columns=['col1', 'col2', 'col3','col4'])
mycolumns = 4
thisItem = list()
with open(myFile, 'r') as linesToRead:
for thisLine in linesToRead:
thisItem.append(thisLine.strip('\n, " "'))
if len(thisItem) == mycolumns:
myData = myData.append({'col1':thisItem[0],'col2':thisItem[1],'col3':thisItem[2],'col4':thisItem[3]}, ignore_index=True)
thisItem = list()
myData.to_csv('lines_as_csv_file.csv', index=False)
print(myData) # Full Table
I have a large space separated input file input.csv, which I can't hold in memory:
## Header
# More header here
A B
1 2
3 4
If I use the iterator=True argument for pandas.read_csv, then it returns a TextFileReader / TextParser object. This allows filtering the file on the fly and only selecting rows for which column A is greater than 2.
But how do I add a third column to the dataframe on the fly without having to loop over all of the data once more?
Specifically I want column C to be equal to column A multiplied by the value in a dictionary d, which has the value of column B as its key; i.e. C = A*d[B].
Currently I have this code:
import pandas
d = {2: 2, 4: 3}
TextParser = pandas.read_csv('input.csv', sep=' ', iterator=True, comment='#')
df = pandas.concat([chunk[chunk['A'] > 2] for chunk in TextParser])
print(df)
Which prints this output:
A B
1 3 4
How do I get it to print this output (C = A*d[B]):
A B C
1 3 4 9
You can use a generator to work on the chunks one at a time:
Code:
def on_the_fly(the_csv):
d = {2: 2, 4: 3}
chunked_csv = pd.read_csv(
the_csv, sep='\s+', iterator=True, comment='#')
for chunk in chunked_csv:
rows_idx = chunk['A'] > 2
chunk.loc[rows_idx, 'C'] = chunk[rows_idx].apply(
lambda x: x.A * d[x.B], axis=1)
yield chunk[rows_idx]
Test Code:
from io import StringIO
data = StringIO(u"""#
A B
1 2
3 4
4 4
""")
import pandas as pd
df = pd.concat([c for c in on_the_fly(data)])
print(df)
Results:
A B C
1 3 4 9.0
2 4 4 12.0
I would like to create a list for every column in a txt file.
The file looks like this:
NAME S1 S2 S3 S4
A 1 4 3 1
B 2 1 2 6
C 2 1 3 5
PROBLEM 1 . How do I dynamically make the number of lists that fit the number of columns, such that I can fill them? In some files I will have 4 columns, others I will have 6 or 8...
PROBLEM 2. What is a pythonic way to iterate through each column and make a list of the values like this:
list_s1 = [1,2,2]
list_s2 = [4,1,1]
etc.
Right now I have read in the txt file and I have each individual line. As input I give the number of NAMES in a file (here HOW_MANY_SAMPLES = 4)
def parse_textFile(file):
list_names = []
with open(file) as f:
header = f.next()
head_list = header.rstrip("\r\n").split("\t")
for i in f:
e = i.rstrip("\r\n").split("\t")
list_names.append(e)
for i in range(1, HOW_MANY_SAMPLES):
l+i = []
l+i.append([a[i] for a in list_names])
I need a dynamic way of creating and filling the number of lists that correspond to the amount of columns in my table.
By using pandas you can create a list of list or a dic to get what you are looking for.
Create a dataframe from your file, then iterate through each column and add it to a list or dic.
from StringIO import StringIO
import pandas as pd
TESTDATA = StringIO("""NAME S1 S2 S3 S4
A 1 4 3 1
B 2 1 2 6
C 2 1 3 5""")
columns = []
c_dic = {}
df = pd.read_csv(TESTDATA, sep=" ", engine='python')
for column in df:
columns.append(df[column].tolist())
c_dic[column] = df[column].tolist()
Then you will have a list of list for all the columns
for x in columns:
print x
Returns
['A', 'B', 'C']
[1, 2, 2]
[4, 1, 1]
[3, 2, 3]
[1, 6, 5]
and
for k,v in c_dic.iteritems():
print k,v
returns
S3 [3, 2, 3]
S2 [4, 1, 1]
NAME ['A', 'B', 'C']
S1 [1, 2, 2]
S4 [1, 6, 5]
if you need to keep track of columns name and data
Problem 1:
You can use len(head_list) instead of having to specify HOW_MANY_SAMPLES.
You can also try using Python's CSV module and setting the delimiter to a space or a tab instead of a comma.
See this answer to a similar StackOverflow question.
Problem 2:
Once you have a list representing each row, you can use zip to create lists representing each column:
See this answer.
With the CSV module, you can follow this suggestion, which is another way to invert the data from row-based lists to column-based lists.
Sample:
import csv
# open the file in universal line ending mode
with open('data.txt', 'rU') as infile:
# register a dialect that skips extra whitespace
csv.register_dialect('ignorespaces', delimiter=' ', skipinitialspace=True)
# read the file as a dictionary for each row ({header : value})
reader = csv.DictReader(infile, dialect='ignorespaces')
data = {}
for row in reader:
for header, value in row.items():
try:
if (header):
data[header].append(value)
except KeyError:
data[header] = [value]
for column in data.keys():
print (column + ": " + str(data[column]))
this yields:
S2: ['4', '1', '1']
S1: ['1', '2', '2']
S3: ['3', '2', '3']
S4: ['1', '6', '5']
NAME: ['A', 'B', 'C']
Is there an efficient way to store each column of a tab-delimited file in a separate dictionary using python?
A sample input file: (Real input file contains thousands of lines and hundreds of columns. Number of columns is not fixed, it changes frequently.)
A B C
1 4 7
2 5 8
3 6 9
I need to print values in column A:
for cell in mydict["A"]:
print cell
and to print values in the same row:
for i in range(1, numrows):
for key in keysOfMydict:
print mydict[key][i]
The simplest way is to use DictReader from the csv module:
with open('somefile.txt', 'r') as f:
reader = csv.DictReader(f, delimiter='\t')
rows = list(reader) # If your file is not large, you can
# consume it entirely
# If your file is large, you might want to
# step over each row:
#for row in reader:
# print(row['A'])
for row in rows:
print(row['A'])
#Marius made a good point - that you might be looking to collect all columns separately by their header.
If that's the case, you'll have to adjust your reading logic a bit:
from collections import defaultdict
by_column = defaultdict(list)
for row in rows:
for k,v in row.iteritems():
by_column[k].append(v)
Another option is pandas:
>>> import pandas as pd
>>> i = pd.read_csv('foo.csv', sep=' ')
>>> i
A B C
0 1 4 7
1 2 5 8
2 3 6 9
>>> i['A']
0 1
1 2
2 3
Name: A, dtype: int64
Not sure this is relevant, but you can do this using rpy2.
from rpy2 import robjects
dframe = robjects.DataFrame.from_csvfile('/your/csv/file.csv', sep=' ')
d = dict([(k, list(v)) for k, v in dframe.items()])
output:
{'A': [1, 2, 3], 'C': [7, 8, 9], 'B': [4, 5, 6]}