how to delete unwanted rows from csv in python? - python

I have csv file abc.csv with two columns Name & ID and list of ids l_id. Now I want to delete all those rows from csv where ID not in l_id. I have tried the following:
l_id=[18850080, 535553, 19292162, 1728035, 1179719, 19194894, 22817838, 19997487, 19728145, 1457232, 13560402, 18855476, 7151442, 18955830, 11294262, 18506072, 1360698]
Name ID
0 2069 19277993
1 625050 19277900
2 1939496 19277793
3 2606806 19275471
4 3438546 19273652
5 4211111 7151442
6 4353024 19200001
7 5175848 11294262
8 5300858 1360698
9 5636006 535553
10 5729989 19277800
11 6045513 19277320
12 6160486 19277458
13 6540851 19276852
14 6752008 19277363
15 7643395 19997487
16 7644736 19292162
17 7712083 19292100
18 7768516 19292169
19 7809273 1360698
with open('abc.csv', 'r') as inp, open('abc_edit.csv', 'w') as out:
writer = csv.writer(out)
for line in inp:
if (set(line.split()).isdisjoint(set(l_id)))==False:
writer.writerow(line)

use pandas isin function
import pandas as pd
df = pd.read_csv('csv_name')
take only rows where those I_id exists in ID columns
df = df.loc[df['ID'].isin(I_id)]

You can convert l_ids to a set first, and use a generator expression in the writerows method of the CSV writer to include those with id in the set:
ids = set(l_id)
with open('abc.csv', 'r') as inp, open('abc_edit.csv', 'w') as out:
reader = csv.reader(inp)
writer = csv.writer(out)
writer.writerow(next(reader)) # write the header unconditionally
writer.writerows(name, id for name, id in reader if int(id) in ids)

Related

search for common value(patient ID) from column 1 and if all there values in other column (pathologies) is null delete the rows of these common ID's

PATIENT_ID
PATHOLOGIES
12
null
12
null
3
patho1
3
null
5
patho2
2
patho1
12
null
If you can see, patient ID 12 is always null
but others can be null or has pathologies
if the same ID is always null, I want to delete it with the related rows in all columns
note: I have 2 million ID, so I want a code to search for the ID's (Python, CSV)
To remove all patients with only "null" variables you can use this example:
import csv
from itertools import groupby
with open("input.csv", "r") as f_in:
reader = csv.reader(f_in)
next(reader) # skip header
out = []
for id_, g in groupby(sorted(reader), lambda k: k[0]):
g = list(g)
if all(pathology == "null" for _, pathology in g):
continue
out.extend(g)
with open("output.csv", "w") as f_out:
writer = csv.writer(f_out)
writer.writerow(["PATIENT_ID", "PATHOLOGIES"])
writer.writerows(out)
This creates output.csv:
PATIENT_ID
PATHOLOGIES
2
patho1
3
null
3
patho1
5
patho2

How can calculate the sum of a column (but taking specific rows of it) in python using csv file?

Level ColumntoSum
1 4
2 10
1 3
2 23
1 15
2 2
So imagine this is my CSV file,it contains 2 columns [Level, ColumnToSum], in Level =[1,2,1,2,1,2] and ColumnToSum has random numbers next to each level.
What I need is to calculate the sum of "ColumntoSum" with Level=1 alone and the sum of level=2 alone then I need to save it in another CSV file in this way. (Having the 2nd column contains the sum of each level)
Level Column
1 Sum1
2 Sum2
After reading your CSV file with pandas:
import pandas as pd
df=pd.read_csv('name_of_your_file.csv')
You can use pandas groupby() function to group them by Level and the sum() function to calculate the sum of each group as shown bellow:
df=df.groupby('Level').sum()
display(df)
OUTPUT:
ColumntoSum
Level
1 22
2 35
Saving your data to CSV file:
df.to_csv('out.csv', index=True)
df.groupby(['Level']).sum()
This will generate your sum
Try this
import csv
with open('data.csv') as fp:
reader = csv.DictReader(fp)
res = {}
for row in reader:
res.setdefault(row['Level'], []).append(int(row['ColumntoSum']))
with open('output.csv', 'w') as fw:
writer = csv.writer(fw)
writer.writerow(('Level', 'Column'))
for k, v in res.items():
writer.writerow((k, sum(v)))
Using pandas
import pandas as pd
df = pd.read_csv('data.csv')
df = df.groupby('Level', as_index=False)['ColumntoSum'].sum().rename(columns={'ColumntoSum': 'Column'})
print(df)
Output:
Level Column
0 1 22
1 2 35
This can be done using itertools.groupby to group the rows by level and then obtaining the sum for each row. The operator.itemgetter function can be used to extract column values efficiently.
import csv
import itertools
import operator
# Define functions to fetch the columns we want
levelgetter = operator.itemgetter(0)
col2sumgetter = operator.itemgetter(1)
with open('data.csv', newline='') as f:
reader = csv.reader(f)
# Skip header row
next(reader)
# sort the rows by level (required for groupby)
sort_key = lambda row: levelgetter(row)
sorted_rows = sorted(reader, key=sort_key)
# Loop over the groups and sum the values
for level, group in itertools.groupby(sorted_rows, key=sort_key):
total = sum(int(col2sumgetter(row)) for row in group)
print(level, total)

Python: merge csv data with differing headers

I have a bunch of software output files that I have manipulated into csv-like text files. I have probably done this the hard way, because I am not too familiar with python library
The next step is to gather all this data in one single csv file. The files have different headers, or are sorted differently.
Lets say this is file A:
A | B | C | D | id
0 2 3 2 "A"
...
and this is file B:
B | A | Z | D | id
4 6 1 0 "B"
...
I want the append.csv file to look like:
A | B | C | D | Z | id
0 2 3 2 "A"
6 4 0 1 "B"
...
How can I do this, elegantly? Thank you for all answers.
You can use pandas to read CSV files into DataFrames and use the concat method, then write the result to CSV:
import pandas as pd
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
df = pd.concat([df1, df2], axis=0, ignore_index=True)
df.to_csv("file.csv", index=False)
The csv module in the standard library provides tools you can use to do this. The DictReader class produces a mapping of column name to value for each row in a csv file; the DictWriter class will write such mappings to a csv file.
DictWriter must be provided with a list of column names, but does not require all column names to be present in each row mapping.
import csv
list_of_files = ['1.csv', '2.csv']
# Collect the column names.
all_headers = set()
for file_ in list_of_files:
with open(file_, newline='') as f:
reader = csv.reader(f)
headers = next(reader)
all_headers.update(headers)
all_headers = sorted(all_headers)
# Generate the output file.
with open('append.csv', 'w', newline='') as outfile:
writer = csv.DictWriter(outfile, fieldnames=all_headers)
writer.writeheader()
for file_ in list_of_files:
with open(file_, newline='') as f:
reader = csv.DictReader(f)
writer.writerows(reader)
$ cat append.csv
A,B,C,D,Z,id
0,2,3,2,,A
6,4,,0,1,B

merge tsv files in one csv by extracting particular columns and naming the column as file name

I have multiple tsv files in folder. From each file I have to extract 1st column which is the abundance and 5th column which is ID, there are no headers for columns. I have to merge these columns from each file in one file and give their headers as there file name. Also I have to compare check if all the ID'a are present, if not then value should be zero.
One of the sample files File_Name1 looks like:
0.11 31 31 U 0 unclassified
99.89 29001 0 - 1 root
99.89 29001 0 - 131567 cellular organisms
99.89 29001 64 D 2 Bacteria
59.94 17401 270 - 1783272 Terrabacteria group
53.47 15522 8 P 1239 Firmicutes
52.10 15127 998 C 186801 Clostridia
37.83 10982 494 O 186802 Clostridiales
20.61 5983 89 F 186803 Lachnospiraceae
16.95 4922 8 G 1506553 Lachnoclostridium
14.53 4219 0 S 84030 [Clostridium] saccharolyticum
Similarly I have multiple files. The file I want is like :
ID File_Name1 File_Name2
186802 16.95 37.88
1506553 20.61 0
84030 14.53 0.05
I have tried something like this:
import glob
import csv
directory = "C:\kraken\kraken_13266"
txt_files = glob.glob(directory+"\*.kraken")
for txt_file in txt_files:
with open(txt_file, "rt") as input_file, open('output.csv', "wt") as
out_file:
in_txt = csv.reader(input_file, delimiter='\t')
for line in in_txt:
firstcolumns = line[:1]
lastcolumns = line[-2].strip().split(",")
allcolumns = firstcolumns + lastcolumns
I'm stuck at this point. How should I proceed further.
The following should do what you are trying to do:
from collections import defaultdict
import glob
import csv
ids = defaultdict(dict) # e.g. {'186802' : {'FileName1' : '16.95', 'FileName2' : '37.88'}}
kraken_files = glob.glob('*.kraken')
for kraken_filename in kraken_files:
with open(kraken_filename, 'r', newline='') as f_input:
csv_input = csv.reader(f_input, delimiter='\t')
file_name = os.path.splitext(kraken_filename)[0]
for row in csv_input:
ids[int(row[4])].update({file_name : float(row[0])})
with open('output.csv', 'w', newline='') as f_output:
fieldnames = ['ID'] + [os.path.splitext(filename)[0] for filename in kraken_files]
csv_output = csv.DictWriter(f_output, fieldnames=fieldnames, restval=0)
csv_output.writeheader()
for id in sorted(ids.keys()):
id_values = ids[id]
id_values['ID'] = id
csv_output.writerow(id_values)
You will need to read all of the files in before you are able to write an output file. A dictionary is used to store all the IDs. For each one a dictionary is used to hold each file that contains a matching ID.

Adding in-between column in csv Python

I work with csv files and it seems python provides a lot of flexibility for handling csv files.
I found several questions linked to my issue, but I cannot figure out how to combine the solutions effectively...
My starting point CSV file looks like this (note there is only 1 column in the 'header' row):
FILE1
Z1 20 44 3
Z1 21 44 5
Z1 21 44 8
Z1 22 45 10
What I want to do is add a column in between cols #1 and #2, and keep the rest unchanged. This new column has the same # rows as the other columns, but contains the same integer for all entries (10 in my example below). Another important point is I don't really know the number of rows, so I might have to count the # rows somehow first (?) My output should then look like:
FILE1
Z1 10 20 44 3
Z1 10 21 44 5
Z1 10 21 44 8
Z1 10 22 45 10
Is there a simple solution to this?
I think the easiest solution would be to just read each row and write a corresponding new row (with the inserted value) in a new file:
import csv
with open('input.csv', 'r') as infile:
with open('output.csv', 'w') as outfile:
reader = csv.reader(infile, delimiter=' ')
writer = csv.writer(outfile, delimiter=' ')
for row in reader:
new_row = [row[0], 10]
new_row += row[1:]
writer.writerow(new_row)
This might not make sense if you're not doing anything else with the data besides this bulk processing, though. You'd' want to look into csv libraries if that were the case.
Use pandas to import the csv file as a DataFrame named df and then use df.insert(idx, col_name, value); where idx is the index of the newly created column, col_name is the name you assign to this column and value is the list of values you wish to assign to the column. See below for illustration:
import pandas as pd
prices = pd.read_csv('C:\\Users\\abdou.seck\\Documents\\prices.csv')
prices
## Output
Shares Number Prices
0 AAP 100 100.67
1 MSFT 50 56.50
2 SAN 200 19.18
3 GOOG 300 500.34
prices.insert(3, 'Total', prices['Number']*prices['Prices'])
prices
## Output:
Shares Number Prices Total
0 AAP 100 100.67 10067
1 MSFT 50 56.50 2825
2 SAN 200 19.18 3836
3 GOOG 300 500.34 150102
Hope this helps.
Read the header first, then initialize the reader, write the header first, then initialize the writer:
import csv
with open("in.csv", "rb") as in_file:
header = in_file.readline()
csv_file_in = csv.reader(in_file, delimiter=" ")
with open("out.csv","wb") as out_file:
out_file.write(header)
csv_file_out = csv.writer(out_file, delimiter=" ")
for row in csv_file_in:
csv_file_out.writerow([row[0], 10] + row[1:])
Pull the data into a list, insert data for each row into the desired spot, and re-write the data.
import csv
data_to_add = 10
new_column_index = 1 # 0 based index
with open('FILE1.csv','r') as f:
csv_r = csv.reader(f,delimiter=' ')
data = [line for line in csv_r]
for row in data:
row.insert(new_column_index,data_to_add)
with open('FILE1.csv','w') as f:
csv_w = csv.writer(f,delimiter=' ')
for row in data:
csv_w.write(row)
Here's how I might do it with pandas:
import pandas as pd
with open("in.csv") as input_file:
header = input_file.readline()
data = pd.read_csv(input_file, sep=" ")
data.insert(1, "New Data", 10)
with open("out.csv", "w") as output_file:
output_file.write(header)
data.to_csv(output_file, index=False, header=False)

Categories