Extracting data using fieldnames - python

I have parsed and extracted lines and field names from a CSV data file using:
reader = csv.DictReader(open('Sourcefile.txt','rt'), delimiter = '\t')
fn = reader.fieldnames
How do I access the parsed data in any line using the fieldname? For example fieldnames could be AA, BB, CC, DD. How do obtain the value for DD in line 5, or AA in line 3?

As long as the file is not too big just convert the reader into a list:
data = list(reader)
Now access the column AA in row 1:
data[0]['AA']
or column for field name 3 in row 3:
data[2][fn[2]]

An easy way is to store all of the data in your file into a container. An alternative to the other answer is to use a pandas dataframe:
>>import pandas as pd
>>from io import StringIO # just to make a fake in memory file
>>s = StringIO('AA\tBB\tCC\tDD\n1\t2\t3\t4\n11\t12\t13\t14\n') # fake data
>>data = pd.read_table(s)
>>data.loc[0,'AA'] # access row 0 (the first row) and column AA
1
Edit: This does use a different package not included with the standard library, but I often find it easier to use pandas than csv when I am constantly reading and manipulating csv files.

Related

Removal of rows containing a particular text in a csv file using Python

I have a genomic dataset consisting of more than 3500 rows. I need to remove rows in two columns that("Length" and "Protein Name") from them. How do I specify the condition for this purpose.
import csv #importing the csv module or method
#opening a new csv file
file = open('C:\\Users\\Admin\\Downloads\\csv.csv', 'r')
type(file)
#reading the csv file
csvreader = csv.reader(file)
header = []
header = next(csvreader)
print(header)
#extracting rows from the csv file
rows = []
for row in csvreader:
rows.append(row)
print(rows)
I am a beginner in python bioinformatic data analysis and I haven't tried any extensive methods. I don't how to proceed from here. I have done the work opening and reading the csv file. I have also extracted the column headers. But I don't know how to proceed from here. Please help.
try this :
csvreader= csvreader[csvreader["columnName"].str.contains("string to delete") == False]
It will be better to read scv in pandas since you have lots of rows. That will be the smart decision to make. And also set your conditional variables which you will use to perform the operation. If this do not help. I will suggest you provide a sample data of your scv file.
df = pd.read_csv('C:\\Users\\Admin\\Downloads\\csv.csv')
length = 10
protein_name = "replace with protain name"
df = df[(df["Length"] > length) & (df["Protein Name"] != protein_name)]
print(df)
You can save the df back to a scv file if you want:
df.to_csv("'C:\\Users\\Admin\\Downloads\\new_csv.csv'", index=False)

Join large set of CSV files where the header is the timestamp for the file

I have a large set of CSV files. Approx. 15 000 files. And would like to figure out how to join them together as one file for data processing.
Each file is in a simple pattern with timestamp that corresponds to a period of time that represent the data in the each CSV file.
Ex.
file1.csv
2021-07-23 08:00:00
Unit.Device.No03.ErrorCode;11122233
Unit.Device.No04.ErrorCode;0
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0
file2.csv
2021-07-23 08:15:00
Unit.Device.No03.ErrorCode;0
Unit.Device.No04.ErrorCode;44556666
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0
Each file starts with the timestamp. I would like to join all the files in a directory, and transpose the "Unit.Device" to columns. And then use the original header as a timestamp column. For each file add a new row with the corresponding "ErrorCode" to each column.
Like this:
Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode..
2021-07-23 08:00:00;11122233;0;0;0;0....
2021-07-23 08:15:00;0;44556666;0;0;0....
Any simple tools for this, or Python routines?
Thanks for the reply on my first question here!
I will also contribute with a solution for this problem.
I did some read up on Pandas after I found something similar to what I wanted to do. I found that the transform method was very easy to use, and put together this snippet of Python code instead.
import pandas as pd
import os
folder = 'in'
df_out = pd.DataFrame()
for filename in os.scandir(folder):
if filename.is_file():
print('Working on file' + filename.path)
df = pd.read_csv(filename.path, encoding='utf-16', sep=';',header =[0])
# Transpose data with timestamp header to columns
df_tranposed = df.T
df_out = df_out.append(df_tranposed)
df_out.to_csv('output.csv')
Try the following Pandas approach:
import pandas as pd
import glob
import os
dfs = []
for csv_filename in glob.glob('./file*.csv'):
print('Working on file', csv_filename)
# Read the CSV file, assume no header and two columns
df = pd.read_csv(csv_filename, sep=';', names=[0, 1], header=None)
# Transpose from the 2nd row (skip the timestamp)
df_transposed = df[1:].T
# Allocate the column names from the first row and 'Timestamp'
df_transposed.columns = df_transposed.iloc[0] + 'Timestamp'
# Copy the timestamp into the transposed dataframe as a datetime value
df_transposed['Timestamp'] = pd.to_datetime(df.iloc[0, 0])
# Remove the first row (containing the names)
df_transposed = df_transposed[1:]
dfs.append(df_transposed)
# Concatenate all dataframes together and sort by Timestamp
df_output = pd.concat(dfs).sort_values(by='Timestamp')
# Sort the header columns and output to a CSV file
df_output.reindex(sorted(df_output.columns), axis=1).to_csv('output.csv', index=None)
Alternatively, it could be done using standard Python:
from datetime import datetime
import csv
import glob
data = []
fieldnames = set()
for fn in glob.glob('file*.csv'):
with open(fn) as f_input:
csv_input = csv.reader(f_input, delimiter=';')
timestamp = next(csv_input)[0]
row = {'Timestamp' : timestamp}
for device, error_code in csv_input:
row[device] = error_code
fieldnames.add(device)
data.append(row)
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=['Timestamp', *sorted(fieldnames)], delimiter=';')
csv_output.writeheader()
csv_output.writerows(sorted(data, key=lambda x: datetime.strptime(x['Timestamp'], '%Y-%m-%d %H:%M:%S')))
This gives output.csv as:
Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode;Unit.Device.No11.ErrorCode
2021-07-23 08:00:00;11122233;0;0;0
2021-07-23 08:15:00;0;44556666;0;0
How does this work
First iterate over all .csv files in a given folder.
For each file open it using a csv.reader()
Read the header row as a special case, storing the value as a Timestamp entry in a dictionary row.
For each row, store additional key value entries in the row dictionary.
Keep a note of each device name using a set.
Append the complete row into a data list.
It is now possible to create an output.csv file. The full list of columns names can be assigned as fieldnames and a csv.DictWriter() used.
Write the header.
Use writerows() to write all the data rows sorted by timestamp. To do this convert each row's Timestamp entry into a datetime value for sorting.
This approach will also work if the CSV files happen to have different types of devices e.g. Unit.Device.No42.ErrorCode.

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

Read CSV with comma delimiter (sorting issue while importing csv)

I am trying to open a csv file by skipping first 5 rows. The data is not getting aligned in dataframe. See screenshot of file
PO = pd.DataFrame()
PO = pd.read_table(acct.csv',sep='\t',skiprows=5,skip_blank_lines=True)
PO
try to set it after import datewise as below.
First sort your data with proper import as it is sticked to the index values. see data image again and data as well. So, when you have proper separator / delimiter you can do following.
do = pd.read_csv('check_test.csv', "r", delimiter='\t', skiprows=range(1, 7),skip_blank_lines=True, encoding="utf8")
d01 = do.iloc[:,1:7]
d02 = d01.sort_values('Date,Reference,Debit')
This is sorting the values into the way you want.

Python Pandas performing operation on each row of CSV file

I have a 1million line CSV file. I want to do call a lookup function on each row's 1'st column, and append its result as a new column in the same CSV (if possible).
What I want is this is something like this:
for each row in dataframe
string=row[1]
result=lookupFunction(string)
row.append[string]
I Know i could do it using python's CSV library by opening my CSV, read each row, do my operation, write results to a new CSV.
This is my code using Python's CSV library
with open(rawfile, 'r') as f:
with open(newFile, 'a') as csvfile:
csvwritter = csv.writer(csvfile, delimiter=' ')
for line in f:
#do operation
However I really want to do it with Pandas because it would be something new to me.
This is what my data looks like
77,#oshkosh # tannersville pa,,PA,US
82,#osithesakcom ca,,CA,US
88,#osp open records or,,OR,US
89,#ospbco tel ord in,,IN,US
98,#ospwmnwithn return in,,IN,US
99,#ospwmnwithn tel ord in,,IN,US
100,#osram sylvania inc ma,,MA,US
106,#osteria giotto montclair nj,,NJ,US
Any help and guidance will be appreciated it. THanks
here is a simple example of adding 2 columns to a new column from you csv file
import pandas as pd
df = pd.read_csv("yourpath/yourfile.csv")
df['newcol'] = df['col1'] + df['col2']
create df and csv
import pandas as pd
df = pd.DataFrame(dict(A=[1, 2], B=[3, 4]))
df.to_csv('test_add_column.csv')
read csv into dfromcsv
dfromcsv = pd.read_csv('test_add_column.csv', index_col=0)
create new column
dfromcsv['C'] = df['A'] * df['B']
dfromcsv
write csv
dfromcsv.to_csv('test_add_column.csv')
read it again
dfromcsv2 = pd.read_csv('test_add_column.csv', index_col=0)
dfromcsv2

Categories