Creating a single output file fom 3 csv files using Python - python

I have 3 CSV files. Names are below
AD.csv
ID.csv
MD.csv
AD.csv
A.Net ATVS
A&E HD 60 Days In
AXSTV 60 Days : Watch Along
BET HD Behind Bars: Rookie Year
Bloomberg Biggie: The Life of Notorious B.I.G.
ID.csv
I.Net ITvs
AETVHD 60 Days In
AXSTV 60 Days : Watch Along
BETHD Behind Bars: Rookie Year
BLOOMHD Dog the Bounty Hunter
MD.csv
A.Net I.Net
A&E HD AETVHD
AXSTV AXSTV
BET HD BETHD
Bloomberg BLOOMHD
In MD.csv, 'a.net' = 'I.net'
which means I have to map the data in 'atvs' with 'itvs' where MD.csv 'a.net = i.net'
I am new to write python script, can anyone help me to map this?
import csv
with open('E:/ad.csv', 'r') as lookuplist:
with open('E:/id.csv', 'r') as csvinput:
with open('vlookupout', 'w') as output:
reader = csv.reader(lookuplist)
reader2 = csv.reader(csvinput)
writer = csv.writer(output)
for 'itvs' in reader2:
for 'atvs' in reader:
if itvs[0] == atvs[0]:
itvs.append(atvs[1:])
writer.writerow(itvs)

If you don't have any dependency constraints, use DataFrame from the pandas library.
Using DataFrames, you can simply read and load the CSVs as tables.
ad = pd.read_csv('E:\ad.csv')
id = pd.read_csv('E:\id.csv')
... and perform joins/merge/aggregations on them.
result = pd.merge(ad,
id[['I.Net', 'ITVs']],
on='I.Net')
It'll be much easier and flexible as per your requirement.

You can do this using Pandas.
import pandas as pd
# read in the csv's
ad_df = pd.read_csv('AD.csv', sep=r'\s\s+', engine='python')
id_df = pd.read_csv('ID.csv', sep=r'\s\s+', engine='python')
md_df = pd.read_csv('MD.csv', sep=r'\s\s+', engine='python')
# Combine the csv's using MD.csv
result = pd.merge(ad_df,md_df[['A.Net', 'I.Net']], on='A.Net')
result = pd.merge(result,id_df[['I.Net', 'ITvs']], on='I.Net')
# in case you want to drop 'I.Net' add:
result.drop('I.Net', axis=1, inplace=True)
#export to csv:
result.to_csv('result.csv', index=False)
Note: Your CSV's have some inconsistencies in the header names. I used the names in my script exactly as provided.
As noted in my comment, your csv seperation looks off. I made one small change in the csv, by adding an extra space between "BLOOMHD" and "Dog the...".

Related

Removal of rows containing a particular text in a csv file using Python

I have a genomic dataset consisting of more than 3500 rows. I need to remove rows in two columns that("Length" and "Protein Name") from them. How do I specify the condition for this purpose.
import csv #importing the csv module or method
#opening a new csv file
file = open('C:\\Users\\Admin\\Downloads\\csv.csv', 'r')
type(file)
#reading the csv file
csvreader = csv.reader(file)
header = []
header = next(csvreader)
print(header)
#extracting rows from the csv file
rows = []
for row in csvreader:
rows.append(row)
print(rows)
I am a beginner in python bioinformatic data analysis and I haven't tried any extensive methods. I don't how to proceed from here. I have done the work opening and reading the csv file. I have also extracted the column headers. But I don't know how to proceed from here. Please help.
try this :
csvreader= csvreader[csvreader["columnName"].str.contains("string to delete") == False]
It will be better to read scv in pandas since you have lots of rows. That will be the smart decision to make. And also set your conditional variables which you will use to perform the operation. If this do not help. I will suggest you provide a sample data of your scv file.
df = pd.read_csv('C:\\Users\\Admin\\Downloads\\csv.csv')
length = 10
protein_name = "replace with protain name"
df = df[(df["Length"] > length) & (df["Protein Name"] != protein_name)]
print(df)
You can save the df back to a scv file if you want:
df.to_csv("'C:\\Users\\Admin\\Downloads\\new_csv.csv'", index=False)

Read CSV with comma delimiter (sorting issue while importing csv)

I am trying to open a csv file by skipping first 5 rows. The data is not getting aligned in dataframe. See screenshot of file
PO = pd.DataFrame()
PO = pd.read_table(acct.csv',sep='\t',skiprows=5,skip_blank_lines=True)
PO
try to set it after import datewise as below.
First sort your data with proper import as it is sticked to the index values. see data image again and data as well. So, when you have proper separator / delimiter you can do following.
do = pd.read_csv('check_test.csv', "r", delimiter='\t', skiprows=range(1, 7),skip_blank_lines=True, encoding="utf8")
d01 = do.iloc[:,1:7]
d02 = d01.sort_values('Date,Reference,Debit')
This is sorting the values into the way you want.

Operations on a very large csv with pandas

I have been using pandas on csv files to get some values out of them. My data looks like this:
"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"
I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:
group freqW1 freqW2
A 1 0
B 1 0
C 0 1
Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.
I suspect there is some easy way to iterate through the csv and do what I want.
My code is like this:
df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close()
You can specify a chunksize option in the read_csv call. See here for details
Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.
Okay I misunderstood the chunk parameter. I solved it by doing this:
frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks:
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
frame = frame.add(df1,fill_value=0)
outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close()

renaming the header when using dictreader

I'm looking for the best way to rename my header using dictreader / dictwriter to add to my other steps already done.
This is what I am trying to do to the Source data example below.
Remove the first 2 lines
Reorder the columns (header & data) to 2, 1, 3 vs the source file
Rename the header to ASXCode, CompanyName, GISC
When I'm at
If I use 'reader = csv.reader.inf' the first lines are removed and columns reordered but as expected no header rename
Alternately when I run the dictreader line 'reader = csv.DictReader(inf, fieldnames=('ASXCode', 'CompanyName', 'GICS'))' I receive the error 'dict contains fields not in fieldnames:' and shows the first row of data rather than the header.
I'm a bit stuck on how I get around this so any tips appreciated.
Source Data example
ASX listed companies as at Mon May 16 17:01:04 EST 2016
Company name ASX code GICS industry group
1-PAGE LIMITED 1PG Software & Services
1300 SMILES LIMITED ONT Health Care Equipment & Services
1ST AVAILABLE LTD 1ST Health Care Equipment & Services
My Code
import csv
import urllib.request
from itertools import islice
local_filename = "C:\\myfile.csv"
url = ('http://mysite/afile.csv')
temp_filename, headers = urllib.request.urlretrieve(url)
with open(temp_filename, 'r', newline='') as inf, \
open(local_filename, 'w', newline='') as outf:
# reader = csv.DictReader(inf, fieldnames=('ASXCode', 'CompanyName', 'GICS'))
reader = csv.reader(inf)
fieldnames = ['ASX code', 'Company name', 'GICS industry group']
writer = csv.DictWriter(outf, fieldnames=fieldnames)
# 1. Remove top 2 rows
next(islice(reader, 2, 2), None)
# 2. Reorder Columns
writer.writeheader()
for row in csv.DictReader(inf):
writer.writerow(row)
IIUC here is a solution using pandas and its function read_csv:
import pandas as pd
#Considering that you have your data in a file called 'stock.txt'
#and it is tab separated, by default the blank lines are not read by read_csv,
#hence set the header=1
df = pd.read_csv('stock.txt', sep='\t',header=1)
#Rename the columns as required
df.columns= ['CompanyName', 'ASXCode', 'GICS']
#Reorder the columns as required
df = df[['ASXCode','CompanyName','GICS']]
And this is how you would do it in ipython and the output would look like:
Based on your tips I got it working in the end. I hadn't used pandas before so had to ready up a little first.
I eventually worked out pandas uses a data frame so I had to do a few things differently with tocsv function and eventually add index=False parameter to the tocsv function to remove the df index.
Now all great thankyou.
import csv
import os
import urllib.request
import pandas as pd
local_filename = "C:\\myfile.csv"
url = ('http://mysite/afile.csv')
temp_filename, headers = urllib.request.urlretrieve(url)
#using pandas dataframe
df = pd.read_csv(temp_filename, sep=',',header=1) #skip header
df.columns = ['CompanyName', 'ASXCode', 'GICS'] #rename columns
df = df[['ASXCode','CompanyName','GICS']] #reorder columns
df.to_csv(local_filename, sep=',', index=False)
os.remove(temp_filename) # clean up

Merge a csv and txt file, then alphabetize and eliminate duplicates using python and pandas

I am trying to combine two csv files (items.csv and prices.csv) to create combined_list.txt. The result (combined_list.txt) should be a list sorted in alphabetical order in the format: item (quantity): $total_price_for_item and include 2 additional lines: a separator line with 10 equal signs and a line with the total amount for the list:
bread (10.0): $3.0
cheese (0.4): $4.0
eggs (11.0): $2.2
ham (0.6): $9.0
milk (2.0): $6.5
peanut butter (4.0): $12.0
tuna (4.0):$8.0
====================
Total: $44.7
items.csv looks like
eggs,6
milk,1
cheese,0.250
ham,0.250
etc...
and prices.txt looks like
eggs,$0.2
milk,$3.25
etc...
I have to do a version with python and another with pandas but nothing I find online hits the mark in a way I can work with. I started with
import csv
with open('items.csv', 'r') as inputFile:
new_file = csv.reader(inputFile, delimiter=' ', quotechar='|')
for row in new_file:
print .join(row)
But I am having trouble putting everything together. Some of the solutions I found are a little too complex for me or don't work with my files, which have no column headers. I'm still trying to figure it out but I know that for some of you this is super easy so I am turning to the collective wisdom instead of hitting my head against the wall alone.
Pandas has a built in method for reading csv files. Here is code to get both sets of data into one dataframe:
import pandas as pd
items = pd.read_csv('items.csv', index_col=0)
items.columns = columns=['Item', 'QTY']
prices = pd.read_csv('prices.csv', index_col=0)
prices.columns = ['Item', 'Price']
df = items.combine_first(prices)
To sort and drop duplicates:
df = df.sort()
df.drop_duplicates('Item', inplace=True)
df = df.to_csv('combined.txt')

Categories