Replacing data to different file - python

First of all, we have two files:
file01.txt
101|10075.0|12|24/12/2015
102|1083.33|12|24/12/2015
The second file has only one line!
file02.txt
101|False|Section06
The first parameter is th same in both files (unique).
I must replace data file01 by some from file02. Match criterion is the first parameter (code).
I have one input (request for code) and readlines for both file what next I need to do Also I'm working with lists.
Expected result:
input = 101
The output should be:
101|False|Section06
102|1083.33|12|24/12/2015

You could use csv.reader() to read the file, and put them in a dict, then replace the keys like this:
import csv
with open('file1') as f:
d = {i[0]: i[1:] for i in csv.reader(f, delimiter='|')}
with open('file2') as f:
d.update({i[0]: i[1:] for i in csv.reader(f, delimiter='|')})
And d looks like:
{'101': ['False', 'Section06'], '102': ['1083.33', '12', '24/12/2015']}
To get the excepted output:
>>> ['|'.join([i[0]]+i[1]) for i in d.items()]
['101|False|Section06', '102|1083.33|12|24/12/2015']
And if you want write them into a file:
with open('file1', 'w') as f:
for i in d.items():
f.write('|'.join([i[0]]+i[1]))

Solution
This works for the given example:
with open('file01.txt') as fobj1, open('file02.txt') as fobj2:
data1 = fobj1.readlines()
data2 = fobj2.readline()
code = data2.split('|', 1)[0]
with open('file01.txt', 'w') as fobj_out:
for line in data1:
if line.split('|', 1)[0] == code:
fobj_out.write(data2 + '\n')
else:
fobj_out.write(line)
Step by step
We open both files for reading:
with open('file01.txt') as fobj1, open('file02.txt') as fobj2:
data1 = fobj1.readlines()
data2 = fobj2.readline()
The read data looks like this:
>> data1
['101|10075.0|12|24/12/2015\n', '102|1083.33|12|24/12/2015']
>> data2
'101|False|Section06'
We only need the code from file02.txt:
>>> code = data2.split('|', 1)[0]
code
'101'
The data2.split('|', 1) splits at |. Since we need only one split, we can limit it with 1.
Now we open file01.txt again. This time for writing:
with open('file01.txt', 'w') as fobj_out:
for line in data1:
if line.split('|', 1)[0] == code:
fobj_out.write(data2 + '\n')
else:
fobj_out.write(line)
This line if line.split('|', 1)[0] == code: does the same split as above but for all lines of file01.txt. If the code is equal to the one from file02.txt, we use the line from file02.txt, otherwise we just write the line form file01.txt back.

You can simply concatenate the two sets of data into a single pandas.DataFrame(), as follows:
import pandas as pd
df1 = pd.DataFrame([[10075.0, 12,'24/12/2015'], [1083.33, 12, '24/12/2015']], index=[101,102], columns=['prc', 'code', 'date'])
'''
101|10075.0|12|24/12/2015
102|1083.33|12|24/12/2015
'''
df2 = pd.DataFrame([[False, 'Section06'], [True, 'Section07']], index=[101,102], columns=['Bool', 'Section'])
'''
101|False|Section06
102|True|Section07
'''
pd.concat([df1,df2], axis=1, join='outer')
Which gives:
prc code date Bool Section
101 10075.00 12 24/12/2015 False Section06
102 1083.33 12 24/12/2015 True Section07
Now you can get rid of the columns you don't need (eg using pandas.Drop())

Related

Reading a txt file and saving individual columns as lists

I am trying to read a .txt file and save the data in each column as a list. each column in the file contains a variable which I will later on use to plot a graph. I have tried looking up the best method to do this and most answers recommend opening the file, reading it, and then either splitting or saving the columns as a list. The data in the .txt is as follows -
0 1.644231726
0.00025 1.651333945
0.0005 1.669593478
0.00075 1.695214575
0.001 1.725409504
the delimiter is a space '' or a tab '\t' . I have used the following code to try and append the columns to my variables -
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter='\t')
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)
However, when I try to print the lists, time and rim, using print(time, rim), I get the following error message -
r = line[1]
IndexError: list index out of range
I am, however, able to print only the 'time' if I comment out the r=line[1] and rim.append(r) parts. How do I approach this problem? Thank you in advance!
I would suggest the following:
import pandas as pd
df=pd.read_csv('./rvt.txt', sep='\t'), header=[a list with your column names])
Then you can use list(your_column) to work with your columns as lists
The problem is with the delimiter. The dataset contain multiple space ' '.
When you use '\t' and
print line you can see it's not separating the line with the delimiter.
eg:
['0 1.644231726']
['0.00025 1.651333945']
['0.0005 1.669593478']
['0.00075 1.695214575']
['0.001 1.725409504']
To get the desired result you can use (space) as delimiter and filter the empty values:
readfile = csv.reader(file, delimiter=" ")
time, rim = [], []
for line in readfile:
line = list(filter(lambda x: len(x), line))
t = line[0]
r = line[1]
Here is the code to do this:
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter=” ”)
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)

Using Pandas to read data and skip metadata

Background
I have data files which consist of two parts: data in CSV format, and Metadata. I can use the method given here 1 and here 2 to manually skip the Metadata portion by specifying the location/line number of the beginning of the Metadata.
Following is the sample of the data file:
Here, you can see that I can specify the line number (420) manually and use the following code to skip the Metadata:
with open('data.csv', 'r') as f:
metadata_location = [i for i, x in enumerate(f.readlines()) if 'Metadata' in x]
with open('data.csv', 'r') as f:
flat_data = pd.read_csv(f, index_col=False, skiprows=lambda x: x >= metadata_location[0])
with open('data.csv') as f:
df = pd.read_csv(f, index_col=False)
df = df[:420]
Question
How can I scan the file to capture the Metadata and then skip reading it? (I will need to process multiple such files, hence, I wish to write such a code)
IIUC, You can pass the callable function to skiprows argument that will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. Use:
df = pd.read_csv("data.csv", index_col=False, skiprows=lambda x: x >= 420)
UPDATE: To find the metadata location:
import re
md_loc = 0
with open("data.csv") as f:
for idx, line in enumerate(f):
if re.search(r'^"Metadata:\s*"$', line):
md_loc = idx
You question is not clear.
If I got you right, you are looking for a way to scan all the lines and run the above code on each?
EDIT 1:
for index, row in All_Patients_Chosen_Visit.iterrows():
df = row[:420]
See above code. Check if it works

How to combine certain fields of different rows of csv file into one row

For my research I have a csv file in which per row a userId and a message and a label(about the userId) is stored as follows:
UserId txt label
1 This is a true
1 part of true
1 the whole true
1 message true
2 more false
2 text false
What i would want to achieve is that I can combine for example two entries of every user in one row. So that would mean for the above sample I would want to get the following output in a csv:
UserId txt label
1 This is a part of true
2 more text false
I don't know how to effectively achieve this (with python?), because the file is have contains 3 million rows with 20 thousand users. So I would like to end up with a file that has only 20 thousand rows.
Here is a method using pandas, groupby in combination with join:
import pandas as pd
df = pd.read_csv(r'C:\YourDir\YourFile.csv',sep=',')
df = df.groupby(['UserId','label'])['txt'].apply(' '.join).reset_index()
print(df)
Result:
UserId label txt
0 1 True This is a part of the whole message
1 2 False more text
Note: Use the appropriate seperator for the sep parameter. I have used a comma.
You can write this back (overwrite) to csv like:
df.to_csv(r'C:\YourDir\YourFile.csv', sep=',', index=False)
Your file doesn't seem to be comma separated, being so, the following can help you:
import re
user_dict = {}
with open("csv_merge.csv") as f:
for l in f:
for m in re.finditer(r"^(\d+)\s*(.*?)\s*(true|false)\s*$", l, re.IGNORECASE):
user, txt, label = m.group(1), m.group(2), m.group(3)
if not user in user_dict:
user_dict[user] = {"txt": txt, "label": label}
else:
user_dict[user]["txt"] += " "+txt
# as far as I could understand, label doesn't change
with open("csv_merge_new.csv", "w") as f:
f.write("UserId,txt,label\n") # comma separated
for k, v in user_dict.items():
f.write(f"{k},{v['txt']},{v['label']}\n")
UserId,txt,label
1,This is a part of the whole message,true
2,more text,false
DEMO
How to combine certain fields of different rows of csv file into one row
Try this (Assuming file is delimited by ",", i.e., it is CSV):
di = {}
with open("file.txt", "r") as fi:
fi.readline()
for line in fi:
l = [' '.join(i.split()) for i in line.split(',')]
if l[0] in di:
di[l[0]][0] += " " + l[1]
else:
di[l[0]] = [l[1], l[2]]
print(di)
with open("out.txt", "w") as fi:
fi.write("UserId, txt, label\n")
for k,v in di.items():
fi.write("{},{},{}\n".format(k,v[0],v[1]))
Outputs:
{'1': ['This is a part of the whole message', 'true'], '2': ['more text', 'false']}
File: out.txt
UserId, txt, label
1,This is a part of the whole message,true
2,more text,false
File: file.txt:
UserId, txt, label
1, This is a, true
1, part of, true
1, the whole, true
1, message, true
2, more, false
2, text, false
Here is an SQLite solution, which should be very fast.
import pandas as pd
import sqlite3 as db
path = 'path/to/some.csv'
df = pd.read_csv(path)
conn = db.connect('my_solution.db')
df.to_sql('table_from_df', conn, if_exists = 'replace', index = False)
sql_query = '''
select
userid,
group_concat(txt, ' ') as txt
from table_from_df
group by 1
order by 1
'''
out_df = pd.read_sql_query(sql_query, conn)
out_df
conn.close()

Combining two scripts into one code for csv file data verification

Hello everyone currently I have two scripts that I would like to combine into 1 code. The first script finds missing time stamps from a set of data and fills in a blank row with NaN values then saves to an output file. The second script compares different rows in a set of data and creates a new column with True/False values based on the test condition.
If I run each script as a function then call both with another function I would get two separate output files. How can I make this run with only 1 saved output file?
First Code
import pandas as pd
df = pd.read_csv("data5.csv", index_col="DateTime", parse_dates=True)
df = df.resample('1min').mean()
df = df.reindex(pd.date_range(df.index.min(), df.index.max(), freq="1min"))
df.to_csv("output.csv", na_rep='NaN')
Second Code
with open('data5.csv', 'r') as f:
rows = [row.split(',') for row in f]
rows = [[cell.strip() for cell in row if cell] for row in rows]
def isValidRow(row):
return float(row[5]) <= 900 or all(float(val) > 7 for val in row[1:4])
header, rows = rows[0], rows[1:]
validRows = list(map(isValidRow, rows))
with open('output.csv', 'w') as f:
f.write(','.join(header + ['IsValid']) + '\n')
for row, valid in zip(rows, validRows):
f.write(','.join(row + [str(valid)]) + '\n')
Let put your code as function of filenames:
def first_code(file_in, file_out):
df = pd.read_csv(file_in, ... )
...
df.to_csv(file_out, ...)
def second_code(file_in, file_out):
with open(file_in, 'r') as f:
...
....
with open(file_out, 'w') as f:
...
Your solution can then be:
first_code('data5.csv', 'output.csv')
second_code('output.csv', 'output.csv')
Hope it helps
Note that there is not problem reading and writing in the same file. Be sure that the file is previously closed to avoid side effect. This is implicitly done by using with, which is a good practice
In the second code, change data5.csv which is the first input to the second code to output.csv. and make sure that the file1.py and file2.py are in the same directory. so your modified code in a single file will be as follows:
import pandas as pd
df = pd.read_csv("data5.csv", index_col="DateTime", parse_dates=True)
df = df.resample('1min').mean()
df = df.reindex(pd.date_range(df.index.min(), df.index.max(), freq="1min"))
df.to_csv("output.csv", na_rep='NaN')
with open('output.csv', 'r') as f:
rows = [row.split(',') for row in f]
rows = [[cell.strip() for cell in row if cell] for row in rows]
def isValidRow(row):
return float(row[5]) <= 900 or all(float(val) > 7 for val in row[1:4])
header, rows = rows[0], rows[1:]
validRows = list(map(isValidRow, rows))
with open('output.csv', 'w') as f:
f.write(','.join(header + ['IsValid']) + '\n')
for row, valid in zip(rows, validRows):
f.write(','.join(row + [str(valid)]) + '\n')

Extract one column of csv into a comma separated list python

I have a CSV file that I read like below:
with open ("ann.csv", "rb") as annotate:
for col in annotate:
ann = col.lower().split(",")
print ann[0]
My CSV file looks like below:
H1,H2,H3
da,ta,one
dat,a,two
My output looks like this:
da
dat
but I want a comma separated output like (da,dat). How can I do that?
First, in Python you have the csv module - use that.
Second, you're iterating through rows, so using col as a variable name is a bit confusing.
Third, just collect the items in a list and print that using .join():
import csv
with open ("ann.csv", "rb") as csvfile:
reader = csv.reader(csvfile)
reader.next() # Skip the header row
collected = []
for row in reader:
collected.append(row[0])
print ",".join(collected)
try like this:
with open ("ann.csv", "rb") as annotate:
output = []
next(annotate) # next will advanced the file pointer to next line
for col in annotate:
output.append(col.lower().split(",")[0])
print ",".join(output)
Then try this:
result = ''
with open ("ann.csv", "rb") as annotate:
for col in annotate:
ann = col.lower().split(",")
# add first element of every line to one string and separate them by comma
result = result + ann[0] + ','
print result
Try this
>>> with open ("ann.csv", "rb") as annotate:
... for col in annotate:
... ann = col.lower().split(",")
... print ann[0]+',',
...
Instead of printing it on the spot, build up a string, and print it in the end.
s = ''
with open ("ann.csv", "rb") as annotate:
for col in annotate:
ann = col.lower().split(",")
s += ann[0] + ','
s = s[:-1] # Remove last comma
print(s)
I would also suggest to change the variable name col, it is looping over lines, not over columns.
Using numpy.loadtxt might be a bit easier:
In [23]: import numpy as np
...: fn = 'a.csv'
...: m = np.loadtxt(fn, dtype=str, delimiter=',')
...: print m
[['H1' 'H2' 'H3']
['da' 'ta' 'one']
['dat' 'a' 'two']]
In [24]: m[:,0][1:]
Out[24]:
array(['da', 'dat'],
dtype='|S3')
In [25]: print ','.join(m[:,0][1:])
da,dat
m[:,0] gets the first column of matrix m, and [1:] skips the first element 'H1'.

Categories