I have text file with data like below:
"id":0
"value1":"234w-76-54"
"id":1
"value1":"2354w44-7-54"
I want to have these data in csv file. I tried with below code, but this is writing each ids and value1 as list in csv file.
with open("log.txt", "r") as file:
file2 = csv.writer (open("file.csv", "w", newline=""), delimiter=",")
file2.writerow(["id", "value1"])
for lines in file:
if "id" in lines:
ids = re.findall(r'(\d+)', lines)
if "value1" in lines:
value1 = re.findall(r'2[\w\.:-]+', lines)
file2.writerow([ids, value1])
getting output-
id value1
['0'] ['234w-76-54']
['1'] ['2354w44-7-54']
Desired output-
id value1
0 234w-76-54
1 2354w44-7-54
The simplest way to do it, in my opionion, is to read in the .txt file using pandas's read_csv() method and write out using the Dataframe.to_csv() method.
Below I've created a fully reproducible example recreating the OP's .txt file, reading it in and then writing out a new .csv file.
import pandas as pd
#Step 0: create the .txt file
file = open("file.txt","w")
input = '''"id":0
"value1":"234w-76-54"
"id":1
"value1":"2354w44-7-54"'''
file.writelines(input)
file.close()
#Step 1: read in the .txt file and create a dataframe with some manipulation to
# get the desired shape
df = pd.read_csv('file.txt', delimiter=':', header=None)
df_out = pd.DataFrame({'id': df.loc[df.iloc[:, 0] == 'id'][1].tolist(),
'value1': df.loc[df.iloc[:, 0] == 'value1'][1].tolist()})
print(df_out)
#Step 2: Create a .csv file
df_out.to_csv('out.csv', index=False)
Expected Outputted .csv file:
findall returns a list. you probably want to use re.search or re.match depending on your usecase.
If your log.txt really has that simple structure you can work with split():
import csv
with open("text.txt", "r") as file:
file2 = csv.writer(open("file.csv", "w", newline=""), delimiter=",")
file2.writerow(["id", "value1"])
for line in file:
if "id" in line:
ids = int(line.split(":")[-1])
else:
value = line.split(":")[-1].split('"')[1]
file2.writerow([ids, value])
The resulting csv will contain the following:
id,value1
0,234w-76-54
1,2354w44-7-54
The comma as separator is set by the delimiter argument in the csv.writer call.
First look for a line with "id" in it. This line can easily be splitted at the :. This results in a list with two elements. Take the last part and cast it into an integer.
If no "id" is in the line, it is an "value1" line. First split the line at the :. Again take the last part of the resulting list and split it at ". This results again in a list with three elements, We need the second one.
Related
I would just like to create a csv file and at the same time add my data row by row with a for loop.
for x in y:
newRow = "\n%s,%s\n" % (sentence1, sentence2)
with open('Mydata.csv', "a") as f:
f.write(newRow)
After the above process, I tried to read the csv file but I can't separate the columns. It seems that there is only one column, maybe I did something wrong in the csv creation process?
colnames = ['A_sentence', 'B_sentence']
Mydata = pd.read_csv(Mydata, names=colnames, delimiter=";")
print(Mydata['A_sntence']) #output Nan
When you are writing the file, it looks like you are using commas as separators, but when reading the file you are using semicolons (probably just a typo). Change delimiter=";" to delimiter="," and it should work.
The code works properly upto entering for loop and the date values is fetched. after that it returns an empty list of values for rest of the variables like time, ref1, seriel and all.
import pandas as pd
import re
# Create a Dataframe from CSV
my_dataframe = pd.read_csv('C:/Users/WI/Desktop/file.csv')
# Drop rows with any empty cells
my_dataframe.dropna(axis=0, how='any', thresh=None, subset=['date'], inplace=False)
with open("C:/Users/WDSI/Desktop/OutputFile.txt", "w") as F:
F.write("%s" %my_dataframe)
fin = open("C:/Users/WDSI/Desktop/OutputFile.txt", "r")
# print("Input file is taken")
fout = open("C:/Users/WDSI/Desktop/OutputFile1.txt", "w")
# print("Output file is taken")
for line in fin:
date = re.findall(r'(\d{4}-\d{2}-\d{2})', fin.read())
time = re.findall(r'(\s\d{2}:\d{2}:\d{2})',fin.read())
seriel=re.findall(r'(\s[A-Z][A-Z][A-Z][0-9])',fin.read())
part=re.findall(r'(\s[0-9][0-9][0-9][A-Z][0-9][0-9][0-9][0-9][0-9])',fin.read())
ref1=re.findall(r'(\s\d{16})',fin.read())
ref3=re.findall(r'(\d{9})+$',fin.read())
#print(date)
#print(time)
#print(seriel)
#print(part)
#print(ref1)
#print(ref3)
fout.write("%10s,%8s" %((date,time)))
fout.close()
when we run this code only date variable gets the value other variables like time, ref1 and all goes empty. also please help me to write date,time,serial,part,ref1,ref3 from each row of csv file. in this format the output file should be written.
You are reading line by line with the for line in fin but the first all your findall read the whole file content with fin.read().
You either process line by line (replace those fin.read() with line):
for line in fin:
date = re.findall(r'(\d{4}-\d{2}-\d{2})', line)
...
Or read the whole file and remove the for:
content = f.read()
date = re.findall(r'(\d{4}-\d{2}-\d{2})', content)
...
It is not exact replica of your solution but that how you can open a file and take whatever you need from each line then write the new data to a new file.
I have catered a csv file with the following lines:
This is a date 2019-08-05, 2019-09-03
This is a email asdfasdf#abc.com
Solution 1:
with open("./Datalake/output.txt", "w+") as wf:
with open("./Datalake/test.csv") as f:
for line in f:
dates = re.findall(r"\d{4}-\d{1,2}-\d{1,2}", line)
dates = "|".join(dates)
emails = re.findall(r'[\w\.-]+#[\w\.-]+', line)
emails = "|".join(emails)
extracted_line = "{}, {}\n".format(dates, emails)
wf.write(extracted_line)
print(extracted_line)
Solution 2:
You can extract directly from data frame. Apply the same search using a lambda function which will execute for each row. But be careful you might need some error handling lambda function will through error if there is None value in the column. Convert the column to str before applying lambda.
df = pd.read_csv("./Datalake/test.csv", sep='\n', header=None, names=["string_col"])
df['dates'] = df["string_col"].apply(lambda x: re.findall(r"\d{4}-\d{1,2}-\d{1,2}", x))
df['emails'] = df["string_col"].apply(lambda x: re.findall(r"[\w\.-]+#[\w\.-]+", x))
In that case the calculated column will a python list so you might consider to use ''.join() in the lambda to make them text.
I am really new to python and I need to change new artikel Ids to the old ones. The Ids are mapped inside a dict. The file I need to edit is a normal txt where every column is sperated by Tabs. The problem is not replacing the values rather then only replacing the ouccurances in the desired column which is set by pos.
I really would appreciate some help.
def replaceArtCol(filename, pos):
with open(filename) as input_file, open('test.txt','w') as output_file:
for each_line in input_file:
val = each_line.split("\t")[pos]
for row in artikel_ID:
if each_line[pos] == pos
line = each_line.replace(val, artikel_ID[val])
output_file.write(line)`
This Code just replaces any occurance of the string in the text file.
supposed your ID mapping dict looks like ID_mapping = {'old_id': 'new_id'}, I think your code is not far from working correctly. A modified version could look like
with open(filename) as input_file, open('test.txt','w') as output_file:
for each_line in input_file:
line = each_line.split("\t")
if line[pos] in ID_mapping.keys():
line[pos] = ID_mapping[line[pos]]
line = '\t'.join(line)
output_file.write(line)
if you're not working in pandas anyway, this can save a lot of overhead.
if your data is tab separated then you must load this data into dataframe.. this way you can have columns and rows structure.. what you are sdoing right now will not allow you to do what you want to do without some complex and buggy logic. you may try these steps
import pandas as pd
df = pd.read_csv("dummy.txt", sep="\t", encoding="latin-1")
df['desired_column_name'] = df['desired_column_name'].replace({"value_to_be_changed": "newvalue"})
print(df.head())
I am extremely new to python(coding, for that matter).
Could I please get some help as to how can I achieve this. I have gone through numerous threads but nothing helped.
My input file looks like this:
I want my output file to look like this:
Just replication of the first column, twice in the second excel sheet. With a line after every 5 rows.
A .csv file can be opened with a normal text editor, do this and you'll see that the entries for each column are comma-separated (csv = comma separated values). Most likely it's semicolons ;, though.
Since you're new to coding, I recommend trying it manually with a text editor first until you have the desired output, and then try to replicate it with python.
Also, you should post code examples here and ask specific questions about why it doesn't work like you expected it to work.
Below is the solution. Don't forget to configure input/output files and the delimiter:
input_file = 'c:\Temp\input.csv'
output_file = 'c:\Temp\output.csv'
delimiter = ';'
i = 0
output_data = ''
with open(input_file) as f:
for line in f:
i += 1
output_data += line.strip() + delimiter + line
if i == 5:
output_data += '\n'
i = 0
with open(output_file, 'w') as file_:
file_.write(output_data)
Python has a csv module for doing this. It is able to automatically read each row into a list of columns. It is then possible to simply take the first element and replicate it into the second column in an output file.
import csv
with open('input.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
input_rows = list(csv_input)
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
for line, row in enumerate(input_rows, start=1):
csv_output.writerow([row[0], row[0]])
if line % 5 == 0:
csv_output.writerow([])
Note, it is not advisable to write the updated data directly over the input file as if there was a problem you would lose your original file.
If your input file has multiple columns, this script will remove them and simple duplicate the first column.
By default, the csv format separates each column using a comma, this can be modified by specifying a desired delimiter as follows:
csv_output = csv.writer(f_output, delimiter=';')
I have a variable that contains a string of:
fruit_wanted = 'banana,apple'
I also have a csv file
fruit,'orange','grape','banana','mango','apple','strawberry'
number,1,2,3,4,5,6
value,3,2,2,4,2,1
price,3,2,1,2,3,4
Now how do I delete the column in which the 'fruit' does not listed in the 'fruit_wanted' variable?
So that the outfile would look like
fruit,'banana','apple'
number,3,5
value,2,2
price,1,3
Thank you.
Read the csv file using the DictReader() class, and ignore the columns you don't want:
fruit_wanted = ['fruit'] + ["'%s'" % f for f in fruit_wanted.split(',')]
outfile = csv.DictWriter(open(outputfile, 'wb'), fieldnames=fruit_wanted)
fruit_wanted = set(fruit_wanted)
for row in csv.DictReader(open(inputfile, 'rb')):
row = {k: row[k] for k in row if k in fruit_wanted}
outfile.writerow(row)
Here's some pseudocode:
open the original CSV for input, and the new one for output
read the first row of the original CSV and figure out which columns you want to delete
write the modified first row to the output CSV
for each row in the input CSV:
delete the columns you figured out before
write the modified row to the output CSV