Set header of csv using separate text file - python

I'm trying to read data from a textfile which consists of newline separated words I intend to use as the header for a separate csv file with no header.
I've loaded the textfile and dataset in via pandas but don't really know where to go from here.
names = pandas.read_csv('names.txt', header = None)
dataset = pandas.read_csv('dataset.csv, header = None')
The contents of the textfile look like this
dog
cat
sheep
...

You could probably read your .txt file in a different way, such as using splitlines() (as you can see from this example)
with open('names.txt') as f:
header_names = f.read().splitlines()
header_names is now a list, and you could it to define the header (column names) of your dataframe:
dataset = pandas.read_csv('dataset.csv', header = None)
dataset.columns = header_names

Related

Python - use txt file to fetch data from excel and extra values

I have a text file which consists of data including some random data among which there are "names" that exist in separate excel file as rows in a column. What I need to do is to compare strings from txt file and excel and output those that are matching along with some extra data corresponding to that row from different columns. I'd be thankful for some example how to go about it maybe using pandas?
You should open the text and excel file like so:
textdata = open(path_to_file, "r")
exceldata = open(path_to_file, "r")
Then put the data in lists:
textdatalist = [line.split(',') for line in textdata.readlines()]
exceldatalist = [line.split(',') for line in exceldata.readlines()]
And then compare the two lists with:
print(set(exceldatalist).intersection(textdatalist))
All together:
textdata = open(path_to_file, "r")
exceldata = open(path_to_file, "r")
textdatalist = [line.split(',') for line in textdata.readlines()]
exceldatalist = [line.split(',') for line in exceldata.readlines()]
print(set(exceldatalist).intersection(textdatalist))

How to write to csv file from a text file

I have text file with data like below:
"id":0
"value1":"234w-76-54"
"id":1
"value1":"2354w44-7-54"
I want to have these data in csv file. I tried with below code, but this is writing each ids and value1 as list in csv file.
with open("log.txt", "r") as file:
file2 = csv.writer (open("file.csv", "w", newline=""), delimiter=",")
file2.writerow(["id", "value1"])
for lines in file:
if "id" in lines:
ids = re.findall(r'(\d+)', lines)
if "value1" in lines:
value1 = re.findall(r'2[\w\.:-]+', lines)
file2.writerow([ids, value1])
getting output-
id value1
['0'] ['234w-76-54']
['1'] ['2354w44-7-54']
Desired output-
id value1
0 234w-76-54
1 2354w44-7-54
The simplest way to do it, in my opionion, is to read in the .txt file using pandas's read_csv() method and write out using the Dataframe.to_csv() method.
Below I've created a fully reproducible example recreating the OP's .txt file, reading it in and then writing out a new .csv file.
import pandas as pd
#Step 0: create the .txt file
file = open("file.txt","w")
input = '''"id":0
"value1":"234w-76-54"
"id":1
"value1":"2354w44-7-54"'''
file.writelines(input)
file.close()
#Step 1: read in the .txt file and create a dataframe with some manipulation to
# get the desired shape
df = pd.read_csv('file.txt', delimiter=':', header=None)
df_out = pd.DataFrame({'id': df.loc[df.iloc[:, 0] == 'id'][1].tolist(),
'value1': df.loc[df.iloc[:, 0] == 'value1'][1].tolist()})
print(df_out)
#Step 2: Create a .csv file
df_out.to_csv('out.csv', index=False)
Expected Outputted .csv file:
findall returns a list. you probably want to use re.search or re.match depending on your usecase.
If your log.txt really has that simple structure you can work with split():
import csv
with open("text.txt", "r") as file:
file2 = csv.writer(open("file.csv", "w", newline=""), delimiter=",")
file2.writerow(["id", "value1"])
for line in file:
if "id" in line:
ids = int(line.split(":")[-1])
else:
value = line.split(":")[-1].split('"')[1]
file2.writerow([ids, value])
The resulting csv will contain the following:
id,value1
0,234w-76-54
1,2354w44-7-54
The comma as separator is set by the delimiter argument in the csv.writer call.
First look for a line with "id" in it. This line can easily be splitted at the :. This results in a list with two elements. Take the last part and cast it into an integer.
If no "id" is in the line, it is an "value1" line. First split the line at the :. Again take the last part of the resulting list and split it at ". This results again in a list with three elements, We need the second one.

Python: Replace string in a txt file but not on every occurrence

I am really new to python and I need to change new artikel Ids to the old ones. The Ids are mapped inside a dict. The file I need to edit is a normal txt where every column is sperated by Tabs. The problem is not replacing the values rather then only replacing the ouccurances in the desired column which is set by pos.
I really would appreciate some help.
def replaceArtCol(filename, pos):
with open(filename) as input_file, open('test.txt','w') as output_file:
for each_line in input_file:
val = each_line.split("\t")[pos]
for row in artikel_ID:
if each_line[pos] == pos
line = each_line.replace(val, artikel_ID[val])
output_file.write(line)`
This Code just replaces any occurance of the string in the text file.
supposed your ID mapping dict looks like ID_mapping = {'old_id': 'new_id'}, I think your code is not far from working correctly. A modified version could look like
with open(filename) as input_file, open('test.txt','w') as output_file:
for each_line in input_file:
line = each_line.split("\t")
if line[pos] in ID_mapping.keys():
line[pos] = ID_mapping[line[pos]]
line = '\t'.join(line)
output_file.write(line)
if you're not working in pandas anyway, this can save a lot of overhead.
if your data is tab separated then you must load this data into dataframe.. this way you can have columns and rows structure.. what you are sdoing right now will not allow you to do what you want to do without some complex and buggy logic. you may try these steps
import pandas as pd
df = pd.read_csv("dummy.txt", sep="\t", encoding="latin-1")
df['desired_column_name'] = df['desired_column_name'].replace({"value_to_be_changed": "newvalue"})
print(df.head())

Converting CSV data from file to JSON

I have a csv file that contains csv data separated by ','. I am trying to convert it into a json format. For this I am tyring to extract headers first. But, I am not able to differentiate between headers and the next row.
Here is the data in csv file:
Start Date ,Start Time,End Date,End Time,Event Title
9/5/2011,3:00:00 PM,9/5/2011,,Social Studies Dept. Meeting
9/5/2011,6:00:00 PM,9/5/2011,8:00:00 PM,Curriculum Meeting
I have tried csvreader as well but I got stuck at the same issue.
Basically Event Title and the date on the next line is not being distinguished.
with open(file_path, 'r') as f:
first_line = re.sub(r'\s+', '', f.read())
arr = []
headers = []
for header in f.readline().split(','):
headers.append(header)
for line in f.readlines():
lineItems = {}
for i,item in enumerate(line.split(',')):
lineItems[headers[i]] = item
arr.append(lineItems)
print(arr)
print(headers)
jsonText = json.dumps(arr)
print(jsonText)
All three print statements give empty result below.
[]
['']
[]
I expect jsonText to be a json of key value pairs.
Use csv.DictReader to get a list of dicts (each row is a dict) then serialize it.
import json
import csv
with open(csvfilepath) as f:
json.dump(list(csv.DictReader(f)), jsonfilepath))
In Python, each file has a marker that keeps track of where you are in the file. Once you call read(), you have read through the entire file, and all future read or readline calls will return nothing.
So, just delete the line involving first_line.

Parsing multiple json objects from a text file using Python

I have a .json file where each line is an object. For example, first two lines are:
{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
I have tried processing using ijson lib as follows:
with open(filename, 'r') as f:
objects = ijson.items(f, 'columns.items')
columns = list(objects)
However, i get error:
JSONError: Additional data
Its seems due to multiple objects I'm receiving such error.
Whats the recommended way for analyzing such Json file in Jupyter?
Thank You in advance
The file format is not correct if this is the complete file. Between the curly brackets there must be a comma and it should start and end with a square bracket. Like so: [{...},{...}]. For your data it would look like:
[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]
Here is some code how to clean your file:
lastline = None
with open("yourfile.json","r") as f:
lineList = f.readlines()
lastline=lineList[-1]
with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
for i,line in enumerate(f,0):
if i == 0:
line = "["+str(line)+","
g.write(line)
elif line == lastline:
g.write(line)
g.write("]")
else:
line = str(line)+","
g.write(line)
To read a json file properly you could also consider using the pandas library (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).
import pandas as pd
#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")
If you are not familiar with pandas, here a quick headstart, how to work with a dataframe object:
df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2
While each line on it's own is valid JSON, your file as a whole is not. As such, you can't parse it in one go, you will have to iterate over each line parse it into an object.
You can aggregate these objects in one list, and from there do whatever you like with your data :
import json
with open(filename, 'r') as f:
object_list = []
for line in f.readlines():
object_list.append(json.loads(line))
# object_list will contain all of your file's data
You could do it as a list comprehension to have it a little more pythonic :
with open(filename, 'r') as f:
object_list = [json.loads(line)
for line in f.readlines()]
# object_list will contain all of your file's data
You have multiple lines in your file, so that's why it's throwing errors
import json
with open(filename, 'r') as f:
lines = f.readlines()
first = json.loads(lines[0])
second = json.loads(lines[1])
That should catch both lines and load them in properly

Categories