I have a text file which consists of data including some random data among which there are "names" that exist in separate excel file as rows in a column. What I need to do is to compare strings from txt file and excel and output those that are matching along with some extra data corresponding to that row from different columns. I'd be thankful for some example how to go about it maybe using pandas?
You should open the text and excel file like so:
textdata = open(path_to_file, "r")
exceldata = open(path_to_file, "r")
Then put the data in lists:
textdatalist = [line.split(',') for line in textdata.readlines()]
exceldatalist = [line.split(',') for line in exceldata.readlines()]
And then compare the two lists with:
print(set(exceldatalist).intersection(textdatalist))
All together:
textdata = open(path_to_file, "r")
exceldata = open(path_to_file, "r")
textdatalist = [line.split(',') for line in textdata.readlines()]
exceldatalist = [line.split(',') for line in exceldata.readlines()]
print(set(exceldatalist).intersection(textdatalist))
Related
I'm trying to read data from a textfile which consists of newline separated words I intend to use as the header for a separate csv file with no header.
I've loaded the textfile and dataset in via pandas but don't really know where to go from here.
names = pandas.read_csv('names.txt', header = None)
dataset = pandas.read_csv('dataset.csv, header = None')
The contents of the textfile look like this
dog
cat
sheep
...
You could probably read your .txt file in a different way, such as using splitlines() (as you can see from this example)
with open('names.txt') as f:
header_names = f.read().splitlines()
header_names is now a list, and you could it to define the header (column names) of your dataframe:
dataset = pandas.read_csv('dataset.csv', header = None)
dataset.columns = header_names
i have data in csv file data as follow
2021-03-26
2021-03-27
2021-04-03
I want the output as {2021-03-26,2021-03-27,2021-04-03}
you can read the file and its data and split it with space.
if your csv contains data on multiple lines then you can add for loop for each line and split each line you will get the list of dates.
check below code:
with open("csv file path", 'r') as fp:
data = fp.read()
data = data.strip()
dates = data.split(' ')
print(dates)
I am trying to read a CSV file in Python. Further I want to read my whole file but just don't want first two columns. Also I don't have columns name so that I can easily drop or skip it.
What code do I need to read the file without reading first two columns?
I have tried below code:
with open("data2.csv", "r") as file:
lines = [line.split() for line in file]
for i, x in enumerate(lines):
print("line {0} = {1}".format(i,x))
I am just reading file line by line from above code. But how to skip first two columns and then read the file? I don't have names of the columns.
You should use the csv module in the standard library. You might need to pass additional kwargs (keyword arguments) depending on the format of your csv file.
import csv
with open('my_csv_file', 'r') as fin:
reader = csv.reader(fin)
for line in reader:
print(line[2:])
# do something with rest of columns...
if the lines list does getting the data you want you can use slicing to get rid of the columns you don't want:
getting rid of first two:
lines[2:]
getting rid of last two:
lines[:-2]
with open("data2.csv", "r") as file:
lines = [line.split()[2:] for line in file]
for i, x in enumerate(lines):
print("line {0} = {1}".format(i,x))
I have a .json file where each line is an object. For example, first two lines are:
{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
I have tried processing using ijson lib as follows:
with open(filename, 'r') as f:
objects = ijson.items(f, 'columns.items')
columns = list(objects)
However, i get error:
JSONError: Additional data
Its seems due to multiple objects I'm receiving such error.
Whats the recommended way for analyzing such Json file in Jupyter?
Thank You in advance
The file format is not correct if this is the complete file. Between the curly brackets there must be a comma and it should start and end with a square bracket. Like so: [{...},{...}]. For your data it would look like:
[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]
Here is some code how to clean your file:
lastline = None
with open("yourfile.json","r") as f:
lineList = f.readlines()
lastline=lineList[-1]
with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
for i,line in enumerate(f,0):
if i == 0:
line = "["+str(line)+","
g.write(line)
elif line == lastline:
g.write(line)
g.write("]")
else:
line = str(line)+","
g.write(line)
To read a json file properly you could also consider using the pandas library (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).
import pandas as pd
#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")
If you are not familiar with pandas, here a quick headstart, how to work with a dataframe object:
df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2
While each line on it's own is valid JSON, your file as a whole is not. As such, you can't parse it in one go, you will have to iterate over each line parse it into an object.
You can aggregate these objects in one list, and from there do whatever you like with your data :
import json
with open(filename, 'r') as f:
object_list = []
for line in f.readlines():
object_list.append(json.loads(line))
# object_list will contain all of your file's data
You could do it as a list comprehension to have it a little more pythonic :
with open(filename, 'r') as f:
object_list = [json.loads(line)
for line in f.readlines()]
# object_list will contain all of your file's data
You have multiple lines in your file, so that's why it's throwing errors
import json
with open(filename, 'r') as f:
lines = f.readlines()
first = json.loads(lines[0])
second = json.loads(lines[1])
That should catch both lines and load them in properly
I have two csv files with a single column of data. How can I remove data in the second csv file in-place by comparing it with the data in the first csv file? For example:
import csv
reader1 = csv.reader(open("file1.csv", "rb"))
reader = csv.reader(open("file2.csv", "rb"))f
for line in reader:
if line in reader1:
print line
if both files are just single columns, then you could use set to remove the differences. However, this presumes that the entries in each file do not need to be duplicated and their order doesn't really matter.
#since each file is a column, unroll each file into a single list:
dat1 = [x[0] for x in reader1]
dat2 = [y[0] for y in reader]
#take the set difference
dat1_without_dat2 = set(dat1).difference(dat2)