I have the below datafile and i want to delete the entire line that contains the "30" number in the first column. This number has always this position.
What i have thought is to read the file and create a list with this first column
and do a check if this number "30" exist on every item on the list and then delete the entire line given the index.
However i am not sure how to proceed.
Please let me know your thoughts .
Datafile
Here is what i have tried up to this point:
f = open("file.txt","r")
lines = f.readlines()
f.close()
f = open("file.txt","w")
for line in lines:
if line!="30"+"\n":
f.write(line)
f.close()
f = open("file.txt", "r")
lines = f.readlines()
f.close()
f = open("file.txt", "w")
for line in lines:
if '30' not in line[4:6]:
f.write(line)
f.close()
Try this
If you're willing to use pandas, you could do it in three lines:
import pandas as pd
# Read in file
df = pd.read_csv("file.txt", header=None, delim_whitespace=True)
# Remove rows where first column contains '30'
df = df[~df[0].str.contains('30')]
# Save the result
df.to_csv("cleaned.txt", sep='\t', index=False, header=False)
This approach can easily be extended to perform other types of filtering or manipulating your data.
One way you can do is use the regular expressions that captures 30 in the beginning is this:
import re
f = open("file.txt", "r")
lines = f.readlines()
f.close()
f = open("file.txt", "w")
for line in lines:
if re.search(r'^\d*30',line):
f.write(line)
f.close()
Hope it works well.
Related
I am aggregating data in a CVS file, the code:
import pandas
df = pandas.read_csv("./input.csv", delimiter=";", low_memory=False)
df.head()
count_severity = df.groupby("PHONE")["IMEI"].unique()
has_multiple_elements = count_severity.apply(lambda x: len(x)>1)
result = count_severity[has_multiple_elements]
result.to_csv("./output.csv", sep=";")
and in some lines of the received file, I get the following:
It turns out that I get the second column, which is after the sign ;, divided into two rows.
Could you tell me please, how to get rid of this line break? I tried adding a parameter line_terminator=None in result.to_csv - it didn't help.
Any method is accepted, even if you have to overwrite this file and save a new one. I also tried this:
import pandas as pd
output_file = open("./output.csv", "r")
output_file = ''.join([i for i in output_file]).replace("\n", "")
output_file_new = open("./output_new.csv", "w")
output_file_new.writelines(output_file)
output_file_new.close()
But then I get solid lines, which is not good at all.
To summarize, I should get the format of this file:
Thank You!
If your wrong lines always start with a comma, you could just replace the sequence "\n," by ",".
with open("./output.csv", "r") as file:
content = file.read()
new_content = content.replace("\n,", ",")
with open("./new_output.csv", "w") as file:
file.write(new_content)
I have this code that reads through my csv files ( p01_results, p02_results, ..... ) to remove some unwanted rows based on its number from, and it works. Right now I trying to add two columns participantID and session. For participantID I tried to read the name of the csv file, save the ID number (01,02, ...) and fill the column with it. For session, I tried to fill every 18 rows with 1s, 2s, 3s and 4s.
I tried to use this code into mine, but didn't work:
test4 = ['test4', 4, 7, 10]
with open(data.csv, 'r') as ifile
with open(adjusted.csv, 'w') as ofile:
for line, new in zip(ifile, test4):
new_line = line.rstrip('\n') + ',' + str(new) + '\n'
ofile.write(new_line)
import os
base_directory = 'C:\\Users\\yosal\\Desktop\\results'
for dir_path, dir_name_list, file_name_list in os.walk(base_directory):
for file_name in file_name_list:
# If this is not a CSV file
if not file_name.endswith('results.csv'):
# Skip it
continue
file_path = os.path.join(dir_path, file_name)
with open(file_path, 'r') as ifile:
line_list = ifile.readlines()
with open(file_path, 'w') as ofile:
# only write these rows to the new file
ofile.writelines(line_list[0])
ofile.writelines(line_list[2:20])
ofile.writelines(line_list[21:39])
ofile.writelines(line_list[40:58])
ofile.writelines(line_list[59:77])
Try reading the CSV into a list. Then, loop through each element of the list (each element being a row in the CSV), and add a string with the delimieter plus the desired string. Then, write a new CSV, either named differently or replacing the old one, and just use your list as the input.
I tried adding a column to my csv file using pandas. So you can try out something like this. First you have to install pandas by running "pip install pandas".
import pandas as pd
df = pd.read_csv('data.csv') ## read the csv file
df.set_index('S/N', inplace=True) ## you can set an index with any column
##you have that already exists in your csv
##in my case it is the "S/N" column i used
df["test"] = ["values","you want","add"]
df.to_csv('data.csv')
Took me some time, but I did it.
import os
base_directory = 'C:\\Users\\yosal\\Desktop\\results'
for dir_path, dir_name_list, file_name_list in os.walk(base_directory):
for file_name in file_name_list:
# If this is not a CSV file
if not file_name.endswith('results.csv'):
# Skip it
continue
file_path = os.path.join(dir_path, file_name)
with open(file_path, 'r') as ifile:
line_list = ifile.readlines()
with open(file_path, 'w') as ofile:
ofile.writelines(str(line_list[0]).rstrip()+",participant,session\n")
for x in range(2, 20):
ofile.writelines(str(line_list[x]).rstrip()+","+file_path[len(base_directory)+2:len(base_directory)+4]+",1\n")
for y in range(21, 39):
ofile.writelines(str(line_list[y]).rstrip()+","+file_path[len(base_directory)+2:len(base_directory)+4]+",2\n")
for h in range(40, 58):
ofile.writelines(str(line_list[h]).rstrip()+","+file_path[len(base_directory)+2:len(base_directory)+4]+",3\n")
for z in range(59 ,77):
ofile.writelines(str(line_list[z]).rstrip()+","+file_path[len(base_directory)+2:len(base_directory)+4]+",4\n")
I have a .json file where each line is an object. For example, first two lines are:
{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
I have tried processing using ijson lib as follows:
with open(filename, 'r') as f:
objects = ijson.items(f, 'columns.items')
columns = list(objects)
However, i get error:
JSONError: Additional data
Its seems due to multiple objects I'm receiving such error.
Whats the recommended way for analyzing such Json file in Jupyter?
Thank You in advance
The file format is not correct if this is the complete file. Between the curly brackets there must be a comma and it should start and end with a square bracket. Like so: [{...},{...}]. For your data it would look like:
[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]
Here is some code how to clean your file:
lastline = None
with open("yourfile.json","r") as f:
lineList = f.readlines()
lastline=lineList[-1]
with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
for i,line in enumerate(f,0):
if i == 0:
line = "["+str(line)+","
g.write(line)
elif line == lastline:
g.write(line)
g.write("]")
else:
line = str(line)+","
g.write(line)
To read a json file properly you could also consider using the pandas library (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).
import pandas as pd
#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")
If you are not familiar with pandas, here a quick headstart, how to work with a dataframe object:
df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2
While each line on it's own is valid JSON, your file as a whole is not. As such, you can't parse it in one go, you will have to iterate over each line parse it into an object.
You can aggregate these objects in one list, and from there do whatever you like with your data :
import json
with open(filename, 'r') as f:
object_list = []
for line in f.readlines():
object_list.append(json.loads(line))
# object_list will contain all of your file's data
You could do it as a list comprehension to have it a little more pythonic :
with open(filename, 'r') as f:
object_list = [json.loads(line)
for line in f.readlines()]
# object_list will contain all of your file's data
You have multiple lines in your file, so that's why it's throwing errors
import json
with open(filename, 'r') as f:
lines = f.readlines()
first = json.loads(lines[0])
second = json.loads(lines[1])
That should catch both lines and load them in properly
I have a file.dat which looks like:
id | user_id | venue_id | latitude | longitude | created_at
---------+---------+----------+-----------+-----------+-----------------
984301 |2041916 |5222 | | |2012-04-21 17:39:01
984222 |15824 |5222 |38.8951118 |-77.0363658|2012-04-21 17:43:47
984315 |1764391 |5222 | | |2012-04-21 17:37:18
984234 |44652 |5222 |33.800745 |-84.41052 | 2012-04-21 17:43:43
I need to get csv file with deleted empty latitude and longtitude rows, like:
id,user_id,venue_id,latitude,longitude,created_at
984222,15824,5222,38.8951118,-77.0363658,2012-04-21T17:43:47
984234,44652,5222,33.800745,-84.41052,2012-04-21T17:43:43
984291,105054,5222,45.5234515,-122.6762071,2012-04-21T17:39:22
I try to do that, using next code:
with open('file.dat', 'r') as input_file:
lines = input_file.readlines()
newLines = []
for line in lines:
newLine = line.strip('|').split()
newLines.append(newLine)
with open('file.csv', 'w') as output_file:
file_writer = csv.writer(output_file)
file_writer.writerows(newLines)
But all the same I get a csv file with "|" symbols and empty latitude/longtitude rows.
Where is mistake?
In general I need to use resulting csv-file in DateFrame, so maybe there is some way to reduce number of actions.
str.strip() removes leading and trailing characters from a string.
You want to split the lines on "|", then strip each element of the resulting list:
import csv
with open('file.dat') as dat_file, open('file.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file)
for line in dat_file:
row = [field.strip() for field in line.split('|')]
if len(row) == 6 and row[3] and row[4]:
csv_writer.writerow(row)
Use this:
data = pd.read_csv('file.dat', sep='|', header=0, skipinitialspace=True)
data.dropna(inplace=True)
I used standard python features without pre-processing data. I got an idea from one of the previous answers and improved it. If data headers contain spaces (it is often the situation in CSV) we should determine column names by ourselves and skip line 1 with headers.
After that, we can remove NaN values only by specific columns.
data = pd.read_csv("checkins.dat", sep='|', header=None, skiprows=1,
low_memory = False, skipinitialspace=True,
names=['id','user_id','venue_id','latitude','longitude','created_at'])
data.dropna(subset=['latitude', 'longitude'], inplace = True)
Using split() without parameters will result in splitting after a space
example "test1 test2".split() results in ["test1", "test2"]
instead, try this:
newLine = line.split("|")
Maybe it's better to use a map() function instead of list comprehensions as it must be working faster. Also writing a csv-file is easy with csv module.
import csv
with open('file.dat', 'r') as fin:
with open('file.csv', 'w') as fout:
for line in fin:
newline = map(str.strip, line.split('|'))
if len(newline) == 6 and newline[3] and newline[4]:
csv.writer(fout).writerow(newline)
with open("filename.dat") as f:
with open("filename.csv", "w") as f1:
for line in f:
f1.write(line)
This can be used to convert a .dat file to .csv file
Combining previous answers I wrote my code for Python 2.7:
import csv
lat_index = 3
lon_index = 4
fields_num = 6
csv_counter = 0
with open("checkins.dat") as dat_file:
with open("checkins.csv", "w") as csv_file:
csv_writer = csv.writer(csv_file)
for dat_line in dat_file:
new_line = map(str.strip, dat_line.split('|'))
if len(new_line) == fields_num and new_line[lat_index] and new_line[lon_index]:
csv_writer.writerow(new_line)
csv_counter += 1
print("Done. Total rows written: {:,}".format(csv_counter))
This has worked for me:
data = pd.read_csv('file.dat',sep='::',names=list_for_names_of_columns)
I am trying to read a file with below data
Et1, Arista2, Ethernet1
Et2, Arista2, Ethernet2
Ma1, Arista2, Management1
I need to read the file replace Et with Ethernet and Ma with Management. At the end of them the digit should be the same. The actual output should be as follows
Ethernet1, Arista2, Ethernet1
Ethernet2, Arista2, Ethernet2
Management1, Arista2, Management1
I tried a code with Regular expressions, I am able to get to the point I can parse all Et1, Et2 and Ma1. But unable to replace them.
import re
with open('test.txt','r') as fin:
for line in fin:
data = re.findall(r'\A[A-Z][a-z]\Z\d[0-9]*', line)
print(data)
The output looks like this..
['Et1']
['Et2']
['Ma1']
import re
#to avoid compile in each iteration
re_et = re.compile(r'^Et(\d+),')
re_ma = re.compile(r'^Ma(\d+),')
with open('test.txt') as fin:
for line in fin:
data = re_et.sub('Ethernet\g<1>,', line.strip())
data = re_ma.sub('Management\g<1>,', data)
print(data)
This example follows Joseph Farah's suggestion
import csv
file_name = 'data.csv'
output_file_name = "corrected_data.csv"
data = []
with open(file_name, "rb") as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
data.append(row)
corrected_data = []
for row in data:
tmp_row = []
for col in row:
if 'Et' in col and not "Ethernet" in col:
col = col.replace("Et", "Ethernet")
elif 'Ma' in col and not "Management" in col:
col = col.replace("Ma", "Management")
tmp_row.append(col)
corrected_data.append(tmp_row)
with open(output_file_name, "wb") as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for row in corrected_data:
writer.writerow(row)
print data
Here are the steps you should take:
Read each line in the file
Separate each line into smaller list items using the comments as delimiters
Use str.replace() to replace the characters with the words you want; keep in mind that anything that says "Et" (including the beginning of the word "ethernet") will be replaced, so remember to account for that. Same goes for Ma and Management.
Roll it back into one big list and put it back in the file with file.write(). You may have to overwrite the original file.