Here is my data sample in a txt file:
1322484979.322313000 QQlb-j7itDQ
1322484981.070116000 Ne8Bb1d5oyc
1322484981.128791000 Ne8Bb1d5oyc
1322484981.431075000 Ne8Bb1d5oyc
1322484985.210652000 QWUiCAE4E7U
The first column is timestamp, second column is IP address, third one is some hash value.
I want to check, if two or more successive rows have same IP address and hash value, I need to use the last timestamp of the duplicated row to substract the first timestamp of the duplicated row, in this case, is 132248981.431075000-1322484981.070116000
If the result is less than 5, I will only keep the first row (the earliest) in the file.
If the result is more than 5, I will keep the first and the last duplicated row, delete rows between them
Since Im a pretty newbie of python, This problem is a bit complicated for me. I dont know what kind of function is needed, can anyone help a little bit?
In a basic way, it could looks like this :
data = open("data.txt", "r")
last_time = 0.0
last_ip = None
last_hash = None
for line in data:
timestamp, ip, hash_value = line.split()
if ip==last_ip and hash_value==last_hash and float(timestamp) - float(last_time) < 5.0:
print "Remove ", line
print "Keep ", line
last_time, last_ip, last_hash = timestamp, ip, hash_value
There is a for loop of 8 million iterations, which takes 2 sample values from a column of a 1 million records dataframe (say df_original_nodes) and then query that 2 samples in another dataframe say (df_original_rel) and if sample does not exist then add that samples as a new row into the queried dataframe (df_original_rel) and finally write the dataframe (df_original_rel) into a CSV.
This loop is taking roughly around 24+ hrs to complete. How this can be made performant? Happy if it even takes 8 hrs to complete than anything 12+ hrs.
Here is the piece of code:
for j in range(1, n_8000000):
ran_num = random.randint(0, 1)
ran_rel_type = rel_type[ran_num]
df_ran_rel = df_original_nodes["UID"].sample(2, ignore_index=True)
FROM = df_ran_rel[0]
TO = df_ran_rel[1]
if df_original_rel.query("#FROM == FROM and #TO == TO").empty:
k += 1
new_row = {"FROM": FROM, "TO": TO, "TYPE": ran_rel_type[0], "PART_OF": ran_rel_type[1]}
df_original_rel = df_original_rel.append(new_row, ignore_index=True)
df_original_rel.to_csv("output/extra_rel.csv", encoding="utf-8", index=False)
My assumption is that querying a dataframe df_original_rel is the heavy-lifting part where the dataframe df_original_rel is also keep growing as the new row is added.
In my view lists are faster to traverse and maybe to query but then there will be another layer of conversion from dataframe to lists and vice-versa which could add further complexity.
Some things that should probably help – most of them around "do less Pandas".
Since I don't have your original data or anything like it, I can't test this.
# Grab a regular list of UIDs that we can use with `random.sample`
original_nodes_uid_list = df_original_nodes["UID"].tolist()
# Make a regular set of FROM-TO tuples
rel_from_to_pairs = set(df_original_rel[["FROM", "TO"]].apply(tuple, axis=1).tolist())
# Store new rows here instead of putting them in the dataframe; we'll also update rel_from_to_pairs as we go.
new_rows = []
for j in range(1, 8_000_000):
# These two lines could probably also be a `random.choice`
ran_num = random.randint(0, 1)
ran_rel_type = rel_type[ran_num]
# Grab a from-to pair from the UID list
FROM, TO = random.sample(original_nodes_uid_list, 2)
# If this pair isn't in the set of known pairs...
if (FROM, TO) not in rel_from_to_pairs:
# ... prepare a new row to be added later
new_rows.append({"FROM": FROM, "TO": TO, "TYPE": ran_rel_type[0], "PART_OF": ran_rel_type[1]})
# ... and since this from-to pair _would_ exist had df_original_rel
# been updated, update the pairs set.
rel_from_to_pairs.add((FROM, TO))
# Finally, make a dataframe of the new rows, concatenate it with the old, and output.
df_new_rel = pd.DataFrame(new_rows)
df_original_rel = pd.concat([df_original_rel, df_new_rel], ignore_index=True)
df_original_rel.to_csv("output/extra_rel.csv", encoding="utf-8", index=False)
Here is the code I am working with.
dfs=dfs[['Reserved']] #the column that I need to insert
dfs=dfs.applymap(str) #json did not accept the nan so needed to convert
sh=gc.open_by_key('KEY') #would open the google sheet
sh_dfs=sh.get_worksheet(0) #getting the worksheet
sh_dfs.insert_rows(dfs.values.tolist()) #inserts the dfs into the new worksheet
Running this code would insert the rows at the first column of the worksheet but what I am trying to accomplish is adding/inserting the column at the very last, column p.
In your situation, how about the following modification? In this modification, at first, the maximum column is retrieved. And, the column number is converted to the column letter, and the values are put to the next column of the last column.
# Ref:
def colnum_string(n):
string = ""
while n > 0:
n, remainder = divmod(n - 1, 26)
string = chr(65 + remainder) + string
return string
values = sh_dfs.get_all_values()
col = colnum_string(max([len(r) for r in values]) + 1)
sh_dfs.update(col + '1', dfs.values.tolist(), value_input_option='USER_ENTERED')
If an error like exceeds grid limits occurs, please insert the blank column.
I'm trying to automate googlesheets through python, and every time my DF query runs, it inserts the data with the current day.
To put it simple, when a date column is empty, it have to be fulfilled with date when the program runs. The image is:
I was trying to do something like it:
ws ="automation").worksheet('sheet2')
I'm not able to fulfill just the empty space, seems that or all the column is replaced, or all rows, etc.
Solved it thorugh another account:
ws_date_pipe ="automation").worksheet('sheet2')
# Range of date column (targeted one, which is the min range)
next_row_min = str(len(list(filter(None, ws_date_pipe.col_values(8))))+1)
# Range of first column (which is the max range)
next_row_max = str(len(list(filter(None, ws_date_pipe.col_values(1)))))
cell_list = ws_date_pipe.range(f"H{next_row_min}:H{next_row_max}")
cell_values = []
# Difference between max-min ranges, space that needs to be fulfilled
for x in range(0, ((int(next_row_max)+1)-int(next_row_min)), 1):
iterator = x
iterator ="%Y-%m-%d")
iterator = str(iterator)
for i, val in enumerate(cell_values):
cell_list[i].value = val
# If date range len "next_row_min" is lower than the first column, then fill.
if int(next_row_min) < int(next_row_max)+1:
print(f'Saved to csv file. {"%Y-%m-%d")}')
I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m =, seq1)
if m:
match =
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
return df
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.
I have some data in a txt file and I would like to load it into a list of dicts. I would normally use csv.ReadDict(open('file')), however this data does not have the key values in the first row. Instead it has a number of rows commented out before the data actually begins. Also, sometimes, the commented rows will not always be at beginning of the file, but could be at the end of the file.
However, all line should always have the same fields, and I guess I could hard-code these field names (or key values) as they shouldn't change.
Sample Date
# twitter data
# retrieved at: 07.08.2014
# total number of records: 5
# exported by: userXYZ
# fields: date, time, username, source
10.12.2013; 02:00; tweeterA; web
10.12.2013; 02:01; tweeterB; iPhone
10.13.2013; 02:04; tweeterC; android
10.13.2013; 02:08; tweeterC; web
10.13.2013; 02:10; tweeterD; iPhone
Below is the what I've been able to figure out so far, but I need some help getting it worked out.
My Code
header = ['date', 'time', 'username', 'source']
data = []
for line in open('data.txt'):
if not line.startswith('#'):
Desired Format
[{'date':'10.12.2013', 'time':'02:00', 'username':'tweeterA', 'source':,'web'},
{'date':'10.12.2013', 'time':'02:01', 'username':'tweeterB', 'source':,'iPhone'},
{'date':'10.12.2013', 'time':'02:04', 'username':'tweeterC', 'source':,'android'},
{'date':'10.12.2013', 'time':'02:08', 'username':'tweeterC', 'source':,'web'},
{'date':'10.12.2013', 'time':'02:10', 'username':'tweeterD', 'source':,'iPhone'}]
If you want a list of dicts where each dict corresponds to a row try this:
list_of_dicts = [{key: value for (key, value) in zip(header, line.strip().split('; '))} for line in open('abcd.txt') if not line.strip().startswith('#')]
for line in open('data.txt'):
if not line.startswith('#'):
data.append(line.split("; "))
at least assuming I understand you correctly
or more succinct
data = [line.split("; ") for line in open("data.txt") if not line.strip().startswith("#")]
list_of_dicts = map(lambda row:dict(zip(header,row)),data)
depending on your version of python you may get an iterator back from map in which case just do
list_of_dicts = list(map(lambda row:dict(zip(header,row)),data))