how to count excel rows with the same values in python - python

I have an excel file containing 3 columns(source,destination and time) and 140400 rows, I want to count rows with the same soure, destination and time value) similar values in all columns, by this I mean to count the rows containing packet information from the same sources to the same destination and at the same time.(row1:0,1,3 and row102:0,1,3 so we have 2 same rows here), all the values are integer. I tried to use df.iloc but just returns zero, tried to use dictionary but couldnt make it. I would appreciate if someone help me to find a solution.
for t in timestamps:
this is one way I tried but didn't work.
for x in range(120):
for y in range(120):
while i < 140400 and df.iloc[i,0] <= t:
#if df.iloc[i,0]<= t :
if df.iloc[i, 0] == t and df.iloc[i, 1]==y and df.iloc[i, 2]==x:
TotalArp[x][y]+=1
i=i+1
this is the file format

If I understood correctly, you just want to count rows that all have the same value, right? This should work, despite not being the most efficient way probably:
counter = 0
for index, row in df.iterrows():
if row[0] == row[1] == row[2]:
counter += 1
Edit:
OK, since I'm too stupid to comment, I'll just edit it here:
duplicate_count_df = df.groupby(df.columns.tolist(), as_index=False).size().drop_duplicates(subset=list(df.columns)
This should lead you into the right direction.

Suppose you have These Columns in Your DataFrame:
["Col1" , "Col2" , "Col3" , "Col4"]
Now you want to count rows that contains equal values in each column of your DataFrame:
len(df[df['Col1'] == df['Col2'] == df['Col3'] == df['Col4'])
Just easy Like That.
Update:
if you would like to get the count by each element specifically :
# Create Dictionary to specify count for each element
Properties = {k:0 for k in set([item for elem in df.columns for item in df[elem]])}
# Then Start counting values that are equal in each row
for item in range(len(df)) :
if df.iloc[item , 0] == df.iloc[item , 1] == df.iloc[item , 2] == df.iloc[item , 3]:
Properties[df.iloc[item , 0]] += 1
print(Properties)
Let's See an Example:
# Here i have a DataFrame with 2 columns and 3 rows
df = pd.DataFrame({'1':[1,2,3] , '2':[1,1,'-']})
df
OutPut :
And Then :
Properties = {k:0 for k in set([item for elem in df.columns for item in df[elem]])}
for item in range(len(df)) :
if df.iloc[item , 0] == df.iloc[item , 1]:
Properties[df.iloc[item , 0]] += 1
Properties
Output:
{1: 1, 2: 0, 3: 0, '-': 0}
to better understand:

Related

recursively merging rows pandas dataframe based on the condition

community,
I have a sorted pandas dataframe that looks as following:
I want to merge rows that have overlapping values in start and end columns. Meaning that if the end value of initial row is bigger than start value of the sequential one or any othe sequential, they will be merged into one row. Examples are rows 3, 4 and 5. Output I would expect is:
To do so, I am trying to implement recursion function, that would loop over the dataframe until condition worsk and then return me a number that would be used to search location for the end row .
However, the functioin I am trying to implement, returns me empty dataframe. Could you help me please, where should I put attention, or what alternative can I build if recurtion is not a solution?
def row_merger(pd_df):
counter = 0
new_df = pd.DataFrame(columns=pd_df.columns)
for i in range(len(pd_df) - 1):
def recursion_inside(pd_df, counter = 0):
counter = 0
if pd_df.iloc[i + 1 + counter]["q.start"] <= pd_df.iloc[i]["q.end"]:
counter = counter+1
recursion_inside(pd_df, counter)
else:
return counter
new_row = {"name": pd_df["name"][i], "q.start": pd_df.iloc[i]
["q.start"], "q.end": pd_df.iloc[i+counter]["q.start"]}
new_df.append(new_row, ignore_index=True)
return new_df
I don't see the benefit of using recursion here, so I would just iterate over the rows instead, building up the rows for the output dataframe one by one, e.g. like this:
def row_merger(df_in):
if len(df_in) <= 1:
return df_in
rows_out = []
current_row = df_in.iloc[0].values
for next_row in df_in.iloc[1:].values:
if next_row[1] > current_row[2]:
rows_out.append(current_row)
current_row = next_row
else:
current_row[2] = max(current_row[2], next_row[2])
rows_out.append(current_row)
return pd.DataFrame(rows_out, columns=df_in.columns)

Iterating over elements in column add new value - Pandas

When iterating through column elements (Y,Y,nan,Y in my case) for some reason I can't add a new element when a condition is met (if twice Y,Y is encountered) I want to replace the last Y with: "encountered" or simply just add it or rewrite it since I have track of the index number.
I have a dataframe
col0 col1
1 A Y
2 B Y
3 B nan
4 C Y
code:
count = 0
for i,e in enumerate(df[col1]):
if 'Y' in e:
count += 1
else:
count = 0
if count == 2:
df['col1'][i] = 'encountered' #Index errror: list index out of range
error message:
IndexError: list index out of range
Even if I try to specify the index in which column-cell I would like to 'add the msg to' gives me the same error:
code;
df['col1'][1] = 'or this'
main idea direct example:
df['col1'][2] = 'under index 2 in column1 add this msg'
is it because of the pyPDF2/utils is interfering?
warning:
File "C:\Users\path\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
error:
IndexError: list index out of range
last_index=df[df['col1']=='Y'].index[-1]
df.loc[last_index,'col1']='encountered'
Here's how I would go about solving this:
prev_val = None
# Iterate through rows to utilize the index
for idx, row in df[['col1']].iterrows():
# unpack your row, a bit more overhead but highly readable
val = row['col1']
# Use previous value instead of counter – is easier to read and is more accurate
if val == 'Y' and val == prev_val:
df.loc[idx, 'col1'] = 'encountered'
# now set the prev value to current:
prev_val = val
What possibly could be the issue with your code is the way you are iterating over your dataframe and also the indexing. Another issue is that you are trying to set the value you are iterating over with a new value. That bound give you issues later.
Does this works for you?:
>>> count = 0
>>> df['encounter'] = np.nan
>>> for i in df.itertuples():
>>> if getattr(i, 'col1')=='Y':
>>> count+=1
>>> else:
>>> count = 0
>>> if count==2:
>>> df.loc[i[0], 'encounter']= 'encountered'
>>> print(df)
col0 col1 encounter
0 A Y NaN
1 B Y encountered
2 B NaN NaN
3 C Y NaN

Pandas loop exact column iteration

I have this kind of code, which check for value in column A. If condition is met then the code check for value in the other column of the same row and copy the value from that column to replace value in column A:
counter = 0
list_of_winners = []
for each in data.iterrows():
winner = data.iloc[counter, 5]
if winner == 'Red':
vitazr = data.iloc[counter, 0]
list_of_winners.append(vitazr)
elif winner == 'Blue':
vitazb = data.iloc[counter, 1]
list_of_winners.append(vitazb)
elif winner == 'Draw':
draw = str('Draw')
list_of_winners.append(draw)
else:
pass
counter += 1
The solution works for me and I am able to create a list and then that list put into original Dataframe and replace the values I looped thru.
What I want to ask.... Isn t there some other more elegant and shorter way to attack/address this problem?
You can do an np.select:
list_of_winners = np.select([data.iloc[:,5] == 'Red',
data.iloc[:,5] == 'Blue',
data.iloc[:,5] == 'Draw'],
[data.iloc[:,0], data.iloc[:, 1], 'Draw',
default=None
)

Concatenate columns of dataframe in array

I'm trying to make a data visualization app, which is introduced a file type CSV and then select the columns to represent (not all columns are represented), I already got the function to select only a few variables, but now I need to join those columns in a single data frame to work with, I tried to do this:
for i in range(0, len(data1.columns)):
i = 0
df = np.array(data1[data1.columns[i]])
i +=1
print(df)
But I've only got the same column repeated numb_selection = numb_columns_dataframe (i.e. if I select 5 columns, the same column returns 5 times)
How do I ensure that for each iteration I insert a different column and not always the same one?
The problem of repeating one column is in i rewriting.
# For example `data1.columns` is ["a", "b", "c", "d", "e"]
# Your code:
for i in range(0, len(data1.columns)):
i = 0 # Here, in every interaction, set into 0
print(i, data1.columns[i], sep=": ")
i += 1
# Output:
# 0: a
# 0: a
# 0: a
# 0: a
# 0: a
i = 0 & i += 1 are useless because you already get i fromrange, ranging from 0 to len (data1.columns).
Fixed version
for i in range(0, len(data1.columns)):
print(i, data1.columns[i], sep=": ")
# Output:
# 0: a
# 1: b
# 2: c
# 3: d
# 5: e
Versions using manual increment i plus iteration through elements:
# First step, iter over columns
for col in data1.columns:
print(col)
# Output:
# a
# b
# c
# d
# e
# Step two, manual increment to obtain the list (array) index
i = 0
for col in data1.columns:
print(i, col, sep=": ")
i += 1
# Output:
# 0: a
# 1: b
# 2: c
# 3: d
# 5: e
Helpful to know, enumerate:
Function enumerate(iterable) is nice for obtain key of index and value itself.
print(list(enumerate(["Hello", "world"])))
# Output:
[
(0, "Hello"),
(1, "world")
]
Usage:
for i, col in enumerate(data1.columns):
print(i, col, sep=": ")
# Output:
# 0: a
# 1: b
# 2: c
# 3: d
# 5: e
At the end I solved it, declaring an empty list before the loop, iterating on the selected variables and saving the indexes in this list. So I get a list with the indexes that I should use for my visualization.
def get_index(name):
'''
return the index of a column name
'''
for column in df.columns:
if column == name:
index = df.columns.get_loc(column)
return index
result=[]
for i in range(len(selected)):
X = get_index(selected[i])
result.append(X)
df = df[df.columns[result]]
x = df.values
Where 'selected' is the list of selected variables (filter first by column name, then get its index number), I don't know if it's the most elegant way to do this, but it works well.

How to find duplicates in a python list that are adjacent to each other and list them with respect to their indices?

I have a program that reads a .csv file, checks for any mismatch in column length (by comparing it to the header-fields), which then returns everything it found out as a list (and then writes it into a file). What I want to do with this list, is to list out the results as follows:
row numbers where the same mismatch is found : the amount of columns in that row
e.g.
rows: n-m : y
where n and m are the numbers of rows which share the same amount of columns that mismatch to header.
I have looked into these topics, and while the information is useful, they do not answer the question:
Find and list duplicates in a list?
Identify duplicate values in a list in Python
This is where I am right now:
r = csv.reader(data, delimiter= '\t')
columns = []
for row in r:
# adds column length to a list
colm = len(row)
columns.append(colm)
b = len(columns)
for a in range(b):
# checks if the current member matches the header length of columns
if columns[a] != columns[0]:
# if it doesnt, write the row and the amount of columns in that row to a file
file.write("row " + str(a + 1) + ": " + str(columns[a]) + " \n")
the file output looks like this:
row 7220: 0
row 7221: 0
row 7222: 0
row 7223: 0
row 7224: 0
row 7225: 1
row 7226: 1
when the desired end result is
rows 7220 - 7224 : 0
rows 7225 - 7226 : 1
So I what I essentially need, the way i see it, is an dictionary where key is the rows with duplicate value and value is the amount of columns in that said mismatch. What I essentially think I need (in a horrible written pseudocode, that doesn't make any sense now that I'm reading it years after writing this question), is here:
def pseudoList():
i = 1
ListOfLists = []
while (i < len(originalList)):
duplicateList = []
if originalList[i] == originalList[i-1]:
duplicateList.append(originalList[i])
i += 1
ListOfLists.append(duplicateList)
def PseudocreateDict(ListOfLists):
pseudoDict = {}
for x in ListOfLists:
a = ListOfLists[x][0] #this is the first node in the uniqueList created
i = len(ListOfLists) - 1
b = listOfLists[x][i] #this is the last node of the uniqueList created
pseudodict.update('key' : '{} - {}'.format(a,b))
This however, seems very convoluted way for doing what I want, so I was wondering if there's a) more efficient way b) an easier way to do this?
You can use a list comprehension to return a list of elements in the columns list that differ from adjacent elements, which will be the end-points of your ranges. Then enumerate these ranges and print/write out those that differ from the first (header) element. An extra element is added to the list of ranges to specify the end index of the list, to avoid out of range indexing.
columns = [2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1];
ranges = [[i+1, v] for i,v in enumerate(columns[1:]) if columns[i] != columns[i+1]]
ranges.append([len(columns),0]) # special case for last element
for i,v in enumerate(ranges[:-1]):
if v[1] != columns[0]:
print "rows", v[0]+1, "-", ranges[i+1][0], ":", v[1]
output:
rows 2 - 5 : 1
rows 6 - 9 : 0
rows 10 - 11 : 1
rows 13 - 13 : 1
You can also try the following code -
b = len(columns)
check = 0
for a in range(b):
# checks if the current member matches the header length of columns
if check != 0 and columns[a] == check:
continue
elif check != 0 and columns[a] != check:
check = 0
if start != a:
file.write("row " + str(start) + " - " + str(a) + ": " + str(columns[a]) + " \n")
else:
file.write("row " + str(start) + ": " + str(columns[a]) + " \n")
if columns[a] != columns[0]:
# if it doesnt, write the row and the amount of columns in that row to a file
start = a+1
check = columns[a]
What you want to do is a map/reduce operation, but without the sorting that is normally done between the mapping and the reducing.
If you output
row 7220: 0
row 7221: 0
row 7222: 0
row 7223: 0
To stdout, you can pipe this data to another python program that generates the groups you want.
The second python program could look something like this:
import sys
import re
line = sys.stdin.readline()
last_rowid, last_diff = re.findall('(\d+)', line)
for line in sys.stdin:
rowid, diff = re.findall('(\d+)', line)
if diff != last_diff:
print "rows", last_rowid, rowid, last_diff
last_diff = diff
last_rowid = rowid
print "rows", last_rowid, rowid, last_diff
You would execute them like this in a unix environment to get the output into a file:
python yourprogram.py | python myprogram.py > youroutputfile.dat
If you cannot run this on a unix environment, you can still use the algorithm I wrote in your program with a few modifications.

Categories