I have a list of numbers in a python data frame and want to group these numbers by a specific range and count. The numbers range from 0 to 20 but lets say there might not be any number 6 in that case I want it to show 0.
dataframe column looks like
|points|
5
1
7
3
2
2
1
18
15
4
5
I want it to look like the below
range | count
1 2
2 2
3 1
4 1
5 2
6 0
7 ...
8
9...
I would iterate through the input lines and fill up a dict with the values.
All you have to do then is count...
import collections
#read your input and store the numbers in a list
lines = []
with open('input.txt') as f:
lines = [int(line.rstrip()) for line in f]
#pre fill the dictionary with 0s from 0 to the highest occurring number in your input.
values = {}
for i in range(max(lines)+1):
values[i] = 0
# increment the occurrence by 1 for any found value
for val in lines:
values[val] += 1
# Order the dict:
values = collections.OrderedDict(sorted(values.items()))
print("range\t|\tcount")
for k in values:
print(str(k) + "\t\t\t" + str(values[k]))
repl: https://repl.it/repls/DesertedDeafeningCgibin
Edit:
a slightly more elegant version using dict comprehension:
# read input as in the first example
values = {i : 0 for i in range(max(lines)+1)}
for val in lines:
values[val] += 1
# order and print as in the first example
Related
I am trying to display each of the characters with their quantity
Input Specification
The first line of input contains the number N, which is the number of lines that follow. The next
N lines will contain at least one and at most 80 characters, none of which are spaces.
Output Specification
Output will be N lines. Line i of the output will be the encoding of the line i + 1 of the input.
The encoding of a line will be a sequence of pairs, separated by a space, where each pair is an
integer (representing the number of times the character appears consecutively) followed by a space,
followed by the character.
Sample Input
4
+++===!!!!
777777......TTTTTTTTTTTT
(AABBC)
3.1415555
Output for Sample Input
3 + 3 = 4 !
6 7 6 . 12 T
1 ( 2 A 2 B 1 C 1 )
1 3 1 . 1 1 1 4 1 1 4 5
just use itertools.groupby and format the result: value and length of the group. Join the elements:
import itertools
s = "+++===!!!! 777777......TTTTTTTTTTTT (AABBC) 3.1415555"
result = "".join(["{} {}".format(sum(1 for _ in group),value) for value,group in itertools.groupby(s)])
result:
3 + 3 = 4 ! 1 6 7 6 . 12 T 1 1 ( 2 A 2 B 1 C 1 ) 1 1 3 1 . 1 1 1 4 1 1 4 5
without a key parameter, itertools.groupby just groups identical items into groups, preserving the order. Just count them. Here I chose to not create a list to consume the group (len(list(group))) but just do sum(1 for _ in group)
I'd do something like this:
s = "+++===!!!! 777777......TTTTTTTTTTTT (AABBC) 3.1415555"
d = {char: 0 for char in s}
for char in s:
d[char] += 1
output = "".join([" {} {}".format(value, key) for key, value in d.items()])
# outputs: '3 + 3 = 4 ! 3 6 7 7 . 1 2 T 1 ( 2 A 2 B 1 C 1 ) 1 3 2 1 1 4 4 5'
Since it looks like you aren't looking for total repeating characters, I would suggest reading the string backward and for each character, you want to count how many times it appears as you're iterating through, and once you hit a different character you use the current count for the output. In fact, you could generate your output as you iterate through the string backward.
It might look something like this:
reverse = input[-1:0]
output = ''
count = 0
letter = reverse[0]
for k in range(0, len(reverse)):
if reverse[k] == letter and k != len(reverse) - 1:
count += 1
else:
output = str(count) + ' ' + reverse[k] + ' ' + output
letter = reverse[k]
count = 0
I'm trying to extract tables from log files which are in .txt format. The file is loaded using read_csv() from pandas.
The log file looks like this:
aaa
bbb
ccc
=====================
A B C D E F
=====================
1 2 3 4 5 6
7 8 9 1 2 3
4 5 6 7 8 9
1 2 3 4 5 6
---------------------
=====================
G H I J
=====================
1 3 4
5 6 7
---------------------
=====================
K L M N O
=====================
1 2 3
4 5 6
7 8 9
---------------------
xxx
yyy
zzz
Here are some points about the log file:
Files start and end with some lines of comment which can be ignored.
In the example above there are three tables.
Headers for each table are located between lines of "======..."
The end of each table is signified by a line of "------..."
My code as of now:
import pandas as pd
import itertools
df = pd.read_csv("xxx.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# while loop to find lines which are table rows & append to one list
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
r.append(df.iloc[i+x].str.split().tolist())
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
This code returns AssertionError: 14 columns passed, passed data had 15 columns. I know that this is due to the fact that for the table rows, I am using .str.split() which by default splits on whitespace. Since there are some columns for which there are missing values, the number of elements in table headers and number of elements in table rows does not match for the second and htird table. I am struggling to get around this, since the number of whitespace characters to signify missing values is different for each table.
My question is: is there a way to account for missing values in some of the columns, so that I can get a DataFrame as output where there are either null or NaN for missing values as appropriate?
With usage of Victor Ruiz method I added if options to handle different header sizes.
=^..^=
Description in code:
import re
import pandas as pd
import itertools
df = pd.read_csv("stack.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# get header string
head = df.iloc[i+1].to_string()
# get space distance in header
space_range = 0
for result in re.findall('([ ]*)', head):
if len(result) > 0:
space_range = len(result)
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
# strip line
line = df.iloc[i+x].to_string()[5::]
# collect items based on elements distance
items = []
for result in re.finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > space_range*2+1:
items.append('NaN')
items.append('NaN')
if len(delimiter) < space_range*2+2 and len(delimiter) > space_range:
items.append('NaN')
r.append([items])
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
Output:
A B C D E F
0 1 2 3 4 5 6
1 7 8 9 1 2 3
2 4 5 6 7 8 9
3 1 2 3 4 5 6
G H I J
0 1 NaN 3 4
1 5 NaN 6 7
K L M N O
0 1 NaN NaN 2 3
1 4 5 NaN NaN 6
2 7 8 NaN 9 None
Maybe this can help you.
Suppose we have the next line of text:
1 3 4
The problem is to identify how much spaces delimits two consecutive items without considering that there is a missing value between them.
Let consider that 5 spaces is a delimiter, and more than 5 is a missing value.
You can use regex to parse the items:
from re import finditer
line = '1 3 4'
items = []
for result in finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > 5:
items.append(nan)
print(items)
Output is:
['1', nan, '3', '4']
A more complex situation would be if it can appear two or more consecutive missing values (the code above will just inyect only one nan)
I have a text file with two columns and 135001 rows. First column is amplitude and second column is related time. I need to go over the first column and understand where the amplitude increase and again decrease and I need to extract the related time. Actually I should make a derivate from first column. When the amplitude increases I should count one and then I should wait until the amplitude reach zero, then again do this process. As I mentioned I need the related time also. This is a very raw code that I am thinking of that and i know it is not true but I do not know how to complete it. For the first step I have problem with decreasing the rows in the first column and I got this error" str could not convert to float".
n=0
with open('39-1+2.txt',"r") as f
for line in f
data=line.split(' ')[0]
time=line.split(' ')[1]
with open ('grad-time.txt', 'w') as s:
for i in range (0, 135001):
if
d= float(data[i+1])-float (data[i])>0
n=n+1
s.write("{}\n".format(d))
wait
float(data [i]= 0.0)
continue
For an example I have this file:
0 11
2 12
3 13
1 14
0 15
1 16
0 17
0 18
The out put should be like:
2 12
1 16
Since you want to use the value of a previous row to make decision on the current row, you can make use of pandas' shift. This will allow you to create a new column that holds the value of the previous row.
Using that logic you just need to check if the previous row is 0, and that the current value is higher than that.
>>> import pandas as pd
>>> df = pd.DataFrame([[0,11],[2,12],[3,13],[1,14],[0,15],[1,16],[0,17],[0,18]])
>>> df['shift'] = df[0].shift(1)
>>> df
0 1 shift
0 0 11 NaN
1 2 12 0.0
2 3 13 2.0
3 1 14 3.0
4 0 15 1.0
5 1 16 0.0
6 0 17 1.0
7 0 18 0.0
>>> df[(df['shift']==0) & (df[0] > df['shift'])].drop(columns=['shift'])
0 1
1 2 12
5 1 16
I haven't tested it, but the following functions should work for your problem:
def prepare_date(filename):
with open(filename) as f:
data = f.readlines()
prepared_data = []
for line in data:
# A line "1 16" becomes [1, 16] in prepared_data
prepared_data.append(
[int(item) for item in line.split()]
)
return prepared_data
def find_increases_in_amplitude(prepared_data):
# get the first data point
last_data_point = prepared_data[0]
increases = []
# loop over data and find increases
for data_point in prepared_data:
# if the last data point had amplitude 0, and the current has amplitude
# greater than zero: store the data_point in "increases"
if last_data_point[0] == 0 and data_point[0] > 0:
increases.append(data_point)
# update last_data_point
last_data_point = data_point
return increases
If you use the first one to open and prepare your data (make it a list like [ [1, 12], ...] ), then run that list through the second function.
I want to print the following sequence of integers in a pyramid (odd rows sorted ascending, even rows sorted descending). If S=4, it must print four rows and so on.
Expected output:
1
3 2
4 5 6
10 9 8 7
I tried out the following code but it produced the wrong output.
S=int(input())
for i in range(1,S+1):
y=i+(i-1)
if i%2!=0:
print(*range(i,y+1))
elif i%2==0:
print(*range(y,i-1,-1))
# Output:
# 1
# 3 2
# 3 4 5
# 7 6 5 4
You need some way of either keeping track of where you are in the sequence when printing each row, generating the entire sequence and then chunking it into rows, or... (the list of possible approaches goes on and on).
Below is a fairly simple approach that just keeps track of a range start value, calculates the range stop value based on the row number, and reverses even rows.
rows = int(input())
start = 1
for n in range(1, rows + 1):
stop = int((n * (n + 1)) / 2) + 1
row = range(start, stop) if n % 2 else reversed(range(start, stop))
start = stop
print(*row)
# If rows input is 4, then output:
# 1
# 3 2
# 4 5 6
# 10 9 8 7
Using itertools.count and just reversing the sublist before printing on even rows
from itertools import count
s = 4
l = count(1)
for i in range(1, s+1):
temp = []
for j in range(i):
temp.append(next(l))
if i % 2:
print(' '.join(map(str, temp)))
else:
print(' '.join(map(str, temp[::-1])))
1
3 2
4 5 6
10 9 8 7
I would like to create a triangle and take user input from the user. I have already created the function for creating triangles.
Function:
def triangle(rows):
PrintingList = list()
for rownum in range (rows ):
PrintingList.append([])
for iteration in range (rownum):
newValue = raw_input()
PrintingList[rownum].append(newValue)
But this takes input in this way..
3
7
4
2
4
6
8
5
9
3
I need it to take a input like this:
3
7 4
2 4 6
8 5 9 3
How do it change it to take input in this way? need some guidance on this...
for rownum in range (rows ):
PrintingList.append([])
newValues = raw_input().strip().split()
PrintingList[rownum] += newValues
I don't see here if you need or not to convert input from strings to ints.. But if you need, this will look like
for rownum in range (rows ):
PrintingList.append([])
newValues = map(int, raw_input().strip().split())
PrintingList[rownum] += newValues