Pythonic equivalent to Matlab's textscan - python

There are some similar questions to this, but nothing exact that I can find.
I have a very odd text-file with lines like the following:
field1=1; field2=2; field3=3;
field1=4; field2=5; field3=6;
Matlab's textscan() function deals with this very neatly, as you can do this:
array = textscan(fid, 'field1=%d; field2=%d; field3=%d;'
and you will get back a cell-array where each column contains the respective field, and the text is simply ignored.
I'd like to rewrite the code that deals with this file in Python, but Numpy's loadtxt() and genfromtxt() don't seem to have this ability to ignore text interspersed with the desired numbers?
What are some Python ways to strip out the text and only get back the fields? I'm happy to use pandas or another library if required. Thanks!
EDIT: This question was suggested as an answer, but it only gives equivalents to the basic usage of textscan that does not deal with unwanted text in the input. The answer below with fromregex is what I needed.

Numpy's fromregex function is basically the same as textscan. It lets you read in based on a regular expression, with groups (parts surrounded by ()) as the values. This works for your example:
data = np.fromregex('temp.txt', r'field1=(\d+); field2=(\d+); field3=(\d+);', dtype='int')
You can also use loadtxt. There is an argument, converters, that lets you provide functions that do the actual conversion from text to a number. You can provide a function that , you just need to provide it a function to strip out the unneeded text.
So in my tests this works:
myconv = lambda x: int(x.split(b'=')[-1])
mycols = [0, 1, 2]
convdict = {i: myconv for i in mycols}
data = np.loadtxt('temp.txt', delimiter=';', usecols=mycols, converters=convdict)
myconv is an anonymous function that takes a value (say 'field1=1'), splits it on the '=', symbol (making ['field1', '1']), takes the last result ('1'), the converts that to a float (1.`).
mycols is just the numbers of the columns you want to keep. Since there is a delimiter at the end of each line, this counts as an empty columns. So we exclude that.
convdict is a dictionary where each key is a column number and each value is the function to convert that column to a number. In this case they are all the same, but you can customize them however you want.

Python has no exact equivalent of Matlab's textscan (edit: but numpy has fromregex. See #TheBlackCat's answer for more.)
With more complicated formats regular expressions may get the job done.
import re
line_pat = re.compile(r'field1=(\d+); field2=(\d+); field3=(\d+);')
with open(filepath, 'r') as f:
array = [[int(n) for n in line_pat.match(line).groups()] for line in f]

Related

How to round numbers in place in a string in python

I'd like to take some numbers that are in a string in python, round them to 2 decimal spots in place and return them. So for example if there is:
"The values in this string are 245.783634 and the other value is: 25.21694"
I'd like to have the string read:
"The values in this string are 245.78 and the other value is: 25.22"
What you'd have to do is find the numbers, round them, then replace them. You can use regular expressions to find them, and if we use re.sub(), it can take a function as its "replacement" argument, which can do the rounding:
import re
s = "The values in this string are 245.783634 and the other value is: 25.21694"
n = 2
result = re.sub(r'\d+\.\d+', lambda m: format(float(m.group(0)), f'.{n}f'), s)
Output:
The values in this string are 245.78 and the other value is: 25.22
Here I'm using the most basic regex and rounding code I could think of. You can vary it to fit your needs, for example check if the numbers have a sign (regex: [-+]?) and/or use something like the decimal module for handling large numbers better.
Another alternative using regex for what it is worth:
import re
def rounder(string, decimal_points):
fmt = f".{decimal_points}f"
return re.sub(r'\d+\.\d+', lambda x: f"{float(x.group()):{fmt}}", string)
text = "The values in this string are 245.783634 and the other value is: 25.21694"
print(rounder(text, 2))
Output:
The values in this string are 245.78 and the other value is: 25.22
I'm not sure quite what you are trying to do. "Round them in place and return them" -- do you need the values saved as variables that you will use later? If so, you might look into using a regular expression (as noted above) to extract the numbers from your string and assign them to variables.
But if you just want to be able to format numbers on-the-fly, have you looked at f-strings? f-string
print(f"The values in this string are {245.783634:.2f} and the other value is: {25.21694:.2f}.")
output:
The values in this string are 245.78 and the other value is: 25.22.
You can use format strings simply
link=f'{23.02313:.2f}'
print(link)
This is one hacky way but many other solutions do exist. I did that in one of my recent projects.

Python Pandas replace string based on format

Please, is there any ways to replace "x-y" by "x,x+1,x+2,...,y" in every row in a data frame? (Where x, y are integer).
For example, I want to replace every row like this:
"1-3,7" by "1,2,3,7"
"1,4,6-9,11-13,5" by "1,4,6,7,8,9,11,12,13,5"
etc
I know that by looping through lines and using regular expression we can do that. But the table is quite big and it takes quite some time. so I think using pandas might be faster.
Thanks alot
In pandas you can use apply to apply any function to either rows or columns in a DataFrame. The function can be passed with a lambda, or defined separately.
(side-remark: your example does not entirely make clear if you actually have a 2-D DataFrame or just a 1-D Series. Either way, apply can be used)
The next step is to find the right function. Here's a rough version (without regular expressions):
def make_list(str):
lst = str.split(',')
newlst = []
for i in lst:
if "-" in i:
newlst.extend(range(*[int(j) for j in i.split("-")]))
else:
newlst.append(int(i))
return newlst

Accessing Data using df['foo'] missing data for pattern searching python

So I have this function which takes in one row from dataframe and matches the pattern
and add it to the data. Since pattern search needs input to be string, I am forcing it with str(). However, if I do that it cuts off my url after certain point.
I figured out if I force it using ix function
str(data.ix[0,'url'])
It does not cut off any and gets me what I want. Also, if I use str(data.ix[:'url']),
it also cuts off after some point.
Problem is I cannot specify the index position inside the ix function as I plan to iterate by row using apply function. Any suggestion?
def foo (data):
url = str(data['url'])
m = re.search(r"model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)", url)
if m:
data['make'] = m.group("make")
data['model'] = m.group("model")
return data
Iterating row-by-row is a last resort. It's almost always slower, less readable, and less idiomatic.
Fortunately, there is an easy way to do what you want to do. Check out the DataFrame.str.extract method, added in version 0.13 of pandas.
Something like this...
pattern = r'model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)'
extracted_data = data.str.extract(pattern)
The result, extracted_data will be a new DataFrame with columns named 'model' and 'make', inferred from the named groups in your regex pattern.
Join it to your original DataFrame, and you're done.
data = data.join(extracted_data)

how to only show int in a sorted list from csv file

I have a huge CSV file where im supose to show only the colume "name" and "runtime"
My problem is that i have to sort the file and print the top 10 min and top 10 max from the
row runtime and print them
But the row 'runtime' contains text like this:
['http://dbpedia.org/ontology/runtime',
'XMLSchema#double',
'http://www.w3.org/2001/XMLSchema#double',
'4140.0',
'5040.0',
'5700.0',
'{5940.0|6600.0}',
'NULL',
'6480.0',....n]
how do i sort the list showing only numbers
my code so far:
import csv
run = []
fp = urllib.urlopen('Film.csv')
reader = csv.DictReader(fp,delimiter=',')
for line in reader:
if line:
run.append(line)
name = []
for row in run:
name.append(row['name'])
runtime = []
for row in run:
runtime.append(row['runtime'])
runtime
expected output:
the csv file contaist null values and values looking like this {5940.0|6600.0}
expected output
'4140.0',
'5040.0',
'5700.0',
'6600.0',
'6800.0',....n]
not containg the NULL values and only the higest values in the ones looking
like this
{5940.0|6600.0}
You could filter it like this, but you should probably wait for better answers.
>>>l=[1,1.3,7,'text']
>>>[i for i in l if type(i) in (type(1),type(1.0))] #only ints and floats allowed
[1,1.3,7]
This should do though.
My workflow probably would be: Use str.isdigit() as a filter, convert to a number with BIF int() or float() and then use sort() or sorted().
While you could use one of the many answers that will show up here, I personally would exploit some domain knowledge of your csv file:
runtime = runtime[3:]
Based on your example value for the runtime row, the first three columns contain metadata. So you know more about the structure of your input file than just "it is a csv file".
Then, all you need to do is sort:
runtime = sorted(runtime)
max_10 = runtime[-10:]
min_10 = runtime[:10]
The syntax I'm using here is called "slice", which allows you to access a range of a sequence, by specifying the start index and the "up-to-but-not-including" index in the square brackets separated by a colon. Neat trick: Negative indexes wrap are seen as starting at the end of the sequence.

How to a turn a list of strings into complex numbers in python?

I'm trying to write code which imports and exports lists of complex numbers in Python. So far I'm attempting this using the csv module. I've exported the data to a file using:
spamWriter = csv.writer(open('data.csv', 'wb')
spamWriter.writerow(complex_data)
Where complex data is a list numbers generated by the complex(re,im) function. Ex:
print complex_data
[(37470-880j),(35093-791j),(33920-981j),(28579-789j),(48002-574j),(46607-2317j),(42353-1557j),(45166-2520j),(45594-232j),(41149+561j)]
To then import this at a later time, I try the following:
mycsv = csv.reader(open('data.csv', 'rb'))
out = list(mycsv)
print out
[['(37470-880j)','(35093-791j)','(33920-981j)','(28579-789j)','(48002-574j)','(46607-2317j)','(42353-1557j)','(45166-2520j)','(45594-232j)','(41149+561j)']]
(Note that this is a list of lists, I just happened to use only one row for the example.)
I now need to turn this into complex numbers rather than strings. I think there should be a way to do this with mapping as in this question, but I couldn't figure out how to make it work. Any help would be appreciated!
Alternatively, if there's any easier way to import/export complex-valued data that I don't know of, I'd be happy to try something else entirely.
Just pass the string to complex():
>>> complex('(37470-880j)')
(37470-880j)
Like int() it takes a string representation of a complex number and parses that. You can use map() to do so for a list:
map(complex, row)
>>> c = ['(37470-880j)','(35093-791j)','(33920-981j)']
>>> map(complex, c)
[(37470-880j), (35093-791j), (33920-981j)]
complex_out = []
for row in out:
comp_row = [complex(x) for x in row]
complex_out.append(comp_row)
CSV docs say:
Note that complex numbers are written out surrounded by parens. This may cause some problems for other programs which read CSV files (assuming they support complex numbers at all).
This should convert elements in 'out' to complex numbers from the string types, which is the simplest solution given your existing code with ease of handling non-complex entries.
for i,row in enumerate(out):
j,entry in enumerate(row):
try:
out[i][j] = complex(out[i][entry])
except ValueError:
# Print here if you want to know something bad happened
pass
Otherwise using map(complex, row) on each row takes fewer lines.
for i,row in enumerate(out):
out[i] = map(complex, row)
I think each method above is bit complex
Easiest Way is this
In [1]: complex_num = '-2+3j'
In [2]: complex(complex_num)
Out[2]: (-2+3j)

Categories