I extract data from scrapy .
There is a string representing a float ' 0,18' .
What is the most efficient way to convert a String into a float ?
Right now, I convert like this. There are space characters to remove. Comma is replaced by dot.
>>> num = ' 0,18'
>>> float(num.replace(' ','').replace(',','.'))
0.18
I believe my method is far from efficient in time complexity when dealing with tons of data.
You may drop the whitespace stripping. float will eat up whitespace:
>>> float(' 0.18')
0.18
This is okay but if you look at how this is processed, at a high level, there are three function calls, every time:
Replace the empty space with nothing
Replace the comma with a dot
Convert string to float
To simply reduce code, you can get rid of step 1. And just replace the comma with the dot and then convert the string to float.
Related
I am trying to convert a string column of a csv file into an integer or float type using pyspark. Everytime I convert it, the output of the conversion is "null".
When I try to check if the string contains a number it says "false.
How can I convert the string?
THis is my try to solve this.Shoprt pic of the data
`w=weather.withColumn("Temperature",col("Temperature").cast('int'))
w.printSchema()
`
The issue is the value in your "Temperature" column is 26 °F, this value can of course not be casted to an int, because only the first 2 characters of the string are integers. You need a way to retrieve only the first integer characters of this string. I'd say there are 2 options, splitting the value on " " and get the first item or use regex.
For this example I'm going to use the splitting, while not the most robust, you can try the regex out yourself.
from pyspark.sql import functions as F
w = weather.withColumn("Temperature",F.split(col("Temperature"), " ").getItem(0).cast('int'))
I have complex number in the form of a string
x = 1+3j
Using the split() method of strings , I want to break it into the real and imaginary parts.
What I tried :
I used the + as a separator of the string and got the real and imaginary values.
Problem: The complex number can also be3-7j , I that case , split() fails as the string does not have a +.
What a want is that the split() should split the string when it encounters either + or -
you can try this :
import numpy as np
y=complex(x)
xr,xi=np.real(y),np.imag(y)
I don't know if you want to keep them as strings, but if you do, just add str() around np.real(y) and np.imag(y).
Note that it doesn't work with you have spaces within your string.
I'm currently using Python to search through a .config file and look for an integer in a line such as "locationId="225".
It replaces the integer such as 225 with another number of my choosing.
This works fine. However, I'm not sure how to enter my own number if the original .config file is missing a number. Example:
locationID=""
So if the original locationId is missing an integer, I still want to replace it with my new integer.
I have used:
import re
sys.stdout.write(re.sub(r'(locationid=")', r'\1 ' + newtext, line))
but this causes it to output something such as
locationId=" 33"
with a space before the 33. How to I remove the space before the 33 and make it output
locationId="33"
?
I basically just want to know how to remove the space before the number.
The space is coming from your replacement string, r'\1 ', but removing that space causes a problem when you concatenate a number, say, 1, with it. If newtext is 1 then the replacement string becomes r'\11' without the space.
Remove the double quote from the capturing group and add it to the replacement string:
re.sub(r'(locationid=)"', r'\1"' + newtext, line)
I want to do the following split:
input: 0x0000007c9226fc output: 7c9226fc
input: 0x000000007c90e8ab output: 7c90e8ab
input: 0x000000007c9220fc output: 7c9220fc
I use the following line of code to do this but it does not work!
split = element.rpartition('0')
I got these outputs which are wrong!
input: 0x000000007c90e8ab output: e8ab
input: 0x000000007c9220fc output: fc
what is the fastest way to do this kind of split?
The only idea for me right now is to make a loop and perform checking but it is a little time consuming.
I should mention that the number of zeros in input is not fixed.
Each string can be converted to an integer using int() with a base of 16. Then convert back to a string.
for s in '0x000000007c9226fc', '0x000000007c90e8ab', '0x000000007c9220fc':
print '%x' % int(s, 16)
Output
7c9226fc
7c90e8ab
7c9220fc
input[2:].lstrip('0')
That should do it. The [2:] skips over the leading 0x (which I assume is always there), then the lstrip('0') removes all the zeros from the left side.
In fact, we can use lstrip ability to remove more than one leading character to simplify:
input.lstrip('x0')
format is handy for this:
>>> print '{:x}'.format(0x000000007c90e8ab)
7c90e8ab
>>> print '{:x}'.format(0x000000007c9220fc)
7c9220fc
In this particular case you can just do
your_input[10:]
You'll most likely want to properly parse this; your idea of splitting on separation of non-zero does not seem safe at all.
Seems to be the XY problem.
If the number of characters in a string is constant then you can use
the following code.
input = "0x000000007c9226fc"
output = input[10:]
Documentation
Also, since you are using rpartitionwhich is defined as
str.rpartition(sep)
Split the string at the last occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself.
Since your input can have multiple 0's, and rpartition only splits the last occurrence this a malfunction in your code.
Regular expression for 0x00000 or its type is (0x[0]+) and than replace it with space.
import re
st="0x000007c922433434000fc"
reg='(0x[0]+)'
rep=re.sub(reg, '',st)
print rep
So I'm working on a problem where I have to find various string repeats after encountering an initial string, say we take ACTGAC so the data file has sequences that look like:
AAACTGACACCATCGATCAGAACCTGA
So in that string once we find ACTGAC then I need to analyze the next 10 characters for the string repeats which go by some rules. I have the rules coded but can anyone show me how once I find the string that I need, I can make a substring for the next ten characters to analyze. I know that str.partition function can do that once I find the string, and then the [1:10] can get the next ten characters.
Thanks!
You almost have it already (but note that indexes start counting from zero in Python).
The partition method will split a string into head, separator, tail, based on the first occurence of separator.
So you just need to take a slice of the first ten characters of the tail:
>>> data = 'AAACTGACACCATCGATCAGAACCTGA'
>>> head, sep, tail = data.partition('ACTGAC')
>>> tail[:10]
'ACCATCGATC'
Python allows you to leave out the start-index in slices (in defaults to zero - the start of the string), and also the end-index (it defaults to the length of the string).
Note that you could also do the whole operation in one line, like this:
>>> data.partition('ACTGAC')[2][:10]
'ACCATCGATC'
So, based on marcog's answer in Find all occurrences of a substring in Python , I propose:
>>> import re
>>> data = 'AAACTGACACCATCGATCAGAACCTGAACTGACTGACAAA'
>>> sep = 'ACTGAC'
>>> [data[m.start()+len(sep):][:10] for m in re.finditer('(?=%s)'%sep, data)]
['ACCATCGATC', 'TGACAAA', 'AAA']