I am trying to convert a string column of a csv file into an integer or float type using pyspark. Everytime I convert it, the output of the conversion is "null".
When I try to check if the string contains a number it says "false.
How can I convert the string?
THis is my try to solve this.Shoprt pic of the data
`w=weather.withColumn("Temperature",col("Temperature").cast('int'))
w.printSchema()
`
The issue is the value in your "Temperature" column is 26 °F, this value can of course not be casted to an int, because only the first 2 characters of the string are integers. You need a way to retrieve only the first integer characters of this string. I'd say there are 2 options, splitting the value on " " and get the first item or use regex.
For this example I'm going to use the splitting, while not the most robust, you can try the regex out yourself.
from pyspark.sql import functions as F
w = weather.withColumn("Temperature",F.split(col("Temperature"), " ").getItem(0).cast('int'))
Related
I have complex number in the form of a string
x = 1+3j
Using the split() method of strings , I want to break it into the real and imaginary parts.
What I tried :
I used the + as a separator of the string and got the real and imaginary values.
Problem: The complex number can also be3-7j , I that case , split() fails as the string does not have a +.
What a want is that the split() should split the string when it encounters either + or -
you can try this :
import numpy as np
y=complex(x)
xr,xi=np.real(y),np.imag(y)
I don't know if you want to keep them as strings, but if you do, just add str() around np.real(y) and np.imag(y).
Note that it doesn't work with you have spaces within your string.
I extract data from scrapy .
There is a string representing a float ' 0,18' .
What is the most efficient way to convert a String into a float ?
Right now, I convert like this. There are space characters to remove. Comma is replaced by dot.
>>> num = ' 0,18'
>>> float(num.replace(' ','').replace(',','.'))
0.18
I believe my method is far from efficient in time complexity when dealing with tons of data.
You may drop the whitespace stripping. float will eat up whitespace:
>>> float(' 0.18')
0.18
This is okay but if you look at how this is processed, at a high level, there are three function calls, every time:
Replace the empty space with nothing
Replace the comma with a dot
Convert string to float
To simply reduce code, you can get rid of step 1. And just replace the comma with the dot and then convert the string to float.
I want to do the following split:
input: 0x0000007c9226fc output: 7c9226fc
input: 0x000000007c90e8ab output: 7c90e8ab
input: 0x000000007c9220fc output: 7c9220fc
I use the following line of code to do this but it does not work!
split = element.rpartition('0')
I got these outputs which are wrong!
input: 0x000000007c90e8ab output: e8ab
input: 0x000000007c9220fc output: fc
what is the fastest way to do this kind of split?
The only idea for me right now is to make a loop and perform checking but it is a little time consuming.
I should mention that the number of zeros in input is not fixed.
Each string can be converted to an integer using int() with a base of 16. Then convert back to a string.
for s in '0x000000007c9226fc', '0x000000007c90e8ab', '0x000000007c9220fc':
print '%x' % int(s, 16)
Output
7c9226fc
7c90e8ab
7c9220fc
input[2:].lstrip('0')
That should do it. The [2:] skips over the leading 0x (which I assume is always there), then the lstrip('0') removes all the zeros from the left side.
In fact, we can use lstrip ability to remove more than one leading character to simplify:
input.lstrip('x0')
format is handy for this:
>>> print '{:x}'.format(0x000000007c90e8ab)
7c90e8ab
>>> print '{:x}'.format(0x000000007c9220fc)
7c9220fc
In this particular case you can just do
your_input[10:]
You'll most likely want to properly parse this; your idea of splitting on separation of non-zero does not seem safe at all.
Seems to be the XY problem.
If the number of characters in a string is constant then you can use
the following code.
input = "0x000000007c9226fc"
output = input[10:]
Documentation
Also, since you are using rpartitionwhich is defined as
str.rpartition(sep)
Split the string at the last occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself.
Since your input can have multiple 0's, and rpartition only splits the last occurrence this a malfunction in your code.
Regular expression for 0x00000 or its type is (0x[0]+) and than replace it with space.
import re
st="0x000007c922433434000fc"
reg='(0x[0]+)'
rep=re.sub(reg, '',st)
print rep
I am working on a program that reads an RFID card, and then pulls information about that card from a database. I am using Python with MySQL for this but in order for it to work I need to convert a string, e.g. "2345d566k", to an int. I don't need the letters to be in there, just the numbers.
when I do the following:
test = "2345d566k"
test2 = int(test)
it returns: ValueError: invalid literal for int() with base 10: '2345d566k'
How could I convert this string to an int?
Assuming you want to ignore characters other than digits, a simple solution would be
test = "ab23cd56e3f"
test2 = int(filter(lambda x: x.isdigit(), test))
#test2 is now 23563
filter() applies the isdigit() function to every character of the string, keeping only those that are digits. Then you can safely call int() to convert the result to an integer.
As pointed out in the comments, this will only work if every line of text you want to convert contains at least 1 digit.
You can easily use regex for this
import re
"".join(re.findall('\d', "ab23cd56e3f"))
It will parse all numeric digit from the given string, If you will put \D in place of \d it will result all alphabets.
I'm trying to extract a piece of information about a certain file. The file name is extracted from an xml file.
The information I want is stored in the name of the file, I want to know how to extract the letters between the 2nd and 3rd period in the string.
Eg. name is extracted from the xml, it is stored as a string that looks something like this "aa.bb.cccc.dd.ee" and I need to find what "cccc" actually is in each of the strings I extract (~50 of them).
I've done some searching and some playing around with slicing etc. but I can't get even close.
I can't just specify the letter in the range [6:11] because the length of the string varies as does the number of characters before the part I want to find.
UPDATE: Solution Added.
Due to the fact the data that I was trying to split and extract part from was from an xml file it was being stored as an element.
I iterated through the list of Estate Names and stored the EstateName attribute for each one as a variable
for element in EstateList:
EstateStr = element.getAttribute('EstateName')
I then used the split on this new variable which contains strings rather than elements and wrote them to the desired text file:
asset = EstateStr.split('.', 3)[2]
z.write(asset + "\n")
If you are certain it will always have this format (5 blocks of characters, separated by 4 decimals points) you can split on '.' then index the third element [2].
>>> 'aa.bb.cccc.dd.ee'.split('.')[2]
'cccc'
This works for various string lengths so you don't have to worry about the absolute position using slicing as your first approach mentioned.
>>> 'a.b.c.d.e'.split('.')[2]
'c'
>>> 'eeee.ddddd.ccccc.bbbbb.aaaa'.split('.')[2]
'ccccc'
Split the string on the period:
third_part = inputstring.split('.', 3)[2]
I've used str.split() with a limit here for efficiency; no point in splitting the dd.ee part here, for example.
The [2] index then picks out the third result from the split, your cccc string:
>>> "aa.bb.cccc.dd.ee".split('.', 3)[2]
'cccc'
You could use re module to extract the string between 2 and third dot.
>>> re.search(r'^[^.]*\.[^.]*\.([^.]*)\..*', "aa.bb.cccc.dd.ee").group(1)
'cccc'