All that i tried was:
df['buyer_zip']=df['buyer_zip'].replace('-', 0)
df['buyer_zip']=df['buyer_zip'].replace('', 0)
df['buyer_zip']=df['buyer_zip'].str[:5]
df["buyer_zip"].fillna( method ='ffill', inplace = True)
df["buyer_zip"].apply(int)
I have two columns in a pandas dataframe called Buyer_zip and Item_zip which are the zip codes of Buyer and items respectively. These zip codes have 4 formats. One is 5 digit zip code( ex: 12345), one is 5+4 digit zip code( 12345-1234), one is 9 digit zipcode (123456789) and the last one is 'EC180'. So, the last format is alphanumeric. There are 15 Million records in total. I am struck at a point where i have to convert all those alphanumeric values to numeric. When trying to do the same, i encountered the error: invalid literal for int() with base 10: 'EC180'. Could someone help me how to find all the words in my data column and replace it with 00000. Appreciate any help.But none of it gave an answer to how to find the words in that column and replace it with numbers
Sample data:
buyer_zip
97219
11415-3528
EC180
907031234
Expected output
buyer_zip
0 97219
1 114153528
2 0
3 907031234
Pandas has several different "replace" methods. On a DataFrame or a Series, replace is meant to match and replace entire values. For instance, df['buyer_zip'].replace('-', 0) looks for a column value that is literally the single character "-" and replaces it with the integer 0. That's not what you want. The series also has a .str attribute which holds functions for strings, and its replace is closer to what you want.
But, that is what you want when you have a string that starts with a non-digit letter. You want that one to be completely replaced with "00000".
Finally, astype is a faster way to convert the column to an int.
import pandas as pd
df = pd.DataFrame({"buyer_zip":["12345", "123451234", "123456789", "EC180"]})
df["buyer_zip"] = df["buyer_zip"].str.replace("-", "")
df["buyer_zip"] = df["buyer_zip"].replace(r"[^\d].*$", "00000", regex=True)
df["buyer_zip"] = df["buyer_zip"].astype(int)
The operations can be chained. Apply the second operation to the result of the first, etc, and you can condense the conversion
df["buyer_zip"] = df["buyer_zip"].str.replace("-", "").replace(r"[^\d].*$", "00000", regex=True).astype(int)
Related
I have a dataframe like this
Index
Identifier
0
10769289.0
1
1082471174.0
The "Identifier column is a string column" and I need to remove the ".0"
I'm using the following code:
Dataframe["Identifier"] = Dataframe["Identifier"].replace(regex=['.0'],value='')
But I got this:
IndexIdentifier0769289182471174
As you can see it removed more than just the ".0". I also tried to use
Dataframe["Identifier"] = Dataframe["Identifier"].str.replace(".0", "")
but I got the same result.
The dot (.) in regex or in replace can indicate any character. Therefore you have to escape the decimal point. Otherwise it will replace any character followed by a zero. Which in your case would mean that it would replace the 10 at the beginning of 10769289.0 and 1082471174.0, as well as the .0 at the end of each number. By escaping the decimal point, it will only look for the following: .0 - which is what you intended.
import pandas as pd
# Create the dataframe as per the example
Dataframe = pd.DataFrame({"Index": [0,1], "Identifier": ['10769289.0', '1082471174.0']})
# Replace the decimal and the zero at the end of each Identifier.
Dataframe["Identifier"] = Dataframe["Identifier"].str.replace("\.0", "")
# Print the dataframe
print(Dataframe)
OUTPUT:
Index Identifier
0 0 10769289
1 1 1082471174
I have a big dataset and I cannot convert the dtype from object to int because of the error "invalid literal for int() with base 10:" I did some research and it is because there are some strings within the column.
How can I find those strings and replace them with numeric values?
You might be looking for .str.isnumeric(), which will only allow you to filter the data for these numbers-in-strings and act on them independently .. but you'll need to decide what those values should be
converted (maybe they're money and you want to truncate €, or another date format that's not a UNIX epoch, or any number of possibilities..)
dropped (just throw them away)
something else
>>> df = pd.DataFrame({"a":["1", "2", "x"]})
>>> df
a
0 1
1 2
2 x
>>> df[df["a"].str.isnumeric()]
a
0 1
1 2
>>> df[~df["a"].str.isnumeric()]
a
2 x
Assuming 'col' the column name.
Just force convert to numeric, or NaN upon error:
df['col_num'] = pd.to_numeric(df['col'], errors='coerce')
If needed you can check which original values gave NaNs using:
df.loc[df['col'].notna()&df['col_num'].isna(), 'col']
Base 10 means it is a float. so In python you would do
int(float(____))
Since you used int(), I'm guessing you needed an integer value.
I have a data frame with one of the columns (UID) having 7 numbers or 10 numbers.
I have written a regex to identify 7 or 10 numbers (thanks to a very similar question in stackoverflow). These seem to work well on a text file.
no_7 = re.compile('(?<![0-9])[0-9]{7}(?![0-9])')
no_9 = re.compile('(?<![0-9])[0-9]{9}(?![0-9])')
Again, thanks to stackoverflow, I have written the following.
If the column is of 7 numbers, the values are copied to the second to last column.
df['column8']=df['UID'].apply(lambda x: x if(x == re.findall(no_7, x)) else 'NaN')
If the column is of 10 numbers, the column are copied to the last column
df['column9']=df['UID'].apply(lambda x: X if(x == re.findall(no_9, x)) else 'NaN')
While debugging the problem I was able to find out that the regex is never able to read the column with numbers as a number.
Regex complains:
TypeError: expected string or bytes-like object
I have tried setting column "UID" pd.to_numeric
I have tried setting column "UID" df["UID"].astype(int)
I have tried setting column "UID" df["UID"].apply(np.int64)
All assuming that the problem is that the column is incorrectly formatted, which I think it not, any longer.
You are obviously using the int type in your column and need str to apply string operations. You can convert using:
df['UID'].astype(str)
However, there are probably much better ways to do what you want, please improve your question as requested for a better response.
I have a some data that I put into a pandas dataframe. Inside of cell [0,5] I have a list of times that I want to call and be printed out.
Dataframe:
GAME_A PROCESSING_SPEED
yellow_selected 19
red_selected 0
yellow_total 19
red_total 60
counters [0.849998, 1.066601, 0.883263, 0.91658, 0.96668]
Code:
import pandas as pd
df = pd.read_csv('data.csv', sep = '>')
print(df.iloc[0])
proc_speed = df.iat[0,5]
print(proc_speed[2])
When I try to print the 3rd time in the dictionary I get .. I tried to use a for loop to print the times but instead I get this. How can I call the specific values from the list. How would I print out the 3rd time 0.883263?
[
0
.
8
4
9
9
9
8
,
1
.
0
6
6
...
This happens because with the way you are loading the data, the column 'PROCESSING_SPEED' is read as an object type, therefore, all elements of that series are considered strings (i.e., in this case proc_speed = "[0.849998, 1.066601, 0.883263, 0.91658, 0.96668]", which is exactly the string the loop is printing character by character).
Before printing the values you desire to display (from that cell), one should convert the string to a list of numbers, for example:
proc_speed = df.iat[4,1]
proc_speed = [float(s) for s in proc_speed[1:-1].split(',')]
for num in proc_speed:
print( num)
Where proc_speed[1:-1].split(',') takes the string containing the list, except for the brackets at the beginning and end, and splits it according to the commas separating values.
In general, we have to be careful when loading columns with varying or ambiguous data types, as Pandas could have trouble parsing them correctly or in the way we want/expect it to be.
You can simply call proc_speed[index] as you have set this variable as a list. Here is a working example, note my call to df.iat has different indexes;
d = {'GAME_A':['yellow_selected', 'red_selected', 'yellow_total', 'red_total', 'counters'],'PROCESSING_SPEED':[19,0,19,60,[0.849998, 1.066601, 0.883263, 0.91658, 0.96668]]}
df = pd.DataFrame(d)
proc_speed = df.iat[4, 1]
for i in proc_speed:
print(i)
0.849998
1.066601
0.883263
0.91658
0.96668
proc_speed[1]
1.066601
proc_speed[3]
0.91658
You can convert with apply, it's easier than splitting, and converts your ints to ints:
pd.read_clipboard(sep="(?!\s+(?<=,\s))\s+")['PROCESSING_SPEED'].apply(eval)[4][2]
# 0.883263
I have a dataframe where instead of expected numerical values were stored
data of the type "Object" which looks like 3 014.0 i.e. '3\xa0014.0', instead of 3014.0 - whitespaces (i.e. '\xa0') - create a problem for conversion
Question: Is there some way to convert it to numeric ?
Strange thing: It appears that I can do conversion of the single element:
float( df.iloc[0,0].replace('\xa0', '') ) # - works
but the same does NOT work for the whole series
df['p1'].astype('str').replace('\xa0','') # does nothing
-- does nothing
I tried:
pd.to_numeric - gives: Unable to parse string
trying to covert to string and then use replace:
df['p1'].astype('str').replace('\xa0','')
do nothing
Data example:
df.iloc[0:3,0]
2017-10-10 11:32:49.895023 3 014.0
2017-10-10 11:33:11.612169 3 013.5
2017-10-10 11:33:22.488124 3 013.0
Name: p1, dtype: object
df.iloc[0:3,0]:
'3\xa0014.0'
Use this instead: df['p1'] = df['p1'].apply(lambda x: float(x.replace('\xa0','')))
df.iloc[0,0] is a string while df['p1'] is a pandas series. The replace method associated with a string and with a series is different. When you call replace on a series, pandas will attempt to replace elements.
For example,
df = pd.DataFrame({'name': 'alexander'})`
df['name'].replace('a', 'x') #does nothing`
df['name'].replace('alexander', 'x') #replaces the name alexander with x
df['p1'].apply(lambda x: float(x.replace('\xa0',''))) applies the replace method to each element (which happens to be a string) in the column p1. You can read more about the method here.
Hope this makes things clearer :)