Map string values in Dataframe to integers using Python - python

So I have a Dataset which has a column containing name of colors(red, blue, green) and I want to convert these string values into int/float to fit into classifier. I was planning on using Python dictionary with keys as name of color and values as numbers.
This was my code:
color_dict = {'red':1, 'blue':2, 'green':3}
for i in train['column_name']:
train['column_name'][i] = color_dict[i]
print(train['column_name'])
Sadly, this did not work.
What should I do differently to make it work?

The answer is in the question :)
train["column_name"] = train["column_name"].map(color_dict)
See the docs for map.
The reason your solution didn't work is a bit tricky. When you access a value like you did (using chained brackets), you're working on a copy of the DataFrame object. Instead, use train.loc[i, "column_name"] = color_dict[i] to set the a single value in a column. See here for more details.

Related

Altering python code to get rid of exec() and eval()

I am new to python and am trying to improve a code's readability as well as speed by removing the recurrent use of exec() and eval(). However it is not obvious to me how I need to alter to code to obtain this.
I want the program to make dataframes and arrays with names based on input. Let's say that the input is like this:
A=[Red, Blue]
B=[Banana, Apple]
C=[Pie, Cake]
Then the code will then make a dataframe with a name based on each combination of input:
Red_Banana_Pie, Red_Banana_Cake, Red_Apple_Pie, Red_Apple_Cake, etc. by looping through the three lists.
for color in A[0:len(A)]:
for fruit in B[0:len(B)]:
for type in C[0:len(C)]:
And then in each loop:
exec('DataFr_'+color+'_'+fruit+'_'+type+'=pd.DataFrame((Data),columns=[\'Title1\',\'Title2\'])')
How can I do this without the exec command?
When you run exec('DataFr_'+color+'_'+fruit+'_'+type+'=pd.DataFrame((Data),columns=[\'Title1\',\'Title2\'])'), you will get 8 DataFrames that has different names. But I don't recommend this because you have to use eval() every time you want to access to your DataFrame.(otherwise you can hardcode that, but that's really bad thing)
I think you need multi-dimensional dictionary for dataframe.
When the input is
A=["Red", "Blue"]
B=["Banana", "Apple"]
C=["Pie", "Cake"]
[+] additionally, you'll basically get user input as string in python.(like "hello, world!")
data_set = {}
for color in A:
data_set.update({color:{}})
for fruit in B:
data_set[color].update({fruit:{}})
for type in C:
data_set[color][fruit].update({type:pd.DataFrame((Data),columns=['Title1','Title2'])})
# I think you have some Data in other place, right?
[+] moreover, you can iterate List without [0:len(A)] in python.
Then you can use each DataFrame by data_set['Red']['Banana']['Cake'].(your implementation will be data_set[A[0]][B[0]][C[1]])
Then you can dynamically create DataFrame for each color, fruit, type without eval, and access to them without hardcoded value.

Is there an equivalent for DataFrame.iteritems() that does not include the name?

I have some legacy code with multiple instances like this...
result = function(df['a'], df['b'], df['z'])
The function accepts *args so I wondered if I could "tidy" the code by doing the following...
result = function(df[['a','b','z']].iteritems())
But iteritems() returns a list of (name, Series) pairs, so it doesn't work.
Is there a "tidy" way to get access to the list of Series only? (no pairs, no name)
(Changing the function is not ideal; it's designed to work with Scalars and Arrays, and as a Series is ArrayLike they work too. So I just would "like" a list of the Series on their own...)
My best attempt is just to get the Series as Arrays instead, but I "dis-like" it due to multiple instances of boiler-plate code, it feels like there "should" be a direct way to iterate on the Series?
result = function(*(df[['a','b','z']].to_numpy().T))
Looping through a Dataframe returns a list of column names, so you can use list comprehension:
function(*[df[i] for i in df[["a","b","z"]]])

How do I use multiple data points from one Excel column in Python?

I am doing some image processing in python, and need to crop an area of the image. However, my pixel coordinate data is arranged as three values in one excel column seperated by commas, as follows:
[1345.83,1738,44.26] (i.e. [x,y,r]) - this is exactly how it appears in the excel cell, square brackets and all.
Any idea how I can read this into my script and start cropping images according to the pixel coord values? Is there a function that can seperate them and treat them as three independent values?
Thanks, Rhod
My understanding is that if you use pandas.read_excel(), you will get a column of strings in this situation. There's lots of options but here I would do, assuming your column name is xyr:
# clean up strings to remove braces on either side
data['xyr_clean'] = data['xyr'].str.lstrip('[').str.rstrip(']')
data[['x', 'y', 'r']] = (
data['xyr_clean'].str.split(', ', expand=True).astype(float)
)
The key thing to know is that pandas string columns have a .str attribute that contains adapted versions of all or most of Python's built-in string methods. Then you can search for "pandas convert string column to float" to get the last bit!

Pandas add column to new data frame at associated string value?

I am trying to add a column from one dataframe to another,
df.head()
street_map2[["PRE_DIR","ST_NAME","ST_TYPE","STREET_ID"]].head()
The PRE_DIR is just the prefix of the street name. What I want to do is add the column STREET_ID at the associated street to df. I have tried a few approaches but my inexperience with pandas and the comparison of strings is getting in the way,
street_map2['STREET'] = df["STREET"]
street_map2['STREET'] = np.where(street_map2['STREET'] == street_map2["ST_NAME"])
The above code shows an "ValueError: Length of values does not match length of index". I've also tried using street_map2['STREET'].str in street_map2["ST_NAME"].str. Can anyone think of a good way to do this? (note it doesn't need to be 100% accurate just get most and it can be completely different from the approach tried above)
EDIT Thank you to all who have tried so far I have not resolved the issues yet. Here is some more data,
street_map2["ST_NAME"]
I have tried this approach as suggested but still have some indexing problems,
def get_street_id(street_name):
return street_map2[street_map2['ST_NAME'].isin(df["STREET"])].iloc[0].ST_NAME
df["STREET_ID"] = df["STREET"].map(get_street_id)
df["STREET_ID"]
This throws this error,
If it helps the data frames are not the same length. Any more ideas or a way to fix the above would be greatly appreciated.
For you to do this, you need to merge these dataframes. One way to do it is:
df.merge(street_map2, left_on='STREET', right_on='ST_NAME')
What this will do is: it will look for equal values in ST_NAME and STREET columns and fill the rows with values from the other columns from both dataframes.
Check this link for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Also, the strings on the columns you try to merge on have to match perfectly (case included).
You can do something like this, with a map function:
df["STREET_ID"] = df["STREET"].map(get_street_id)
Where get_street_id is defined as a function that, given a value from df["STREET"]. will return a value to insert into the new column:
(disclaimer; currently untested)
def get_street_id(street_name):
return street_map2[street_map2["ST_NAME"] == street_name].iloc[0].ST_NAME
We get a dataframe of street_map2 filtered by where the st-name column is the same as the street-name:
street_map2[street_map2["ST_NAME"] == street_name]
Then we take the first element of that with iloc[0], and return the ST_NAME value.
We can then add that error-tolerance that you've addressed in your question by updating the indexing operation:
...
street_map2[street_map2["ST_NAME"].str.contains(street_name)]
...
or perhaps,
...
street_map2[street_map2["ST_NAME"].str.startswith(street_name)]
...
Or, more flexibly:
...
street_map2[
street_map2["ST_NAME"].str.lower().replace("street", "st").startswith(street_name.lower().replace("street", "st"))
]
...
...which will lowercase both values, convert, for example, "street" to "st" (so the mapping is more likely to overlap) and then check for equality.
If this is still not working for you, you may unfortunately need to come up with a more accurate mapping dataset between your street names! It is very possible that the street names are just too different to easily match with string comparisons.
(If you're able to provide some examples of street names and where they should overlap, we may be able to help you better develop a "fuzzy" match!)
Alright, I managed to figure it out but the solution probably won't be too helpful if you aren't in the exact same situation with the same data. Bernardo Alencar's answer was essential correct except I was unable to apply an operation on the strings while doing the merge (I still am not sure if there is a way to do it). I found another dataset that had the street names formatted similar to the first. I then merged the first with the third new data frame. After this I had the first and second both with columns ["STREET_ID"]. Then I finally managed to merge the second one with the combined one by using,
temp = combined["STREET_ID"]
CrimesToMapDF = street_maps.merge(temp, left_on='STREET_ID', right_on='STREET_ID')
Thus getting the desired final data frame with associated street ID's

Data presentation difference in python

Hopefully a fairly simple answer to my issue.
When I run the following code:
print (data_1.iloc[1])
I get a nice, vertical presentation of the data, with each column value header, and its value presented on separate rows. This is very useful when looking at 2 sets of data, and trying to find discrepancies.
However, when I write the code as:
print (data_1.loc[data_1["Name"].isin(["John"])])
I get all the information arrayed across the screen, with the column header in 1 row, and the values in another row.
My question is:
Is there any way of using the second code, and getting the same vertical presentation of the data?
The difference is that data_1.iloc[1] returns a pandas Series whereas data_1.loc[data_1["Name"].isin(["John"])] returns a DataFrame. Pandas has different representations for these two data types (i.e. they print differently).
The reason iloc[1] gives you a Series is because you indexed it using a scalar. If you do data_1.iloc[[1]] you'll see you get a DataFrame instead. Conversely, I'm assuming that data_1["Name"].isin(["John"]) is returning a collection. If you wanted to get a Series instead you might try something like
print(data_1.loc[data_1["Name"].isin(["John"])[0]])
but only if you're sure you're getting one element back.

Categories