Pandas add column to new data frame at associated string value? - python

I am trying to add a column from one dataframe to another,
df.head()
street_map2[["PRE_DIR","ST_NAME","ST_TYPE","STREET_ID"]].head()
The PRE_DIR is just the prefix of the street name. What I want to do is add the column STREET_ID at the associated street to df. I have tried a few approaches but my inexperience with pandas and the comparison of strings is getting in the way,
street_map2['STREET'] = df["STREET"]
street_map2['STREET'] = np.where(street_map2['STREET'] == street_map2["ST_NAME"])
The above code shows an "ValueError: Length of values does not match length of index". I've also tried using street_map2['STREET'].str in street_map2["ST_NAME"].str. Can anyone think of a good way to do this? (note it doesn't need to be 100% accurate just get most and it can be completely different from the approach tried above)
EDIT Thank you to all who have tried so far I have not resolved the issues yet. Here is some more data,
street_map2["ST_NAME"]
I have tried this approach as suggested but still have some indexing problems,
def get_street_id(street_name):
return street_map2[street_map2['ST_NAME'].isin(df["STREET"])].iloc[0].ST_NAME
df["STREET_ID"] = df["STREET"].map(get_street_id)
df["STREET_ID"]
This throws this error,
If it helps the data frames are not the same length. Any more ideas or a way to fix the above would be greatly appreciated.

For you to do this, you need to merge these dataframes. One way to do it is:
df.merge(street_map2, left_on='STREET', right_on='ST_NAME')
What this will do is: it will look for equal values in ST_NAME and STREET columns and fill the rows with values from the other columns from both dataframes.
Check this link for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Also, the strings on the columns you try to merge on have to match perfectly (case included).

You can do something like this, with a map function:
df["STREET_ID"] = df["STREET"].map(get_street_id)
Where get_street_id is defined as a function that, given a value from df["STREET"]. will return a value to insert into the new column:
(disclaimer; currently untested)
def get_street_id(street_name):
return street_map2[street_map2["ST_NAME"] == street_name].iloc[0].ST_NAME
We get a dataframe of street_map2 filtered by where the st-name column is the same as the street-name:
street_map2[street_map2["ST_NAME"] == street_name]
Then we take the first element of that with iloc[0], and return the ST_NAME value.
We can then add that error-tolerance that you've addressed in your question by updating the indexing operation:
...
street_map2[street_map2["ST_NAME"].str.contains(street_name)]
...
or perhaps,
...
street_map2[street_map2["ST_NAME"].str.startswith(street_name)]
...
Or, more flexibly:
...
street_map2[
street_map2["ST_NAME"].str.lower().replace("street", "st").startswith(street_name.lower().replace("street", "st"))
]
...
...which will lowercase both values, convert, for example, "street" to "st" (so the mapping is more likely to overlap) and then check for equality.
If this is still not working for you, you may unfortunately need to come up with a more accurate mapping dataset between your street names! It is very possible that the street names are just too different to easily match with string comparisons.
(If you're able to provide some examples of street names and where they should overlap, we may be able to help you better develop a "fuzzy" match!)

Alright, I managed to figure it out but the solution probably won't be too helpful if you aren't in the exact same situation with the same data. Bernardo Alencar's answer was essential correct except I was unable to apply an operation on the strings while doing the merge (I still am not sure if there is a way to do it). I found another dataset that had the street names formatted similar to the first. I then merged the first with the third new data frame. After this I had the first and second both with columns ["STREET_ID"]. Then I finally managed to merge the second one with the combined one by using,
temp = combined["STREET_ID"]
CrimesToMapDF = street_maps.merge(temp, left_on='STREET_ID', right_on='STREET_ID')
Thus getting the desired final data frame with associated street ID's

Related

Is there a Python pandas function for retrieving a specific value of a dataframe based on its content?

I've got multiple excels and I need a specific value but in each excel, the cell with the value changes position slightly. However, this value is always preceded by a generic description of it which remains constant in all excels.
I was wondering if there was a way to ask Python to grab the value to the right of the element containing the string "xxx".
try iterating over the excel files (I guess you loaded each as a separate pandas object?)
somehting like for df in [dataframe1, dataframe2...dataframeN].
Then you could pick the column you need (if the column stays constant), e.g. - df['columnX'] and find which index it has:
df.index[df['columnX']=="xxx"]. Maybe will make sense to add .tolist() at the end, so that if "xxx" is a value that repeats more than once, you get all occurances in alist.
The last step would be too take the index+1 to get the value you want.
Hope it was helpful.
In general I would highly suggest to be more specific in your questions and provide code / examples.

How to replace only zeroes with some conditions on dataframe

I have searched many places but still can't come up neither with my own logic neither find on the internet ...
problem
I have students performance dataset while performing EDA , i came up with a small problem
like ,why students having zero 'absences' have zeroes in their final grades ..
that is practically impossible for a student to be present the whole year and still get a zero in their finals
So I decided to filter out all the rows with zeroes in those two columns using
dataset[(dataset['G3']==0)&(dataset['absences']==0)]
but this returned a dataframe
So i tried
dataset.loc[(dataset['G3']==0)&(dataset['absences']==0),['G3','absences']]
which returned me two columns with the condition satisfied , what i wanted is to replace 'G3' column zeroes and 'absences' column zeroes to be replaced with their respective means and not disturb the dataframe too
i tried to replace them by
dataset.loc[(dataset['G3']==0)&(dataset['absences']==0),['G3','absences']].replace(0,np.mean[dataset[['G3','absences']]])
which threw me error
function object cannot be subscriptable
I don't know what to do
I have tried many things but still can't get through this problem any solution may help
thanks in advance
In case you want to replace with the mean of subset of values != 0, the you can use
dataset = pd.DataFrame({'G3': np.random.randint(0,3,100),
'absences' : np.random.randint(0,3,100)})
dataset.loc[(dataset['G3']==0)&(dataset['absences']==0),['G3', 'absences']] = [dataset.loc[(dataset['G3']!=0)]['G3'].mean(), dataset.loc[(dataset['absences']!=0)]['absences'].mean()]

I want to subtract one column from another in pandas, but I keep getting a copy error. Is there a better way to do this operation?

I have a data frame TB_greater_2018 that 3 columns: country, e_inc_100k_2000 and e_inc_100k_2018. I would like to subtract e_inc_100k_2000 from e_inc_100k_2018 and then use those values returned to create a new column of the differences and then sort by the countries with the largest difference. My current code is:
case_increase_per_100k = TB_greater_2018["e_inc_100k_2018"] - TB_greater_2018["e_inc_100k_2000"]
TB_greater_2018["case_increase_per_100k"] = case_increase_per_100k
TB_greater_2018.sort_values("case_increase_per_100k", ascending=[False]).head()
When I run this, I get a SettingwithCopyWarning. Is there a way to do this without getting this warning? Or just overall a better way of accomplishing the task?
You can do
TB_greater_2018["case_increase_per_100k"] = TB_greater_2018["e_inc_100k_2018"] - TB_greater_2018["e_inc_100k_2000"]
TB_greater_2018.sort_values("case_increase_per_100k", ascending=[False]).head()
It looks like the error is from finding the difference and using that as a column in separate operations, although tbh I'm not clear why that would be.

Data presentation difference in python

Hopefully a fairly simple answer to my issue.
When I run the following code:
print (data_1.iloc[1])
I get a nice, vertical presentation of the data, with each column value header, and its value presented on separate rows. This is very useful when looking at 2 sets of data, and trying to find discrepancies.
However, when I write the code as:
print (data_1.loc[data_1["Name"].isin(["John"])])
I get all the information arrayed across the screen, with the column header in 1 row, and the values in another row.
My question is:
Is there any way of using the second code, and getting the same vertical presentation of the data?
The difference is that data_1.iloc[1] returns a pandas Series whereas data_1.loc[data_1["Name"].isin(["John"])] returns a DataFrame. Pandas has different representations for these two data types (i.e. they print differently).
The reason iloc[1] gives you a Series is because you indexed it using a scalar. If you do data_1.iloc[[1]] you'll see you get a DataFrame instead. Conversely, I'm assuming that data_1["Name"].isin(["John"]) is returning a collection. If you wanted to get a Series instead you might try something like
print(data_1.loc[data_1["Name"].isin(["John"])[0]])
but only if you're sure you're getting one element back.

Using Pandas, Could i detect wrong element in a fixed column and return that value?

I am new to Pandas. My goal is to detect the wrong element in a fixed column and return that row value
Here is the sample scenario
45 dollar is the wrong element in the country column. so i want to detect this value and return the row number(if possible) in my program. My first thought was to create a list and match with this or do i need to search NLP solution here. Kindly help me to solve it out
Some of the answer depends on how you want to validate going forward. Are you looking for any value containing a number or any value that is not an expected country?
Install pycountry and import it, post that execute the below code:
[i.name for i in list(pycountry.countries)]
this gives you a list of all the countries.
Post this check which countries fall in the list and negate them to get a list of which rows doesnot fall under countries list.
import pycountry
df.Country[~df.Country.isin([i.name for i in list(pycountry.countries)])]
Note: This might not work if the country names are not standardly maintained in the column name.

Categories