I have a column with a name and number, and I would like to extract the information and create 3 separate columns with the info in pandas using python. I'd like to also drop the original column. What is the most efficient code to do it? Is it feasible with a single line?
The number has brackets [] around it, which I also want to drop.
Thanks so much!
I'm a noob, don't have much experience with stripping/splicing and lambda functions within pandas.
We can use str.extract here with 3 capture groups for each component:
df[["last", "first", "number"]] = df["last_first_number"].str.extract(r'(\w+), (\w+) \[(\d+)\]')
Related
I already found a roundabout solution to the problem, and I am sure there is a simple way.
I got two data frames with one column in each. DF1 and DF2 contains strings.
Now, I try to match using .str contains in python, limited by knowledge, I am forced to manually enter substrings that I am looking for.
contain_values = df[df['month'].str.contains('Ju|Ma')]
This highlighted way is how I am able to solve the problem of matching substring within DF1 from DF2.
The current scenario pushes me to add 100 words using the vertical bar right here, str.contains('Ju|Ma').
Now can anyone kindly share some wisdom on how to link the second data frame that contain one column (contains 100+ words)
here is one way to do it. if you post a MRE, i would be able to test and share result, but the below should work
# create a list of words to search
w=['ju', 'ma']
# create a search string
s='|'.join(w)
# search using regex
import re
df[df['month'].str.contains(s, re.IGNORECASE)]
I'm trying to obtain the "DESIRED OUTCOME" shown in my image below. I have a somewhat messy way to do it that i came up with but I was hoping there is a more efficient way this could be done using Pandas? Please advise and thank you in advance!
The problem is pretty standard, and so is its solution: group by the first column and join the data in the second column. Note that the function join is not called but passed to apply as a parameter.
df.groupby('Name')['Food'].apply(';'.join)
#Name
#Gary Oranges;Pizza
#John Tacos
#Matt Chicken;Steak
You can group by Name column then aggregate with ';'.join function
df.groupby('Name').agg({'Food': ';'.join})
I have a Spark DataFrame (sdf) where each row shows an IP visiting a URL. I want to count distinct IP-URL pairs in this data frame and the most straightforward solution is sdf.groupBy("ip", "url").count(). However, since the data frame has billions of rows, precise counts can take quite a while. I'm not particularly familiar with PySpark -- I tried replacing .count() with .approx_count_distinct(), which was syntactically incorrect.
I searched "how to use .approx_count_distinct() with groupBy()" and found this answer. However, the solution suggested there (something along those lines: sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count"))) doesn't seem to give me the counts that I want. The method .approx_count_distinct() can't take two columns as arguments, so I can't write sdf.agg(F.approx_count_distinct(sdf.ip, sdf.url).alias("distinct_count")), either.
My question is, is there a way to get .approx_count_distinct() to work on multiple columns and count distinct combinations of these columns? If not, is there another function that can do just that and what's an example usage of it?
Thank you so much for your help in advance!
Group with expressions and alias as needed. Lets try:
df.groupBy("ip", "url").agg(expr("approx_count_distinct(ip)").alias('ip_count'),expr("approx_count_distinct(url)").alias('url_count')).show()
Your code sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count")) will give a value of 1 to every group since you are counting the value of one of the grouping column; url.
If you want to count distinct of IP-URL pairs using approx_count_distinct function, you can compound them in an array then apply the function. It would be something like this
sdf.selectExpr("approx_count_distinct(array(ip, url)) as distinct_count")
I'm working with a dataframe in python using pandas. I need to use the max.() method to find the largest number in a particular column and then find its corresponding name in another column. I then need to print out a sentence that shows the largest number in the column with the name associated to it. I cannot seem to figure out how to find the corresponding name in the second column after achieving the correct largest number. I believe it has something to do with the iloc or loc functions but i'm not entirely sure as I am still new to python. Thanks cheers!
You can use iloc in this way to solve your problem :
df.loc[df["Num"] == max(df.Num)].color
df.loc[df["Num"] == max(df.Num)].color.values
df.loc[df["Num"] == max(df.Num)].color.values[0]
Here Num is column which contains number and color is the column from which you want to get value
You can refer from here for better understanding
I am trying to add a column from one dataframe to another,
df.head()
street_map2[["PRE_DIR","ST_NAME","ST_TYPE","STREET_ID"]].head()
The PRE_DIR is just the prefix of the street name. What I want to do is add the column STREET_ID at the associated street to df. I have tried a few approaches but my inexperience with pandas and the comparison of strings is getting in the way,
street_map2['STREET'] = df["STREET"]
street_map2['STREET'] = np.where(street_map2['STREET'] == street_map2["ST_NAME"])
The above code shows an "ValueError: Length of values does not match length of index". I've also tried using street_map2['STREET'].str in street_map2["ST_NAME"].str. Can anyone think of a good way to do this? (note it doesn't need to be 100% accurate just get most and it can be completely different from the approach tried above)
EDIT Thank you to all who have tried so far I have not resolved the issues yet. Here is some more data,
street_map2["ST_NAME"]
I have tried this approach as suggested but still have some indexing problems,
def get_street_id(street_name):
return street_map2[street_map2['ST_NAME'].isin(df["STREET"])].iloc[0].ST_NAME
df["STREET_ID"] = df["STREET"].map(get_street_id)
df["STREET_ID"]
This throws this error,
If it helps the data frames are not the same length. Any more ideas or a way to fix the above would be greatly appreciated.
For you to do this, you need to merge these dataframes. One way to do it is:
df.merge(street_map2, left_on='STREET', right_on='ST_NAME')
What this will do is: it will look for equal values in ST_NAME and STREET columns and fill the rows with values from the other columns from both dataframes.
Check this link for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Also, the strings on the columns you try to merge on have to match perfectly (case included).
You can do something like this, with a map function:
df["STREET_ID"] = df["STREET"].map(get_street_id)
Where get_street_id is defined as a function that, given a value from df["STREET"]. will return a value to insert into the new column:
(disclaimer; currently untested)
def get_street_id(street_name):
return street_map2[street_map2["ST_NAME"] == street_name].iloc[0].ST_NAME
We get a dataframe of street_map2 filtered by where the st-name column is the same as the street-name:
street_map2[street_map2["ST_NAME"] == street_name]
Then we take the first element of that with iloc[0], and return the ST_NAME value.
We can then add that error-tolerance that you've addressed in your question by updating the indexing operation:
...
street_map2[street_map2["ST_NAME"].str.contains(street_name)]
...
or perhaps,
...
street_map2[street_map2["ST_NAME"].str.startswith(street_name)]
...
Or, more flexibly:
...
street_map2[
street_map2["ST_NAME"].str.lower().replace("street", "st").startswith(street_name.lower().replace("street", "st"))
]
...
...which will lowercase both values, convert, for example, "street" to "st" (so the mapping is more likely to overlap) and then check for equality.
If this is still not working for you, you may unfortunately need to come up with a more accurate mapping dataset between your street names! It is very possible that the street names are just too different to easily match with string comparisons.
(If you're able to provide some examples of street names and where they should overlap, we may be able to help you better develop a "fuzzy" match!)
Alright, I managed to figure it out but the solution probably won't be too helpful if you aren't in the exact same situation with the same data. Bernardo Alencar's answer was essential correct except I was unable to apply an operation on the strings while doing the merge (I still am not sure if there is a way to do it). I found another dataset that had the street names formatted similar to the first. I then merged the first with the third new data frame. After this I had the first and second both with columns ["STREET_ID"]. Then I finally managed to merge the second one with the combined one by using,
temp = combined["STREET_ID"]
CrimesToMapDF = street_maps.merge(temp, left_on='STREET_ID', right_on='STREET_ID')
Thus getting the desired final data frame with associated street ID's