Problem of regex to separate country names with numbers that follow them - python

I have to clean a column "Country" of a DataFrame where, sometimes, the country names are followed by numbers (for example we will see "France6" instead of France). I would like to separate the country name from the number that follows it.
I coded this function to solve the problem:
def new_name2(row):
for item in re.finditer("([a-zA-Z]*)(\d*)",row.Country):
row.Country=item.group(1)
return row
We can see that I created two groups, the first one to catch the country name, and the other to separate the number. Following that, I should get (France)(6).
Unfortunately, when I run it, my Country column turns empty. This means that the first group that I get is not "France" but "" and I don't understand why, because on a regex website, I can see that my expression ([a-zA-Z]*)(\d*) is working.

Your loop rewrites row.Country each time even with a zero-length match!
Instead, you could strip off the numbers directly
df["Country"] = df["Country"].str.rstrip("0123456789")
Using a dedicated Pandas method will almost-certainly be much faster than simple Python loop due to vectorizing

Add a beginning and ending match like this:
^([a-zA-Z]*)(\d*)$
This will force it to match the entire string. Perhaps that was the problem.
If that doesn't work, try logging the regex result. Maybe your inputs are faulty.

Related

Remove spaces from strings in pandas DataFrame not working

Trying to remove spaces from a column of strings in pandas dataframe. Successfully did it using this method in other section of code.
for index, row in summ.iterrows():
row['TeamName'] = row['TeamName'].replace(" ", "")
summ.head() shows no change made to the column of strings after this operation, however no error as well.
I have no idea why this issue is happening considering I used this exact same method later in the code and accomplished the task successfully.
Why not use str.replace:
df["TeamName"] = df["TeamName"].str.replace(r' ', '', regex=False)
I may be proven wrong here, but I am wondering if its because you are iterating over it, and maybe working on a copy that isn't changing the data. From pandas.DataFrame.iterrows documentation, this is what I found there:
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
just a thought... hth

Unable to get data from string in python

I am trying to get the 'id' of LinkedIn profiles using Python.
By ID, I mean from https://www.linkedin.com/in/adigup21/, it should get adigup21.
I am using this trick ID = (link.lstrip("https://www.linkedin.com/in/").rstrip('/'))
But for some cases, it misses out on characters or is blank (I always make sure the format is same and good)
Is there any accurate alternative present for this?
link.rstrip('/').split('/').pop()
rstrip removes the (optional) final slash, split makes an array out of the slash-separated parts, pop extracts the last element.
BTW, this is just a hack. Manipulating URLs elements is best done with URL parsing, along the lines of
pth=urllib.parse.urlparse(link).path
One can then do rstrip/split/pop thing on pth.

How can I remove certain strings from a list based on the strings in another list, if those strings differ slightly? More info below

This seems like a pretty rudimentary question, but I'm wondering because the items in these lists change every so often when a website is scraped...
employees = ['leadership(x)', 'drivers(y)', 'trainers(z)']
Where x,y,z are the number of employees in those specific roles, and are the values that change every so often.
If I know that the strings will always be 'leadership' 'drivers' and 'trainers', just with a difference in what's in between the parentheses, how can I dynamically remove these strings without having to hardcode it every week that I run the program?
The obvious but not so successful solution is...
employees = ['leadership(x)', 'drivers(y)', 'trainers(z)']
unwanted = ['leadership(x)', 'drivers(y)', 'trainers(z)']
for i in unwanted:
if i in employees:
employees.remove(i)
This of course fails because the values are hardcoded and the values are bound to change, any help with this would be greatly appreciated!
You could do something like
unwanted_prefixes = ['leadership', 'drivers', 'trainers']
unwanted = [s for s in employees if s.split('(')[0] in unwanted_prefixes]
This will make the list of things to delete contain any string beginning with those 3 prefixes and either containing nothing else or immediately followed by a parenthesis.
A more complicated solution, if that one deletes strings that you want, that follows roughly the same idea, but with a regex:
import re
unwanted_re = re.compile(r'(leadership|drivers|trainers)\(\d+\)')
unwanted = [x for x in employees if unwanted_re.fullmatch(x)]

Pandas add column to new data frame at associated string value?

I am trying to add a column from one dataframe to another,
df.head()
street_map2[["PRE_DIR","ST_NAME","ST_TYPE","STREET_ID"]].head()
The PRE_DIR is just the prefix of the street name. What I want to do is add the column STREET_ID at the associated street to df. I have tried a few approaches but my inexperience with pandas and the comparison of strings is getting in the way,
street_map2['STREET'] = df["STREET"]
street_map2['STREET'] = np.where(street_map2['STREET'] == street_map2["ST_NAME"])
The above code shows an "ValueError: Length of values does not match length of index". I've also tried using street_map2['STREET'].str in street_map2["ST_NAME"].str. Can anyone think of a good way to do this? (note it doesn't need to be 100% accurate just get most and it can be completely different from the approach tried above)
EDIT Thank you to all who have tried so far I have not resolved the issues yet. Here is some more data,
street_map2["ST_NAME"]
I have tried this approach as suggested but still have some indexing problems,
def get_street_id(street_name):
return street_map2[street_map2['ST_NAME'].isin(df["STREET"])].iloc[0].ST_NAME
df["STREET_ID"] = df["STREET"].map(get_street_id)
df["STREET_ID"]
This throws this error,
If it helps the data frames are not the same length. Any more ideas or a way to fix the above would be greatly appreciated.
For you to do this, you need to merge these dataframes. One way to do it is:
df.merge(street_map2, left_on='STREET', right_on='ST_NAME')
What this will do is: it will look for equal values in ST_NAME and STREET columns and fill the rows with values from the other columns from both dataframes.
Check this link for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Also, the strings on the columns you try to merge on have to match perfectly (case included).
You can do something like this, with a map function:
df["STREET_ID"] = df["STREET"].map(get_street_id)
Where get_street_id is defined as a function that, given a value from df["STREET"]. will return a value to insert into the new column:
(disclaimer; currently untested)
def get_street_id(street_name):
return street_map2[street_map2["ST_NAME"] == street_name].iloc[0].ST_NAME
We get a dataframe of street_map2 filtered by where the st-name column is the same as the street-name:
street_map2[street_map2["ST_NAME"] == street_name]
Then we take the first element of that with iloc[0], and return the ST_NAME value.
We can then add that error-tolerance that you've addressed in your question by updating the indexing operation:
...
street_map2[street_map2["ST_NAME"].str.contains(street_name)]
...
or perhaps,
...
street_map2[street_map2["ST_NAME"].str.startswith(street_name)]
...
Or, more flexibly:
...
street_map2[
street_map2["ST_NAME"].str.lower().replace("street", "st").startswith(street_name.lower().replace("street", "st"))
]
...
...which will lowercase both values, convert, for example, "street" to "st" (so the mapping is more likely to overlap) and then check for equality.
If this is still not working for you, you may unfortunately need to come up with a more accurate mapping dataset between your street names! It is very possible that the street names are just too different to easily match with string comparisons.
(If you're able to provide some examples of street names and where they should overlap, we may be able to help you better develop a "fuzzy" match!)
Alright, I managed to figure it out but the solution probably won't be too helpful if you aren't in the exact same situation with the same data. Bernardo Alencar's answer was essential correct except I was unable to apply an operation on the strings while doing the merge (I still am not sure if there is a way to do it). I found another dataset that had the street names formatted similar to the first. I then merged the first with the third new data frame. After this I had the first and second both with columns ["STREET_ID"]. Then I finally managed to merge the second one with the combined one by using,
temp = combined["STREET_ID"]
CrimesToMapDF = street_maps.merge(temp, left_on='STREET_ID', right_on='STREET_ID')
Thus getting the desired final data frame with associated street ID's

python parsing file into data structure

So I started looking into it, and I haven't found a good way to parse a file following the format I will show you below. I have taken a data structures course, but it doesn't really help me with what I want to do. Any help will be greatly appreciated!
Goal: Create a tool that can read, create, and manipulate a custom file type
File Format: I'm sure there is a name for this type of format, but I couldn't find it. Anyways, the format is subject to some change since the variable names can be added, removed, or changed. Also, after each variable name the data could be one of several different types. Right now the files do not use sub groups, but I want to be prepared in case they decide to change that. The only things I can think of that will remain constant are the GROUP = groupName, END_GROUP = groupName, and the varName = data.
GROUP = myGroup
name1 = String, datenum, number, list, array
name2 = String, datenum, number, list, array
// . . .
name# = String, datenum, number, list, array
GROUP = mySubGroup
name1 = String, datenum, number, list, array
END_GROUP = mySubGroup
// More names could go here
END_GROUP = myGroup
GROUP = myGroup2
// etc.
END_GROUP = myGroup2
Strings and dates are enclosed in " (ie "myString")
Numbers are written as a raw ascii encoded number. They also use the E format if they are large or small (ie 5.023E-6)
Lists are comma separated and enclosed in parentheses (ie (1,2,3,4) )
Additional Info:
I want to be able to easily read a file and manipulate it as needed. For example, if I read the file and I want to change an attribute of a specific variable within a group I should be able to do something along the lines of dataStructure.groupName.varName = newData.
It should be easy to create my own file (using a default template that I will make myself or a custom template that has been passed in).
I want it to treat numbers as numbers and not strings. I should be able to add, subtract, multiply, etc. values within the data structure that are numbers
The big kicker, I'd like to have this written in vanilla python since our systems have only the most basic modules. It is a huge pain for someone to download another module since they have to create their own virtual environment and import the module to it. This tool should be as system independent as possible
Initial Attempt: I was thinking of using a dictionary to organize the data in levels. I do, however, like the idea of using dot structures (like what one would see using MATLAB structures). I wrote a function that will read all the lines of the file and remove the newline characters from each line. From there I want to check for every GROUP = I can find. I would start adding data to that group until I hit an END_GROUP line. Using regular expressions I should be able to parse out the line to determine whether it is a date, number, string, etc.
I am asking this question because I hope to have some insight on things I may be missing. I'd like for this tool to be used long after I've left the dev team which is why I'm trying to do my best to make it as intuitive and easy to use as possible. Thank you all for your help, I really appreciate it! Let me know if you need any more information to help you help me.
EDIT: To clarify what help I need, here are my two main questions I am hoping to answer:
How should I build a data structure to hold grouped data?
Is there an accepted algorithm for parsing data like this?

Categories