I am trying to get the 'id' of LinkedIn profiles using Python.
By ID, I mean from https://www.linkedin.com/in/adigup21/, it should get adigup21.
I am using this trick ID = (link.lstrip("https://www.linkedin.com/in/").rstrip('/'))
But for some cases, it misses out on characters or is blank (I always make sure the format is same and good)
Is there any accurate alternative present for this?
link.rstrip('/').split('/').pop()
rstrip removes the (optional) final slash, split makes an array out of the slash-separated parts, pop extracts the last element.
BTW, this is just a hack. Manipulating URLs elements is best done with URL parsing, along the lines of
pth=urllib.parse.urlparse(link).path
One can then do rstrip/split/pop thing on pth.
Related
I have to clean a column "Country" of a DataFrame where, sometimes, the country names are followed by numbers (for example we will see "France6" instead of France). I would like to separate the country name from the number that follows it.
I coded this function to solve the problem:
def new_name2(row):
for item in re.finditer("([a-zA-Z]*)(\d*)",row.Country):
row.Country=item.group(1)
return row
We can see that I created two groups, the first one to catch the country name, and the other to separate the number. Following that, I should get (France)(6).
Unfortunately, when I run it, my Country column turns empty. This means that the first group that I get is not "France" but "" and I don't understand why, because on a regex website, I can see that my expression ([a-zA-Z]*)(\d*) is working.
Your loop rewrites row.Country each time even with a zero-length match!
Instead, you could strip off the numbers directly
df["Country"] = df["Country"].str.rstrip("0123456789")
Using a dedicated Pandas method will almost-certainly be much faster than simple Python loop due to vectorizing
Add a beginning and ending match like this:
^([a-zA-Z]*)(\d*)$
This will force it to match the entire string. Perhaps that was the problem.
If that doesn't work, try logging the regex result. Maybe your inputs are faulty.
I'm trying to modularize a type of report from the API. This is my query for the request:
content = ['CampaignId', 'AdvertisingChannelType', ...]
report_query = (adwords.ReportQueryBuilder()
.Select(content)
.From('CAMPAIGN_PERFORMANCE_REPORT')
.During(start_date=since,end_date=until)
.Build())
However, I'm having a problem with the .Select() statement since its common usage is .Select('CampaignId', 'AdvertisingChannelType', ...) (as the list but without the brackets []) and in my query I'm parsing the arguments as a list, which of course returns an error.
My question is, how can I parse the elements of content as required? I've tried turning the list into a string but it doesn't work as all the list becomes a single element. I can't assign by hand the elements since it's number may vary (will be used for more than one client).
Any help will be appreciated. Thanks!
I'm not sure exactly if this is helpful, but maybe try looking into python maps.
I am relatively new to Python and very new to NLP (and nltk) and I have searched the net for guidance but not finding a complete solution. Unfortunately the sparse code I have been playing with is on another network, but I am including an example spreadsheet. I would like to get suggested steps in plain English (more detailed than I have below) so I could first try to script it myself in Python 3. Unless it would simply be easier for you to just help with the scripting... in which case, thank you.
Problem: A few columns of an otherwise robust spreadsheet are very unstructured with anywhere from 500-5000 English characters that tell a story. I need to essentially make it a bit more structured by pulling out the quantifiable data. I need to:
1) Search for a string in the user supplied unstructured free text column (The user inputs the column header) (I think I am doing this right)
2) Make that string a NEW column header in Excel (I think I am doing this right)
3) Grab the number before the string (This is where I am getting stuck. And as you will see in the sheet, sometimes there is no space between the number and text and of course, sometimes there are misspellings)
4) Put that number in the NEW column on the same row (Have not gotten to this step yet)
I will have to do this repeatedly for multiple keywords but I can figure that part out, I believe, with a loop or something. Thank you very much for your time and expertise...
If I'm understanding this correctly, first we need to obtain the numbers from the string of text.
cell_val = sheet1wb1.cell(row=rowNum,column=4).value
This will create a list containing every number in the string
new_ = [int(s) for s in cell_val.split() if s.isdigit()]
print(new_)
You can use the list to assign the values to the column.
Then define the value of the 1st number in the list to the 5th column
sheet1wb1.cell(row=rowNum, column=5).value = str(new_[1])
I think I have found what I am looking for. https://community.esri.com/thread/86096 has 3 or 4 scripts that seem to do the trick. Thank you..!
So I started looking into it, and I haven't found a good way to parse a file following the format I will show you below. I have taken a data structures course, but it doesn't really help me with what I want to do. Any help will be greatly appreciated!
Goal: Create a tool that can read, create, and manipulate a custom file type
File Format: I'm sure there is a name for this type of format, but I couldn't find it. Anyways, the format is subject to some change since the variable names can be added, removed, or changed. Also, after each variable name the data could be one of several different types. Right now the files do not use sub groups, but I want to be prepared in case they decide to change that. The only things I can think of that will remain constant are the GROUP = groupName, END_GROUP = groupName, and the varName = data.
GROUP = myGroup
name1 = String, datenum, number, list, array
name2 = String, datenum, number, list, array
// . . .
name# = String, datenum, number, list, array
GROUP = mySubGroup
name1 = String, datenum, number, list, array
END_GROUP = mySubGroup
// More names could go here
END_GROUP = myGroup
GROUP = myGroup2
// etc.
END_GROUP = myGroup2
Strings and dates are enclosed in " (ie "myString")
Numbers are written as a raw ascii encoded number. They also use the E format if they are large or small (ie 5.023E-6)
Lists are comma separated and enclosed in parentheses (ie (1,2,3,4) )
Additional Info:
I want to be able to easily read a file and manipulate it as needed. For example, if I read the file and I want to change an attribute of a specific variable within a group I should be able to do something along the lines of dataStructure.groupName.varName = newData.
It should be easy to create my own file (using a default template that I will make myself or a custom template that has been passed in).
I want it to treat numbers as numbers and not strings. I should be able to add, subtract, multiply, etc. values within the data structure that are numbers
The big kicker, I'd like to have this written in vanilla python since our systems have only the most basic modules. It is a huge pain for someone to download another module since they have to create their own virtual environment and import the module to it. This tool should be as system independent as possible
Initial Attempt: I was thinking of using a dictionary to organize the data in levels. I do, however, like the idea of using dot structures (like what one would see using MATLAB structures). I wrote a function that will read all the lines of the file and remove the newline characters from each line. From there I want to check for every GROUP = I can find. I would start adding data to that group until I hit an END_GROUP line. Using regular expressions I should be able to parse out the line to determine whether it is a date, number, string, etc.
I am asking this question because I hope to have some insight on things I may be missing. I'd like for this tool to be used long after I've left the dev team which is why I'm trying to do my best to make it as intuitive and easy to use as possible. Thank you all for your help, I really appreciate it! Let me know if you need any more information to help you help me.
EDIT: To clarify what help I need, here are my two main questions I am hoping to answer:
How should I build a data structure to hold grouped data?
Is there an accepted algorithm for parsing data like this?
I have written a python script where I have collected some values in a list. I need to pass on these values to an URL in a loop where in each time a different value is picked up.
i..e, I want to achieve this:
http://www.abc.com/xyz/pqr/symbol=something[i].
Here "something" is a list and I have verified that it contains the proper values. However when I pass the values to the URL, I am not getting the desired results. I have tried with URL encoding for something[i] but still it is not giving me proper results. Can someone help me?
EDIT: My example script at the moment is:
import json
script=["Linux","Windows"]
for i in xrange(len(script)):
site="abc.com/pqr/xyz/symbol=json.dumps(script[i])";
print site
I think the problem is your approach to formatting. You don't really need json if you have a list already and are just trying to modify a URL...
import json
script=["Linux","Windows"]
something = ["first","second"]
for i,j in zip(script,something):
site="http:abc.com/pqr/xyz/symbol={0}".format(j)
print i, site
This uses the .format() operator, which "sends" the values in parentheses into the string at the positions marked with {}. You could just add the strings together if it is always at the end. You could also use the older % operator instead. It does pretty much the same thing, but in this case it inserts the string j at the position marked by %s:
site="http:abc.com/pqr/xyz/symbol=%s" % (j)
Side note: I slightly prefer % because once you learn it, it can also be used in other programming languages, but .format() has more options and is the recommended way to do it since python 2.6.
Output:
Linux http:abc.com/pqr/xyz/symbol=first
Windows http:abc.com/pqr/xyz/symbol=second
You should be able to get close to what you want from this starting point, but if this is nothing like your desired output, then you need to clarify in your question...