Basically, i need to obtain Strings with one part of its in common (the last two characters) in a dataframe (csv). I am using python.
need some help with that
Related
I have some problems with encoding using Python. I've searched for an answer for couple of hours now and still no luck.
I am currently working on Jupyter notebook with Python dataframes (pandas).
Long story short - In a dataframe column I have different strings - single letters from the alphabet. I wanted to apply a function on this column, that will convert letters to numbers based on a specific key. But I got an error every time I tried this. When I dug for a reason behind this I realised that:
I have two strings 'T'. But they are not equal.
string1.encode() = b'T'
string2.encode() = b'\xd0\xa2'
How can I standardize/encode/decode/modify all strings to have the same coding/basis so I can compare them and make operations on them? What is the easiest way to achieve that?
I just started learning Python. I came across slicing in Python 3. I'm working on a problem where I need to collect alternating symbols from a given string and combine them with the remaining symbols from the given string after removal of alternative symbols and finally combine them both.
I tried this way. Is there a way to make this code faster?
This is given 1n2a3m4e5i$s and expected output is nameis,12345$
str="1n2a3m4e5i$s" # this is given to you
str1=str[0::2]
str2=str[1::2]
print(str2+","+str1) # this is required at the end
I have two tasks to do.
1)I have to extract the headers of any CVS file containing invoices data.
In specific: invoice number, address, location, physical good.
I have been asked to create a text classifier for this task, therefore the classifier will go over any CVS file and identify those 4 headers.
2)After the classifier identifies the 4 words I have to find the attach the data of that column and create a class.
I researched the matter and the three methodologies that I thought were must be appropriated are:
1)bad of words
2)word embedded
3)K-means clustering
Bag of words can identify the word but it does not give me the location of the word itself to go and grab the column and create the class.
Word embedded is over complicated for this task, I believe, and even if give me the position of the word in the file is too time-consuming for this
K-means seems simple and effective it tells me where the word is.
My question before I start coding
did I miss something. Is my reasoning correct?
And most important the second question
Once the position of the word is identified in the CSV file how I translate that into coding so I can attach the data in that column
I would simply:
look at the first line of the file (the header);
filter out the column names you are looking for, use enumerate so the result will contain the column indices
retrieve the column indices from the filtered result
iterate over the rest of the file;
use those indices to extract the specific columns' data from each line/row
put that data in a container for use later (maybe use a list)
I am currently working on a project about satellites and visualizing particular data about satellites from each country. For the data I am using Microsoft Excel. When loading the data from it, everything is fine, except the fact that 4 of my columns (which contain only numeric data) are loaded only as meta strings. I checked each cell of the columns to see if they contain any particular strings..but I couldn't find anything. Below are the columns that are not taken as numeric. Any solutions ?
Use Excel to re-save your data as comma-separated values file (CSV). Ensure that numeric fields use period instead of comma as decimals separator (e.g. your Inclination and Period columns) as some programs might have trouble with that. Ensure there are no spaces in numeric values (your Launch mass column) and don't use thousands separators.
I have a column in a CSV file that has names such that each cell in that column could be the same as a slightly misspelled cell. For example, "Nike" could be the same as "Nike inc." could be the same as "Nike Inc".
My Current Script
I've already written a program in Python that removes prefixes and suffixes from
each cell if that value occurs more than 2 times in the column as prefixes or
suffixes. I then compared one row to the next after sorting alphabetically in
this column.
My Current Problem
There are still many cells that are in reality duplicates of other cells, but they
are not indicated as such. These examples are:
a) Not exact matches (and not off just by capitalization)
b) Not caught by comparing its stem (without prefix and without suffix) to
its alphabetical neighbor
My current Questions
1) Does anyone have experience mapping IDs to names from all over the world
(so accents, unicode and all that stuff is an issue here, too, although I managed
to solve most of these unicode issues)
and has good ideas for algorithm development that are not listed here?
2) In some of the cases where duplicates are not picked up, I know why I
know they are duplicates. In one instance there is a period in the middle of a
line that is not present in its non-period containing brother cell. Is one good
strategy to simply to create an extra column and output cell values that I suspect
of being duplicates based on the few instances where I know why I know it?
3) How do I check myself? One way is to flag the maximum number of potential
duplicates and look over all of these manually. Unfortunately, the size of our
dataset doesn't make that very pretty, nor very feasible...
Thanks for any help you can provide!
Try to transliterate the names, to remove all the international symbols, then consider using a function like soundex or http://en.wikipedia.org/wiki/Levenshtein_distance (e.g. http://pypi.python.org/pypi/Fuzzy) to calculate text similarity.