Arabic words in Python - python

I have a problem to print an Arabic text in Python, I write a code with convert English characters into Arabic ones as is called (chat language or Franco Arabic) and then create a combination between different results to get suggestions based on user input.
def transliterate(francosentence, verbose=False):
francowords = francosentence.split()
arabicconvertedwords = []
for i in francowords:
rankeddata=[]
rankeddata=transliterate_word(i)
arabicconvertedwords.append(rankeddata)
for index in range(len(rankeddata)):
print rankeddata[index]
ran=list(itertools.product(*arabicconvertedwords))
for I in range(len(ran)):
print ran[I]
The first print (print rankeddata[index]) gives Arabic words, but after the combination process is executed the second print (print ran[I]) gives something like that: (u'\u0627\u0646\u0647', u'\u0631\u0627\u064a\u062d', u'\u0627\u0644\u062c\u0627\u0645\u0639\u0647')
How can I print Arabic words?

Your second loop is operating over tuples of unicode (product yields a single product at a time as a tuple), not individual unicode values.
While print uses the str form of the object printed, tuple's str form uses the repr of the contained objects, it doesn't propagate "str-iness" (technically, tuple lacks __str__ entirely, so it's falling back to __repr__).
If you want to see the Arabic, you need to print the elements individually or concatenate them so you're printing strings, not tuple. For example, you could change:
print ran[I]
to something like:
print u', '.join(ran[I])
which will convert to a single comma-separated unicode value that print will format as expected (the str form), rather than using the repr form with escapes for non-ASCII values.
Side-note: As a point of style (and memory use), use the iterator protocol directly, don't listify everything then use C-style indexing loops. The following code has to store a ton of stuff in memory if the inputs are large (the total size of the output is the multiplicative product of the lengths of each input):
ran=list(itertools.product(*arabicconvertedwords))
for I in range(len(ran)):
print u', '.join(ran[I])
where it could easily produce just one item at a time on demand, producing results faster with no memory overhead:
# Don't listify...
ran = itertools.product(*arabicconvertedwords)
for r in ran: # Iterate items directly, no need for list or indexing
print u', '.join(r)

Related

In Cpython implementation, are strings an array of characters or array of references/pointers like python lists?

I went through the post https://rushter.com/blog/python-strings-and-memory/
Based on that article,
Depending on the type of characters in a string, each character in that string would be represented using either 1/2/4 bytes
Since the address length of each such character is fixed (either 1/2/4), we can find the address of index i using starting_pos_address + no_of_bytes*index
But the below code kinda contradicts this model of string being stored as a contiguous block of characters, but more like an array of references/pointers to individual characters/strings since o in both the strings point to the same object
>>> s1 = "hello"
>>> s2 = "world"
>>> id(s1[4])
140195535215024
>>> id(s2[1])
140195535215024
So, should I see string as an array of characters or array of references to character objects?
The key piece of information can be read in this answer to a similiar question - "Indexing into a string creates a new string" - which means, both s1[4] and s2[1] create new string, "o". Because strings are interned, Python optimalizes the reference to point to the same object in memory, which is not necessarily the character than was part of any of the original string.
So yes, strings are stored as arrays of characters

python string to list (special list)

I'm trying to get this string into list, how can i do that pleas ?
My string :
x = "[(['xyz1'], 'COM95'), (['xyz2'], 'COM96'), (['xyz3'], 'COM97'), (['xyz4'], 'COM98'), (['xyz5'], 'COM99'), (['xyz6'], 'COM100')]"
I want to convert it to a list, so that:
print(list[0])
Output : (['xyz1'], 'COM95')
If you have this string instead of a list, that presumes it is coming from somewhere outside your control (otherwise you'd just make a proper list). If the string is coming from a source outside your program eval() is dangerous. It will gladly run any code passed to it. In this case you can use ast.liter_eval() which is safer (but make sure you understand the warning on the docs):
import ast
x = "[(['xyz1'], 'COM95'), (['xyz2'], 'COM96'), (['xyz3'], 'COM97'), (['xyz4'], 'COM98'), (['xyz5'], 'COM99'), (['xyz6'], 'COM100')]"
l = ast.literal_eval(x)
Which gives an l of:
[(['xyz1'], 'COM95'),
(['xyz2'], 'COM96'),
(['xyz3'], 'COM97'),
(['xyz4'], 'COM98'),
(['xyz5'], 'COM99'),
(['xyz6'], 'COM100')]
If the structure is uniformly a list of tuples with a one-element list of strings and an individual string, you can manually parse it using the single quote as a separator. This will give you one string value every other component of the split (which you can access using a striding subscript). You can then build the actual tuple from pairing of two values:
tuples = [([a],s) for a,s in zip(*[iter(x.split("'")[1::2])]*2)]
print(tuples[0])
(['xyz1'], 'COM95')
Note that this does not cover the case where an individual string contains a single quote that needed escaping
You mean convert list like string into list? Maybe you can use eval().
For example
a="[1,2,3,4]"
a=eval(a)
Then a become a list
to convert as list use x = eval(x)
print(list[0]) will give you an error because list is a python builtin function
you should do print(x[0]) to get what you want

Trying to print keys based on their value type (int or str) from a dictionary of lists

I'm learning to access dictionary keys-values and work with list comprehensions. My assignment asks me to:
"Use a while loop that prints only variant names located in chromosomes that do not have numbers (e.g., X)."
And I'm working with this dictionary of lists, where the keys are variant names, and the zeroth elements in the list values (the character sets on the left of the colon([0])) are chromosome names, while the characters to the right of the colon ([1])are their chromosome location, and the [2] values are gene names.
cancer_variations={"rs13283416": ["9:116539328-116539328+","ASTN2"],\
"rs17610181":["17:61590592-61590592+","NACA2"],\
"rs1569113445":["X:12906527-12906527+","TLR8TLR8-AS1"],\
"rs143083812":["7:129203569-129203569+","SMO"],\
"rs5009270":["7:112519123-112519123+","IFRD1"],\
"rs12901372":["15:67078168-67078168+","SMAD3"],\
"rs4765540":["12:124315096-124315096+","FAM101A"],\
"rs3815148":["CHR_HG2266_PATCH:107297975-107297975+","COG5"],\
"rs12982744":["19:2177194-2177194+","DOT1L"],\
"rs11842874":["13:113040195-113040195+","MCF2L"]}
I have found how to print the variant names based on the length of the zeroth element in the lists (the chromosome names):
for rs, info in cancer_variations.items():
tmp_info=info[0].split(":")
if (len(tmp_info[0])>3):
print(rs)
But I'm having trouble printing the key values, the variant names, based on the TYPE of the chromosome name, the zeroth element in the list values. To that end, I've devised this code, but I'm not sure how to phrase the Boolean values to print only if the chromosome name is one particular type, (Str) or (int).
for rs, info in cancer_variations.items():
tmp_info=info[0].split(":")
if tmp_info[0] = type.str
print(rs)
I am not sure exactly what I'm not seeing here with my syntax.
Any help will be greatly appreciated.
If I understand you right, you want to check if the first part before : contains a number or not.
You can iterate the string character-by-character and use str.isnumeric() to check if the character is number or not. If any character is a number, continue to next item:
cancer_variations = {
"rs13283416": ["9:116539328-116539328+", "ASTN2"],
"rs17610181": ["17:61590592-61590592+", "NACA2"],
"rs1569113445": ["X:12906527-12906527+", "TLR8TLR8-AS1"],
"rs143083812": ["7:129203569-129203569+", "SMO"],
"rs5009270": ["7:112519123-112519123+", "IFRD1"],
"rs12901372": ["15:67078168-67078168+", "SMAD3"],
"rs4765540": ["12:124315096-124315096+", "FAM101A"],
"rs3815148": ["CHR_HG2266_PATCH:107297975-107297975+", "COG5"],
"rs12982744": ["19:2177194-2177194+", "DOT1L"],
"rs11842874": ["13:113040195-113040195+", "MCF2L"],
}
for k, (v, *_) in cancer_variations.items():
if not any(ch.isnumeric() for ch in v.split(":")[0]):
print(k)
Prints:
rs1569113445
You need to look up how to determine your desired classification of the data. In this case, all you need is to differentiate alphabetic data from numeric:
if tmp_info[0].isalpha():
print(rs)
Should get you on your way.
First you need to make sure what you want to do.
If what you want is to distinguish a numeric string from a normal string, then you may want to know that a numeric string is strictly formed of numbers; if you add any other character, it's not considered numeric by python. You can prove this making this experiment:
print('23123'.isnumeric())
print('2312ds3'.isnumeric())
Results in:
True
False
Numeric strings is what you are looking to exclude, and any other, in this case, that stays as str, will fit, if i'm understanding.
So, in that manner, we are going to iterate over the dict, using the loop you've made:
for rs, info in cancer_variations.items():
tmp_info=info[0].split(":")
if not tmp_info[0].isnumeric():
print(rs)
Which results in:
rs1569113445
rs3815148

Generate wordlist with known characters

I'm looking to write a piece of code in Javascript or Python that generates a wordlist file out of a pre-defined combination of characters.
E.g.
input = abc
output =
ABC
abc
Abc
aBc
abC
AbC
ABc
aBC
I have very basic knowledge of either so all help is appreciated.
Thank you
I'll assume that you're able to import Python packages. Therefore, take a look at itertools.product:
This tool computes the cartesian product of input iterables.
For example, product(A, B) returns the same as ((x,y) for x in A for y in B).
It looks quite like what you're looking for, right? That's every possible combination from two different lists.
Since you're new to Python, I'll assume you don't know what a map is. Nothing too hard to understand:
Returns a list of the results after applying the given function to each item of a given iterable (list, tuple etc.)
That's easy! So the first parameter is the function you want to apply and the second one is your iterable.
The function I applied in the map is as follows:
''.join
This way you set '' as your separator (basically no separator at all) and put together every character with .join.
Why would you want to put together the characters? Well, you'll have a list (a lot of them in fact) and you want a string, so you better put those chars together in each list.
Now here comes the hard part, the iterable inside the map:
itertools.product(*((char.upper(), char.lower()) for char in string)
First of all notice that * is the so-called splat operator in this situation. It splits the sequence into separate arguments for the function call.
Now that you know that, let's dive into the code.
Your (A, B) for itertools.product(A, B) are now (char.upper(), char.lower()). That's both versions of char, upper and lowercase. And what's char? It's an auxiliar variable that will take the value of each and every character in the given string, one at a time.
Therefore for input 'abc' char will take values a, b and c while in the loop, but since you're asking for every possible combination of uppercase and lowercase char you'll get exactly what you asked for.
I hope I made everything clear enough. :)
Let me know if you need any further clarification in the comments. Here's a working function based on my previous explanation:
import itertools
def func():
string = input("Introduce some characters: ")
output = map(''.join, itertools.product(*((char.upper(), char.lower()) for char in string)))
print(list(output))
As an additional note, if you printed output you wouldn't get your desired output, you have to turn the map type into a list for it to be printable.
A simple approach using generators, and no library code. It returns a generator (iterator-like object), but can be converted to a list easily.
def lU(s):
if not s:
yield ''
else:
for sfx in lU(s[1:]):
yield s[0].upper() + sfx
yield s[0].lower() + sfx
print list(lU("abc"))
Note that all the sub-lists of suffixes are not fully expanded, but the number of generator objects (each a constant size) that get generated is proportional to the length of the string.

Python 3 - Comparing two lists, finding out if el is in list, based on "starts with"

I have two lists:
items_on_queue = ['The rose is red and blue', 'The sun is yellow and round']
things_to_tweet = ['The rose is red','The sun is yellow','Playmobil is a toy']
I want to find out if an element is present on both lists based on the FEW CHARACTERS AT THE BEGINNING, and delete the element from things_to_tweet if a match is found.
The final output should be things_to_tweet = ['Playmobil is a toy']
Any idea how I can do this?
Thank you
PS/ I tried, but I cannot do an "==" comparison because each el is different in every list, even if they start the same, so they're not seen as equal by Python.
I also tried a loop inside a loop but I don't know how to compare one element with ALL the elements of another list only IF the strings start in the same manner.
I also checked other SO threads but they seem to refer to comparisons between lists when elements are exactly the same, which is not what I need.
Condition with String startswith(..)
[s for s in things_to_tweet if not any(i.startswith(s) for i in items_on_queue)]
#Output:
#['Playmobil is a toy']
To keep things simple and readable, I would make use of a helper function (I named it is_prefix_of_any). Without this function we would have two nested loops, which is needlessly confusing. Checking whether a string is a prefix of another string is done with the str.startswith function.
I also opted to create a new list instead of removing strings from things_to_tweet, because removing things from a list you're iterating over will often cause unexpected results.
# define a helper function that checks if any string in a list
# starts with another string
# we will use this to check if any string in items_on_queue starts
# with a string from things_to_tweet
def is_prefix_of_any(prefix, strings):
for string in strings:
if string.startswith(prefix):
return True
return False
# build a new list containing only the strings we want
things = []
for thing in things_to_tweet:
if not is_prefix_of_any(thing, items_on_queue):
things.append(thing)
print(things) # output: ['Playmobil is a toy']
A veteran would do this with much less code, but this should be a lot easier to understand.

Categories