I would like to randomly select a string of characters from this list of symbols without replacement: '#','+','?','!','$','*','%','#','}','>','&','^'.
The length of the generated string would be equal in length to the word in another column from the csv.
Example of an existing csv:
Word Length
dog 3
wolf 4
cactus 6
bus 3
I would like to have the code such that it appends a third column to the existing csv file with the generated string equal in length for each word. This is an example of the result I want:
Word Length String
dog 3 #!#
wolf 4 &*%!
cactus 6 ^?!##%
bus 3 }&^
This is the code I tried but I do not think it is right.
import random
import pandas as pd
import os
cwd = os.getcwd()
cwd
os.chdir("/Users/etcetc") #change directory
df = pd.read_csv('generatingstring.csv')
list1 = ['#','+','?','!','$','*','%','#','}','>','&','^']
list2 = df['String'] #creating a new column for the generated string
for row in df['Length']: #hope this reads each row in that column
for n in range(1, row): #hope this reads the length value within cell
s = random.choice(list1)
list1.remove(s) #to ensure random selection without replacement
list2.append(s)
I was hoping to make it read each row within the Length column, and within each row take note of how many symbols to randomly select.
Thank you!
You can try
import numpy as np
df.Word.map(lambda x : ''.join(np.random.choice(list1,len(x),replace = False)))
Out[145]:
0 &$!
1 >^$!
2 #}%?$>
3 #+!
Name: Word, dtype: object
Related
I have to work on a flat file (size > 500 Mo) and I need to create to split file on one criterion.
My original file as this structure (simplified):
JournalCode|JournalLib|EcritureNum|EcritureDate|CompteNum|
I need to create to file depending on the first digit from 'CompteNum'.
I have started my code as well
import sys
import pandas as pd
import numpy as np
import datetime
C_FILE_SEP = "|"
def main(fic):
pd.options.display.float_format = '{:,.2f}'.format
FileFec = pd.read_csv(fic, C_FILE_SEP, encoding= 'unicode_escape')
It seems ok, my concern is to create my 2 files based on criteria. I have tried with unsuccess.
TargetFec = 'Target_'+fic+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+'.txt'
target = open(TargetFec, 'w')
FileFec = FileFec.astype(convert_dict)
for row in FileFec.iterrows():
Fec_Cpt = str(FileFec['CompteNum'])
nb = len(Fec_Cpt)
if (nb > 7):
target.write(str(row))
target.close()
the result of my target file is not like I expected:
(0, JournalCode OUVERT
JournalLib JOURNAL D'OUVERTURE
EcritureNum XXXXXXXXXX
EcritureDate 20190101
CompteNum 101300
CompteLib CAPITAL SOUSCRIT
CompAuxNum
CompAuxLib
PieceRef XXXXXXXXXX
PieceDate 20190101
EcritureLib A NOUVEAU
Debit 000000000000,00
Credit 000038188458,00
EcritureLet NaN
DateLet NaN
ValidDate 20190101
Montantdevise
Idevise
CodeEtbt 100
Unnamed: 19 NaN
And I expected to obtain line into my target file when CompteNum(0:1) > 7
I have read many posts for 2 days, please some help will be perfect.
There is a sample of my data available here
Philippe
Suiting the rules and the desired format, you can use logic like:
# criteria:
verify = df['CompteNum'].apply(lambda number: str(number)[0] == '8' or str(number)[0] == '9')
# saving the dataframes:
df[verify].to_csv('c:/users/jack/desktop/meets-criterios.csv', sep = '|', index = False)
Original comment:
As I understand it, you want to filter the imported dataframe according to some criteria. You can work directly on the pandas you imported. Look:
# criteria:
verify = df['CompteNum'].apply(lambda number: len(str(number)) > 7)
# filtering the dataframe based on the given criteria:
df[verify] # meets the criteria
df[~verify] # does not meet the criteria
# saving the dataframes:
df[verify].to_csv('<your path>/meets-criterios.csv')
df[~verify].to_csv('<your path>/not-meets-criterios.csv')
Once you have the filtered dataframes, you can save them or convert them to other objects, such as dictionaries.
I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated
Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan
I am struggling a little to do something like that:
to get this output:
The purpose of it, is to separate a sentence into 3 parts to make some manipulations after.
Any help is welcome
Select from the dataframe only the second line of each pair, which is the line
containing the separator, then use astype(str).apply(''.join...) to restrain the word
that can be on any value column on the original dataframe to a single string.
Iterate over each row using split with the word[i] of the respective row, after split
reinsert the separator back on the list, and with the recently created list build the
desired dataframe.
Input used as data.csv
title,Value,Value,Value,Value,Value
Very nice blue car haha,Very,nice,,car,haha
Very nice blue car haha,,,blue,,
A beautiful green building,A,,green,building,lol
A beautiful green building,,beautiful,,,
import pandas as pd
df = pd.read_csv("data.csv")
# second line of each pair
d1 = df[1::2]
d1 = d1.fillna("").reset_index(drop=True)
# get separators
word = d1.iloc[:,1:].astype(str).apply(''.join, axis=1)
strings = []
for i in range(len(d1.index)):
word_split = d1.iloc[i, 0].split(word[i])
word_split.insert(1, word[i])
strings.append(word_split)
dn = pd.DataFrame(strings)
dn.insert(0, "title", d1["title"])
print(dn)
Output from dn
title 0 1 2
0 Very nice blue car haha Very nice blue car haha
1 A beautiful green building A beautiful green building
I got two descriptions, one in a dataframe and other that is a list of words and I need to compute the levensthein distance of each word in the description against each word in the list and return the count of the result of the levensthein distance that is equal to 0
import pandas as pd
definitions=['very','similarity','seem','scott','hello','names']
# initialize list of lists
data = [['hello my name is Scott'], ['I went to the mall yesterday'], ['This seems very similar']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Descriptions'])
# print dataframe.
df
Column counting the number of all words in each row that computing the Lev distances against each word in the dictionary returns 0
df['lev_count_0']= Column counting the number of all words in each row that computing the Lev distances against each word in the dictionary returns 0
So for example, the first case will be
edit_distance("hello","very") # This will be equal to 4
edit_distance("hello","similarity") # this will be equal to 9
edit_distance("hello","seem") # This will be equal to 4
edit_distance("hello","scott") # This will be equal to 5
edit_distance("hello","hello")# This will be equal to 0
edit_distance("hello","names") # this will be equal to 5
So for the first row in df['lev_count_0'] the result should be 1, since there is just one 0 comparing all words in the Descriptions against the list of Definitions
Description | lev_count_0
hello my name is Scott | 1
My solution
from nltk import edit_distance
import pandas as pd
data = [['hello my name is Scott'], ['I went to the mall yesterday'], ['This seems very similar']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Descriptions'])
dictionary=['Hello', 'my']
def lev_dist(colum):
count=0
dataset=list(colum.split(" "))
for word in dataset :
for dic in dictionary:
result=edit_distance(word,dic)
if result ==0 :
count=count+1
return count
df['count_lev_0'] = df.Descriptions.apply(lev_dist)
I have a dataset of genes and drugs all in 1 column, looks like this:
Molecules
3-nitrotyrosine
4-phenylbutyric acid
5-fluorouracil/leucovorin/oxaliplatin
5-hydroxytryptamine
ABCB4
ABCC8
ABCC9
ABCF2
ABHD4
The disperasal of genes and drugs in the column is random, so there is no precise partitioning I can do.
I am looking to remove the genes and put them into a new column, I am wondering if I can use isupper() to select the genes and move them into a new column, although I know this only works with strings. Is there some way to select the rows with uppercase letters to put into a new column? Any guidance would be appreciated.
Expected Output:
Column 1 Column 2
3-nitrotyrosine ABCB4
4-phenylbutyric acid ABCC8
5-fluorouracil/leucovorin/oxaliplatin ABCC9
5-hydroxytryptamine ABCF2
Read your file in to a list:
with open('test.txt', 'r') as f:
lines = [line.strip() for line in f]
Strip out all uppercase as so:
mols = [x for x in lines if x.upper() != x]
genes = [x for x in lines if x.upper() == x]
Result:
mols
['3-nitrotyrosine', '4-phenylbutyric acid',
'5-fluorouracil/leucovorin/oxaliplatin', '5-hydroxytryptamine']
genes
['ABCB4', 'ABCC8', 'ABCC9', 'ABCF2', 'ABHD4']
As mentioned, separating the upper case is simple:
df.loc[df['Molecules'].str.isupper()]
Molecules
5 ABCB4
6 ABCC8
7 ABCC9
8 ABCF2
9 ABHD4
df.loc[df['Molecules'].str.isupper() == False]
Molecules
0 3-nitrotyrosine
1 4-phenylbutyric
2 acid
3 5-fluorouracil/leucovorin/oxaliplatin
4 5-hydroxytryptamine
However how you want to match up the rows are unclear until you are able to provide additional details.