I need to create a dataframe from a loop. the idea is that the loop will read a data frame of texts (train_vs) and search for specific key words ['govern', 'data'] and then calculate their frequency or TF. what I want is an outcome of two columns with the TF of the words for each text inside them. the code I am using is the following:
d = pd.DataFrame()
key = ['govern', 'data']
for k in key:
for w in range(0, len(train_vs)):
wordcount = Counter(train_vs['doc_text'].iloc[w])
a_vs = (wordcount[k]/len(train_v.iloc[w])*1)
temp = pd.DataFrame([{k: a_vs}] )
d = pd.concat([d, temp])
however, I am getting two columns but with values for the first key word and nan for second for the whole texts column and then nan for the first and values for the second again for the whole texts column. so the number of the rows of the outcome dataframe is double.
I want to have both values next to each other.
Your help is highly appreciated.
Thanks.
From the pandas.concat documentation:
Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN values.
What you are doing, when loop with the key changes is to try and concatenate a new df (temp) that has a single column('data') to the old df that also has a single column ('gonvern') and that's why you get the half columns of NANs.
What you could do instead of concatenating millions of dataframes is to build only one dataframe, by building the columns.
d = pd.DataFrame()
key = ['govern', 'data']
for k in key:
column = []
for w in range(0, len(train_vs)):
wordcount = Counter(train_vs['doc_text'].iloc[w])
a_vs = (wordcount[k] / len(train_v.iloc[w]) * 1)
column.append(a_vs)
d[k] = column
Related
I have two csv files, and the two files have the exact same amount of rows and columns containing only numerical values. I want to compare each columns separately.
The idea would be to compare column 1 value of file "a" to column 1 value of file "b" and check the difference and so on for all the numbers in the column (there are 100 rows) and write out a number that in how many cases were the difference more than 0. So e.g. if in the case of column 1 there where 55 numbers that didnt mach in case of file "a" and "b" than I want to get back a value of 55 for column 1 and so on.
I would like to repeat the same for all the columns. I know it should be a double for loop but idk exactly how.
Thanks in advance!
import pandas as pd
dk = pd.read_csv('C:/Users/D/1_top_a.csv', sep=',', header=None)
dk = dk.dropna(how='all')
dk = dk.dropna(how='all', axis=1)
print(dk)
dl = pd.read_csv('C:/Users/D/1_top_b.csv', sep=',', header=None)
dl = dl.dropna(how='all')
dl = dl.dropna(how='all', axis=1)
#print(dl)
rows=dk.shape[0]
print(rows)
for row in range(len(dl)):
for col in range(len(dl.columns)):
if dl.iloc[row, col] != dk.iloc[row, col]:
I find the recordlinkage package very useful for comparing values from 2 datasets. You can define which columns to compare and it returns a 0 or 1 if they match. Next, you can filter for all matching values
https://recordlinkage.readthedocs.io/en/latest/about.html
Code looks like this:
# create pair of dataframes to compare
indexer = rl.Index()
indexer.add(Block('row_identifier1', 'row_identifier2'))
datasets = indexer.index(dataset1, dataset2)
# initialise class
comparer = rl.Compare()
# initialise similarity measurement algorithms
comparer.string('string_value1', 'string_value2', method='jarowinkler', threshold=0.95, label='string_matching')
comparer.exact('value3', 'value4', label='integer_matching')
# the method .compute() returns the DataFrame with the feature vectors.
results = comparer.compute(datasets, dataset1, dataset2)
I am not sure how to build a data frame here but I am looking for a way to take the data from multiple columns and combine them into 1 column. Not as a sum but as a joined value.
Ex. MB|Val|34567|W123 -> MB|Val|34567|W123|MB_Val_34567_W123.
What I have tried so far is creating a conditions variable that calls a particular column identical to the value in it
conditions = [(Groupings_df['GroupingCriteria1'] == 'MB')]
then a values variable that would include what I want in the new column
values = ['MB_Val_34567_W123']
and lastly grouping it
Groupings_df['GroupingColumn'] = np.select(conditions,values)
This works for 1 row but it would be inefficient to keep manually changing the number in the values variable (34567) over a df with thousands of rows
IIUC, you want to create a new column as a concatenation of each row:
df = pd.DataFrame({'GC1': ['MB'], 'GC2': ['Val'], 'GC3': [34567], 'GC4': ['W123'],
'Dummy': [10], 'Other': ['Hello']})
df['GC'] = df.filter(like='GC').astype(str).apply(lambda x: '_'.join(x), axis=1)
print(df)
# Output
GC1 GC2 GC3 GC4 Dummy Other GC
0 MB Val 34567 W123 10 Hello MB_Val_34567_W123
My input Data Frame is
Below is My code for creating Multiple columns as per my single column data, if my column contains 'reporting' that should be column name as well as it will be place one if reporting contains in that column.
am getting correct output but I want this code dynamical way is any another ways...
df['reporting']=pd.np.where((df['Name'].str.contains('reporting',regex=False)),1,0)
df['update']=pd.np.where((df['Name'].str.contains('update',regex=False)),1,0)
df['offer']=pd.np.where((df['Name'].str.contains('offer',regex=False)),1,0)
df['line']=pd.np.where((df['Name'].str.contains('line',regex=False)),1,0)
Output:
Use Series.str.findall for get all value sof list with \b\b for words boundaries, join them by | and pass to Series.str.get_dummies:
L = ["reporting","update","offer","line"]
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df = df.join(df['Name'].str.findall(pat).str.join('|').str.get_dummies())
Or processing each column separately, here np.where is not necessary, convert True,False to 1,0 by Series.astype or Series.view:
for c in L:
df[c] = df['Name'].str.contains(c, regex=False).astype(int)
for c in L:
df[c] = df['Name'].str.contains(c, regex=False).view('i1')
Make a list of keywords, iterate the list and create new columns?
keywords = ["reporting","update","offer","line"]
for word in keywords:
df[word]=pd.np.where((df['Name'].str.contains(word,regex=False)),1,0)
I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.
Now I have a list of indices label_index. I want to extract the corresponding values from a dataframe label_file based on the indices. The values of label_index will appear in column image_num in the dataframe and the goal is to get a list of corresponding values in Thermal conductivity(W/mK) column.
label_file = pd.read_excel("/Users/yixuansun/Documents/Research/ThermalConductiviy/Anisotropic/anisotropic_porous_media/data.xlsx",
sheet_name = "total")
label = []
for i in label_index:
for j in range(len(label_file)):
if i == label_file.iloc[j]["image_num"]:
label.append(label_file.iloc[j]["Thermal conductivity(W/mK)"])
I used the brute force to find the match (two for loops). It does take a very long time to get through. I am wondering if there is a more efficient way to do so.
Get column "Thermal conductivity(W/mK)" where "image_num" column has one of the values specified in label_index list:
series = label_file.loc[
label_file['image_num'].isin(label_index),
'Thermal conductivity(W/mK)']
EDIT 1:
For sorting by label_index you can use an auxiliary column as following:
df = label_file.loc[
label_file['image_num'].isin(label_index),
['Thermal conductivity(W/mK)', 'image_num']]
# create aux. column to sort by
df['sortbyme'] = df['image_num'].apply(lambda x: label_index.index(x))
# sort by aux. column and get only 'Thermal conductivity(W/mK)' column
series = df.sort_values('sortbyme').reset_index()['Thermal conductivity(W/mK)']
I actually found a fast but cleaner way myself.
ther = []
for i in label_index:
ther.append(label_file.loc[i]["Thermal conductivity(W/mK)"])
This will do the work.