I'm attempting to merge multiple sets of word data. Each csv file that is read in (there is 4 files) contains a column for each unique word in a book, and a column for how many times that word shows up. Whats supposed to happen is the word columns of all of these csv files are supposed to merge into one in this new matrix file I'm trying to create, but when I attempt to merge each csv file and its data, an empty data frame is returned.
The csv files are like:
Word Count
Thou 100
O 20
Hither 8
and I want them to merge like this:
Word Book1 Book2 Book3
Thou 50 0 88
Hello 32 35 27
No 89 38 0
Yes 80 99 0
import os
from os import listdir
from os.path import isfile, join
import pandas as pd
dataPath = 'data/'
fileNames = [f for f in listdir(dataPath) if isfile(join(dataPath, f))]
columns = [os.path.splitext(x)[0] for x in fileNames]
columns.remove('rows')
columns.remove('cols')
columns.remove('matrix')
columns.insert(0, "Word")
wordData = []
matrix = pd.DataFrame(columns=columns)
for file in fileNames:
if '.txt' in file:
continue
elif 'matrix' in file:
continue
else:
myFile = open(f"./data/{file}", "r")
readFile = myFile.read()
dataVector = pd.read_csv(f"./data/{file}", sep=",")
#print(dataVector)
matrix.merge(dataVector, how="outer", on=["Word"])
print(matrix)
myFile.close()
pd.set_option("display.max_rows", None, "display.max_columns", None)
matrix = matrix.fillna(0)
matrix.to_csv(path_or_buf="./data/matrix.csv")
I think this may be the thing you needed.
Data:
import pandas as pd
book_list = []
book_list.append(pd.DataFrame({'Word': ['a', 'b'], 'Count': [1, 2]}))
book_list.append(pd.DataFrame({'Word': ['b', 'c'], 'Count': [3, 4]}))
book_list.append(pd.DataFrame({'Word': ['d', 'e', 'f'], 'Count': [5, 6, 7]}))
book_list.append(pd.DataFrame({'Word': ['c', 'e'], 'Count': [8, 9]}))
Code:
result = None
for idx_book, book in enumerate(book_list):
if result is None:
result = book
else:
result = result.merge(book, how="outer", on=["Word"], suffixes=(idx_book-1, idx_book))
Result:
Word Count0 Count1 Count2 Count3
0 a 1.0 NaN NaN NaN
1 b 2.0 3.0 NaN NaN
2 c NaN 4.0 NaN 8.0
3 d NaN NaN 5.0 NaN
4 e NaN NaN 6.0 9.0
5 f NaN NaN 7.0 NaN
Ended up solving it by using this lambda function:
matrix = reduce(lambda left,right: pd.merge(left,right,on=['Word'],how='outer'), wordData).fillna(0)
Related
I have a .txt file of this sort
12
21
23
1
23
42
12
0
In which <12,21,23> are features and <1> is a label.
Again <23,42,12> are features and <0> is the label and so on.
I want to create a pandas dataframe from the above text file which contains only a single column into multiple column.
The format of the dataframe is {column1,column2,column3,column4}. And there are no column names in it.
Can someone please help me out in this?
Thanks
import pandas as pd
df = dict()
features = list()
label = ''
filename = '.txt'
with open(filename) as fd:
i = 0
for line in fd:
if i != 3:
features.append(line.strip())
i += 1
else:
label = line.strip()
i = 0
df[label] = features
features = list()
df = pd.DataFrame(df)
df
import pandas as pd
with open(<FILEPATH>, "r") as f:
lines = f.readlines()
formatted = [int(line[:-1]) for line in lines] # Remove \n and convert to int
labels = formatted[3::4]
features = list(zip(formatted[::4], formatted[1::4], formatted[2::4])) # You can modify this if there are more than three rows
data = {}
for i, label in enumerate(labels):
data[label] = list(features[i])
df = pd.DataFrame(data)
Comment if you have any questions or found any errors, and I will make ammendments.
You can use numpy first, you need to ensure that the number of values is a multiple of 4
each record as column with the label as header
a = np.loadtxt('file.txt').reshape((4,-1), order='F')
df = pd.DataFrame(a[:-1], columns=a[-1])
Output:
1.0 0.0
0 12.0 23.0
1 21.0 42.0
2 23.0 12.0
each record as a new row
a = np.loadtxt('file.txt').reshape((-1,4))
df = pd.DataFrame(a)
Output:
0 1 2 3
0 12.0 21.0 23.0 1.0
1 23.0 42.0 12.0 0.0
row = []
i = 0
data = []
with open('a.txt') as f:
for line in f:
data
i+= 1
row.append(int(line.strip()))
if i%4==0 and i!=0:
print(i)
data_rows_count +=1
data.append(row)
row = []
f.close()
df = pd.DataFrame(data)
results in df to be:
0 1 2 3
0 12 21 23 1
1 23 42 12 0
I have a DataFrame with 2 columns:
import pandas as pd
data = {'Country': ['A', 'A', 'A' ,'B', 'B'],'Capital': ['CC', 'CD','CE','CF','CG'],'Population': [5, 35, 20,34,65]}
df = pd.DataFrame(data,columns=['Country', 'Capital', 'Population'])
I want to compare each row with all others, and if it has the same Country, I would like to concatenate the pair into a new data frame (and transfor it into a new csv).
new_data = {'Country': ['A', 'A','B'],'Capital': ['CC', 'CD','CF'],'Population': [5, 35,34],'Country_2': ['A', 'A' ,'B'],'Capital_2': ['CD','CE','CG'],'Population_2': [35, 20,65]}
df_new = pd.DataFrame(new_data,columns=['Country', 'Capital', 'Population','Country_2','Capital_2','Population_2'])
NOTE: This is a simplification of my data, I have more than 5000 rows and I would like to do it automatically
I tried comparing dictionaries, and also comparing one row at a time, but I couldn't do it.
Thank you for the attention
>>> df.join(df.groupby('Country').shift(-1), rsuffix='_2')\
... .dropna(how='any')
Country Capital Population Capital_2 Population_2
0 A CC 5 CD 35.0
1 A CD 35 CE 20.0
3 B CF 34 CG 65.0
This pairs every row with the next one using join + shift − but we restrict shifting only within the same country using groupby. See what the groupby + shift does on its own:
>>> df.groupby('Country').shift(-1)
Capital Population
0 CD 35.0
1 CE 20.0
2 NaN NaN
3 CG 65.0
4 NaN NaN
Then once these values are added to the right of your data with the _2 suffix, the rows that have NaNs are dropped with dropna().
Finally note that Country_2 is not repeated as it’s the same as Country, but it would be very easy to add
To get all combinations you can try:
from itertools import combinations,chain
df = (
pd.concat(
[pd.DataFrame(
np.array(list(chain(*(combinations(k.values,2))))).reshape(-1, len(df.columns) * 2),
columns = df.columns.append(df.columns.map(lambda x: x + '_2')))
for g,k in df.groupby('Country')]
)
)
I have a txt file that look like this:
a,b,c
a,b,c,d
a,b
a,b,c,d,e
a,b,c,d
with each line having possibly different items.
I tried the:
df = pd.read_csv('text.txt', sep = ',', header = None)
but it gave me error as 'Error tokenizing data'
Does anyone know how to solve it? to separate a txt file with ',' regardless of number of elements at each line. much appreciated!
Just provide names for all of your columns:
import pandas as pd
print(pd.read_csv('text.txt', header=None, names=[0, 1, 2, 3, 4]))
Output:
0 1 2 3 4
0 a b c NaN NaN
1 a b c d NaN
2 a b NaN NaN NaN
3 a b c d e
4 a b c d NaN
Generated values are fine too:
print(pd.read_csv('text.txt', header=None, names=range(0, 5)))
text.txt
a,b,c
a,b,c,d
a,b
a,b,c,d,e
a,b,c,d
Pass in the names parameter:
pd.read_csv(('text.txt', sep = ',', header = None, names=['a','b','c','d','e'])
I have 2 data frames with the same column names. Old data frame old_df and the new data frame is new_df with 1 column as a key.
I am trying to merge the 2 data frames into a single data frame which following conditions.
If the key is missing in the new table, then data from old_df should be taken
if the key is missing in old table, then data from new_df should be added.
If the key is present in both the tables then the data from new_df should overwrite the data from old_df.
Below is my code snippet that I am trying to play with.
new_data = pd.read_csv(filepath)
new_data.set_index(['Name'])
old_data = pd.read_sql_query("select * from dbo.Details", con=engine)
old_data.set_index(['Name'])
merged_result = pd.merge(new_data[['Name','RIC','Volatility','Sector']],
old_data,
on='Name',
how='outer')
I am thinking of using np.where from this point onwards but not sure how to proceed. please advice.
I believe you need DataFrame.combine_first with DataFrame.set_index for match by Name columns:
merged_result = (new_data.set_index('Name')[['RIC','Volatility','Sector']]
.combine_first(old_data.set_index('Name'))
.reset_index())
Sample data:
old_data = pd.DataFrame({'RIC':range(6),
'Volatility':[5,3,6,9,2,4],
'Name':list('abcdef')})
print (old_data)
RIC Volatility Name
0 0 5 a
1 1 3 b
2 2 6 c
3 3 9 d
4 4 2 e
5 5 4 f
new_data = pd.DataFrame({'RIC':range(4),
'Volatility':[10,20,30, 40],
'Name': list('abhi')})
print (new_data)
RIC Volatility Name
0 0 10 a
1 1 20 b
2 2 30 h
3 3 40 i
merged_result = (new_data.set_index('Name')
.combine_first(old_data.set_index('Name'))
.reset_index())
print (merged_result)
Name RIC Volatility
0 a 0.0 10.0
1 b 1.0 20.0
2 c 2.0 6.0
3 d 3.0 9.0
4 e 4.0 2.0
5 f 5.0 4.0
6 h 2.0 30.0
7 i 3.0 40.0
#jezrael's answer looks good. You may also try splitting dataset upon conditions and concatenating the old and new dataframes.
In the following example, I'm taking col1 as index and producing results that comply with your question's rules for combination.
import pandas as pd
old_data = {'col1': ['a', 'b', 'c', 'd', 'e'], 'col2': ['A', 'B', 'C', 'D', 'E']}
new_data = {'col1': ['a', 'b', 'e', 'f', 'g'], 'col2': ['V', 'W', 'X', 'Y', 'Z']}
old_df = pd.DataFrame(old_data)
new_df = pd.DataFrame(new_data)
old_df:
new_df:
Now,
df = pd.concat([new_df, old_df[~old_df['col1'].isin(new_df['col1'])]], axis=0).reset_index(drop=True)
Which gives us
df:
Hope this helps.
I have a dataframe in which I'm looking to group and then partition the values within a group into multiple columns.
For example: say I have the following dataframe:
>>> import pandas as pd
>>> import numpy as np
>>> df=pd.DataFrame()
>>> df['Group']=['A','C','B','A','C','C']
>>> df['ID']=[1,2,3,4,5,6]
>>> df['Value']=np.random.randint(1,100,6)
>>> df
Group ID Value
0 A 1 66
1 C 2 2
2 B 3 98
3 A 4 90
4 C 5 85
5 C 6 38
>>>
I want to groupby the "Group" field, get the sum of the "Value" field, and get new fields, each of which holds the ID values of the group.
Currently I am able to do this as follows, but I am looking for a cleaner methodology:
First, I create a dataframe with a list of the IDs in each group.
>>> g=df.groupby('Group')
>>> result=g.agg({'Value':np.sum, 'ID':lambda x:x.tolist()})
>>> result
ID Value
Group
A [1, 4] 98
B [3] 76
C [2, 5, 6] 204
>>>
And then I use pd.Series to split those up into columns, rename them, and then join it back.
>>> id_df=result.ID.apply(lambda x:pd.Series(x))
>>> id_cols=['ID'+str(x) for x in range(1,len(id_df.columns)+1)]
>>> id_df.columns=id_cols
>>>
>>> result.join(id_df)[id_cols+['Value']]
ID1 ID2 ID3 Value
Group
A 1 4 NaN 98
B 3 NaN NaN 76
C 2 5 6 204
>>>
Is there a way to do this without first having to create the list of values?
You could use
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
to create id_df without the intermediate result DataFrame.
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
'ID': [1, 2, 3, 4, 5, 6],
'Value': np.random.randint(1, 100, 6)})
grouped = df.groupby('Group')
values = grouped['Value'].agg('sum')
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'ID{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)
yields
ID1 ID2 ID3 Value
Group
A 1 4 NaN 77
B 3 NaN NaN 84
C 2 5 6 86
Another way of doing this is to first added a "helper" column on to your data, then pivot your dataframe using the "helper" column, in the case below "ID_Count":
Using #unutbu setup:
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
'ID': [1, 2, 3, 4, 5, 6],
'Value': np.random.randint(1, 100, 6)})
#Create group
grp = df.groupby('Group')
#Create helper column
df['ID_Count'] = grp['ID'].cumcount() + 1
#Pivot dataframe using helper column and add 'Value' column to pivoted output.
df_out = df.pivot('Group','ID_Count','ID').add_prefix('ID').assign(Value = grp['Value'].sum())
Output:
ID_Count ID1 ID2 ID3 Value
Group
A 1.0 4.0 NaN 77
B 3.0 NaN NaN 84
C 2.0 5.0 6.0 86
Using get_dummies and MultiLabelBinarizer (scikit-learn):
import pandas as pd
import numpy as np
from sklearn import preprocessing
df = pd.DataFrame()
df['Group']=['A','C','B','A','C','C']
df['ID']=[1,2,3,4,5,6]
df['Value']=np.random.randint(1,100,6)
mlb = preprocessing.MultiLabelBinarizer(classes=classes).fit([])
df2 = pd.get_dummies(df, '', '', columns=['ID']).groupby(by='Group').sum()
df3 = pd.DataFrame(mlb.inverse_transform(df2[df['ID'].unique()].values), index=df2.index)
df3.columns = ['ID' + str(x + 1) for x in range(df3.shape[0])]
pd.concat([df3, df2['Value']], axis=1)
ID1 ID2 ID3 Value
Group
A 1 4 NaN 63
B 3 NaN NaN 59
C 2 5 6 230