I have a txt file that look like this:
a,b,c
a,b,c,d
a,b
a,b,c,d,e
a,b,c,d
with each line having possibly different items.
I tried the:
df = pd.read_csv('text.txt', sep = ',', header = None)
but it gave me error as 'Error tokenizing data'
Does anyone know how to solve it? to separate a txt file with ',' regardless of number of elements at each line. much appreciated!
Just provide names for all of your columns:
import pandas as pd
print(pd.read_csv('text.txt', header=None, names=[0, 1, 2, 3, 4]))
Output:
0 1 2 3 4
0 a b c NaN NaN
1 a b c d NaN
2 a b NaN NaN NaN
3 a b c d e
4 a b c d NaN
Generated values are fine too:
print(pd.read_csv('text.txt', header=None, names=range(0, 5)))
text.txt
a,b,c
a,b,c,d
a,b
a,b,c,d,e
a,b,c,d
Pass in the names parameter:
pd.read_csv(('text.txt', sep = ',', header = None, names=['a','b','c','d','e'])
Related
I'm attempting to merge multiple sets of word data. Each csv file that is read in (there is 4 files) contains a column for each unique word in a book, and a column for how many times that word shows up. Whats supposed to happen is the word columns of all of these csv files are supposed to merge into one in this new matrix file I'm trying to create, but when I attempt to merge each csv file and its data, an empty data frame is returned.
The csv files are like:
Word Count
Thou 100
O 20
Hither 8
and I want them to merge like this:
Word Book1 Book2 Book3
Thou 50 0 88
Hello 32 35 27
No 89 38 0
Yes 80 99 0
import os
from os import listdir
from os.path import isfile, join
import pandas as pd
dataPath = 'data/'
fileNames = [f for f in listdir(dataPath) if isfile(join(dataPath, f))]
columns = [os.path.splitext(x)[0] for x in fileNames]
columns.remove('rows')
columns.remove('cols')
columns.remove('matrix')
columns.insert(0, "Word")
wordData = []
matrix = pd.DataFrame(columns=columns)
for file in fileNames:
if '.txt' in file:
continue
elif 'matrix' in file:
continue
else:
myFile = open(f"./data/{file}", "r")
readFile = myFile.read()
dataVector = pd.read_csv(f"./data/{file}", sep=",")
#print(dataVector)
matrix.merge(dataVector, how="outer", on=["Word"])
print(matrix)
myFile.close()
pd.set_option("display.max_rows", None, "display.max_columns", None)
matrix = matrix.fillna(0)
matrix.to_csv(path_or_buf="./data/matrix.csv")
I think this may be the thing you needed.
Data:
import pandas as pd
book_list = []
book_list.append(pd.DataFrame({'Word': ['a', 'b'], 'Count': [1, 2]}))
book_list.append(pd.DataFrame({'Word': ['b', 'c'], 'Count': [3, 4]}))
book_list.append(pd.DataFrame({'Word': ['d', 'e', 'f'], 'Count': [5, 6, 7]}))
book_list.append(pd.DataFrame({'Word': ['c', 'e'], 'Count': [8, 9]}))
Code:
result = None
for idx_book, book in enumerate(book_list):
if result is None:
result = book
else:
result = result.merge(book, how="outer", on=["Word"], suffixes=(idx_book-1, idx_book))
Result:
Word Count0 Count1 Count2 Count3
0 a 1.0 NaN NaN NaN
1 b 2.0 3.0 NaN NaN
2 c NaN 4.0 NaN 8.0
3 d NaN NaN 5.0 NaN
4 e NaN NaN 6.0 9.0
5 f NaN NaN 7.0 NaN
Ended up solving it by using this lambda function:
matrix = reduce(lambda left,right: pd.merge(left,right,on=['Word'],how='outer'), wordData).fillna(0)
Say I have a dataframe like below:
df = pd.DataFrame({0:['Hello World!']}) # here df could have more than one column of data as shown below
df = pd.DataFrame({0:['Hello World!'], 1:['Hello Mars!']}) # or df could have more than one row of data as shown below
df = pd.DataFrame({0:['Hello World!', 'Hello Mars!']})
and I also have a list of column names like below:
new_col_names = ['a','b','c','d'] # here, len(new_col_names) might vary like below
new_col_names = ['a','b','c','d','e'] # but we can always be sure that the len(new_col_names) >= len(df.columns)
Given that, how could I replace the column names in df such that it results something like below:
df = pd.DataFrame({0:['Hello World!']})
new_col_names = ['a','b','c','d']
# result would be like this
a b c d
Hello World! (empty string) (empty string) (empty string)
df = pd.DataFrame({0:['Hello World!'], 1:['Hello Mars!']})
new_col_names = ['a','b','c','d']
# result would be like this
a b c d
Hello World! Hello Mars! (empty string) (empty string)
df = pd.DataFrame({0:['Hello World!', 'Hello Mars!']})
new_col_names = ['a','b','c','d','e']
a b c d e
Hello World! (empty string) (empty string) (empty string) (empty string)
Hellow Mars! (empty string) (empty string) (empty string) (empty string)
From reading around StackOverflow answers such as this, I have a vague idea that it could be something like below:
df[new_col_names] = '' # but this returns KeyError
# or this
df.columns=new_col_names # but this returns ValueError: Length mismatch (of course)
If someone could show me, a way to overwrite existing dataframe column name and at the same time add new data columns with empty string values in the rows, I'd greatly appreciate the help.
Idea is create dictionary by existing columns names by zip, rename only existing columns and then add all new one by DataFrame.reindex:
df = pd.DataFrame({0:['Hello World!', 'Hello Mars!']})
new_col_names = ['a','b','c','d','e']
df1 = (df.rename(columns=dict(zip(df.columns, new_col_names)))
.reindex(new_col_names, axis=1, fill_value=''))
print (df1)
a b c d e
0 Hello World!
1 Hello Mars!
df1 = (df.rename(columns=dict(zip(df.columns, new_col_names)))
.reindex(new_col_names, axis=1))
print (df1)
a b c d e
0 Hello World! NaN NaN NaN NaN
1 Hello Mars! NaN NaN NaN NaN
Here is a function that will do what you want
I couldn't find a 1-liner, but jezrael did: his answer
import pandas as pd
# function
def rename_add_col(df: pd.DataFrame, cols: list) -> pd.DataFrame:
c_len = len(df.columns)
if c_len == len(cols):
df.columns = cols
else:
df.columns = cols[:c_len]
df = pd.concat([df, pd.DataFrame(columns=cols[c_len:])])
return df
# create dataframe
t1 = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', '5', '6'], 'c': ['7', '8', '9']})
a b c
0 1 4 7
1 2 5 8
2 3 6 9
# call function
cols = ['d', 'e', 'f']
t1 = rename_add_col(t1, cols)
d e f
0 1 4 7
1 2 5 8
2 3 6 9
# call function
cols = ['g', 'h', 'i', 'new1', 'new2']
t1 = rename_add_col(t1, cols)
g h i new1 new2
0 1 4 7 NaN NaN
1 2 5 8 NaN NaN
2 3 6 9 NaN NaN
This might help you do it all at once
Use your old Dataframe to recreate another dataframe with the pd.DataFrame() method and then add new columns in the columns paramater by list addition.
Note : This would add new columns as per index length, but with NaN values, workaround for which would be doing a df.fillna(' ')
pd.DataFrame(df.to_dict() , columns = list(df.columns)+['b','c'])
Hope this Helps! Cheers !
I work in python and pandas.
Let's suppose that I have a dataframe like that (INPUT):
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
I want to process it to finally get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
To manage this I do the following:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1
df_2['B'] -= 1
df_2['C'] = np.nan
df_2 looks like that for now:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
Now I want to do a matching/merging between df_1 and df_2 with using as keys the columns A and B.
I tried with isin() to do this:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
df_2.iloc[df_temp.index] = df_temp
but it gives me back the same df_2 as before without matching the common row 5 1 1 for A, B, C respectively:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
How can I do this properly?
By the way, just to be clear, the matching should not be done like
1st row of df1 - 1st row of df1
2nd row of df1 - 2nd row of df2
3rd row of df1 - 3rd row of df2
...
But it has to be done as:
any row of df1 - any row of df2
based on the specified columns as keys.
I think that this is why isin() above at my code does not work since it does the filtering/matching in the former way.
On the other hand, .merge() can do the matching in the latter way but it does not preserve the order of the rows in the way I want and it is pretty tricky or inefficient to fix that.
Finally, keep in mind that with my actual dataframes way more than only 2 columns (e.g. 15) will be used as keys for the matching so it is better that you come up with something concise even for bigger dataframes.
P.S.
See my answer below.
Here's my suggestion using a lambda function in apply. Should be easily scalable to more columns to compare (just adjust cols_to_compare accordingly). By the way, when generating df_2, be sure to copy df_1, otherwise changes in df_2 will carry over to df_1 as well.
So generating the data first:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1.copy() # Be sure to create a copy here
df_2['B'] -= 1
df_2['C'] = np.nan
an now we 'scan' df_1 for the rows of interest:
cols_to_compare = ['A', 'B']
df_2['C'] = df_2.apply(lambda x: 1 if any((df_1.loc[:, cols_to_compare].values[:]==x[cols_to_compare].values).all(1)) else np.nan, axis=1)
What is does is check whether the values in the current row are also like this in any row in the concerning columns of df_1.
The output is:
A B C
0 2 7 NaN
1 5 1 1.0
2 3 3 NaN
3 5 0 NaN
Someone (I do not remember his username) suggested the following (which I think works) and then he deleted his post for some reason (??!):
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
You can accomplish this using two for loops:
for row in df_2.iterrows():
for row2 in df_1.iterrows():
if [row[1]['A'],row[1]['B']] == [row2[1]['A'],row2[1]['B']]:
df_2['C'].iloc[row[0]] = row2[1]['C']
Just modify your below line:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
with:
df_1[df_1['A'].isin(df_2['A']) & df_1['B'].isin(df_2['B'])]
It works fine!!
I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None
I have a large space separated input file input.csv, which I can't hold in memory:
## Header
# More header here
A B
1 2
3 4
If I use the iterator=True argument for pandas.read_csv, then it returns a TextFileReader / TextParser object. This allows filtering the file on the fly and only selecting rows for which column A is greater than 2.
But how do I add a third column to the dataframe on the fly without having to loop over all of the data once more?
Specifically I want column C to be equal to column A multiplied by the value in a dictionary d, which has the value of column B as its key; i.e. C = A*d[B].
Currently I have this code:
import pandas
d = {2: 2, 4: 3}
TextParser = pandas.read_csv('input.csv', sep=' ', iterator=True, comment='#')
df = pandas.concat([chunk[chunk['A'] > 2] for chunk in TextParser])
print(df)
Which prints this output:
A B
1 3 4
How do I get it to print this output (C = A*d[B]):
A B C
1 3 4 9
You can use a generator to work on the chunks one at a time:
Code:
def on_the_fly(the_csv):
d = {2: 2, 4: 3}
chunked_csv = pd.read_csv(
the_csv, sep='\s+', iterator=True, comment='#')
for chunk in chunked_csv:
rows_idx = chunk['A'] > 2
chunk.loc[rows_idx, 'C'] = chunk[rows_idx].apply(
lambda x: x.A * d[x.B], axis=1)
yield chunk[rows_idx]
Test Code:
from io import StringIO
data = StringIO(u"""#
A B
1 2
3 4
4 4
""")
import pandas as pd
df = pd.concat([c for c in on_the_fly(data)])
print(df)
Results:
A B C
1 3 4 9.0
2 4 4 12.0