Reading a txt files with uneven dimensions in python - python

I've a text file containg some data for correlation function. It is structured like
The first two rows are bin numbers and 45 entries, while the remaining rows are the values at the given location and containers 46 entries in each row. In these remaining rows, i.e, from row 2+, the first column is the order of the values.
I want to read this as a pandas data frame. Since there is a mismatch of dimension, pandas show an error.
ParserError: Error tokenizing data. C error: Expected 45 fields in line 9, saw 46
To fix this error, I modify the txt file by adding 'r1' and 'r2' in place of the blank space in the first two rows.
This is a solution if there are only a few files, but unfortunately, I've 100s of files structured in the same way. Is there a way to read the data from this. It would be fine for me if I can just skip first column entirely from row two onwards.

It seems like the two first rows are some kind of Multiindex-Columns, thus the entries of the row-index are missing (45 instead of 46 columns).
If my guess is correct you can extend Oivalfs answer by pandas Multiindex:
import pandas as pd
rows_uneven_dimensions = 2
df_first_two_rows = pd.read_csv("test.txt", header=None, nrows=rows_uneven_dimensions, sep='\t')
df_all_other_rows = pd.read_csv("test.txt", header=None, skiprows=rows_uneven_dimensions, sep='\t', index_col=0) #Defining first col as index
df_all_other_rows.index.name = 'Index' # Optional: Set index Name
cols=pd.MultiIndex.from_arrays([df_first_two_rows.iloc[idx] for idx in df_first_two_rows.index], names=("First Level", "Second Level")) # Define Multinidex based on Rows of df_first_two_rows
# Names of Levels is just for illustration purpose
df_all_other_rows.columns=cols # Replacing the old col-index with new Multiindex
print(df_all_other_rows)
Given the test.txt given by Oivalf the result will look like this:
First Level 0
Second Level 1 2 2 3 4 5 6
Index
0 A B C D E F G
1 H I J K L M N
2 O P Q R S T U
3 V W X Y Z A B

I may would try to read at first all rows except the first two with the skiprows attribute in the read_csv() function. Afterwards read the first two rows with the nrows=2 attribute.
But thats only a workaround. Maybe there are better solutions.
In the end combine your given dataframes.
Example:
import pandas as pd
rows_uneven_dimensions = 2
df_first_two_rows = pd.read_csv("test.txt", header=None, nrows=rows_uneven_dimensions, sep='\t')
df_all_other_rows = pd.read_csv("test.txt", header=None, skiprows=rows_uneven_dimensions, sep='\t')
frames = [df_first_two_rows, df_all_other_rows]
result = pd.concat(frames, ignore_index=True, axis=0)
test.txt
0 0 0 0 0 0 0
1 2 2 3 4 5 6
0 A B C D E F G
1 H I J K L M N
2 O P Q R S T U
3 V W X Y Z A B
result dataframe value:
[[0 0 0 0 0 0 0 nan]
[1 2 2 3 4 5 6 nan]
[0 'A' 'B' 'C' 'D' 'E' 'F' 'G']
[1 'H' 'I' 'J' 'K' 'L' 'M' 'N']
[2 'O' 'P' 'Q' 'R' 'S' 'T' 'U']
[3 'V' 'W' 'X' 'Y' 'Z' 'A' 'B']]
The last value in the first two rows gets filled with nan by default in the pd.concat function.

Related

Set specific rows values to column in pandas dataframe using list comprehension

I would like to change values in a column for specific rows. The rows are defined in a vector containing the row indexes. I thought this could be done with list comprehension (similar to apply function in R). I tried this:
[(dataframe.loc[(dataframe['id']== x), 'column_to_be_changed'] = 1) (x) for x in indexes]
The error message is "SyntaxError: invalid syntax" with pointer to " = 1 "
This part works:
x = n (e.g. 5)
dataframe.loc[(dataframe['id']== x), 'column_to_be_changed'] = 1)
Since a list comprehension gives a list back and not pandas dataframe, I am missing something, I guess. Help would be much appreciated. Thanks.
I think you are just looking for mask or where. See the below example:
df=pd.DataFrame({'id': [1,2,3,4], 'some_column': ['a','b','c','d']})
print(df)
# id some_column
# 0 1 a
# 1 2 b
# 2 3 c
# 3 4 d
li = [1,2] #indexes 1 and 2, so b and c
mask = df.index.isin(li)
df['some_column'].mask(mask, 'z', inplace=True) # 'z' is the value that will be set if the index is in 'li'
print(df)
# id some_column
# 0 1 a
# 1 2 z
# 2 3 z
# 3 4 d

How can you do a comparison between two string columns by position in Python?

I want to create two binary indicators by checking to see if the characters in the
first and third positions for column 'A' matches the characters found in the first and third positions of column 'B'.
Here is a sample data frame:
df = pd.DataFrame({'A' : ['a%d', 'a%', 'i%'],
'B' : ['and', 'as', 'if']})
A B
0 a%d and
1 a% as
2 i% if
I would like the data frame to look like below:
A B Match_1 Match_3
0 a%d and 1 1
1 a% as 1 0
2 i% if 1 0
I tried using the following string comparison, but it the column just returns '0' values for the match_1 column.
df['match_1'] = np.where(df['A'][0] == df['B'][0], 1, 0)
I am wondering if there is a function that is similar to the substr function found in SQL.
You could use pandas str method, that can work to slice the elements:
df['match_1'] = df['A'].str[0].eq(df['B'].str[0]).astype(int)
df['match_3'] = df['A'].str[2].eq(df['B'].str[2]).astype(int)
output:
A B match_1 match_3
0 a%d and 1 1
1 a% as 1 0
2 i% if 1 0
If you have many positions to test, you can use a loop:
for pos in (1, 3):
df['match_%d' % pos] = df['A'].str[pos-1].eq(df['B'].str[pos-1]).astype(int)

How to find strings with delimiter and replace them with new lines in pandas dataframe Python

I am trying to figure out how to solve following problem:
I have pandas dataframe that contains some strings that are delimited with ','. My goal is to find these and replace them with new lines so that there are no more delimiters within the dataframe. For example a cell contains 'hi,there' and I would like it to become 'hi' and 'there' so there will be two lines instead of one at the end.
This should be applied until there are no delimiters within original dataframe so in case there are two words ('hi,there' and 'whats,up,there') in one line in two different columns, it becomes 6 lines instead of original one (cartesian product). The same should be applied for all lines within dataframe.
Here below is code demonstrating the original dataframe (a) and the result I would like to end with:
a = pd.DataFrame([['Hi,there', 'fv', 'whats,up,there'],['dfasd', 'vfgfh', 'kutfddx'],['fdfa', 'uyg', 'iutyfrd']], columns = ['a', 'b', 'c'])
Output:
Desired output here:
So far I managed to copy the lines so many times I need for this purpose but I cannot figure out how to replace the delimited words with what I want:
ndf = pd.DataFrame([])
for i in a.values:
n = 1
for j in i:
if ',' in j:
n = n*len(j.split(','))
ndf = ndf.append([i]*n, ignore_index=False)
This produces:
Any idea how to proceed? I can only use pandas and numpy for this but I am convinced it should suffice.
First I split by coma words then use stack() function
a_list = a.apply(lambda x : x.str.split(','))
for i in a_list:
tmp = pd.DataFrame.from_records(a_list[i].tolist()).stack().reset_index(level=1, drop=True).rename('new_{}'.format(i))
a = a.drop(i, axis=1).join(tmp)
a = a.reset_index(drop=True)
Result:
>>> a
new_a new_c new_b
0 Hi whats fv
1 Hi up fv
2 Hi there fv
3 there whats fv
4 there up fv
5 there there fv
6 dfasd kutfddx vfgfh
7 fdfa iutyfrd uyg
Update
To handle missing values (np.nan and None) first I convert it to string then do the same as for normal data and then I replace NaN string to np.nan.
Let's insert some missing values
import numpy as np
a['a'].loc[0] = np.nan
a['b'].loc[1] = None
# a b c
# 0 NaN fv whats,up,there
# 1 dfasd None kutfddx
# 2 fdfa uyg iutyfrd
a.fillna('NaN', inplace=True) # some string
#
# insert the code above (with for loop)
#
a.replace('NaN', np.nan, inplace=True)
# new_a new_b new_c
# 0 NaN fv whats
# 1 NaN fv up
# 2 NaN fv there
# 3 dfasd NaN kutfddx
# 4 fdfa uyg iutyfrd
IIUC, you can agg with itertools.product
import itertools
df.agg(lambda r: pd.Series(list(itertools.product(*[r.a.split(',')], *[r.b.split(',')], *[r.c.split(',')]))), 1).stack().apply(pd.Series).reset_index(drop=True)
0 1 2
0 Hi fv whats
1 Hi fv up
2 Hi fv there
3 there fv whats
4 there fv up
5 there fv there
6 dfasd vfgfh kutfddx
7 fdfa uyg iutyfrd

Duplicating Pandas Dataframe rows based on string split, without iteration

I have a dataframe with a multiindex, where one of thecolumns represents multiple values, separated by a "|", like this:
value
left right
x a|b 2
y b|c|d -1
I want to duplicate the rows based on the "right" column, to get something like this:
values
left right
x a 2
x b 2
y b -1
y c -1
y d -1
The solution I have to this feels wrong and runs slow, because it's based on iteration:
df2 = df.iloc[:0]
for index, row in df.iterrows():
stgs = index[1].split("|")
for s in stgs:
row.name = (index[0], s)
df2 = df2.append(row)
Is there a more vectored way to do this?
Pandas Series have a dedicated method split to perform this operation
split works only on Series so isolate the Column you want
SO = df['right']
Now 3 steps at once: spilt return A Series of array. apply(pd.Series, 1) convert array in columns. stack stacks you columns into a unique column
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
The only issue is that you have now a multi-index. So just drop the level you don`t need
S1.index.droplevel(-1)
Full example
SO = pd.Series(data=["a,b", "b,c,d"])
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
S1
Out[4]:
0 0 a
1 b
1 0 b
1 c
2 d
S1.index = S1.index.droplevel(-1)
S1
Out[5]:
0 a
0 b
1 b
1 c
1 d
Building upon the answer #xNoK, I am adding here the additional step needed to include the result back in the original DataFrame.
We have this data:
arrays = [['x', 'y'], ['a|b', 'b|c|d']]
midx = pd.MultiIndex.from_arrays(arrays, names=['left', 'right'])
df = pd.DataFrame(index=midx, data=[2, -1], columns=['value'])
df
Out[17]:
value
left right
x a|b 2
y b|c|d -1
First, let's generate the values for right index as #xNoK suggested. First take the Index level we want to work on by index.levels[1] and convert it it to series so that we can perform the str.split() function, and finally stack() it to get the result we want.
new_multi_idx_val = df.index.levels[1].to_series().str.split('|').apply(pd.Series).stack()
new_multi_idx_val
Out[18]:
right
a|b 0 a
1 b
b|c|d 0 b
1 c
2 d
dtype: object
Now we want to put this value in the original DataFrame df. To do that, let's change its shape so that result we generated in the previous step could be copied.
In order to do that, we can repeat the rows (including the indexes) by a number of | present in right level of multi-index. df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)) gives the number of times a row (including index) should be repeated. We apply this to the function index.repeat() and fetch values at those indexes to create a new DataFrame df_repeted.
df_repeted = df.loc[df.index.repeat(df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)))]
df_repeted
Out[19]:
value
left right
x a|b 2
a|b 2
y b|c|d -1
b|c|d -1
b|c|d -1
Now df_repeted DataFrame is in a shape where we could change the index to get the answer we want.
Replace the index of df_repeted with desired values as following:
df_repeted.index = [df_repeted.index.droplevel(1), new_multi_idx_val]
df_repeted.index.rename(names=['left', 'right'], inplace=True)
df_repeted
Out[20]:
value
left right
x a 2
b 2
y b -1
c -1
d -1

add columns different length pandas

I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it

Categories