I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None
Related
Suppose I have 3 dataframes that are wrapped in a list. The dataframes are:
df_1 = pd.DataFrame({'text':['a','b','c','d','e'],'num':[2,1,3,4,3]})
df_2 = pd.DataFrame({'text':['f','g','h','i','j'],'num':[1,2,3,4,3]})
df_3 = pd.DataFrame({'text':['k','l','m','n','o'],'num':[6,5,3,1,2]})
The list of the dfs is:
df_list = [df_1, df_2, df_3]
Now I want to make a for loop such that goes on df_list, and for each df takes the text column and merge them on a new dataframe with a new column head called topic. Now since each text column is different from each dataframe I want to populate the headers as topic_1, topic_2, etc. The desired outcome should be as follow:
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o
I can easily extract the text columns as:
lst = []
for i in range(len(df_list)):
lst.append(df_list[i]['text'].tolist())
It is just that I am stuck on the last part, namely bringing the columns into 1 df without using brute force.
You can extract the wanted columns with a list comprehension and concat them:
pd.concat([d['text'].rename(f'topic_{i}')
for i,d in enumerate(df_list, start=1)],
axis=1)
output:
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o
Generally speaking you want to avoid looping anything on a pandas DataFrame. However, in this solution I do use a loop to rename your columns. This should work assuming you just have these 3 dataframes:
import pandas as pd
df_1 = pd.DataFrame({'text':['a','b','c','d','e'],'num':[2,1,3,4,3]})
df_2 = pd.DataFrame({'text':['f','g','h','i','j'],'num':[1,2,3,4,3]})
df_3 = pd.DataFrame({'text':['k','l','m','n','o'],'num':[6,5,3,1,2]})
df_list = [df_1.text, df_2.text, df_3.text]
df_combined = pd.concat(df_list,axis=1)
df_combined.columns = [f"topic_{i+1}" for i in range(len(df_combined.columns))]
>>> df_combined
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o
Want to replace some rows of some columns in a bigger pandas df by data in a smaller pandas df. The column names are same in both.
Tried using combine_first but it only updates the null values.
For example lets say df1.shape is 100, 25 and df2.shape is 10,5
df1
A B C D E F G ...Z Y Z
1 abc 10.20 0 pd.NaT
df2
A B C D E
1 abc 15.20 1 10
Now after replacing df1 should look like:
A B C D E F G ...Z Y Z
1 abc 15.20 1 10 ...
To replace values in df1 the condition is where df1.A = df2.A and df1.B = df2.B
How can it be achieved in the most pythonic way? Any help will be appreciated.
Don't know I really understood your question does this solves your problem ?
df1 = pd.DataFrame(data={'A':[1],'B':[2],'C':[3],'D':[4]})
df2 = pd.DataFrame(data={'A':[1],'B':[2],'C':[5],'D':[6]})
new_df=pd.concat([df1,df2]).drop_duplicates(['A','B'],keep='last')
print(new_df)
output:
A B C D
0 1 2 5 6
You could play with Multiindex.
First let us create those dataframe that you are working with:
cols = pd.Index(list(ascii_uppercase))
vals = np.arange(100*len(cols)).reshape(100, len(cols))
df = pd.DataFrame(vals, columns=cols)
df1 = pd.DataFrame(vals[:10,:5], columns=cols[:5])
Then transform A and B in indices:
df = df.set_index(["A","B"])
df1 = df1.set_index(["A","B"])*1.5 # multiply just to make the other values different
df.loc[df1.index, df1.columns] = df1
df = df.reset_index()
I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})
This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))
To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]
I have a dataframe like this :
A B C E D
---------------
0 a r g g
1 x
2 x f f r
3 t
3 y
I am trying for forward filling using ffill. It is not working
cols = df.columns[:4].tolist()
df[cols] = df[cols].ffill()
I also tried :
df[cols] = df[cols].fillna(method='ffill')
But it is not getting filled.
Is it the empty columns in data causing this issue?
Data is mocked. Exact data is different (contains strings,numbers and empty columns)
desired o/p:
A B C E D
---------------
0 a r g g
1 a r g x
2 x f f r
3 x f f t
3 x f f y
Replace empty values in subset of columns by NaN:
df[cols] = df[cols].replace('', np.nan).ffill()
You should replace the empty strings with np.NaN before:
df = df.replace('', np.NaN)
df[cols] = df[cols].ffill()
Replace '' with np.nan first:
df[df='']=np.nan
df[cols] = df[cols].ffill()
I have a large space separated input file input.csv, which I can't hold in memory:
## Header
# More header here
A B
1 2
3 4
If I use the iterator=True argument for pandas.read_csv, then it returns a TextFileReader / TextParser object. This allows filtering the file on the fly and only selecting rows for which column A is greater than 2.
But how do I add a third column to the dataframe on the fly without having to loop over all of the data once more?
Specifically I want column C to be equal to column A multiplied by the value in a dictionary d, which has the value of column B as its key; i.e. C = A*d[B].
Currently I have this code:
import pandas
d = {2: 2, 4: 3}
TextParser = pandas.read_csv('input.csv', sep=' ', iterator=True, comment='#')
df = pandas.concat([chunk[chunk['A'] > 2] for chunk in TextParser])
print(df)
Which prints this output:
A B
1 3 4
How do I get it to print this output (C = A*d[B]):
A B C
1 3 4 9
You can use a generator to work on the chunks one at a time:
Code:
def on_the_fly(the_csv):
d = {2: 2, 4: 3}
chunked_csv = pd.read_csv(
the_csv, sep='\s+', iterator=True, comment='#')
for chunk in chunked_csv:
rows_idx = chunk['A'] > 2
chunk.loc[rows_idx, 'C'] = chunk[rows_idx].apply(
lambda x: x.A * d[x.B], axis=1)
yield chunk[rows_idx]
Test Code:
from io import StringIO
data = StringIO(u"""#
A B
1 2
3 4
4 4
""")
import pandas as pd
df = pd.concat([c for c in on_the_fly(data)])
print(df)
Results:
A B C
1 3 4 9.0
2 4 4 12.0