Parsing array from txt file to Pandas dataframe in Python - python

Hi, I have such array in my .txt file:
n|vechicle.car.characteristics[0].speed|180
n|vechicle.car.characteristics[0].weight|3
c|vechicle.car.characteristics[0].color|black
c|vechicle.car.characteristics[0].fuel|95
n|vechicle.car.characteristics[1].speed|160
n|vechicle.car.characteristics[1].weight|4
c|vechicle.car.characteristics[1].color|green
c|vechicle.car.characteristics[1].fuel|92
n|vechicle.car.characteristics[2].speed|200
n|vechicle.car.characteristics[2].weight|5
c|vechicle.car.characteristics[2].color|white
c|vechicle.car.characteristics[2].fuel|95
And I'd like to parse it into such dataFrame:
speed weight color fuel
0 180 3 black 95
1 160 4 green 92
2 200 5 white 95
That's, how i solved it:
import re
import pandas as pd
df_output_list = {}
df_output_dict = []
match_counter = 1
with open('sample_car.txt',encoding='utf-8') as file:
line = file.readline()
while line:
result = re.split(r'\|',line.rstrip())
result2 = re.findall(r'.(?<=\[)(\d+)(?=\])',result[1])
regex = re.compile('vechicle.car.characteristics.')
match = re.search(regex, result[1])
if match:
if match_counter == 1:
ArrInd = 0
match_counter+=1
#print(df_output_list)
if ArrInd == int(result2[0]):
df_output_list[result[1].split('.')[3]] = result[2]
ArrInd = int(result2[0])
else:
df_output_dict.append(df_output_list)
df_output_list = {}
df_output_list[result[1].split('.')[3]] = result[2]
ArrInd = int(result2[0])
line = file.readline()
df_output_dict.append(df_output_list)
#print(df_output_dict)
df_output = pd.DataFrame(df_output_dict)
print(df_output)
And i found it so complicated. Is it possible to simplify it?
Column names should be parsed automatically.

Read as csv file with sep='|' then get last column which contain values and then reshape in appropriate shape.
>>> columns=['speed','weight','color','fuel']
>>> s = pd.read_csv('filename.txt', sep='|', header=None).iloc[:,-1]
>>> df = pd.DataFrame(s.to_numpy().reshape(-1,4), columns=columns)
>>> df
speed weight color fuel
0 180 3 black 95
1 160 4 green 92
2 200 5 white 95
If you have fix row formate like n|vechicle.car.characteristics[0].speed|180 then we can do this
>>> df = pd.read_csv('d.csv', sep='|', header=None)
>>> columns = df.iloc[:,1].str.split('.').str[-1].unique()
>>> df_out = pd.DataFrame(df.iloc[:,-1].to_numpy().reshape(-1,len(columns)), columns=columns)
>>> df_out
speed weight color fuel
0 180 3 black 95
1 160 4 green 92
2 200 5 white 95

Related

converting the contents of txt file to columns of pandas dataframe

I have a .txt file of this sort
12
21
23
1
23
42
12
0
In which <12,21,23> are features and <1> is a label.
Again <23,42,12> are features and <0> is the label and so on.
I want to create a pandas dataframe from the above text file which contains only a single column into multiple column.
The format of the dataframe is {column1,column2,column3,column4}. And there are no column names in it.
Can someone please help me out in this?
Thanks
import pandas as pd
df = dict()
features = list()
label = ''
filename = '.txt'
with open(filename) as fd:
i = 0
for line in fd:
if i != 3:
features.append(line.strip())
i += 1
else:
label = line.strip()
i = 0
df[label] = features
features = list()
df = pd.DataFrame(df)
df
import pandas as pd
with open(<FILEPATH>, "r") as f:
lines = f.readlines()
formatted = [int(line[:-1]) for line in lines] # Remove \n and convert to int
labels = formatted[3::4]
features = list(zip(formatted[::4], formatted[1::4], formatted[2::4])) # You can modify this if there are more than three rows
data = {}
for i, label in enumerate(labels):
data[label] = list(features[i])
df = pd.DataFrame(data)
Comment if you have any questions or found any errors, and I will make ammendments.
You can use numpy first, you need to ensure that the number of values is a multiple of 4
each record as column with the label as header
a = np.loadtxt('file.txt').reshape((4,-1), order='F')
df = pd.DataFrame(a[:-1], columns=a[-1])
Output:
1.0 0.0
0 12.0 23.0
1 21.0 42.0
2 23.0 12.0
each record as a new row
a = np.loadtxt('file.txt').reshape((-1,4))
df = pd.DataFrame(a)
Output:
0 1 2 3
0 12.0 21.0 23.0 1.0
1 23.0 42.0 12.0 0.0
row = []
i = 0
data = []
with open('a.txt') as f:
for line in f:
data
i+= 1
row.append(int(line.strip()))
if i%4==0 and i!=0:
print(i)
data_rows_count +=1
data.append(row)
row = []
f.close()
df = pd.DataFrame(data)
results in df to be:
0 1 2 3
0 12 21 23 1
1 23 42 12 0

Looping over data frame to cap and sum another data frame

I am trying to use entries from df1 to limit amounts in df2, then add them up based on their type and summarize in df3. I'm not sure how to get it, the for loop using iterrows would be my best guess but it's not complete.
Code:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'Caps':['25','50','100']})
df2 = pd.DataFrame({'Amounts':['45','25','65','35','85','105','80'], \
'Type': ['a' ,'b' ,'b' ,'c' ,'a' , 'b' ,'d' ]})
df3 = pd.DataFrame({'Type': ['a' ,'b' ,'c' ,'d']})
df1['Caps'] = df1['Caps'].astype(float)
df2['Amounts'] = df2['Amounts'].astype(float)
for index1, row1 in df1.iterrows():
for index2, row2 in df3.iterrows():
df3[str(row1['Caps']+'limit')] = df2['Amounts'].where(
df2['Type'] == row2['Type']).where(
df2['Amounts']<= row1['Caps'], row1['Caps']).sum()
# My ideal output would be this:
df3 = pd.DataFrame({'Type':['a','b','c','d'],
'Total':['130','195','35','80'],
'25limit':['50','75','25','25'],
'50limit':['95','125','35','50'],
'100limit':['130','190','35','80'],
})
Output:
>>> df3
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Use numpy for compare all values Amounts with Caps by broadcasting to 2d array a, then create DataFrame by constructor with sum per columns, transpose by DataFrame.T and DataFrame.add_prefix.
For aggregated column use DataFrame.insert for first column with GroupBy.sum:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
am = df2['Amounts'].to_numpy()
ca = df1['Caps'].to_numpy()
#pandas below 0.24
#am = df2['Amounts'].values
#ca = df1['Caps'].values
a = np.where(am <= ca[:, None], am[None, :], ca[:, None])
df1 = (pd.DataFrame(a,columns=df2['Type'],index=df1['Caps'])
.sum(axis=1, level=0).T.add_suffix('limit'))
df1.insert(0, 'Total', df2.groupby('Type')['Amounts'].sum())
df1 = df1.reset_index().rename_axis(None, axis=1)
print (df1)
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Here is my solution without numpy, however it is two times slower than #jezrael's solution, 10.5ms vs. 5.07ms.
limcols= df1.Caps.to_list()
df2=df2.reindex(columns=["Amounts","Type"]+limcols)
df2[limcols]= df2[limcols].transform( \
lambda sc: np.where(df2.Amounts.le(sc.name),df2.Amounts,sc.name))
# Summations:
g=df2.groupby("Type")
df3= g[limcols].sum()
df3.insert(0,"Total", g.Amounts.sum())
# Renaming columns:
c_dic={ lim:f"{lim:.0f}limit" for lim in limcols}
df3= df3.rename(columns=c_dic).reset_index()
# Cleanup:
#df2=df2.drop(columns=limcols)

Modifying strings in a pandas DataFrame by row

I have the following strings in a pandas DataFrame in Python3, column string1 and string2:
import pandas as pd
datainput = [
{ 'string1': 'TTTABCDABCDTTTTT', 'string2': 'ABABABABABABABAA' },
{ 'string1': 'AAAAAAAA', 'string2': 'TTAAAATT' },
{ 'string1': 'TTABCDTTTTT', 'string2': 'ABABABABABA' }
]
df = pd.DataFrame(datainput)
df
string1 string2
0 TTTABCDABCDTTTTT ABABABABABABABAA
1 AAAAAAAA TTAAAATT
2 TTABCDTTTTT ABABABABABA
For each row, strings in columns string1 and string2 are defined to be the same length.
For each row of the DataFrame, the strings may need to be "cleaned" of beginning/trailing letters 'T'. However, for each row, the strings need to both be stripped of the same number of characters, so as the strings remain the same length.
The correct output is as follows:
df
string1 string2
0 ABCDABCD BABABABA
1 AAAA AAAA
2 ABCD ABAB
If these were two variables, it would be straightforward to calculate this with strip(), e.g.
string1 = "TTTABCDABCDTTTTT"
string2 = "ABABABABABABABAA"
length_original = len(string1)
num_left_chars = len(string1) - len(string1.lstrip('T'))
num_right_chars = len(string1.rstrip('T'))
edited = string1[num_left_chars:num_right_chars]
## print(edited)
## 'ABCDABCD'
However, in this case, one needs to iterate through all rows and redefine two rows at once. How could one modify each these strings row by row?
EDIT: My main confusion is, given both columns could T, how do I re-define them both?
A bit lengthy but gets the job done..
import re
def count_head(s):
head = re.findall('^T+', s)
if head:
return len(head[0])
return 0
def count_tail(s):
tail = re.findall('T+$', s)
if tail:
return len(tail[0])
return 0
df1 = df.copy()
df1['st1_head'] = df1['string1'].apply(count_head)
df1['st2_head'] = df1['string2'].apply(count_head)
df1['st1_tail'] = df1['string1'].apply(count_tail)
df1['st2_tail'] = df1['string2'].apply(count_tail)
df1['length'] = df1['string1'].str.len()
def trim_strings(row):
head = max(row['st1_head'], row['st2_head'])
tail = max(row['st1_tail'], row['st2_tail'])
l = row['length']
return {'string1': row['string1'][head:(l-tail)],
'string2': row['string2'][head:(l-tail)]}
new_df = pd.DataFrame(list(df1.apply(trim_strings, axis=1)))
print(new_df)
output:
string1 string2
0 ABCDABCD BABABABA
1 AAAA AAAA
2 ABCD ABAB
A more compact version:
def trim(st1, st2):
l = len(st1)
head = max(len(st1) - len(st1.lstrip('T')),
len(st2) - len(st2.lstrip('T')))
tail = max(len(st1) - len(st1.rstrip('T')),
len(st2) - len(st2.rstrip('T')))
return (st1[head:(l-tail)],
st2[head:(l-tail)])
new_df = pd.DataFrame(list(
df.apply(lambda r: trim(r['string1'], r['string2']),
axis=1)), columns=['string1', 'string2'])
print(new_df)
The main thing to notice is the df.apply(<your function>, axis=1), which lets you do any function (in this case acting on both columns at once), on each row.
raw_data = {'name': ['Will Morris', 'Alferd Hitcock', 'Sir William', 'Daniel Thomas'],
'age': [11, 49, 66, 77],
'color': ['TblueT', 'redT', 'white', "cyan"],
'marks': [74, 90, 44, 17]}
df = pd.DataFrame(raw_data, columns = ['name', 'age', 'color', 'grade'])
print(df)
cols = ['name','color']
print("new df")
#following line does the magic
df[cols] = df[cols].apply(lambda row: row.str.lstrip('T').str.rstrip('T'), axis=1)
print(df)
Will print
name age color grade
0 TWillard MorrisT 20 TblueT 88
1 Al Jennings 19 redT 92
2 Omar Mullins 22 yellow 95
3 Spencer McDaniel 21 green 70
new df
name age color grade
0 Willard Morris 20 blue 88
1 Al Jennings 19 red 92
2 Omar Mullins 22 yellow 95
3 Spencer McDaniel 21 green 70

Remove column index from dataframe

I extracted multiple dataframes from excel sheet by passing cordinates (start & end)
Now i used below funtion to extacr according to cordinates, but when i am trying to
convert it into dataframe, no sure from where index are coming in df as columns
I wanted to remove these index and make 2nd row as columns, this is my dataframe
0 1 2 3 4 5 6
Cols/Rows A A2 B B2 C C2
0 A 50 50 150 150 200 200
1 B 200 200 250 300 300 300
2 C 350 500 400 400 450 450
def extract_dataframes(sheet):
ws = sheet['pivots']
cordinates = [('A1', 'M8'), ('A10', 'Q17'), ('A19', 'M34'), ('A36', 'Q51')]
multi_dfs_list = []
for i in cordinates:
data_rows = []
for row in ws[i[0]:i[1]]:
data_cols = []
for cell in row:
data_cols.append(cell.value)
data_rows.append(data_cols)
multi_dfs_list.append(data_rows)
multi_dfs = {i: pd.DataFrame(df) for i, df in enumerate(multi_dfs_list)}
return multi_dfs
I tried to delete index but not working.
Note: when i say
>>> multi_dfs[0].columns # first dataframe
RangeIndex(start=0, stop=13, step=1)
Change
multi_dfs = {i: pd.DataFrame(df) for i, df in enumerate(multi_dfs_list)}
for
multi_dfs = {i: pd.DataFrame(df[1:], columns=df[0]) for i, df in enumerate(multi_dfs_list)}
From the Docs,
columns : Index or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
I think need:
df = pd.read_excel(file, skiprows=1)

Read all lines of csv file using .read_csv

I am trying to read simple csv file using pandas but I can't figure out how to not "lose" the first row.
For example:
my_file.csv
Looks like this:
45
34
77
But when I try to to read it:
In [18]: import pandas as pd
In [19]: df = pd.read_csv('my_file.csv', header=False)
In [20]: df
Out[20]:
45
0 34
1 77
[2 rows x 1 columns]
This is not what I am after, I want to have 3 rows. I want my DataFrame to look exactly like this:
In [26]: my_list = [45,34,77]
In [27]: df = pd.DataFrame(my_list)
In [28]: df
Out[28]:
0
0 45
1 34
2 77
[3 rows x 1 columns]
How can I use .read_csv to get the result I am looking for?
Yeah, this is a bit of a UI problem. We should handle False; right now it thinks you want the header on row 0 (== False.) Use None instead:
>>> df = pd.read_csv("my_file.csv", header=False)
>>> df
45
0 34
1 77
>>> df = pd.read_csv("my_file.csv", header=None)
>>> df
0
0 45
1 34
2 77

Categories