sample file:
03|02|2|02|F|3|47|P| |AG|AFL|24|20201016| 1 |West |CH|India - LA |CNDO
code:
df1 = pd.read_csv("GM3.txt",sep="|",dtype=object)
df1.to_csv('file_validation.csv',index=None)
output in csv:
3 2 2 2 F 3 47 P AG AFL 24 20201016 1 West CH India - LA CNDO 302
when I am trying to print df1.to_csv() it is giving me below output:
0 03 02 2 CH India - LA CNDO
I want csv to be stored as string format i.e. 03,02 instead of integer.
Your code works for me:
import pandas as pd
df1 = pd.read_csv("GM3.txt",sep="|",dtype=object)
df1.to_csv('file_validation.csv',index=None)
produces
Related
I have a string which is -
str="Key=xxxx, age=11, key=yyyy , age=22,Key=zzzz, age=01, key=qqqq, age=21,Key=wwwww, age=91, key=pppp, age=22"
I want to convert this string to Python DataFrame with KEY and AGE as Column names.
The given key and age are in pair.
How could I achieve this conversion?
You can try regex
import re
import pandas as pd
s = "Key=xxxx, age=11, key=yyyy , age=22,Key=zzzz, age=01, key=qqqq, age=21,Key=wwwww, age=91, key=pppp, age=22"
df = pd.DataFrame(zip(re.findall(r'Key=([^,\s]+)', s, re.IGNORECASE), re.findall(r'age=([^,\s]+)', s, re.IGNORECASE)),
columns=['key', 'age'])
df
key age
0 xxxx 11
1 yyyy 22
2 zzzz 01
3 qqqq 21
4 wwwww 91
5 pppp 22
Use a regex that find all pairs of key/age : "key=(\w+)\s*,\s*age=(\w+)" then use them to build the dataframe
import re
import pandas as pd
content = "Key=xxxx, age=11, key=yyyy , age=22,Key=zzzz, age=01, key=qqqq, age=21,Key=wwwww, age=91, key=pppp, age=22"
pat = re.compile(r"key=(\w+)\s*,\s*age=(\w+)", flags=re.IGNORECASE)
values = pat.findall(content)
df = pd.DataFrame(values, columns=['key', 'age'])
print(df)
# - - - - -
key age
0 xxxx 11
1 yyyy 22
2 zzzz 01
3 qqqq 21
4 wwwww 91
5 pppp 22
I want to convert text file to excel file, without deleting spaces for each line.
Note that the number of columns will be equal to all lines of the file.
the text file follows the following format:
First row
05100079 0000001502 5 01 2 070 1924 02 06 1994 C508 2 8500 3 8500 3 3 1 1 012 10 0 98 00 4 8 8 9 0 40 01 2 15 26000 1748 C508 116 102 3 09 98 013 1 1 0 1 10 10 0 09003 50060 50060 0 0 369 99 9 1 4 4 5 8 0 0181 1 80 00 01 0 9 9 8 1 0 00 00 020 0
second row
05100095 0000001502 2 01 2 059 1917 02 03 1977 C504 2 8500 3 8500 3 9 1 1 54-11-0999-00 2 9 0 90 01 2 12 26000 1744 C504 116 102 3 09 98 013 1 1 0 2 0 09011 50060 50060 0 36 9 9 1 9 9 5 8 0 3161 9 9 8 020 0 `
How to edit the code to convert text file to excel file without deleting the spaces between data?
This code below deletes the space in each line.
I mean to convert the file to Excel Sheet without any modification to the original file.
The spaces stay spaces and all other data stays the same format.
import xlwt
import xlrd
book = xlwt.Workbook()
ws = book.add_sheet('First Sheet') # Add a sheet
f = open('testval.txt', 'r+')
data = f.readlines() # read all lines at once
for i in range(len(data)):
row = data[i].split() # This will return a line of string data, you may need to convert to other formats depending on your use case`
for j in range(len(row)):
ws.write(i, j, row[j]) # Write to cell i, j
book.save('testval' + '.xls')
f.close()
Expected output:
Excel file in the same format as the original file"text"
If you have fixed-length fields, you need to split each line using index intervals.
For instance, you can do:
book = xlwt.Workbook()
ws = book.add_sheet('First Sheet') # Add a sheet
with io.open("testval.txt", mode="r", encoding="utf-8") as f:
for row_idx, row in enumerate(f):
row = row.rstrip()
ws.write(row_idx, 0, row[0:8])
ws.write(row_idx, 1, row[9:19])
ws.write(row_idx, 2, row[20:21])
ws.write(row_idx, 3, row[22:24])
# and so on...
book.save("sample.xlsx")
You get something like that:
I would like to replace null value of stadium attendance (affluence in french) with their means. Therefore I do this to have the mean by seasons / teams :
test = data.groupby(['season','domicile']).agg({'affluence':'mean'})
This code works and give me what I want (data is dataframe) :
affluence
season domicile
1999 AS Monaco 10258.647059
AS Saint-Etienne 27583.375000
FC Nantes 28334.705882
Girondins de Bordeaux 30084.941176
Montpellier Hérault SC 13869.312500
Olympique Lyonnais 35453.941176
Olympique de Marseille 51686.176471
Paris Saint-Germain 42792.647059
RC Strasbourg Alsace 19845.058824
Stade Rennais FC 13196.812500
2000 AS Monaco 8917.937500
AS Saint-Etienne 26508.750000
EA Guingamp 13056.058824
FC Nantes 31913.235294
Girondins de Bordeaux 29371.588235
LOSC 16793.411765
Olympique Lyonnais 34564.529412
Olympique de Marseille 50755.176471
Paris Saint-Germain 42716.823529
RC Strasbourg Alsace 13664.875000
Stade Rennais FC 19264.062500
Toulouse FC 19926.294118
....
So now I would like to do a condition on the season and the team. For example test[test.season == 1999]. However this doesn't work because I have only one column 'affluence'. It gives me the error :
'DataFrame' object has no attribute 'season'
I tried :
test = data[['season','domicile','affluence']].groupby(['season','domicile']).agg({'affluence':'mean'})
Which results as above. So I thought of maybe indexing the season/team, but how ? And after that how do I access it ?
Thanks
Doing test = data.groupby(['season','domicile'], as_index=False).agg({'affluence':'mean'}) should do the trick for what you're trying to do.
The parameter as_index=False is particularly useful when you do not want to deal with MultiIndexes.
Example:
import pandas as pd
data = {
'A' : [0, 0, 0, 1, 1, 1, 2, 2, 2],
'B' : list('abcdefghi')
}
df = pd.DataFrame(data)
print(df)
# A B
# 0 0 a
# 1 0 b
# 2 0 c
# 3 1 d
# 4 1 e
# 5 1 f
# 6 2 g
# 7 2 h
# 8 2 i
grp_1 = df.groupby('A').count()
print(grp_1)
# B
# A
# 0 3
# 1 3
# 2 3
grp_2 = df.groupby('A', as_index=False).count()
print(grp_2)
# A B
# 0 0 3
# 1 1 3
# 2 2 3
After the groupby-operation, the columns you refer in the groupby-operation become the index. You can access the index by df.index (or test.index in your case).
In your case, you created a multi-Index. A detailed description of how to handle dataframe with MultiIndex can be found in the pandas documentation.
However, you could recreate a standard dataframe again by using:
df = pd.DataFrame({
'season': test.index.season,
'domicile': test.index.domicile,
'affluence': test.affluence}
)
I read in a pipe-separated CSV like this
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv")
test.head()
This returns
SK_Country|"Number"|"Alpha2Code"|"Alpha3Code"|"CountryName"|"TopLevelDomain"
0 1|20|"ad"|"and"|"Andorra"|".ad"
1 2|4|"af"|"afg"|"Afghanistan"|".af"
2 3|28|"ag"|"atg"|"Antigua and Barbuda"|".ag"
3 4|660|"ai"|"aia"|"Anguilla"|".ai"
4 5|8|"al"|"alb"|"Albania"|".al"
When I try and extract specific data from it, like below:
df = test[["Alpha3Code"]]
I get the following error:
KeyError: ['Alpha3Code'] not in index
I don't understand what goes wrong - I can see the value is in the CSV when I print the head, likewise when I open the CSV, everything looks fine.
I've tried to google around and read some posts regarding the issue here on the stack and tried different approaches, but nothing seems to fix this annoying problem.
Notice how everything is crammed into one string column? That's because you didn't specify the delimiter separating columns to pd.read_csv, which in this case has to be '|'.
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv",
sep='|')
test.head()
# SK_Country Number Alpha2Code Alpha3Code CountryName \
# 0 1 20 ad and Andorra
# 1 2 4 af afg Afghanistan
# 2 3 28 ag atg Antigua and Barbuda
# 3 4 660 ai aia Anguilla
# 4 5 8 al alb Albania
#
# TopLevelDomain
# 0 .ad
# 1 .af
# 2 .ag
# 3 .ai
# 4 .al
As pointed out in the comment by #chrisz, you have to specify the delimiter:
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv",delimiter='|')
test.head()
SK_Country Number Alpha2Code Alpha3Code CountryName \
0 1 20 ad and Andorra
1 2 4 af afg Afghanistan
2 3 28 ag atg Antigua and Barbuda
3 4 660 ai aia Anguilla
4 5 8 al alb Albania
TopLevelDomain
0 .ad
1 .af
2 .ag
3 .ai
4 .al
I have a rather large text file with multiple columns that I must convert to a 15 column .csv file to be read in excel. The logic for parsing the fields I need is written out below, but I am having trouble writing it to .csv.
columns = [ 'TRANSACTN_NBR', 'RECORD_NBR',
'SEQUENCE_OR_PIC_NBR', 'CR_DB', 'RT_NBR', 'ACCOUNT_NBR',
'RSN_COD', 'ITEM_AMOUNT', 'ITEM_SERIAL', 'CHN_IND',
'REASON_DESCR', 'SEQ2', 'ARCHIVE_DATE', 'ARCHIVE_TIME', 'ON_US_IND' ]
for line in in_file:
values = line.split()
if 'PRINT DATE:' in line:
dtevalue = line.split(a,1)[-1].split(b)[0]
lines.append(dtevalue)
elif 'PRINT TIME:' in line:
timevalue = line.split(c,1)[-1].split(b)[0]
lines.append(timevalue)
elif (len(values) >= 4 and values[3] == 'C'
and len(values[2]) >= 2 and values[2][:2] == '41'):
print(values)
elif (len(values) >= 4 and values[3] == 'D'
and values[4] in rtnbr):
on_us = '1'
else:
on_us = '0'
print (lines[0])
print (lines[1])
I have originally tried the csv module but the parsed rows are written in 12 columns and I could not find a way to write the date and time (parsed separately) in the columns after each row
I was also looking at the pandas package but have only seen ways to extract patterns, which wouldn't work with the established parsed criteria
Is there a way to write to csv using the above criteria? Or do I have to scrap it and rewrite the code within a specific package?
Any help is appreciated
EDIT: Text file sample:
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
1--------------------
1ANTECR09 CHEK DPCK_R_009
TRANSIT EXTRACT SUB-SYSTEM
CURRENT DATE = 08/03/2017 JOURNAL REPORT PAGE 1
PROCESS DATE =
ID = 022000046-MNT
FILE HEADER = H080320171115
+____________________________________________________________________________________________________________________________________
R T SEQUENCE CR BT A RSN ITEM ITEM CHN USER REASO
NBR NBR OR PIC NBR DB NBR NBR COD AMOUNT SERIAL IND .......FIELD.. DESCR
5,556 01 7450282689 C 538196640 9835177743 15 $9,064.81 00 CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431 DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896 DR CR
5,559 01 7450282692 D 071108834 176885 38 $6,688.00 1454 DR CR
5,560 01 7450282693 D 031309123 1390001566241 38 $293.42 6878 DR CR
--------------------
34,615 207 4100223726 C 538196620 9866597322 10 $645.49 00 CREDIT
34,616 207 4100223727 D 022000046 8891636675 31 $645.49 111583 DR ON-
--------------------
34,617 208 4100223728 C 538196620 11701364 10 $756.19 00 CREDIT
34,618 208 4100223729 D 071923828 00 54 $305.31 11384597 BAD AC
34,619 208 4100223730 D 071923828 35110011 30 $450.88 10913052 6 DR SEL
--------------------
Desired output: looking at only lines containing seq starting with 42, contains C
1293 83834 4100225908 C 538196620 9860890913 10 161.5 0 CREDIT 41 3-Aug-17 11:15:51
1294 83838 4100225911 C 538196620 25715845 10 138 0 CREDIT 41 3-Aug-17 11:15:51
Look at the ‘pandas‘ package, more specifically the class DataFrame. With a little cleverness you ought to be able to read your table using ‘pandas.read_table()‘ which returns a dataframe that you can output to csv with ‘to_csv()‘ effectively a 2 line solution. You’ll need to look at the docs to find the parameters you’ll need to properly read your table format, but should be a little easier than doing it manually.