The print result has a serial number, how do I delete it
import pandas as pd
data = pd.read_csv("G:/jeri/1.csv",usecols=['Age'])
print(data)
f = open (r'G:/hello.txt','w')
print (data,file = f)
Open the output txt text to get the result
Age
0 24
1 29
2 32
3 23
4 58
5 42
6 37
7 42
8 51
0,1,2,3,4,5,6,7,8, I'm a beginner, how do I delete it
If you just want the single column, you could also use to_csv() to write the file. For example:
import pandas as pd
df = pd.read_csv("1.csv", usecols=['Age'])
df.to_csv("hello.txt", index=False)
Related
I have a text file called Orbit 1 and I need help opening it and then creating three separate arrays. I'm new to Python and have been having difficulty with this aspect. Here are the first few rows of my text file. There are 1112 rows including the header.
Year Month Day Hour Minute Second Millisecond Longitude Latitude Altitude
2019 3 17 5 55 55 0 108.8730074 50.22483151 412.6226898
2019 3 17 5 56 0 0 108.9895097 50.53642185 412.7368197
2019 3 17 5 56 5 0 109.1078294 50.8478274 412.850563
2019 3 17 5 56 10 0 109.2280101 51.15904424 412.9640113
2019 3 17 5 56 15 0 109.3500969 51.47006828 413.0772319
2019 3 17 5 56 20 0 109.4741362 51.78089533 413.1901358
2019 3 17 5 56 25 0 109.6001758 52.09152105 413.3025291
2019 3 17 5 56 30 0 109.728265 52.40194099 413.414457
2019 3 17 5 56 35 0 109.8584548 52.71215052 413.5259984
2019 3 17 5 56 40 0 109.9907976 53.02214489 413.6371791
I desire to open this text file to create three arrays called lat[N], long[N], and time[N] where N is the number of rows in the file. I ultimately want to be able to determine what the latitude, longitude, and time is at any point. For example, lat[0] should return 50.22483151 if working properly. In addition, for the time, I would need to convert to decimal hours and then create the array.
Essentially I need help with opening this text file I have and then creating the three arrays.
I've tried this method for opening the file, but I get stuck when trying to write the array and I think I may not be opening the file correctly.
import numpy as np
file_name = 'C:\\Users\\Saman\\OneDrive\\Documents\\Orbit 1.txt'
data = []
with open(file_name) as file:
next(file)
for line in file:
row = line.split()
row = [float(x) for x in row]
data.append(row)
The most effortless way to solve your problem is to use Pandas:
import pandas as pd
df = pd.read_table('Orbit 1.txt', sep=r'\s+')
df['Longitude']
#0 108.873007
#1 108.989510
#2 109.107829
#3 109.228010
#4 109.350097
#5 109.474136
#6 109.600176
#7 109.728265
#8 109.858455
#9 109.990798
Once you get a Pandas DataFrame, you may want to use it for the rest of the data processing, too.
file_name = 'info.txt'
Lat=[]
Long=[]
Time=[]
left_justified=lambda x: x+" "*(19-len(x))
right_justified=lambda x: " "*(19-len(x))+x
with open(file_name) as file:
next(file)
for line in file:
data=line.split()
Lat.append(data[8])
Long.append(data[7])
hrs=int(data[3])
minutes=int(data[4])
secs=int(data[5])
total_secs=secs+minutes*60+hrs*3600
Time.append(total_secs/3600)
print(left_justified("Time"),left_justified("Lat"),left_justified("Long"))
for i in range(len(Lat)):
print(left_justified(str(Time[i])),left_justified(Lat[i]),left_justified(Long[i]))
Try this
I scrapped data and exported as csv files.
For simplicity, the data look like below
(I intentionally put arbitrary variables to just illustrate an example):
id var1 var2 var3 ...
A 10 14 355 ...
B 35 56 22 ...
C 95 22 222 ...
D 44 55 222 ...
Since I collected the data daily, I saved my file name as city_20180814_result.csv
For example, if I collected the data in NYC at Aug 14th 2018, the corresponding file name is NYC_20180814_result.csv
Here, I want to add a new column, the date variable, into each csv file.
The desired example is going to be like the one below. To be specific, I want to add a date (YYYYMMDD as a format) column to each csv file and the values are going to be the date when the data were collected. For example, the below example csv file was generated on Aug 14th 2018, then the updated data will look like below:
id date var1 var2 var3 ...
A 20180814 10 14 355 ...
B 20180814 35 56 22 ...
C 20180814 95 22 222 ...
D 20180814 44 55 222 ...
The conventional way to do this is to open every csv file and manually add a new column, assign a corresponding date to all the rows, and repeat this step for all csv files. But there are too many to get this done. Is there any way to do this efficiently? Since I saved file names including the date, it would be a good idea to use this if it's possible. Any help/codes (by using python again or excel macro) would be appreciated.
My solution using python's pandas package:
import os
import re
import pandas as pd
FILE_PATTERN = re.compile(r'(.*)_(\d{8})_result.csv')
def addDate(file_dir):
csv_list = [csvfile for csvfile in os.listdir(file_dir) if re.fullmatch(FILE_PATTERN, csvfile)]
for csvname in csv_list:
date = re.fullmatch(FILE_PATTERN, csvname).group(2)
df = pd.read_csv(os.path.join(file_dir, csvname))
df.insert(loc=1, column='date', value=[date]*len(df))
df.to_csv(os.path.join(file_dir, csvname), index=False)
Sample input: NYC_20180814_result.csv in some_path:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
Same csv after executing addDate(some_path):
A date B C
0 0 20180814 1 2
1 3 20180814 4 5
2 6 20180814 7 8
P.S. You'll not see the index column in your csv file.
I work with Series and DataFrames on the terminal a lot. The default __repr__ for a Series returns a reduced sample, with some head and tail values, but the rest missing.
Is there a builtin way to pretty-print the entire Series / DataFrame? Ideally, it would support proper alignment, perhaps borders between columns, and maybe even color-coding for the different columns.
You can also use the option_context, with one or more options:
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(df)
This will automatically return the options to their previous values.
If you are working on jupyter-notebook, using display(df) instead of print(df) will use jupyter rich display logic (like so).
No need to hack settings. There is a simple way:
print(df.to_string())
Sure, if this comes up a lot, make a function like this one. You can even configure it to load every time you start IPython: https://ipython.org/ipython-doc/1/config/overview.html
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
As for coloring, getting too elaborate with colors sounds counterproductive to me, but I agree something like bootstrap's .table-striped would be nice. You could always create an issue to suggest this feature.
After importing pandas, as an alternative to using the context manager, set such options for displaying entire dataframes:
pd.set_option('display.max_columns', None) # or 1000
pd.set_option('display.max_rows', None) # or 1000
pd.set_option('display.max_colwidth', None) # or 199
For full list of useful options, see:
pd.describe_option('display')
Use the tabulate package:
pip install tabulate
And consider the following example usage:
import pandas as pd
from io import StringIO
from tabulate import tabulate
c = """Chromosome Start End
chr1 3 6
chr1 5 7
chr1 8 9"""
df = pd.read_table(StringIO(c), sep="\s+", header=0)
print(tabulate(df, headers='keys', tablefmt='psql'))
+----+--------------+---------+-------+
| | Chromosome | Start | End |
|----+--------------+---------+-------|
| 0 | chr1 | 3 | 6 |
| 1 | chr1 | 5 | 7 |
| 2 | chr1 | 8 | 9 |
+----+--------------+---------+-------+
Using pd.options.display
This answer is a variation of the prior answer by lucidyan. It makes the code more readable by avoiding the use of set_option.
After importing pandas, as an alternative to using the context manager, set such options for displaying large dataframes:
def set_pandas_display_options() -> None:
"""Set pandas display options."""
# Ref: https://stackoverflow.com/a/52432757/
display = pd.options.display
display.max_columns = 1000
display.max_rows = 1000
display.max_colwidth = 199
display.width = 1000
# display.precision = 2 # set as needed
set_pandas_display_options()
After this, you can use either display(df) or just df if using a notebook, otherwise print(df).
Using to_string
Pandas 0.25.3 does have DataFrame.to_string and Series.to_string methods which accept formatting options.
Using to_markdown
If what you need is markdown output, Pandas 1.0.0 has DataFrame.to_markdown and Series.to_markdown methods.
Using to_html
If what you need is HTML output, Pandas 0.25.3 does have a DataFrame.to_html method but not a Series.to_html. Note that a Series can be converted to a DataFrame.
If you are using Ipython Notebook (Jupyter). You can use HTML
from IPython.core.display import HTML
display(HTML(df.to_html()))
Try this
pd.set_option('display.height',1000)
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
datascroller was created in part to solve this problem.
pip install datascroller
It loads the dataframe into a terminal view you can "scroll" with your mouse or arrow keys, kind of like an Excel workbook at the terminal that supports querying, highlighting, etc.
import pandas as pd
from datascroller import scroll
# Call `scroll` with a Pandas DataFrame as the sole argument:
my_df = pd.read_csv('<path to your csv>')
scroll(my_df)
Disclosure: I am one of the authors of datascroller
Scripts
Nobody has proposed this simple plain-text solution:
from pprint import pprint
pprint(s.to_dict())
which produces results like the following:
{'% Diabetes': 0.06365372374283895,
'% Obesity': 0.06365372374283895,
'% Bachelors': 0.0,
'% Poverty': 0.09548058561425843,
'% Driving Deaths': 1.1775938892425206,
'% Excessive Drinking': 0.06365372374283895}
Jupyter Notebooks
Additionally, when using Jupyter notebooks, this is a great solution.
Note: pd.Series() has no .to_html() so it must be converted to pd.DataFrame()
from IPython.display import display, HTML
display(HTML(s.to_frame().to_html()))
which produces results like the following:
You can set expand_frame_repr to False:
display.expand_frame_repr : boolean
Whether to print out the full DataFrame repr for wide DataFrames
across multiple lines, max_columns is still respected, but the output
will wrap-around across multiple “pages” if its width exceeds
display.width.
[default: True]
pd.set_option('expand_frame_repr', False)
For more details read How to Pretty-Print Pandas DataFrames and Series
this link can help ypu
hi my friend just run this
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(df)
just do this
Output
Column
0 row 0
1 row 1
2 row 2
3 row 3
4 row 4
5 row 5
6 row 6
7 row 7
8 row 8
9 row 9
10 row 10
11 row 11
12 row 12
13 row 13
14 row 14
15 row 15
16 row 16
17 row 17
18 row 18
19 row 19
20 row 20
21 row 21
22 row 22
23 row 23
24 row 24
25 row 25
26 row 26
27 row 27
28 row 28
29 row 29
30 row 30
31 row 31
32 row 32
33 row 33
34 row 34
35 row 35
36 row 36
37 row 37
38 row 38
39 row 39
40 row 40
41 row 41
42 row 42
43 row 43
44 row 44
45 row 45
46 row 46
47 row 47
48 row 48
49 row 49
50 row 50
51 row 51
52 row 52
53 row 53
54 row 54
55 row 55
56 row 56
57 row 57
58 row 58
59 row 59
60 row 60
61 row 61
62 row 62
63 row 63
64 row 64
65 row 65
66 row 66
67 row 67
68 row 68
69 row 69
You can achieve this using below method. just pass the total no. of columns present in the DataFrame as arg to
'display.max_columns'
For eg :
df= DataFrame(..)
with pd.option_context('display.max_rows', None, 'display.max_columns', df.shape[1]):
print(df)
Try using display() function. This would automatically use Horizontal and vertical scroll bars and with this you can display different datasets easily instead of using print().
display(dataframe)
display() supports proper alignment also.
However if you want to make the dataset more beautiful you can check pd.option_context(). It has lot of options to clearly show the dataframe.
Note - I am using Jupyter Notebooks.
Below is the CSV File that I have:
Record Time Value 1 Value 2 Value 3
Event 1 20 35 40
Event 2 48 43 56
Event 3 45 58 90
FFC 4 12 89 94
FFC 5 30 25 60
Event 6 99 45 13
I would like to use pandas in order to parse through the 'Record' Column until I find the first FFC and then print that entire row. Additionally, I would like to print the row that is two above the first found FFC. Any suggestions on how to approach this?
My reasoning for wanting to use Pandas is that I am going to need to call upon specific values within the two printed rows and plot them.
To start I have:
csvfile = pd.read_csv('Test.csv')
print(csvfile)
Thank you very much for your assistance, it is greatly appreciated!
This is one way.
import pandas as pd
from io import StringIO
mystr = StringIO("""Record Time Value1 Value2 Value3
Event 1 20 35 40
Event 2 48 43 56
Event 3 45 58 90
FFC 4 12 89 94
FFC 5 30 25 60
Event 6 99 45 13""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr, delim_whitespace=True)
# get index of condition
idx = df[df['Record'] == 'FFC'].index[0]
# filter for appropriate indices
res1 = df.loc[idx]
res2 = df.loc[idx-2]
To output a dataframe:
print(res1.to_frame().T)
# Record Time Value1 Value2 Value3
# 3 FFC 4 12 89 94
I have a csv file
(I am showing the first three rows here)
HEIGHT,WEIGHT,AGE,GENDER,SMOKES,ALCOHOL,EXERCISE,TRT,PULSE1,PULSE2,YEAR
173,57,18,2,2,1,2,2,86,88,93
179,58,19,2,2,1,2,1,82,150,93
I am using pandas read_csv to read the file and put them into columns.
Here is my code:
import pandas as pd
import os
path='~/Desktop/pulse.csv'
path=os.path.expanduser(path)
my_data=pd.read_csv(path, index_col=False, header=None, quoting = 3, delimiter=',')
print my_data
The problem is the first and last columns have " before and after the values.
Additionally I can't get rid of the indexes.
It might be making some silly mistake but I thank you for your help in advance
Final solution - use replace with converting to ints and for remove " from columns names use strip:
df = pd.read_csv('pulse.csv', quoting=3)
df = df.replace('"','', regex=True).astype(int)
df.columns = df.columns.str.strip('"')
print (df.head())
HEIGHT WEIGHT AGE GENDER SMOKES ALCOHOL EXERCISE TRT PULSE1 \
0 173 57 18 2 2 1 2 2 86
1 179 58 19 2 2 1 2 1 82
2 167 62 18 2 2 1 1 1 96
3 195 84 18 1 2 1 1 2 71
4 173 64 18 2 2 1 3 2 90
PULSE2 YEAR
0 88 93
1 150 93
2 176 93
3 73 93
4 88 93
index_col=False means force not read first column to index, but dataframe always need some index, so is added default - 0,1,2.... So here can be omit.
header=None should be removed because it force dont read first row (header of csv) to columns of DataFrame. Then also first row of data is header and numeric values are converted to strings.
delimiter=',' should be removed too, because it is same as sep=',' what is default parameter.
#jezrael is right - a pandas dataframe will always add indices. It's necessary.
try something like df[0] = df[0].str.strip() replacing zero with the last column.
before you do so, convert your csv to a dataframe - pd.DataFrame.from_csv(path)