I want to read one csv file into Jupyter Notebook with Python's Pandas lib.
I have uploaded .csv file into jupyter notebook, and I wrote a code, but I think that my dataframe does not display correctly.
This is the code for reading the file:
df = pd.read_csv('text analysis.csv')
print(df)
And my output, when I print that dataframe looks like this:
avg title word len. avg text word len. avg sent. len. \
0 5.20 4.27 11.00
1 4.69 4.98 26.20
2 5.50 4.53 21.62
3 4.82 4.42 15.10
4 6.40 5.07 36.50
... ... ... ...
34205 4.29 4.96 24.60
34206 4.67 4.58 13.00
34207 4.92 5.08 26.79
34208 4.09 4.72 22.23
34209 4.75 5.76 18.38
I have seen much better representation on JN, with all cells. This looks worse then when I print dataframe in idle
Try to use display() insted of print() and check it.
Use this to your code.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)
This sets it to dispay the entire dataframe.
Related
How to combine multiple txt files into one merged file, where each file contains different number of columns(with Float values usually) and I need to get one merged file with all the columns as follows:
EDIT:
there is one rule: In case there is a non-numeric value ("Nan" for example..), I need to do padding according to the last numeric value that was before it.
file1.txt
1.04
2.26
3.87
file2.txt
5.44 4.65 9.86
8.67 Nan 7.45
8.41 6.54 6.21
file3.txt
6.98 6.52
4.45 8.74
0.58 4.12
merged.txt
1.04 5.44 4.65 9.86 6.98 6.52
2.26 8.67 8.67 7.45 4.45 8.74
3.87 8.41 6.54 6.21 0.58 4.12
I saw here answer to the case of one column in each file.
how can I do this for multiple columns?
The simplest way is probably using numpy:
import numpy as np
filenames = ["file1.txt", "file2.txt", "file3.txt"]
fmt = '%.2f' # assuming format is known in advance
all_columns = []
for filename in filenames:
all_columns.append(np.genfromtxt(filename))
arr_out = np.column_stack(tuple(all_columns)) # Stack columns
# Fill NaN-elements with last numeric value
arr_1d = np.ravel(arr_out) # "flat reference" to arr_out
replaced_all_nan = False
nan_indices = np.where(np.isnan(arr_1d))
while len(nan_indices[0]):
new_indices = tuple([i-1 for i in nan_indices])
arr_1d[nan_indices] = arr_1d[new_indices]
nan_indices = np.where(np.isnan(arr_1d))
np.savetxt("merged.txt", arr_out, fmt=fmt)
One problem (if it is one for you) that might occur is that the very first, i.e. the upper-left element, is non-numeric. In that case, the last (lower-right) value or the last numeric value before that would be used.
I have a tab-limited csv file with a dataset with 2 columns (time and value) of data of type float. I have 100s of these files from a lab equipment. An example set is shown below.
3.64 1.22e-11
4.14 2.44e-11
4.64 1.22e-11
5.13 2.44e-11
5.66 1.22e-11
6.17 1.22e-11
6.67 2.44e-11
7.17 2.44e-11
7.69 1.22e-11
8.20 2.44e-11
8.70 1.22e-11
9.20 2.44e-11
9.72 2.44e-11
10.22 1.22e-11
10.72 1.22e-11
11.22 1.22e-11
11.72 1.22e-11
12.22 1.22e-11
12.70 -1.95e-10
13.22 -1.57e-09
13.73 -3.04e-09
14.25 -4.39e-09
14.77 -5.73e-09
15.28 -7.02e-09
15.80 -8.26e-09
16.28 -8.61e-09
16.83 -8.70e-09
17.31 -8.76e-09
17.81 -8.80e-09
18.31 -8.83e-09
18.83 -8.91e-09
19.33 -8.98e-09
19.84 -9.02e-09
20.34 -9.05e-09
20.84 -9.06e-09
21.34 -9.07e-09
21.88 -9.08e-09
22.39 -9.08e-09
22.89 -9.09e-09
23.39 -9.09e-09
23.89 -9.09e-09
24.41 -9.09e-09
I want to trim the data to reset time (x,1st col to 0) when the value (y/2nd column) starts to change, and also trim after the value plateaus.
For 1st derivative, if I use NumPy.gradient, I can see where the data changes, but I couldn't find a similar function for pandas.
Any suggestions?
Added: Output (done in excel manually) will look like below where (in this case) first 18 rows and last 3 are removed. The first row is set to 0 by subtracting all values from the previous row.
0.00 0.000000000000
0.52 -0.000000001375
1.03 -0.000000002845
1.55 -0.000000004195
2.07 -0.000000005535
2.58 -0.000000006825
3.10 -0.000000008065
3.58 -0.000000008415
4.13 -0.000000008505
4.61 -0.000000008565
5.11 -0.000000008605
5.61 -0.000000008635
6.13 -0.000000008715
6.63 -0.000000008785
7.14 -0.000000008825
7.64 -0.000000008855
8.14 -0.000000008865
8.64 -0.000000008875
9.18 -0.000000008885
9.69 -0.000000008885
10.19 -0.000000008895
What I have tried is using python and pandas to differentiate and then remove where derivative is 0, but that removes data point within the output I want too.
dfT = df1[df1.dB != 0]
dfT = dfT[df1.dB >= 0]
dfT = dfT.dropna()
dfT = dfT.reset_index(drop=True)
dfT
Why not use what is already working aka np.gradient and put it back to your dataframe? I am not able to create your final desired output however, since it looks like you rely on more than just filtering out gradient = 0? Open to fixing it once I get a bit clearer on logic
### imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
### data
time = [3.64,4.14,4.64,5.13,5.66,6.17,6.67,7.17,7.69,8.2,8.7,9.2,9.72,10.22,10.72,11.22,11.72,12.22,12.7,13.22,13.73,14.25,14.77,15.28,15.8,16.28,16.83,17.31,17.81,18.31,18.83,19.33,19.84,20.34,20.84,21.34,21.88,22.39,22.89,23.39,23.89,24.41,21.88,22.39,22.89,23.39,23.89,24.41]
value = [0.0000000000122,0.0000000000244,0.0000000000122,0.0000000000244,0.0000000000122,0.0000000000122,0.0000000000244,0.0000000000244,0.0000000000122,0.0000000000244,0.0000000000122,0.0000000000244,0.0000000000244,0.0000000000122,0.0000000000122,0.0000000000122,0.0000000000122,0.0000000000122,-0.000000000195,-0.00000000157,-0.00000000304,-0.00000000439,-0.00000000573,-0.00000000702,-0.00000000826,-0.00000000861,-0.0000000087,-0.00000000876,-0.0000000088,-0.00000000883,-0.00000000891,-0.00000000898,-0.00000000902,-0.00000000905,-0.00000000906,-0.00000000907,-0.00000000908,-0.00000000908,-0.00000000909,-0.00000000909,-0.00000000909,-0.00000000909,-0.00000000908,-0.00000000908,-0.00000000909,-0.00000000909,-0.00000000909,-0.00000000909]
### dataframe creation
# df = pd.read_csv('test.csv', names=["time", "value"])
df = pd.DataFrame({'time':time, 'value':value})
plt.plot(df.time,df.value)
Outputs:
Next you can differentiate and as you can see within your first 18 rows you mentioned there are multiple points where gradient is greater than 0:
df['gradient'] = np.gradient(df.value.values)
df
plt.plot(df.time,df.gradient)
Outputs:
Next filter out non change and add new time
### filter data where gradient is not 0 and add new time
df_filtered = df[df.gradient != 0]
df_filtered['time_difference'] = df_filtered.time.diff().fillna(0)
df_filtered['new_time'] = df_filtered['time_difference'].cumsum()
df_filtered.reset_index(drop=True,inplace=True)
df_filtered
Outputs:
I have a dataframe that contains data collected every 0.01m down into the earth. Due to its high resolution the resulting size of the dataset is very large. Is there a way in pandas to downsample to 5m intervals thus reducing the size of the dataset?
RESULT (Every 0.01m)
Depth_m
value
1.34
31.7
1.35
31.7
1.36
31.7
1.37
31.9
1.38
31.9
1.39
31.9
1.40
31.9
....
.....
44.35
32.9
44.36
32.9
44.37
32.9
OUTCOME I WANT (Every 5m)
Depth_m
value
5.47
31.7
10.49
31.7
15.51
31.7
20.53
31.9
25.55
31.9
30.57
31.9
35.59
31.9
40.61
31.9
45.63
31.9
I have tried to use pandas.resample but that seems to only work with timeseries data. I think I understand what I must do but not sure how to do it in pandas. Basically I am thinking I need to calculate what the current sampling rate is, in this case 0.01m. Then how many observations are there every 5m. Then I can average the values based on the number of observations and drop the rows. Loop through this process every 5m.
You can use Panda's .iloc for selection by position coupled with a slice object to downsample. Some care must be taken to ensure you have integer step sizes and not floats when converting from non-integer sample intervals (hence the use of astype("int")).
import numpy as np
import pandas as pd
sequence_interval = 0.01
downsampled_interval = 5
step_size = np.round(downsampled_interval / sequence_interval).astype("int")
df = pd.DataFrame(
{
"Depth_m": np.arange(131, 4438) / 100,
"value": np.random.random(size=4307),
}
)
downsampled_df = df.iloc[::step_size, :]
print(downsampled_df)
The result is
Depth_m value
0 1.31 0.357536
500 6.31 0.384327
1000 11.31 0.302109
1500 16.31 0.200971
2000 21.31 0.689973
2500 26.31 0.712869
3000 31.31 0.776306
3500 36.31 0.221901
4000 41.31 0.661378
There is no resample for integer values. As a workaround, you could round the Depth to the nearest 5 and use groupby to get the average Value for every 5m depth:
>>> df.groupby(df["Depth_m"].apply(lambda x: 5*round(x/5)))["Value"].mean()
Depth_m
0 34.256410
5 34.274549
10 34.564870
15 34.653307
20 34.630739
25 34.517034
30 34.584830
35 34.581162
40 34.620758
45 34.390374
Name: Value, dtype: float64
Input df:
import numpy as np
np.random.seed(100)
df = pd.DataFrame({"Depth_m": [i/100 for i in range(134, 4438)],
"Value": np.random.randint(30, 40, size=4304)})
I have a problem I'm struggling with since a few days, and can't find my way out!
I have a folder with many CSVs, each containing two columns: "date" (YYYY-MM-DD) and "value" (a float). The dates are usually a range of consecutive days (but some days might be missing).
Now each of these CSVs starts from a different date.
I need to merge them into a unique panda dataframe with "date" as index and then several columns like "csv1_value", "csv2_value", "csv3_value" etc. I've done it with 'merge' command on 'date' and that means I do have a panda that contains only the rows where the same "date" was found across all the CSVs.
This is useful because actually some 'dates' in the range might be missing from a file, and in that case I need that date to be deleted from the panda even if it's present in the other files.
BUT I would need to have actually the start of the range in the panda to be the older date I have, and then if that date is missing in the others (because they start later) having the value for that file as 0.
AND any date that is missing from one file range, it should be filled with the latest value (useful to have 0.00 in any file starting later until there's actually some value).
A bit complex, will try an example:
csv1:
"2020-01-01","1.01"
"2020-01-02","2.01"
"2020-01-03","3.01"
"2020-01-04","4.01"
"2020-01-05","5.01"
"2020-01-06","6.01"
"2020-01-07","7.01"
"2020-01-08","8.01"
"2020-01-09","9.01"
"2020-01-10","10.01"
csv2:
"2020-01-04","4.02"
"2020-01-05","5.02"
"2020-01-06","6.02"
"2020-01-08","8.02"
"2020-01-09","9.02"
"2020-01-10","10.02"
csv3:
"2020-01-03","3.03"
"2020-01-04","4.03"
"2020-01-05","5.03"
"2020-01-06","6.03"
"2020-01-07","7.03"
"2020-01-09","9.03"
"2020-01-10","10.03"
resulting Panda should be:
"2020-01-01","1.01","0.00","0.00"
"2020-01-02","2.01","0.00","0.00"
"2020-01-03","3.01","0.00","3.03"
"2020-01-04","4.01","4.02","4.03"
"2020-01-05","5.01","5.02","5.03"
"2020-01-06","6.01","6.02","6.03"
"2020-01-07","7.01","6.02","7.03"
"2020-01-08","8.01","8.02","7.03"
"2020-01-09","9.01","9.02","9.03"
"2020-01-10","10.01","10.02","10.03"
Anyone has an idea how I could achieve all this? My head is exploding...
you can do this using two outer joins, then fill NA with zeros
df1 = pd.read_csv('csv1')
df2 = pd.read_csv('csv2')
df3 = pd.read_csv('csv3')
DF = pd.merge(df1, df2, how='outer', on='date')
DF = pd.merge(DF, df3, how='outer', on='date')
DF.fillna(0, inplace=True)
My solution is designed to cope with arbitrary number of input files
(not only 3, as in the other solution).
Start with reading of your input files, creating a list of
DataFrames, with proper names of the second column:
import glob
frames = []
for i, fn in enumerate(glob.glob('Input*.csv'), start=1):
frames.append(pd.read_csv(fn, parse_dates=[0], names=['Date', f'csv{i}_value']))
Then join them into a single DataFrame:
df = frames.pop(0)
while len(frames) > 0:
df2 = frames.pop(0)
df = df.join(df2.set_index('Date'), on='Date')
For now, from your sample files, you have:
Date csv1_value csv2_value csv3_value
0 2020-01-01 1.01 NaN NaN
1 2020-01-02 2.01 NaN NaN
2 2020-01-03 3.01 NaN 3.03
3 2020-01-04 4.01 4.02 4.03
4 2020-01-05 5.01 5.02 5.03
5 2020-01-06 6.01 6.02 6.03
6 2020-01-07 7.01 NaN 7.03
7 2020-01-08 8.01 8.02 NaN
8 2020-01-09 9.01 9.02 9.03
9 2020-01-10 10.01 10.02 10.03
And to get the result, run:
df = df.ffill().fillna(0.0)
The result is:
Date csv1_value csv2_value csv3_value
0 2020-01-01 1.01 0.00 0.00
1 2020-01-02 2.01 0.00 0.00
2 2020-01-03 3.01 0.00 3.03
3 2020-01-04 4.01 4.02 4.03
4 2020-01-05 5.01 5.02 5.03
5 2020-01-06 6.01 6.02 6.03
6 2020-01-07 7.01 6.02 7.03
7 2020-01-08 8.01 8.02 7.03
8 2020-01-09 9.01 9.02 9.03
9 2020-01-10 10.01 10.02 10.03
How to find possible errors
One of things to check is whether the program finds expected
CSV files.
To check it, run:
for i, fn in enumerate(glob.glob('Input*.csv'), start=1):
print(i, fn)
and you should get a list of files found.
Another detail to check is whether your files have names starting
from Input and have csv extension. Maybe you should change Input*.csv
to some other pattern?
Attempt also to run my code partially, i.e.:
first the loop creating the list of DataFrames,
then check the size of this list, print some of DataFrames
and invoke info() on them (make test printouts),
after that run the second part of my code (while loop).
If some error occus, state in which instruction it occurred.
I have the foll. csv file:
RUN YR AP15 PMTE
12008 4.53 0.04
12009 3.17 0.26
12010 6.20 1.38
12011 5.38 3.55
12012 7.32 6.13
12013 4.39 9.40
Here, the column 'YR' has the values 2008, 2009...2013. However, there is no space between the values for YR and values for RUN. Because of this, when I try to read in the dataframe, it does not read the YR column correctly.
pandas.read_csv('file.csv', skipinitialspace=True, usecols=['YR','PMTE'], sep=' ')
The line above reads in the AP15 column instead of YR. How do I fix this?
It seems like your 'csv' is really a fixed-width format file. Sometimes these are accompanied by another file listing the size of each column, but maybe you aren't that lucky, and have to count the column widths manually. You can read this file with pandas's fixed width reading function:
df = pd.read_fwf('fixed_width.txt', widths=[4, 4, 8, 8])
In [7]: df
Out[7]:
RUN YR AP15 PMTE
0 1 2008 4.53 0.04
1 1 2009 3.17 0.26
2 1 2010 6.20 1.38
3 1 2011 5.38 3.55
4 1 2012 7.32 6.13
5 1 2013 4.39 9.40
In [8]: df.columns
Out[8]: Index(['RUN', 'YR', 'AP15', 'PMTE'], dtype='object')
There is an option to find the widths automatically but it probably requires at least a space between each column, as it doesn't seem to work here.
One workaround you can do for this would be to first make the column RUN and YR as one for your csv . Example -
RUNYR AP15 PMTE
12008 4.53 0.04
12009 3.17 0.26
12010 6.20 1.38
12011 5.38 3.55
12012 7.32 6.13
12013 4.39 9.40
Then read the csv into a dataframe with RUNYR as a string column, and then slice the RUNYR column up to make two different columns using pandas.Series.str.slice method. Example -
df = pd.read_csv('file.csv', skipinitialspace=True, header=0, sep=' ',dtype={'RUNYR':str})
df['RUN'] = df['RUNYR'].str.slice(None,1).astype(int)
df['YR'] = df['RUNYR'].str.slice(1).astype(int)
df = df.drop('RUNYR',axis=1)
Demo -
In [21]: df = pd.read_csv('a.csv', skipinitialspace=True, header=0, sep=' ',dtype={'RUNYR':str})
In [22]: df['RUN'] = df['RUNYR'].str.slice(None,1).astype(int)
In [23]: df['YR'] = df['RUNYR'].str.slice(1).astype(int)
In [24]: df = df.drop('RUNYR',axis=1)
In [25]: df
Out[25]:
AP15 PMTE RUN YR
0 4.53 0.04 1 2008
1 3.17 0.26 1 2009
2 6.20 1.38 1 2010
3 5.38 3.55 1 2011
4 7.32 6.13 1 2012
5 4.39 9.40 1 2013
And then write this back to your csv using .to_csv method (to fix your csv permanently) .