I was reading a super big csv file(10G) using pandas, and read_csv(filename, chunksize=chunksize) return me an iterator (assum it names 'reader'). And now I want to get an exact chunk because I just want a certain few of lines(for example, the csv file I read has 1000000000 lines, and I want to get number 50000000 line and 1000 lines after it), what should I do except tranverse the iterator until it reaches the chunk I want?
Here is my former code:
def get_lines_by_chunk(file_name, line_beg, line_end, chunk_size=-1):
func_name = 'get_lines_by_chunk'
line_no = get_file_line_no(file_name)
if chunk_size < 0:
chunk_size = get_chunk_size(line_no, line_beg, line_end)
reader = pd.read_csv(file_name, chunksize=chunk_size)
data = pd.DataFrame({})
flag = 0
for chunk in reader:
line_before = flag * chunk_size
flag = flag + 1
line_after = flag * chunk_size
if line_beg >= line_before and line_beg <= line_after:
if line_end >= line_after:
temp = chunk[line_beg - line_before : chunk_size]
data = pd.concat([data, temp], ignore_index=True)
else:
temp = chunk[line_beg - line_before : line_end - line_before]
data = pd.concat([data, temp], ignore_index=True)
return data
elif line_end <= line_after and line_end >= line_before:
temp = chunk[0 : line_end - line_before]
data = pd.concat([data, temp], ignore_index=True)
return data
elif line_beg < line_before and line_end > line_after:
temp = chunk[0 : chunk_size]
data = pd.concat([data, temp], ignore_index=True)
return data
If you need to read your CSV file with differently-sized chunks you can use iterator=True:
Assuming we have a 1000rows DF (see in setup section for how it's been generated)
In [103]: reader = pd.read_csv(fn, iterator=True)
In [104]: reader.get_chunk(5)
Out[104]:
a b
0 1 8
1 2 28
2 3 85
3 4 56
4 5 29
In [105]: reader.get_chunk(3)
Out[105]:
a b
5 6 55
6 7 16
7 8 96
NOTE: get_chunk can't skip data, it will continuously read data with specified chunk sizes
if you want to read only the rows 100 - 110:
In [106]: cols = pd.read_csv(fn, nrows=1).columns.tolist()
In [107]: cols
Out[107]: ['a', 'b']
In [109]: pd.read_csv(fn, header=None, skiprows=100, nrows=10, names=cols)
Out[109]:
a b
0 100 52
1 101 15
2 102 74
3 103 10
4 104 35
5 105 73
6 106 48
7 107 49
8 108 1
9 109 56
But if you can use HDF5 format - it will be much easier and faster:
let's save it as HDF5 first :
In [110]: df.to_hdf('c:/temp/test.h5', 'mydf', format='t', data_columns=True, compression='blosc', complevel=9)
now we can read it by index positions as follows:
In [113]: pd.read_hdf('c:/temp/test.h5', 'mydf', start=99, stop=109)
Out[113]:
a b
99 100 52
100 101 15
101 102 74
102 103 10
103 104 35
104 105 73
105 106 48
106 107 49
107 108 1
108 109 56
or querying (SQL like):
In [115]: pd.read_hdf('c:/temp/test.h5', 'mydf', where="a >= 100 and a <= 110")
Out[115]:
a b
99 100 52
100 101 15
101 102 74
102 103 10
103 104 35
104 105 73
105 106 48
106 107 49
107 108 1
108 109 56
109 110 23
Setup:
In [99]: df = pd.DataFrame({'a':np.arange(1, 1001), 'b':np.random.randint(0, 100, 1000)})
In [100]: fn = r'C:\Temp\test.csv'
In [101]: df.to_csv(fn, index=False)
In [102]: df.shape
Out[102]: (1000, 2)
Related
High D_HIGH D_HIGH_H
33 46.57 0 0L
0 69.93 42 42H
1 86.44 68 68H
34 56.58 83 83L
35 67.12 125 125L
2 117.91 158 158H
36 94.51 186 186L
3 120.45 245 245H
4 123.28 254 254H
37 83.20 286 286L
In column D_HIGH_H there is L & H at end.
If there are two continuous H then the one having highest value in High column has to be selected and other has to be ignored(deleted).
If there are two continuous L then the one having lowest value in High column has to be selected and other has to be ignored(deleted).
If the sequence is H,L,H,L then no changes to be made.
Output I want is as follows:
High D_HIGH D_HIGH_H
33 46.57 0 0L
1 86.44 68 68H
34 56.58 83 83L
2 117.91 158 158H
36 94.51 186 186L
4 123.28 254 254H
37 83.20 286 286L
I tried various options using list map but did not work out.Also tried with groupby but no logical conclusion.
Here's one way:
g = ((l := df['D_HIGH_H'].str[-1]) != l.shift()).cumsum()
def f(x):
if (x['D_HIGH_H'].str[-1] == 'H').any():
return x.nlargest(1, 'D_HIGH')
return x.nsmallest(1, 'D_HIGH')
df.groupby(g, as_index=False).apply(f)
Output:
High D_HIGH D_HIGH_H
0 33 46.57 0 0L
1 1 86.44 68 68H
2 34 56.58 83 83L
3 2 117.91 158 158H
4 36 94.51 186 186L
5 4 123.28 254 254H
6 37 83.20 286 286L
You can use extract to get the letter, then compute a custom group and groupby.apply with a function that depends on the letter:
# extract letter
s = df['D_HIGH_H'].str.extract('(\D)$', expand=False)
# group by successive letters
# get the idxmin/idxmax depending on the type of letter
keep = (df['High']
.groupby([s, s.ne(s.shift()).cumsum()], sort=False)
.apply(lambda x: x.idxmin() if x.name[0] == 'L' else x.idxmax())
.tolist()
)
out = df.loc[keep]
Output:
High D_HIGH D_HIGH_H
33 46.57 0 0L
1 86.44 68 68H
34 56.58 83 83L
2 117.91 158 158H
36 94.51 186 186L
4 123.28 254 254H
37 83.20 286 286L
In my first table I have columns: indeks, il, start and stop. The last two define a range. I need to list (in a new table) all numbers in the range from start to stop, but also save indeks and the other values belonging to the range.
This table shows what kind of data I have (sample):
ID
Indeks
Start
Stop
il
0
A1
1
3
25
1
B1
31
55
5
2
C1
36
900
865
3
D1
900
2500
20
...
...
...
...
...
And this is the table I want to get:
Indeks
Start
Stop
il
kod
A1
1
3
25
1
A1
1
3
25
2
A1
1
3
25
3
B1
31
55
5
31
B1
31
55
5
32
B1
31
55
5
33
...
...
...
...
...
B1
31
55
5
53
B1
31
55
5
54
B1
31
55
5
55
C1
36
900
865
36
C1
36
900
865
37
C1
36
900
865
38
...
...
...
...
...
C1
36
900
865
898
C1
36
900
865
899
C1
36
900
865
900
...
...
...
...
...
EDITET
lidy=pd.read_excel('path' )
lid=pd.DataFrame(lidy)
output = []
for i in range (0,len(lid)):
for j in range (lid.iloc[i,1],lid.iloc[i,2]+1):
y=((lid.iloc[i,0], j))output.append(y)
print(output)
OR
lidy=pd.read_excel('path' )
lid=pd.DataFrame(lidy)
for i in range (0,len(lid)):
for j in range (lid.iloc[i,1],lid.iloc[i,2]+1):
y=((lid.iloc[i,0], j))
print(y)
Two options:
(1 - preferred) Use Pandas (in combination with openpyxl as engine): The Excel-file I'm using is named data.xlsx, and sheet Sheet1 contains your data. Then this
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
df["kod"] = df[["Start", "Stop"]].apply(
lambda row: range(row.iat[0], row.iat[1] + 1), axis=1
)
df = df.iloc[:, 1:].explode("kod", ignore_index=True)
with pd.ExcelWriter("data.xlsx", mode="a", if_sheet_exists="replace") as writer:
df.to_excel(writer, sheet_name="Sheet2", index=False)
should produce the required output in sheet Sheet2. The work is done by putting the required range()s in the new column kod, and then .explode()-ing it.
(2) Use only openpyxl:
from openpyxl import load_workbook
wb = load_workbook(filename="data.xlsx")
ws = wb["Sheet1"]
rows = ws.iter_rows(values_only=True)
# Reading the required column names
data = [list(next(rows)[1:]) + ["kod"]]
for row in rows:
# Read the input data (a row)
base = list(row[1:])
# Create the new data via iterating over the the given range
data.extend(base + [n] for n in range(base[1], base[2] + 1))
if "Sheet2" in wb.sheetnames:
del wb["Sheet2"]
ws_new = wb.create_sheet(title="Sheet2")
for row in data:
ws_new.append(row)
wb.save("data.xlsx")
I have many blanks in a merged data set and I want to fill them with a condition.
My current code looks like this
import pandas as pd
import csv
import numpy as np
pd.set_option('display.max_columns', 500)
# Read all files into pandas dataframes
Jan = pd.read_csv(r'C:\~\Documents\Jan.csv')
Feb = pd.read_csv(r'C:\~\Documents\Feb.csv')
Mar = pd.read_csv(r'C:\~\Documents\Mar.csv')
Jan=pd.DataFrame({'Department':['52','5','56','70','7'],'Item':['2515','254','818','','']})
Feb=pd.DataFrame({'Department':['52','56','765','7','40'],'Item':['2515','818','524','','']})
Mar=pd.DataFrame({'Department':['7','70','5','8','52'],'Item':['45','','818','','']})
all_df_list = [Jan, Feb, Mar]
appended_df = pd.concat(all_df_list)
df = appended_df
df.to_csv(r"C:\~\Documents\SallesDS.csv", index=False)
Data set:
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7
40
7 45
70
5 818
8
52
What I want is to fill the empty cells in Item with a correspondent values of the Department column.
So If Department is 52 and Item is empty it should be filled with 2515
Department 7 and Item is empty fill it with 45
and the result should look like this
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7 45
40
7 45
70
5 818
8
52 2515
I tried the following method but non of them worked.
1
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(52)), 'Item'] = 2515
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(7)), 'Item'] = 45
2
df["Item"] = df["Item"].fillna(df["Department"])
df = df.replace({"Item":{"52":"2515", "7":"45"}})
both ethir return error or do not work
Answer:
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue
The following solution first creates a map of each department and it's maximum corresponding item (assuming there is one), and then matches that item to a department with a blank item. Note that in your data frame, the empty items are an empty string ("") and not NaN.
Create a map:
values = df.groupby('Department').max()
values['Item'] = values['Item'].apply(lambda x: np.nan if x == "" else x)
values = values.dropna().reset_index()
Department Item
0 5 818
1 52 2515
2 56 818
3 7 45
4 765 524
Then use df.apply():
df['Item'] = df.apply(lambda x: values[values['Department'] == x['Department']]['Item'].values if x['Item'] == "" else x['Item'], axis=1)
In this case, the new values will have brackets around them. They can be removed with str.replace():
df['Item'] = df['Item'].astype(str).str.replace(r'\[|\'|\'|\]', "", regex=True)
The result:
Department Item
0 52 2515
1 5 254
2 56 818
3 70
4 7 45
0 52 2515
1 56 818
2 765 524
3 7 45
4 40
0 7 45
1 70
2 5 818
3 8
4 52 2515
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue
How to get the max value from the second column and min value from the third column in CSV file with no row headers as per the screenshot of DataFrame through defining a function?
My code is:
import pandas as pd
def minmaxvalue(filename):
# some code
minmaxvalue("my_data.cvs")
How to get the max&min value between the defining function?
i a b
1 33 99
2 35 100
3 37 101
4 39 102
5 41 103
6 43 104
7 45 105
8 47 106
9 49 107
10 51 108
11 53 109
12 55 110
13 57 111
14 59 112
15 61 113
import pandas as pd
def minmaxvalue(filename):
# reading from file
df = pd.read_csv(filename, names=['a', 'b'])
# returning max and min
return df['a'].max(), df['b'].min()
minmaxvalue("my_data.csv")
One way is this:
def minmaxvalue(filename):
minim = filename['a'][0]
maxim = filename['b'][0]
for i in range(0, len(filename)):
if minim > filename['a'][i]:
minim = filename['a'][i]
if maxim < filename['b'][i]:
maxim = filename['b'][i]
return minim, maxim
I'm trying to unpivot two columns inside a pandas dataframe. The transformation I seek would be the inverse of this question.
We start with a dataset that looks like this:
import pandas as pd
import numpy as np
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df_orig
accuracy time_a time_b memory_a memory_b
0 6 118 170 102 239
1 241 9 166 159 162
2 164 70 76 228 121
3 228 121 135 128 92
I wish to unpivot both themwmory and time columns, obtaining this dataset in result:
df
accuracy memory category time
0 6 102 a 118
1 241 159 a 9
2 164 228 a 70
3 228 128 a 121
12 6 239 b 170
13 241 162 b 166
14 164 121 b 76
15 228 92 b 135
So far I have managed to get my desired output using df.melt() twice plus some extra commands:
df = df_orig.copy()
# Unpivot memory columns
df = df.melt(id_vars=['accuracy','time_a', 'time_b'],
value_vars=['memory_a', 'memory_b'],
value_name='memory',
var_name='mem_cat')
# Unpivot time columns
df = df.melt(id_vars=['accuracy','memory', 'mem_cat'],
value_vars=['time_a', 'time_b'],
value_name='time',
var_name='time_cat')
# Keep only the 'a'/'b' as categories
df.mem_cat = df.mem_cat.str[-1]
df.time_cat = df.time_cat.str[-1]
# Keeping only the colums whose categories match (DIRTY!)
df = df[df.mem_cat==df.time_cat]
# Removing the duplicated category column.
df = df.drop(columns='time_cat').rename(columns={"mem_cat":'category'})
Given how easy it was to solve the inverse question, I believe my code is way too complex. Can anyone do it better?
Use wide_to_long:
np.random.seed(123)
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df = (pd.wide_to_long(df_orig.reset_index(),
stubnames=['time','memory'],
i='index',
j='category',
sep='_',
suffix='\w+')
.reset_index(level=1)
.reset_index(drop=True)
.rename_axis(None))
print (df)
category accuracy time memory
0 a 254 109 66
1 a 98 230 83
2 a 123 57 225
3 a 113 126 73
4 b 254 126 220
5 b 98 17 106
6 b 123 214 96
7 b 113 47 32