python duplicate remove MemoryError problem [duplicate]
This question already has an answer here:
Memory error due to the huge input file size
2 answers
guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError
lines = open('file.txt', 'r').readlines()
lines_set = set(lines)
out = open('b.txt', 'w')
for line in lines_set:
out.write(line)
python memory
marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
This question already has an answer here:
Memory error due to the huge input file size
2 answers
guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError
lines = open('file.txt', 'r').readlines()
lines_set = set(lines)
out = open('b.txt', 'w')
for line in lines_set:
out.write(line)
python memory
marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
usewith open
instead ofopen
– Rahul Agarwal
Nov 12 '18 at 14:09
add a comment |
This question already has an answer here:
Memory error due to the huge input file size
2 answers
guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError
lines = open('file.txt', 'r').readlines()
lines_set = set(lines)
out = open('b.txt', 'w')
for line in lines_set:
out.write(line)
python memory
This question already has an answer here:
Memory error due to the huge input file size
2 answers
guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError
lines = open('file.txt', 'r').readlines()
lines_set = set(lines)
out = open('b.txt', 'w')
for line in lines_set:
out.write(line)
This question already has an answer here:
Memory error due to the huge input file size
2 answers
python memory
python memory
edited Nov 12 '18 at 14:10
Mehrdad Pedramfar
4,89211237
4,89211237
asked Nov 12 '18 at 14:08
aynaz hamidiaynaz hamidi
1
1
marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
usewith open
instead ofopen
– Rahul Agarwal
Nov 12 '18 at 14:09
add a comment |
usewith open
instead ofopen
– Rahul Agarwal
Nov 12 '18 at 14:09
use
with open
instead of open
– Rahul Agarwal
Nov 12 '18 at 14:09
use
with open
instead of open
– Rahul Agarwal
Nov 12 '18 at 14:09
add a comment |
1 Answer
1
active
oldest
votes
Your code loads the whole file into memory:
lines = open('file.txt', 'r').readlines()
Then it allocates more memory, scaled to the size of the file:
lines_set = set(lines)
If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.
One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.
For example:
seen = set()
with open('file.txt', 'r') as infile:
with open('b.txt', 'w') as outfile:
for line in infile:
h = sha256(line)
if h in seen:
continue
seen.add(h)
outfile.write(line)
This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen
set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen
set if you see the prefix is in the in-memory seen
.
NameError: name 'sha256' is not defined
– aynaz hamidi
Nov 12 '18 at 14:21
after 512MG type MemoryError
– aynaz hamidi
Nov 12 '18 at 14:31
I don't know what a "MG" is.
– Jean-Paul Calderone
Nov 12 '18 at 18:04
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your code loads the whole file into memory:
lines = open('file.txt', 'r').readlines()
Then it allocates more memory, scaled to the size of the file:
lines_set = set(lines)
If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.
One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.
For example:
seen = set()
with open('file.txt', 'r') as infile:
with open('b.txt', 'w') as outfile:
for line in infile:
h = sha256(line)
if h in seen:
continue
seen.add(h)
outfile.write(line)
This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen
set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen
set if you see the prefix is in the in-memory seen
.
NameError: name 'sha256' is not defined
– aynaz hamidi
Nov 12 '18 at 14:21
after 512MG type MemoryError
– aynaz hamidi
Nov 12 '18 at 14:31
I don't know what a "MG" is.
– Jean-Paul Calderone
Nov 12 '18 at 18:04
add a comment |
Your code loads the whole file into memory:
lines = open('file.txt', 'r').readlines()
Then it allocates more memory, scaled to the size of the file:
lines_set = set(lines)
If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.
One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.
For example:
seen = set()
with open('file.txt', 'r') as infile:
with open('b.txt', 'w') as outfile:
for line in infile:
h = sha256(line)
if h in seen:
continue
seen.add(h)
outfile.write(line)
This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen
set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen
set if you see the prefix is in the in-memory seen
.
NameError: name 'sha256' is not defined
– aynaz hamidi
Nov 12 '18 at 14:21
after 512MG type MemoryError
– aynaz hamidi
Nov 12 '18 at 14:31
I don't know what a "MG" is.
– Jean-Paul Calderone
Nov 12 '18 at 18:04
add a comment |
Your code loads the whole file into memory:
lines = open('file.txt', 'r').readlines()
Then it allocates more memory, scaled to the size of the file:
lines_set = set(lines)
If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.
One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.
For example:
seen = set()
with open('file.txt', 'r') as infile:
with open('b.txt', 'w') as outfile:
for line in infile:
h = sha256(line)
if h in seen:
continue
seen.add(h)
outfile.write(line)
This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen
set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen
set if you see the prefix is in the in-memory seen
.
Your code loads the whole file into memory:
lines = open('file.txt', 'r').readlines()
Then it allocates more memory, scaled to the size of the file:
lines_set = set(lines)
If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.
One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.
For example:
seen = set()
with open('file.txt', 'r') as infile:
with open('b.txt', 'w') as outfile:
for line in infile:
h = sha256(line)
if h in seen:
continue
seen.add(h)
outfile.write(line)
This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen
set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen
set if you see the prefix is in the in-memory seen
.
edited Nov 12 '18 at 14:17
answered Nov 12 '18 at 14:11
Jean-Paul CalderoneJean-Paul Calderone
41.2k571101
41.2k571101
NameError: name 'sha256' is not defined
– aynaz hamidi
Nov 12 '18 at 14:21
after 512MG type MemoryError
– aynaz hamidi
Nov 12 '18 at 14:31
I don't know what a "MG" is.
– Jean-Paul Calderone
Nov 12 '18 at 18:04
add a comment |
NameError: name 'sha256' is not defined
– aynaz hamidi
Nov 12 '18 at 14:21
after 512MG type MemoryError
– aynaz hamidi
Nov 12 '18 at 14:31
I don't know what a "MG" is.
– Jean-Paul Calderone
Nov 12 '18 at 18:04
NameError: name 'sha256' is not defined
– aynaz hamidi
Nov 12 '18 at 14:21
NameError: name 'sha256' is not defined
– aynaz hamidi
Nov 12 '18 at 14:21
after 512MG type MemoryError
– aynaz hamidi
Nov 12 '18 at 14:31
after 512MG type MemoryError
– aynaz hamidi
Nov 12 '18 at 14:31
I don't know what a "MG" is.
– Jean-Paul Calderone
Nov 12 '18 at 18:04
I don't know what a "MG" is.
– Jean-Paul Calderone
Nov 12 '18 at 18:04
add a comment |
use
with open
instead ofopen
– Rahul Agarwal
Nov 12 '18 at 14:09