python duplicate remove MemoryError problem [duplicate]












-4
















This question already has an answer here:




  • Memory error due to the huge input file size

    2 answers




guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError



lines = open('file.txt', 'r').readlines()

lines_set = set(lines)

out = open('b.txt', 'w')

for line in lines_set:
out.write(line)









share|improve this question















marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
















  • use with open instead of open

    – Rahul Agarwal
    Nov 12 '18 at 14:09
















-4
















This question already has an answer here:




  • Memory error due to the huge input file size

    2 answers




guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError



lines = open('file.txt', 'r').readlines()

lines_set = set(lines)

out = open('b.txt', 'w')

for line in lines_set:
out.write(line)









share|improve this question















marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
















  • use with open instead of open

    – Rahul Agarwal
    Nov 12 '18 at 14:09














-4












-4








-4









This question already has an answer here:




  • Memory error due to the huge input file size

    2 answers




guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError



lines = open('file.txt', 'r').readlines()

lines_set = set(lines)

out = open('b.txt', 'w')

for line in lines_set:
out.write(line)









share|improve this question

















This question already has an answer here:




  • Memory error due to the huge input file size

    2 answers




guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError



lines = open('file.txt', 'r').readlines()

lines_set = set(lines)

out = open('b.txt', 'w')

for line in lines_set:
out.write(line)




This question already has an answer here:




  • Memory error due to the huge input file size

    2 answers








python memory






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 12 '18 at 14:10









Mehrdad Pedramfar

4,89211237




4,89211237










asked Nov 12 '18 at 14:08









aynaz hamidiaynaz hamidi

1




1




marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.






marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.















  • use with open instead of open

    – Rahul Agarwal
    Nov 12 '18 at 14:09



















  • use with open instead of open

    – Rahul Agarwal
    Nov 12 '18 at 14:09

















use with open instead of open

– Rahul Agarwal
Nov 12 '18 at 14:09





use with open instead of open

– Rahul Agarwal
Nov 12 '18 at 14:09












1 Answer
1






active

oldest

votes


















1














Your code loads the whole file into memory:




lines = open('file.txt', 'r').readlines()




Then it allocates more memory, scaled to the size of the file:




lines_set = set(lines)




If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.



One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.



For example:



seen = set()
with open('file.txt', 'r') as infile:
with open('b.txt', 'w') as outfile:
for line in infile:
h = sha256(line)
if h in seen:
continue
seen.add(h)
outfile.write(line)


This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen set if you see the prefix is in the in-memory seen.






share|improve this answer


























  • NameError: name 'sha256' is not defined

    – aynaz hamidi
    Nov 12 '18 at 14:21











  • after 512MG type MemoryError

    – aynaz hamidi
    Nov 12 '18 at 14:31











  • I don't know what a "MG" is.

    – Jean-Paul Calderone
    Nov 12 '18 at 18:04


















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














Your code loads the whole file into memory:




lines = open('file.txt', 'r').readlines()




Then it allocates more memory, scaled to the size of the file:




lines_set = set(lines)




If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.



One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.



For example:



seen = set()
with open('file.txt', 'r') as infile:
with open('b.txt', 'w') as outfile:
for line in infile:
h = sha256(line)
if h in seen:
continue
seen.add(h)
outfile.write(line)


This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen set if you see the prefix is in the in-memory seen.






share|improve this answer


























  • NameError: name 'sha256' is not defined

    – aynaz hamidi
    Nov 12 '18 at 14:21











  • after 512MG type MemoryError

    – aynaz hamidi
    Nov 12 '18 at 14:31











  • I don't know what a "MG" is.

    – Jean-Paul Calderone
    Nov 12 '18 at 18:04
















1














Your code loads the whole file into memory:




lines = open('file.txt', 'r').readlines()




Then it allocates more memory, scaled to the size of the file:




lines_set = set(lines)




If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.



One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.



For example:



seen = set()
with open('file.txt', 'r') as infile:
with open('b.txt', 'w') as outfile:
for line in infile:
h = sha256(line)
if h in seen:
continue
seen.add(h)
outfile.write(line)


This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen set if you see the prefix is in the in-memory seen.






share|improve this answer


























  • NameError: name 'sha256' is not defined

    – aynaz hamidi
    Nov 12 '18 at 14:21











  • after 512MG type MemoryError

    – aynaz hamidi
    Nov 12 '18 at 14:31











  • I don't know what a "MG" is.

    – Jean-Paul Calderone
    Nov 12 '18 at 18:04














1












1








1







Your code loads the whole file into memory:




lines = open('file.txt', 'r').readlines()




Then it allocates more memory, scaled to the size of the file:




lines_set = set(lines)




If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.



One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.



For example:



seen = set()
with open('file.txt', 'r') as infile:
with open('b.txt', 'w') as outfile:
for line in infile:
h = sha256(line)
if h in seen:
continue
seen.add(h)
outfile.write(line)


This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen set if you see the prefix is in the in-memory seen.






share|improve this answer















Your code loads the whole file into memory:




lines = open('file.txt', 'r').readlines()




Then it allocates more memory, scaled to the size of the file:




lines_set = set(lines)




If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.



One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.



For example:



seen = set()
with open('file.txt', 'r') as infile:
with open('b.txt', 'w') as outfile:
for line in infile:
h = sha256(line)
if h in seen:
continue
seen.add(h)
outfile.write(line)


This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen set if you see the prefix is in the in-memory seen.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 12 '18 at 14:17

























answered Nov 12 '18 at 14:11









Jean-Paul CalderoneJean-Paul Calderone

41.2k571101




41.2k571101













  • NameError: name 'sha256' is not defined

    – aynaz hamidi
    Nov 12 '18 at 14:21











  • after 512MG type MemoryError

    – aynaz hamidi
    Nov 12 '18 at 14:31











  • I don't know what a "MG" is.

    – Jean-Paul Calderone
    Nov 12 '18 at 18:04



















  • NameError: name 'sha256' is not defined

    – aynaz hamidi
    Nov 12 '18 at 14:21











  • after 512MG type MemoryError

    – aynaz hamidi
    Nov 12 '18 at 14:31











  • I don't know what a "MG" is.

    – Jean-Paul Calderone
    Nov 12 '18 at 18:04

















NameError: name 'sha256' is not defined

– aynaz hamidi
Nov 12 '18 at 14:21





NameError: name 'sha256' is not defined

– aynaz hamidi
Nov 12 '18 at 14:21













after 512MG type MemoryError

– aynaz hamidi
Nov 12 '18 at 14:31





after 512MG type MemoryError

– aynaz hamidi
Nov 12 '18 at 14:31













I don't know what a "MG" is.

– Jean-Paul Calderone
Nov 12 '18 at 18:04





I don't know what a "MG" is.

– Jean-Paul Calderone
Nov 12 '18 at 18:04



Popular posts from this blog

Full-time equivalent

さくらももこ

13 indicted, 8 arrested in Calif. drug cartel investigation