python duplicate remove MemoryError problem [duplicate]

-4

This question already has an answer here:

Memory error due to the huge input file size

2 answers

guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError

lines = open('file.txt', 'r').readlines()



lines_set = set(lines)



out  = open('b.txt', 'w')



for line in lines_set:

    out.write(line)

edited Nov 12 '18 at 14:10

Mehrdad Pedramfar

4,89211237

asked Nov 12 '18 at 14:08

aynaz hamidi

marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

use with open instead of open

– Rahul Agarwal
Nov 12 '18 at 14:09

add a comment |

-4

This question already has an answer here:

Memory error due to the huge input file size

2 answers

guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError

lines = open('file.txt', 'r').readlines()



lines_set = set(lines)



out  = open('b.txt', 'w')



for line in lines_set:

    out.write(line)

edited Nov 12 '18 at 14:10

Mehrdad Pedramfar

4,89211237

asked Nov 12 '18 at 14:08

aynaz hamidi

marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

use with open instead of open

– Rahul Agarwal
Nov 12 '18 at 14:09

add a comment |

-4

This question already has an answer here:

Memory error due to the huge input file size

2 answers

guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError

lines = open('file.txt', 'r').readlines()



lines_set = set(lines)



out  = open('b.txt', 'w')



for line in lines_set:

    out.write(line)

edited Nov 12 '18 at 14:10

Mehrdad Pedramfar

4,89211237

asked Nov 12 '18 at 14:08

aynaz hamidi

This question already has an answer here:

Memory error due to the huge input file size

2 answers

guys, I have one file text size 38GB and my system ram = 64 gig
I run this code for remove duplicate but type MemoryError

lines = open('file.txt', 'r').readlines()



lines_set = set(lines)



out  = open('b.txt', 'w')



for line in lines_set:

    out.write(line)

This question already has an answer here:

Memory error due to the huge input file size

2 answers

python memory

edited Nov 12 '18 at 14:10

Mehrdad Pedramfar

4,89211237

asked Nov 12 '18 at 14:08

aynaz hamidi

edited Nov 12 '18 at 14:10

Mehrdad Pedramfar

4,89211237

asked Nov 12 '18 at 14:08

aynaz hamidi

edited Nov 12 '18 at 14:10

Mehrdad Pedramfar

4,89211237

edited Nov 12 '18 at 14:10

Mehrdad Pedramfar

4,89211237

edited Nov 12 '18 at 14:10

Mehrdad Pedramfar

4,89211237

asked Nov 12 '18 at 14:08

aynaz hamidi

asked Nov 12 '18 at 14:08

aynaz hamidi

asked Nov 12 '18 at 14:08

aynaz hamidi

marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

marked as duplicate by Mehrdad Pedramfar, Jonah Bishop, Klaus D., Azat Ibrakov, Jean-Paul Calderone Nov 12 '18 at 18:04

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

use with open instead of open

– Rahul Agarwal
Nov 12 '18 at 14:09

add a comment |

use with open instead of open

– Rahul Agarwal
Nov 12 '18 at 14:09

use with open instead of open

– Rahul Agarwal
Nov 12 '18 at 14:09

add a comment |

1 Answer
1

active

oldest

votes

Your code loads the whole file into memory:

lines = open('file.txt', 'r').readlines()

Then it allocates more memory, scaled to the size of the file:

lines_set = set(lines)

If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.

One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.

For example:

seen = set()

with open('file.txt', 'r') as infile:

    with open('b.txt', 'w') as outfile:

        for line in infile:

            h = sha256(line)

            if h in seen:

                continue

            seen.add(h)

            outfile.write(line)

This still requires that the hashes of all unique lines fit in memory - however this is closer to 32 bytes per line. Depending on the length of lines in your file, this may or may not be good enough. If it is not good enough, you can move the seen set to secondary storage - ie, disk. You may want to keep a preliminary filter in main memory (ie, RAM) for performance reasons. For example, keep a set of the first 4 or 8 bytes of the sha256 in memory. Only consult the on-disk seen set if you see the prefix is in the in-memory seen.

edited Nov 12 '18 at 14:17

answered Nov 12 '18 at 14:11

Jean-Paul Calderone

41.2k571101

NameError: name 'sha256' is not defined

– aynaz hamidi
Nov 12 '18 at 14:21

after 512MG type MemoryError

– aynaz hamidi
Nov 12 '18 at 14:31

I don't know what a "MG" is.

– Jean-Paul Calderone
Nov 12 '18 at 18:04

add a comment |

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Your code loads the whole file into memory:

lines = open('file.txt', 'r').readlines()

Then it allocates more memory, scaled to the size of the file:

lines_set = set(lines)

If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.

One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.

For example:

seen = set()

with open('file.txt', 'r') as infile:

    with open('b.txt', 'w') as outfile:

        for line in infile:

            h = sha256(line)

            if h in seen:

                continue

            seen.add(h)

            outfile.write(line)

edited Nov 12 '18 at 14:17

answered Nov 12 '18 at 14:11

Jean-Paul Calderone

41.2k571101

NameError: name 'sha256' is not defined

– aynaz hamidi
Nov 12 '18 at 14:21

after 512MG type MemoryError

– aynaz hamidi
Nov 12 '18 at 14:31

I don't know what a "MG" is.

– Jean-Paul Calderone
Nov 12 '18 at 18:04

add a comment |

Your code loads the whole file into memory:

lines = open('file.txt', 'r').readlines()

Then it allocates more memory, scaled to the size of the file:

lines_set = set(lines)

If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.

One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.

For example:

seen = set()

with open('file.txt', 'r') as infile:

    with open('b.txt', 'w') as outfile:

        for line in infile:

            h = sha256(line)

            if h in seen:

                continue

            seen.add(h)

            outfile.write(line)

edited Nov 12 '18 at 14:17

answered Nov 12 '18 at 14:11

Jean-Paul Calderone

41.2k571101

NameError: name 'sha256' is not defined

– aynaz hamidi
Nov 12 '18 at 14:21

after 512MG type MemoryError

– aynaz hamidi
Nov 12 '18 at 14:31

I don't know what a "MG" is.

– Jean-Paul Calderone
Nov 12 '18 at 18:04

add a comment |

Your code loads the whole file into memory:

lines = open('file.txt', 'r').readlines()

Then it allocates more memory, scaled to the size of the file:

lines_set = set(lines)

If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.

One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.

For example:

seen = set()

with open('file.txt', 'r') as infile:

    with open('b.txt', 'w') as outfile:

        for line in infile:

            h = sha256(line)

            if h in seen:

                continue

            seen.add(h)

            outfile.write(line)

edited Nov 12 '18 at 14:17

answered Nov 12 '18 at 14:11

Jean-Paul Calderone

41.2k571101

Your code loads the whole file into memory:

lines = open('file.txt', 'r').readlines()

Then it allocates more memory, scaled to the size of the file:

lines_set = set(lines)

If you want to be able to operate on files of size approaching or exceeding the amount of memory you have, you need to avoid loading the whole thing into memory at once.

One option would be to write as you read, avoiding storing any line except the one you're operating on in memory, and perform deduplication using hashes instead of exact equality testing.

For example:

seen = set()

with open('file.txt', 'r') as infile:

    with open('b.txt', 'w') as outfile:

        for line in infile:

            h = sha256(line)

            if h in seen:

                continue

            seen.add(h)

            outfile.write(line)

edited Nov 12 '18 at 14:17

answered Nov 12 '18 at 14:11

Jean-Paul Calderone

41.2k571101

edited Nov 12 '18 at 14:17

answered Nov 12 '18 at 14:11

Jean-Paul Calderone

41.2k571101

answered Nov 12 '18 at 14:11

Jean-Paul Calderone

41.2k571101

answered Nov 12 '18 at 14:11

Jean-Paul Calderone

41.2k571101

NameError: name 'sha256' is not defined

– aynaz hamidi
Nov 12 '18 at 14:21

after 512MG type MemoryError

– aynaz hamidi
Nov 12 '18 at 14:31

I don't know what a "MG" is.

– Jean-Paul Calderone
Nov 12 '18 at 18:04

add a comment |

NameError: name 'sha256' is not defined

– aynaz hamidi
Nov 12 '18 at 14:21

after 512MG type MemoryError

– aynaz hamidi
Nov 12 '18 at 14:31

I don't know what a "MG" is.

– Jean-Paul Calderone
Nov 12 '18 at 18:04

NameError: name 'sha256' is not defined

– aynaz hamidi
Nov 12 '18 at 14:21

after 512MG type MemoryError

– aynaz hamidi
Nov 12 '18 at 14:31

I don't know what a "MG" is.

– Jean-Paul Calderone
Nov 12 '18 at 18:04

add a comment |

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nrthugu