problem description
Inmultiprocessing, multiple processes are started to write multiple files, but after running, there is * .gz on the hard disk, but the content is empty.
the environmental background of the problems and what methods you have tried
when importing a small gzip file, the output gzip is empty
when exporting a large gzip file, the output gzip is normal
related codes
I want to write the results randomly into 10 gzip files, so my husband becomes a list and saves the file handle.
gzip_files = []
for i in range(10):
gzip_files.append(gzip.open(str(i) + ".gz","wt"))
pick a suitable read from file and randomly write it to a gzip_file
def choose_read_gz(file, gzip_files, cutoff, read_format):
with gzip.open(file, "rt") as handle:
for read in SeqIO.parse(handle, read_format):
if len(read.seq) > cutoff:
gzip_files[randint(0, 9)].write(read.format(read_format))
more files, start multiple processes
with Pool(16) as pool:
for file in files:
pool.apply_async(choose_read_gz, args=(file, gzip_files, cutoff, read_format, ))
pool.close()
pool.join()
Last close the file
for gzip_file in gzip_files:
gzip_file.close()
what result do you expect? What is the error message actually seen?
- when importing a small gzip file, the output gzip content is empty and the size is 27B.
- when exporting a large gzip file, the output gzip content is normal.
- add flush each time, the output gzip is normal and the content is normal.
def (file, gzip_files, cutoff):
with gzip.open(file, "rt") as handle:
for read in SeqIO.parse(handle, read_format):
if len(read.seq) > cutoff:
filehandle = gzip_files[randint(0, 9)]
filehandle.write(read.format(read_format))
filehandle.flush()
- Last flush that controls all handles, the output gzip content is empty
for gzip_file in gzip_files:
gzip_file.flush()
gzip_file.close()
Why do you need to force flush every time to write to the hard disk, and why can"t you control the file handle to close and write to the hard disk at the end?