[ AI  Python  DataNinja  ]

When you work with text data, you often want to split it into training and test sets. This is something very usual in machine learning and also in deep learning´s natural language processing. The problem is that you have to split the data in a way that is consistent with the training and test sets, but at the same time you want to keep your data consistent.

iob_train_test_split

NOTE: img source

When working with Named Entity Recognition (NER) you tipically work with IOB format. This is a very common format of tagging words in text data. But this comes with a problem…the data is stored in a very rudimentary and old way…the IOB format:

IOB – Inside-Outside-Beginning

  • B-tag -> token is beginning of a chunk
  • I-tag -> token is inside a chunk.
  • O-tag -> token belongs to no chunk.

So you basically have a single line per word and each sample (phrase in our case) will be delimited by an empty “\n”.

Last B-DATE
year I-DATE
, O
8.3 B-QUANTITY
million I-QUANTITY
passengers O
flew O
the O
airline O
, O
down O
4 B-PERCENT
percent I-PERCENT
from O
2007 B-DATE
. O

Everyone O
knows O
about O
ADT, B-ORG
longtime O
home O
security O
system O
provider O
in O
the B-GPE
U.S. I-GPE

Here you will find a simple example of how to split a dataset into training and test sets:

"""
Contact me:
e-mail: enrique@enriquecatala.com
Linkedin: https://www.linkedin.com/in/enriquecatala/
Web: https://enriquecatala.com
Twitter: https://twitter.com/enriquecatala
Support: https://github.com/sponsors/enriquecatala
Youtube: https://www.youtube.com/enriquecatala
Description
----
This script splits the training data into train and test sets.This will split randomly into train and test the IOB file
Helper script to shuffle and split between train and test a file containing [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).
This is usually used when working with NLP tasks like Named Entity Recognition.
NOTE: Please, be aware that i´m using the split from memory, so if you have a big file, it will be slow or crash if file does´nt fit in memory
Input
----
input_file
File with the IOB tagged phrases
percentage_train
Percentage of the data to be used for training
output_folder
Folder where the output files will be saved
Output
----
Two files inside the output folder:
train.iob
test.iob
Phrases will be randomly splitted into train and test sets
"""
import sys
from transformers import AutoTokenizer
import argparse
import random
if __name__=='__main__':
argparser = argparse.ArgumentParser()
argparser.add_argument('input_file', type=str, help='Input file')
argparser.add_argument('percentage_train', type=float, help='Percentage of train', default=0.8)
argparser.add_argument('output_folder', type=str, help='Output folder', default='./')
# check if debug mode
gettrace = getattr(sys, 'gettrace', None)
if gettrace():
#Simulate the args to be expected
argv = ["/home/enrique/git/roberta-finance-NER/data/ECB/training/checkpointLabelstudio.iob","0.8","/home/enrique/git/roberta-finance-NER/data/ECB/training/"]
# Parse arguments
args = argparser.parse_args(argv)
else:
args = argparser.parse_args()
input_file = args.input_file
percentage_train = args.percentage_train
output_folder = args.output_folder
phrases = []
# read the input_file and count empty lines
with open(input_file, 'r') as f:
lines = f.readlines()
# number of phrases = number of \n lines
num_phrases = 0
phrase = []
for line in lines:
phrase.append(line)
if line == '\n':
num_phrases += 1
phrases.append(phrase)
phrase = []
# number of phrases detected
print('Number of phrases detected: {}'.format(num_phrases))
# Number of train_phrases
train_phrases = int(num_phrases * percentage_train)
print("Train phrases: {}".format(train_phrases))
# Number of test_lines
test_phrases = num_phrases - train_phrases
print("Test phrases: {}".format(test_phrases))
# Shuffling phrases
print("Shuffling phrases randomly...")
random.shuffle(phrases)
print(f"Writting data to folder: {output_folder}")
# read the input_file and write each line to a new file
train_phrases_written = 0
with open(f'{output_folder}/train.iob', 'w') as f_train:
with open(f'{output_folder}/test.iob', 'w') as f_test:
for phrase in phrases:
if train_phrases_written < train_phrases:
f_train.write(''.join(phrase))
train_phrases_written += 1
else:
f_test.write(''.join(phrase))
print("Done!")

Enjoy!