Help needed for Python NLP Named Entity Recognition

Nipponho

Senior Member
Joined
Nov 18, 2011
Messages
647
Reaction score
10
Help! I need to come out with python codes to extract name, designation, company, address, zip code, telephone number, country city, email address from any input corpus of text strings.

I managed to figure out the codes to extact phone number, e mail and names and will figure out the rest like country, city zip code etc. later. Unfortunately I am stuck now as my python skills is very elementary. Can anyone help me to amend my codes such that I can input any corpus and the codes will help me extract phone number, e mail and names as ouputs?

Thanks!
 

Nipponho

Senior Member
Joined
Nov 18, 2011
Messages
647
Reaction score
10
import re
import nltk
from nltk.corpus import stopwords
from nltk import sent_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


stop = stopwords.words('english')


document = """
Hey,
This week has been crazy. Attached is my report on IBM. Can you give it a quick read and provide some feedback.
Also, make sure you reach out to Claire (claire@xyz.com).
You're the best.
Cheers,
George W.
212-555-1234
"""



# Remove stop words
document = ' '.join([i for i in document.split() if i not in stop])

# Segment the corpus into sentences.
sentences = sent_tokenize(document)

# Segment the corpus into sentences.
#sentences = nltk.sent_tokenize(document)

# Tokenize each sentence into an array of words.
sentences = [nltk.word_tokenize(sent) for sent in sentences]

# POS Tag each word in each sentence
sentences = [nltk.pos_tag(sent) for sent in sentences]


def extract_phone_numbers(sentences):
r = re.compile(r'(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})')
phone_numbers = r.findall(sentences)
return [re.sub(r'\D', '', number) for number in phone_numbers]

def extract_email_addresses(sentences):
r = re.compile(r'[\w\.-]+@[\w\.-]+')
return r.findall(sentences)

def extract_names(sentences):
names = []
for tagged_sentence in sentences:
for chunk in nltk.ne_chunk(tagged_sentence):
if type(chunk) == nltk.tree.Tree:
if chunk.label() == 'PERSON':
names.append(' '.join([c[0] for c in chunk]))
return names

if __name__ == '__main__':
numbers = extract_phone_numbers(sentences)
emails = extract_email_addresses(sentences)
names = extract_names(sentences)
 
Important Forum Advisory Note
This forum is moderated by volunteer moderators who will react only to members' feedback on posts. Moderators are not employees or representatives of HWZ Forums. Forum members and moderators are responsible for their own posts. Please refer to our Community Guidelines and Standards and Terms and Conditions for more information.
Top