python - unstruct wikipedia synonym bracket -
i want unstruct wikipedia synonym bracket.
here's easy 1 do.
he [[korean]].
i can remove bracket.
here's difficult one.
he lives in [[gimhae city|gimhae]].
the first one(gimhae city) wikipedia document title.
so have second 1 in bracket.
any suggestion welcome.
you can use following regex:
\[{2}(?:[^|\]]*\|)?([^]]*)]{2}
and relace \1
.
see demo
here regex matches:
\[{2}
- 2 opening square brackets(?:[^|\]]*\|)?
- 0 or 1 sequence of characters other|
,]
(with[^|\]]*
) , literal|
\|
(note escaped outside of character class)([^]]*)
- matches , captures group 1 we'll reference later\1
0 or more characters other closing square bracket]{2}
- 2 closing square brackets (note not have escape them here since first[
escaped).
the python snippet:
import re p = re.compile(r'\[{2}(?:[^|\]]*\|)?([^]]*)]{2}') test_str = "he lives in [[gimhae city|gimhae]]. lives in [[gimhae]]. " result = re.sub(p, r"\1", test_str) print(result) # => lives in gimhae. lives in gimhae.
Comments
Post a Comment